* [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 2:40 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-05 2:40 UTC (permalink / raw)
To: Rik van Riel
Cc: Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Greetings,
Jeff Dike found that many KVM pages are being refaulted in 2.6.29:
"Lots of pages between discarded due to memory pressure only to be
faulted back in soon after. These pages are nearly all stack pages.
This is not consistent - sometimes there are relatively few such pages
and they are spread out between processes."
The refaults can be drastically reduced by the following patch, which
respects the referenced bit of all anonymous pages (including the KVM
pages).
However it risks reintroducing the problem addressed by commit 7e9cd4842
(fix reclaim scalability problem by ignoring the referenced bit,
mainly the pte young bit). I wonder if there are better solutions?
Thanks,
Fengguang
---
mm/vmscan.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1288,12 +1288,12 @@ static void shrink_active_list(unsigned
* Identify referenced, file-backed active pages and
* give them one more trip around the active list. So
* that executable code get better chances to stay in
- * memory under moderate memory pressure. Anon pages
- * are not likely to be evicted by use-once streaming
- * IO, plus JVM can create lots of anon VM_EXEC pages,
- * so we ignore them here.
+ * memory under moderate memory pressure.
+ *
+ * Also protect anon pages: swapping could be costly,
+ * and KVM guest's referenced bit is helpful.
*/
- if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
+ if ((vm_flags & VM_EXEC) || PageAnon(page)) {
list_add(&page->lru, &l_active);
continue;
}
^ permalink raw reply [flat|nested] 243+ messages in thread
* [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 2:40 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-05 2:40 UTC (permalink / raw)
To: Rik van Riel
Cc: Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Greetings,
Jeff Dike found that many KVM pages are being refaulted in 2.6.29:
"Lots of pages between discarded due to memory pressure only to be
faulted back in soon after. These pages are nearly all stack pages.
This is not consistent - sometimes there are relatively few such pages
and they are spread out between processes."
The refaults can be drastically reduced by the following patch, which
respects the referenced bit of all anonymous pages (including the KVM
pages).
However it risks reintroducing the problem addressed by commit 7e9cd4842
(fix reclaim scalability problem by ignoring the referenced bit,
mainly the pte young bit). I wonder if there are better solutions?
Thanks,
Fengguang
---
mm/vmscan.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1288,12 +1288,12 @@ static void shrink_active_list(unsigned
* Identify referenced, file-backed active pages and
* give them one more trip around the active list. So
* that executable code get better chances to stay in
- * memory under moderate memory pressure. Anon pages
- * are not likely to be evicted by use-once streaming
- * IO, plus JVM can create lots of anon VM_EXEC pages,
- * so we ignore them here.
+ * memory under moderate memory pressure.
+ *
+ * Also protect anon pages: swapping could be costly,
+ * and KVM guest's referenced bit is helpful.
*/
- if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
+ if ((vm_flags & VM_EXEC) || PageAnon(page)) {
list_add(&page->lru, &l_active);
continue;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 2:40 ` Wu Fengguang
@ 2009-08-05 4:15 ` KOSAKI Motohiro
-1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-05 4:15 UTC (permalink / raw)
To: Wu Fengguang
Cc: kosaki.motohiro, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Andrea Arcangeli, Avi Kivity, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
Hi
> Greetings,
>
> Jeff Dike found that many KVM pages are being refaulted in 2.6.29:
>
> "Lots of pages between discarded due to memory pressure only to be
> faulted back in soon after. These pages are nearly all stack pages.
> This is not consistent - sometimes there are relatively few such pages
> and they are spread out between processes."
I suprise this result really.
- Why this issue happened only on kvm?
- Why shrink_inactive_list() can't find pte young bit?
Is this really unused stack?
>
> The refaults can be drastically reduced by the following patch, which
> respects the referenced bit of all anonymous pages (including the KVM
> pages).
>
> However it risks reintroducing the problem addressed by commit 7e9cd4842
> (fix reclaim scalability problem by ignoring the referenced bit,
> mainly the pte young bit). I wonder if there are better solutions?
>
> Thanks,
> Fengguang
>
> ---
> mm/vmscan.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> --- linux.orig/mm/vmscan.c
> +++ linux/mm/vmscan.c
> @@ -1288,12 +1288,12 @@ static void shrink_active_list(unsigned
> * Identify referenced, file-backed active pages and
> * give them one more trip around the active list. So
> * that executable code get better chances to stay in
> - * memory under moderate memory pressure. Anon pages
> - * are not likely to be evicted by use-once streaming
> - * IO, plus JVM can create lots of anon VM_EXEC pages,
> - * so we ignore them here.
> + * memory under moderate memory pressure.
> + *
> + * Also protect anon pages: swapping could be costly,
> + * and KVM guest's referenced bit is helpful.
> */
> - if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> + if ((vm_flags & VM_EXEC) || PageAnon(page)) {
> list_add(&page->lru, &l_active);
> continue;
> }
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 4:15 ` KOSAKI Motohiro
0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-05 4:15 UTC (permalink / raw)
To: Wu Fengguang
Cc: kosaki.motohiro, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Andrea Arcangeli, Avi Kivity, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
Hi
> Greetings,
>
> Jeff Dike found that many KVM pages are being refaulted in 2.6.29:
>
> "Lots of pages between discarded due to memory pressure only to be
> faulted back in soon after. These pages are nearly all stack pages.
> This is not consistent - sometimes there are relatively few such pages
> and they are spread out between processes."
I suprise this result really.
- Why this issue happened only on kvm?
- Why shrink_inactive_list() can't find pte young bit?
Is this really unused stack?
>
> The refaults can be drastically reduced by the following patch, which
> respects the referenced bit of all anonymous pages (including the KVM
> pages).
>
> However it risks reintroducing the problem addressed by commit 7e9cd4842
> (fix reclaim scalability problem by ignoring the referenced bit,
> mainly the pte young bit). I wonder if there are better solutions?
>
> Thanks,
> Fengguang
>
> ---
> mm/vmscan.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> --- linux.orig/mm/vmscan.c
> +++ linux/mm/vmscan.c
> @@ -1288,12 +1288,12 @@ static void shrink_active_list(unsigned
> * Identify referenced, file-backed active pages and
> * give them one more trip around the active list. So
> * that executable code get better chances to stay in
> - * memory under moderate memory pressure. Anon pages
> - * are not likely to be evicted by use-once streaming
> - * IO, plus JVM can create lots of anon VM_EXEC pages,
> - * so we ignore them here.
> + * memory under moderate memory pressure.
> + *
> + * Also protect anon pages: swapping could be costly,
> + * and KVM guest's referenced bit is helpful.
> */
> - if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> + if ((vm_flags & VM_EXEC) || PageAnon(page)) {
> list_add(&page->lru, &l_active);
> continue;
> }
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 4:15 ` KOSAKI Motohiro
@ 2009-08-05 4:41 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-05 4:41 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Avi Kivity, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
On Wed, Aug 05, 2009 at 12:15:40PM +0800, KOSAKI Motohiro wrote:
> Hi
>
> > Greetings,
> >
> > Jeff Dike found that many KVM pages are being refaulted in 2.6.29:
> >
> > "Lots of pages between discarded due to memory pressure only to be
> > faulted back in soon after. These pages are nearly all stack pages.
> > This is not consistent - sometimes there are relatively few such pages
> > and they are spread out between processes."
>
> I suprise this result really.
>
> - Why this issue happened only on kvm?
Maybe because
- they take up a large portion of memory
- their access patterns/frequencies vary a lot
> - Why shrink_inactive_list() can't find pte young bit?
It can, but I guess the grace period would be much shorter than with
this patch.
> Is this really unused stack?
They were actually being refaulted. So they should be kind of
not-too-hot as well as not-too-cold pages.
Thanks,
Fengguang
> >
> > The refaults can be drastically reduced by the following patch, which
> > respects the referenced bit of all anonymous pages (including the KVM
> > pages).
> >
> > However it risks reintroducing the problem addressed by commit 7e9cd4842
> > (fix reclaim scalability problem by ignoring the referenced bit,
> > mainly the pte young bit). I wonder if there are better solutions?
> >
> > Thanks,
> > Fengguang
> >
> > ---
> > mm/vmscan.c | 10 +++++-----
> > 1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > --- linux.orig/mm/vmscan.c
> > +++ linux/mm/vmscan.c
> > @@ -1288,12 +1288,12 @@ static void shrink_active_list(unsigned
> > * Identify referenced, file-backed active pages and
> > * give them one more trip around the active list. So
> > * that executable code get better chances to stay in
> > - * memory under moderate memory pressure. Anon pages
> > - * are not likely to be evicted by use-once streaming
> > - * IO, plus JVM can create lots of anon VM_EXEC pages,
> > - * so we ignore them here.
> > + * memory under moderate memory pressure.
> > + *
> > + * Also protect anon pages: swapping could be costly,
> > + * and KVM guest's referenced bit is helpful.
> > */
> > - if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> > + if ((vm_flags & VM_EXEC) || PageAnon(page)) {
> > list_add(&page->lru, &l_active);
> > continue;
> > }
>
>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 4:41 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-05 4:41 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Avi Kivity, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
On Wed, Aug 05, 2009 at 12:15:40PM +0800, KOSAKI Motohiro wrote:
> Hi
>
> > Greetings,
> >
> > Jeff Dike found that many KVM pages are being refaulted in 2.6.29:
> >
> > "Lots of pages between discarded due to memory pressure only to be
> > faulted back in soon after. These pages are nearly all stack pages.
> > This is not consistent - sometimes there are relatively few such pages
> > and they are spread out between processes."
>
> I suprise this result really.
>
> - Why this issue happened only on kvm?
Maybe because
- they take up a large portion of memory
- their access patterns/frequencies vary a lot
> - Why shrink_inactive_list() can't find pte young bit?
It can, but I guess the grace period would be much shorter than with
this patch.
> Is this really unused stack?
They were actually being refaulted. So they should be kind of
not-too-hot as well as not-too-cold pages.
Thanks,
Fengguang
> >
> > The refaults can be drastically reduced by the following patch, which
> > respects the referenced bit of all anonymous pages (including the KVM
> > pages).
> >
> > However it risks reintroducing the problem addressed by commit 7e9cd4842
> > (fix reclaim scalability problem by ignoring the referenced bit,
> > mainly the pte young bit). I wonder if there are better solutions?
> >
> > Thanks,
> > Fengguang
> >
> > ---
> > mm/vmscan.c | 10 +++++-----
> > 1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > --- linux.orig/mm/vmscan.c
> > +++ linux/mm/vmscan.c
> > @@ -1288,12 +1288,12 @@ static void shrink_active_list(unsigned
> > * Identify referenced, file-backed active pages and
> > * give them one more trip around the active list. So
> > * that executable code get better chances to stay in
> > - * memory under moderate memory pressure. Anon pages
> > - * are not likely to be evicted by use-once streaming
> > - * IO, plus JVM can create lots of anon VM_EXEC pages,
> > - * so we ignore them here.
> > + * memory under moderate memory pressure.
> > + *
> > + * Also protect anon pages: swapping could be costly,
> > + * and KVM guest's referenced bit is helpful.
> > */
> > - if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> > + if ((vm_flags & VM_EXEC) || PageAnon(page)) {
> > list_add(&page->lru, &l_active);
> > continue;
> > }
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 2:40 ` Wu Fengguang
@ 2009-08-05 7:58 ` Avi Kivity
-1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05 7:58 UTC (permalink / raw)
To: Wu Fengguang
Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/05/2009 05:40 AM, Wu Fengguang wrote:
> Greetings,
>
> Jeff Dike found that many KVM pages are being refaulted in 2.6.29:
>
> "Lots of pages between discarded due to memory pressure only to be
> faulted back in soon after. These pages are nearly all stack pages.
> This is not consistent - sometimes there are relatively few such pages
> and they are spread out between processes."
>
> The refaults can be drastically reduced by the following patch, which
> respects the referenced bit of all anonymous pages (including the KVM
> pages).
>
> However it risks reintroducing the problem addressed by commit 7e9cd4842
> (fix reclaim scalability problem by ignoring the referenced bit,
> mainly the pte young bit). I wonder if there are better solutions?
>
How do you distinguish between kvm pages and non-kvm anonymous pages?
More importantly, why should you?
Jeff, do you see the refaults on Nehalem systems? If so, that's likely
due to the lack of an accessed bit on EPT pagetables. It would be
interesting to compare with Barcelona (which does).
If that's indeed the case, we can have the EPT ageing mechanism give
pages a bit more time around by using an available bit in the EPT PTEs
to return accessed on the first pass and not-accessed on the second.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 7:58 ` Avi Kivity
0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05 7:58 UTC (permalink / raw)
To: Wu Fengguang
Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/05/2009 05:40 AM, Wu Fengguang wrote:
> Greetings,
>
> Jeff Dike found that many KVM pages are being refaulted in 2.6.29:
>
> "Lots of pages between discarded due to memory pressure only to be
> faulted back in soon after. These pages are nearly all stack pages.
> This is not consistent - sometimes there are relatively few such pages
> and they are spread out between processes."
>
> The refaults can be drastically reduced by the following patch, which
> respects the referenced bit of all anonymous pages (including the KVM
> pages).
>
> However it risks reintroducing the problem addressed by commit 7e9cd4842
> (fix reclaim scalability problem by ignoring the referenced bit,
> mainly the pte young bit). I wonder if there are better solutions?
>
How do you distinguish between kvm pages and non-kvm anonymous pages?
More importantly, why should you?
Jeff, do you see the refaults on Nehalem systems? If so, that's likely
due to the lack of an accessed bit on EPT pagetables. It would be
interesting to compare with Barcelona (which does).
If that's indeed the case, we can have the EPT ageing mechanism give
pages a bit more time around by using an available bit in the EPT PTEs
to return accessed on the first pass and not-accessed on the second.
--
error compiling committee.c: too many arguments to function
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 7:58 ` Avi Kivity
@ 2009-08-05 8:17 ` Avi Kivity
-1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05 8:17 UTC (permalink / raw)
To: Wu Fengguang
Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm, Andrea Arcangeli,
KVM list
[-- Attachment #1: Type: text/plain, Size: 1455 bytes --]
On 08/05/2009 10:58 AM, Avi Kivity wrote:
> On 08/05/2009 05:40 AM, Wu Fengguang wrote:
>> Greetings,
>>
>> Jeff Dike found that many KVM pages are being refaulted in 2.6.29:
>>
>> "Lots of pages between discarded due to memory pressure only to be
>> faulted back in soon after. These pages are nearly all stack pages.
>> This is not consistent - sometimes there are relatively few such pages
>> and they are spread out between processes."
>>
>> The refaults can be drastically reduced by the following patch, which
>> respects the referenced bit of all anonymous pages (including the KVM
>> pages).
>>
>> However it risks reintroducing the problem addressed by commit 7e9cd4842
>> (fix reclaim scalability problem by ignoring the referenced bit,
>> mainly the pte young bit). I wonder if there are better solutions?
>
> How do you distinguish between kvm pages and non-kvm anonymous pages?
> More importantly, why should you?
>
> Jeff, do you see the refaults on Nehalem systems? If so, that's
> likely due to the lack of an accessed bit on EPT pagetables. It would
> be interesting to compare with Barcelona (which does).
>
> If that's indeed the case, we can have the EPT ageing mechanism give
> pages a bit more time around by using an available bit in the EPT PTEs
> to return accessed on the first pass and not-accessed on the second.
>
The attached patch implements this.
--
error compiling committee.c: too many arguments to function
[-- Attachment #2: ept-emulate-accessed-bit.patch --]
[-- Type: text/x-patch, Size: 2115 bytes --]
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7b53614..310938a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -195,6 +195,7 @@ static u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
static u64 __read_mostly shadow_user_mask;
static u64 __read_mostly shadow_accessed_mask;
static u64 __read_mostly shadow_dirty_mask;
+static int __read_mostly shadow_accessed_shift;
static inline u64 rsvd_bits(int s, int e)
{
@@ -219,6 +220,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
{
shadow_user_mask = user_mask;
shadow_accessed_mask = accessed_mask;
+ shadow_accessed_shift
+ = find_first_bit((void *)&shadow_accessed_mask, 64);
shadow_dirty_mask = dirty_mask;
shadow_nx_mask = nx_mask;
shadow_x_mask = x_mask;
@@ -817,11 +820,11 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
while (spte) {
int _young;
u64 _spte = *spte;
- BUG_ON(!(_spte & PT_PRESENT_MASK));
- _young = _spte & PT_ACCESSED_MASK;
+ BUG_ON(!(_spte & shadow_accessed_mask));
+ _young = _spte & shadow_accessed_mask;
if (_young) {
young = 1;
- clear_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
+ clear_bit(shadow_accessed_shift, (unsigned long *)spte);
}
spte = rmap_next(kvm, rmapp, spte);
}
@@ -2572,7 +2575,7 @@ static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn)
&& shadow_accessed_mask
&& !(*spte & shadow_accessed_mask)
&& is_shadow_present_pte(*spte))
- set_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
+ set_bit(shadow_accessed_shift, (unsigned long *)spte);
}
void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0ba706e..bc99367 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4029,7 +4029,7 @@ static int __init vmx_init(void)
bypass_guest_pf = 0;
kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK |
VMX_EPT_WRITABLE_MASK);
- kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
+ kvm_mmu_set_mask_ptes(0ull, 1ull << 63, 0ull, 0ull,
VMX_EPT_EXECUTABLE_MASK);
kvm_enable_tdp();
} else
^ permalink raw reply related [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 8:17 ` Avi Kivity
0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05 8:17 UTC (permalink / raw)
To: Wu Fengguang
Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
[-- Attachment #1: Type: text/plain, Size: 1455 bytes --]
On 08/05/2009 10:58 AM, Avi Kivity wrote:
> On 08/05/2009 05:40 AM, Wu Fengguang wrote:
>> Greetings,
>>
>> Jeff Dike found that many KVM pages are being refaulted in 2.6.29:
>>
>> "Lots of pages between discarded due to memory pressure only to be
>> faulted back in soon after. These pages are nearly all stack pages.
>> This is not consistent - sometimes there are relatively few such pages
>> and they are spread out between processes."
>>
>> The refaults can be drastically reduced by the following patch, which
>> respects the referenced bit of all anonymous pages (including the KVM
>> pages).
>>
>> However it risks reintroducing the problem addressed by commit 7e9cd4842
>> (fix reclaim scalability problem by ignoring the referenced bit,
>> mainly the pte young bit). I wonder if there are better solutions?
>
> How do you distinguish between kvm pages and non-kvm anonymous pages?
> More importantly, why should you?
>
> Jeff, do you see the refaults on Nehalem systems? If so, that's
> likely due to the lack of an accessed bit on EPT pagetables. It would
> be interesting to compare with Barcelona (which does).
>
> If that's indeed the case, we can have the EPT ageing mechanism give
> pages a bit more time around by using an available bit in the EPT PTEs
> to return accessed on the first pass and not-accessed on the second.
>
The attached patch implements this.
--
error compiling committee.c: too many arguments to function
[-- Attachment #2: ept-emulate-accessed-bit.patch --]
[-- Type: text/x-patch, Size: 2115 bytes --]
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7b53614..310938a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -195,6 +195,7 @@ static u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
static u64 __read_mostly shadow_user_mask;
static u64 __read_mostly shadow_accessed_mask;
static u64 __read_mostly shadow_dirty_mask;
+static int __read_mostly shadow_accessed_shift;
static inline u64 rsvd_bits(int s, int e)
{
@@ -219,6 +220,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
{
shadow_user_mask = user_mask;
shadow_accessed_mask = accessed_mask;
+ shadow_accessed_shift
+ = find_first_bit((void *)&shadow_accessed_mask, 64);
shadow_dirty_mask = dirty_mask;
shadow_nx_mask = nx_mask;
shadow_x_mask = x_mask;
@@ -817,11 +820,11 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
while (spte) {
int _young;
u64 _spte = *spte;
- BUG_ON(!(_spte & PT_PRESENT_MASK));
- _young = _spte & PT_ACCESSED_MASK;
+ BUG_ON(!(_spte & shadow_accessed_mask));
+ _young = _spte & shadow_accessed_mask;
if (_young) {
young = 1;
- clear_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
+ clear_bit(shadow_accessed_shift, (unsigned long *)spte);
}
spte = rmap_next(kvm, rmapp, spte);
}
@@ -2572,7 +2575,7 @@ static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn)
&& shadow_accessed_mask
&& !(*spte & shadow_accessed_mask)
&& is_shadow_present_pte(*spte))
- set_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
+ set_bit(shadow_accessed_shift, (unsigned long *)spte);
}
void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0ba706e..bc99367 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4029,7 +4029,7 @@ static int __init vmx_init(void)
bypass_guest_pf = 0;
kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK |
VMX_EPT_WRITABLE_MASK);
- kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
+ kvm_mmu_set_mask_ptes(0ull, 1ull << 63, 0ull, 0ull,
VMX_EPT_EXECUTABLE_MASK);
kvm_enable_tdp();
} else
^ permalink raw reply related [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 7:58 ` Avi Kivity
@ 2009-08-05 14:15 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 14:15 UTC (permalink / raw)
To: Avi Kivity
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Avi Kivity wrote:
>> However it risks reintroducing the problem addressed by commit 7e9cd4842
>> (fix reclaim scalability problem by ignoring the referenced bit,
>> mainly the pte young bit). I wonder if there are better solutions?
Agreed, we need to figure out what the real problem is,
and how to solve it better.
> Jeff, do you see the refaults on Nehalem systems? If so, that's likely
> due to the lack of an accessed bit on EPT pagetables. It would be
> interesting to compare with Barcelona (which does).
Not having a hardware accessed bit would explain why
the VM is not reactivating the pages that were accessed
while on the inactive list.
> If that's indeed the case, we can have the EPT ageing mechanism give
> pages a bit more time around by using an available bit in the EPT PTEs
> to return accessed on the first pass and not-accessed on the second.
Can we find out which pages are EPT pages?
If so, we could unmap them when they get moved from the
active to the inactive list, and soft fault them back in
on access, emulating the referenced bit for EPT pages and
making page replacement on them work like it should.
Your approximation of pretending the page is accessed the
first time and pretending it's not the second time sounds
like it will just lead to less efficient FIFO replacement,
not to anything even vaguely approximating LRU.
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 14:15 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 14:15 UTC (permalink / raw)
To: Avi Kivity
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Avi Kivity wrote:
>> However it risks reintroducing the problem addressed by commit 7e9cd4842
>> (fix reclaim scalability problem by ignoring the referenced bit,
>> mainly the pte young bit). I wonder if there are better solutions?
Agreed, we need to figure out what the real problem is,
and how to solve it better.
> Jeff, do you see the refaults on Nehalem systems? If so, that's likely
> due to the lack of an accessed bit on EPT pagetables. It would be
> interesting to compare with Barcelona (which does).
Not having a hardware accessed bit would explain why
the VM is not reactivating the pages that were accessed
while on the inactive list.
> If that's indeed the case, we can have the EPT ageing mechanism give
> pages a bit more time around by using an available bit in the EPT PTEs
> to return accessed on the first pass and not-accessed on the second.
Can we find out which pages are EPT pages?
If so, we could unmap them when they get moved from the
active to the inactive list, and soft fault them back in
on access, emulating the referenced bit for EPT pages and
making page replacement on them work like it should.
Your approximation of pretending the page is accessed the
first time and pretending it's not the second time sounds
like it will just lead to less efficient FIFO replacement,
not to anything even vaguely approximating LRU.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 8:17 ` Avi Kivity
@ 2009-08-05 14:33 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 14:33 UTC (permalink / raw)
To: Avi Kivity
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm, KVM list
Avi Kivity wrote:
> The attached patch implements this.
The attached page requires each page to go around twice
before it is evicted, but they will still get evicted in
the order in which they were made present.
FIFO page replacement was shown to be a bad idea in the
1960's and it is still a terrible idea today.
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 14:33 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 14:33 UTC (permalink / raw)
To: Avi Kivity
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm, KVM list
Avi Kivity wrote:
> The attached patch implements this.
The attached page requires each page to go around twice
before it is evicted, but they will still get evicted in
the order in which they were made present.
FIFO page replacement was shown to be a bad idea in the
1960's and it is still a terrible idea today.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 14:15 ` Rik van Riel
@ 2009-08-05 15:12 ` Avi Kivity
-1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05 15:12 UTC (permalink / raw)
To: Rik van Riel
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/05/2009 05:15 PM, Rik van Riel wrote:
>> If that's indeed the case, we can have the EPT ageing mechanism give
>> pages a bit more time around by using an available bit in the EPT
>> PTEs to return accessed on the first pass and not-accessed on the
>> second.
>
> Can we find out which pages are EPT pages?
>
No need to (see below).
> If so, we could unmap them when they get moved from the
> active to the inactive list, and soft fault them back in
> on access, emulating the referenced bit for EPT pages and
> making page replacement on them work like it should.
It should be easy to implement via the mmu notifier callback: when the
mm calls clear_flush_young(), mark it as young, and unmap it from the
EPT pagetable.
> Your approximation of pretending the page is accessed the
> first time and pretending it's not the second time sounds
> like it will just lead to less efficient FIFO replacement,
> not to anything even vaguely approximating LRU.
Right, it's just a hack that gives EPT pages higher priority, like the
original patch suggested. Note that LRU for VMs is not a good
algorithm, since the VM will also reference the least recently used
page, leading to thrashing.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 15:12 ` Avi Kivity
0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05 15:12 UTC (permalink / raw)
To: Rik van Riel
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/05/2009 05:15 PM, Rik van Riel wrote:
>> If that's indeed the case, we can have the EPT ageing mechanism give
>> pages a bit more time around by using an available bit in the EPT
>> PTEs to return accessed on the first pass and not-accessed on the
>> second.
>
> Can we find out which pages are EPT pages?
>
No need to (see below).
> If so, we could unmap them when they get moved from the
> active to the inactive list, and soft fault them back in
> on access, emulating the referenced bit for EPT pages and
> making page replacement on them work like it should.
It should be easy to implement via the mmu notifier callback: when the
mm calls clear_flush_young(), mark it as young, and unmap it from the
EPT pagetable.
> Your approximation of pretending the page is accessed the
> first time and pretending it's not the second time sounds
> like it will just lead to less efficient FIFO replacement,
> not to anything even vaguely approximating LRU.
Right, it's just a hack that gives EPT pages higher priority, like the
original patch suggested. Note that LRU for VMs is not a good
algorithm, since the VM will also reference the least recently used
page, leading to thrashing.
--
error compiling committee.c: too many arguments to function
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 15:12 ` Avi Kivity
@ 2009-08-05 15:15 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 15:15 UTC (permalink / raw)
To: Avi Kivity
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Avi Kivity wrote:
>> If so, we could unmap them when they get moved from the
>> active to the inactive list, and soft fault them back in
>> on access, emulating the referenced bit for EPT pages and
>> making page replacement on them work like it should.
>
> It should be easy to implement via the mmu notifier callback: when the
> mm calls clear_flush_young(), mark it as young, and unmap it from the
> EPT pagetable.
You mean "mark it as old"?
>> Your approximation of pretending the page is accessed the
>> first time and pretending it's not the second time sounds
>> like it will just lead to less efficient FIFO replacement,
>> not to anything even vaguely approximating LRU.
>
> Right, it's just a hack that gives EPT pages higher priority, like the
> original patch suggested. Note that LRU for VMs is not a good
> algorithm, since the VM will also reference the least recently used
> page, leading to thrashing.
That is one of the reasons we use a very coarse two
handed clock algorithm instead of true LRU.
LRU has more overhead and more artifacts :)
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 15:15 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 15:15 UTC (permalink / raw)
To: Avi Kivity
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Avi Kivity wrote:
>> If so, we could unmap them when they get moved from the
>> active to the inactive list, and soft fault them back in
>> on access, emulating the referenced bit for EPT pages and
>> making page replacement on them work like it should.
>
> It should be easy to implement via the mmu notifier callback: when the
> mm calls clear_flush_young(), mark it as young, and unmap it from the
> EPT pagetable.
You mean "mark it as old"?
>> Your approximation of pretending the page is accessed the
>> first time and pretending it's not the second time sounds
>> like it will just lead to less efficient FIFO replacement,
>> not to anything even vaguely approximating LRU.
>
> Right, it's just a hack that gives EPT pages higher priority, like the
> original patch suggested. Note that LRU for VMs is not a good
> algorithm, since the VM will also reference the least recently used
> page, leading to thrashing.
That is one of the reasons we use a very coarse two
handed clock algorithm instead of true LRU.
LRU has more overhead and more artifacts :)
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 15:15 ` Rik van Riel
@ 2009-08-05 15:25 ` Avi Kivity
-1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05 15:25 UTC (permalink / raw)
To: Rik van Riel
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/05/2009 06:15 PM, Rik van Riel wrote:
> Avi Kivity wrote:
>
>>> If so, we could unmap them when they get moved from the
>>> active to the inactive list, and soft fault them back in
>>> on access, emulating the referenced bit for EPT pages and
>>> making page replacement on them work like it should.
>>
>> It should be easy to implement via the mmu notifier callback: when
>> the mm calls clear_flush_young(), mark it as young, and unmap it from
>> the EPT pagetable.
>
> You mean "mark it as old"?
I meant 'return young, and drop it from the EPT pagetable'.
If we use the present bit as a replacement for the accessed bit, present
means young, and clear_flush_young means "if present, return young and
unmap, otherwise return old'.
See kvm_age_rmapp() in arch/x86/kvm/mmu.c.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 15:25 ` Avi Kivity
0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05 15:25 UTC (permalink / raw)
To: Rik van Riel
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/05/2009 06:15 PM, Rik van Riel wrote:
> Avi Kivity wrote:
>
>>> If so, we could unmap them when they get moved from the
>>> active to the inactive list, and soft fault them back in
>>> on access, emulating the referenced bit for EPT pages and
>>> making page replacement on them work like it should.
>>
>> It should be easy to implement via the mmu notifier callback: when
>> the mm calls clear_flush_young(), mark it as young, and unmap it from
>> the EPT pagetable.
>
> You mean "mark it as old"?
I meant 'return young, and drop it from the EPT pagetable'.
If we use the present bit as a replacement for the accessed bit, present
means young, and clear_flush_young means "if present, return young and
unmap, otherwise return old'.
See kvm_age_rmapp() in arch/x86/kvm/mmu.c.
--
error compiling committee.c: too many arguments to function
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 14:33 ` Rik van Riel
@ 2009-08-05 15:37 ` Avi Kivity
-1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05 15:37 UTC (permalink / raw)
To: Rik van Riel
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm, KVM list
On 08/05/2009 05:33 PM, Rik van Riel wrote:
> Avi Kivity wrote:
>
>> The attached patch implements this.
>
> The attached page requires each page to go around twice
> before it is evicted, but they will still get evicted in
> the order in which they were made present.
>
> FIFO page replacement was shown to be a bad idea in the
> 1960's and it is still a terrible idea today.
>
Which is why we have accessed bits in page tables... but emulating the
accessed bit via RWX (note no present bit in EPT) is better than
ignoring it.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 15:37 ` Avi Kivity
0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-05 15:37 UTC (permalink / raw)
To: Rik van Riel
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm, KVM list
On 08/05/2009 05:33 PM, Rik van Riel wrote:
> Avi Kivity wrote:
>
>> The attached patch implements this.
>
> The attached page requires each page to go around twice
> before it is evicted, but they will still get evicted in
> the order in which they were made present.
>
> FIFO page replacement was shown to be a bad idea in the
> 1960's and it is still a terrible idea today.
>
Which is why we have accessed bits in page tables... but emulating the
accessed bit via RWX (note no present bit in EPT) is better than
ignoring it.
--
error compiling committee.c: too many arguments to function
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 7:58 ` Avi Kivity
@ 2009-08-05 15:45 ` Dike, Jeffrey G
-1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-05 15:45 UTC (permalink / raw)
To: Avi Kivity, Wu, Fengguang
Cc: Rik van Riel, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro,
Mel Gorman, LKML, linux-mm
> Jeff, do you see the refaults on Nehalem systems?
My test box is pre-Nehalem - no EPT.
Jeff
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 15:45 ` Dike, Jeffrey G
0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-05 15:45 UTC (permalink / raw)
To: Avi Kivity, Wu, Fengguang
Cc: Rik van Riel, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro,
Mel Gorman, LKML, linux-mm
> Jeff, do you see the refaults on Nehalem systems?
My test box is pre-Nehalem - no EPT.
Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 2:40 ` Wu Fengguang
@ 2009-08-05 15:58 ` Andrea Arcangeli
-1 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 15:58 UTC (permalink / raw)
To: Wu Fengguang
Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
> */
> - if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> + if ((vm_flags & VM_EXEC) || PageAnon(page)) {
> list_add(&page->lru, &l_active);
> continue;
> }
>
Please nuke the whole check and do an unconditional list_add;
continue; there.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 15:58 ` Andrea Arcangeli
0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 15:58 UTC (permalink / raw)
To: Wu Fengguang
Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
> */
> - if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> + if ((vm_flags & VM_EXEC) || PageAnon(page)) {
> list_add(&page->lru, &l_active);
> continue;
> }
>
Please nuke the whole check and do an unconditional list_add;
continue; there.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 7:58 ` Avi Kivity
@ 2009-08-05 16:05 ` Andrea Arcangeli
-1 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 16:05 UTC (permalink / raw)
To: Avi Kivity
Cc: Wu Fengguang, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Wed, Aug 05, 2009 at 10:58:10AM +0300, Avi Kivity wrote:
> How do you distinguish between kvm pages and non-kvm anonymous pages?
> More importantly, why should you?
It can't distinguish. Besides the pages being refaulted (as minor
faults) implies they weren't collected yet. So the fact they are
allowed to stay on active list or not can't matter or alter the
refaulting issue.
> Jeff, do you see the refaults on Nehalem systems? If so, that's likely
> due to the lack of an accessed bit on EPT pagetables. It would be
> interesting to compare with Barcelona (which does).
It seems it wasn't using EPT.
Refaulting as minor faults, still possible w/ or w/o EPT and young
bit... when young bit is found not set, we just unmap the spte/pte and
we leave the page in lru for a while until it is collected. So it can
be refaulted even with young bit perfectly functional in spte and pte.
But the _whole_ point of NPT young bit (shame on EPT) and in pte, is
to try not to unmap the pagetables to get the aging information. So
there's a one more pass with young bit functional compared to without
young bit functional, but it doesn't mean that when young bit is clear
at second pass we immediately free the page, just we go into the
"refaulting lru cache waiting to be collected". And if the page isn't
actually collected it doesn't matter if it's in active or inactive
list, so patch can't matter if it's ""minor"" refaults what we're
talking about here :).
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 16:05 ` Andrea Arcangeli
0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 16:05 UTC (permalink / raw)
To: Avi Kivity
Cc: Wu Fengguang, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Wed, Aug 05, 2009 at 10:58:10AM +0300, Avi Kivity wrote:
> How do you distinguish between kvm pages and non-kvm anonymous pages?
> More importantly, why should you?
It can't distinguish. Besides the pages being refaulted (as minor
faults) implies they weren't collected yet. So the fact they are
allowed to stay on active list or not can't matter or alter the
refaulting issue.
> Jeff, do you see the refaults on Nehalem systems? If so, that's likely
> due to the lack of an accessed bit on EPT pagetables. It would be
> interesting to compare with Barcelona (which does).
It seems it wasn't using EPT.
Refaulting as minor faults, still possible w/ or w/o EPT and young
bit... when young bit is found not set, we just unmap the spte/pte and
we leave the page in lru for a while until it is collected. So it can
be refaulted even with young bit perfectly functional in spte and pte.
But the _whole_ point of NPT young bit (shame on EPT) and in pte, is
to try not to unmap the pagetables to get the aging information. So
there's a one more pass with young bit functional compared to without
young bit functional, but it doesn't mean that when young bit is clear
at second pass we immediately free the page, just we go into the
"refaulting lru cache waiting to be collected". And if the page isn't
actually collected it doesn't matter if it's in active or inactive
list, so patch can't matter if it's ""minor"" refaults what we're
talking about here :).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 16:05 ` Andrea Arcangeli
@ 2009-08-05 16:12 ` Dike, Jeffrey G
-1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-05 16:12 UTC (permalink / raw)
To: Andrea Arcangeli, Avi Kivity
Cc: Wu, Fengguang, Rik van Riel, Yu, Wilfred, Kleen, Andi,
Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro,
Mel Gorman, LKML, linux-mm
> It can't distinguish. Besides the pages being refaulted (as minor
> faults) implies they weren't collected yet. So the fact they are
> allowed to stay on active list or not can't matter or alter the
> refaulting issue.
Sounds like there's some terminology confusion. A refault is a page being discarded due to memory pressure and subsequently being faulted back in. I was counting the number of faults between the discard and faulting back in for each affected page. For a large number of predominately stack pages, that number was very small.
Jeff
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 16:12 ` Dike, Jeffrey G
0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-05 16:12 UTC (permalink / raw)
To: Andrea Arcangeli, Avi Kivity
Cc: Wu, Fengguang, Rik van Riel, Yu, Wilfred, Kleen, Andi,
Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro,
Mel Gorman, LKML, linux-mm
> It can't distinguish. Besides the pages being refaulted (as minor
> faults) implies they weren't collected yet. So the fact they are
> allowed to stay on active list or not can't matter or alter the
> refaulting issue.
Sounds like there's some terminology confusion. A refault is a page being discarded due to memory pressure and subsequently being faulted back in. I was counting the number of faults between the discard and faulting back in for each affected page. For a large number of predominately stack pages, that number was very small.
Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 16:12 ` Dike, Jeffrey G
@ 2009-08-05 16:19 ` Andrea Arcangeli
-1 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 16:19 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Avi Kivity, Wu, Fengguang, Rik van Riel, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Wed, Aug 05, 2009 at 09:12:39AM -0700, Dike, Jeffrey G wrote:
> Sounds like there's some terminology confusion. A refault is a page
> being discarded due to memory pressure and subsequently being
> faulted back in. I was counting the number of faults between the
> discard and faulting back in for each affected page. For a large
> number of predominately stack pages, that number was very small.
Hmm ok, but if it's anonymous pages we're talking about here (I see
KVM in the equation so it has to be!) normally we call that thing
swapin to imply I/O is involved, not refault... Refault to me sounds
minor fault from swapcache (clean or dirty) and that's about it...
Anon page becomes swapcache, it is unmapped if young bit permits, and
then it's collected from lru eventually, if it is collected I/O will
be generated as swapin during the next page fault.
If it's too much swapin, then yes, it could be that patch that
prevents young bit to keep the anon pages in active list. But fix is
to remove the whole check, not just to enable list_add for anon pages.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 16:19 ` Andrea Arcangeli
0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 16:19 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Avi Kivity, Wu, Fengguang, Rik van Riel, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Wed, Aug 05, 2009 at 09:12:39AM -0700, Dike, Jeffrey G wrote:
> Sounds like there's some terminology confusion. A refault is a page
> being discarded due to memory pressure and subsequently being
> faulted back in. I was counting the number of faults between the
> discard and faulting back in for each affected page. For a large
> number of predominately stack pages, that number was very small.
Hmm ok, but if it's anonymous pages we're talking about here (I see
KVM in the equation so it has to be!) normally we call that thing
swapin to imply I/O is involved, not refault... Refault to me sounds
minor fault from swapcache (clean or dirty) and that's about it...
Anon page becomes swapcache, it is unmapped if young bit permits, and
then it's collected from lru eventually, if it is collected I/O will
be generated as swapin during the next page fault.
If it's too much swapin, then yes, it could be that patch that
prevents young bit to keep the anon pages in active list. But fix is
to remove the whole check, not just to enable list_add for anon pages.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 14:15 ` Rik van Riel
@ 2009-08-05 16:31 ` Andrea Arcangeli
-1 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 16:31 UTC (permalink / raw)
To: Rik van Riel
Cc: Avi Kivity, Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Wed, Aug 05, 2009 at 10:15:16AM -0400, Rik van Riel wrote:
> Not having a hardware accessed bit would explain why
> the VM is not reactivating the pages that were accessed
> while on the inactive list.
Problem is, even with young bit functional the VM isn't reactivating
those pages anyway because of that broken check... That check should
be nuked entirely in my view as it fundamentally thinks it can
outsmart the VM intelligence by checking a bit in the vma... quite
absurd in my view.
> Can we find out which pages are EPT pages?
>
> If so, we could unmap them when they get moved from the
> active to the inactive list, and soft fault them back in
> on access, emulating the referenced bit for EPT pages and
> making page replacement on them work like it should.
>
> Your approximation of pretending the page is accessed the
> first time and pretending it's not the second time sounds
> like it will just lead to less efficient FIFO replacement,
> not to anything even vaguely approximating LRU.
I think it'll still better than current situation, as young bit is
always set for ptes. Otherwise EPT pages are too penalized, we need
them to stay one round more in active list like everything else. They
are too penalizied anyways because at second pass they'll be forced
out of active list and unmapped.
This is what alpha and all other archs without young bit set in
hardware have to do. They set young bit in software and clear it in
software and then set it again in software if there's a page fault
(hopefully a minor fault). Returning "not young" first time sounds
worse to me.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 16:31 ` Andrea Arcangeli
0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 16:31 UTC (permalink / raw)
To: Rik van Riel
Cc: Avi Kivity, Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Wed, Aug 05, 2009 at 10:15:16AM -0400, Rik van Riel wrote:
> Not having a hardware accessed bit would explain why
> the VM is not reactivating the pages that were accessed
> while on the inactive list.
Problem is, even with young bit functional the VM isn't reactivating
those pages anyway because of that broken check... That check should
be nuked entirely in my view as it fundamentally thinks it can
outsmart the VM intelligence by checking a bit in the vma... quite
absurd in my view.
> Can we find out which pages are EPT pages?
>
> If so, we could unmap them when they get moved from the
> active to the inactive list, and soft fault them back in
> on access, emulating the referenced bit for EPT pages and
> making page replacement on them work like it should.
>
> Your approximation of pretending the page is accessed the
> first time and pretending it's not the second time sounds
> like it will just lead to less efficient FIFO replacement,
> not to anything even vaguely approximating LRU.
I think it'll still better than current situation, as young bit is
always set for ptes. Otherwise EPT pages are too penalized, we need
them to stay one round more in active list like everything else. They
are too penalizied anyways because at second pass they'll be forced
out of active list and unmapped.
This is what alpha and all other archs without young bit set in
hardware have to do. They set young bit in software and clear it in
software and then set it again in software if there's a page fault
(hopefully a minor fault). Returning "not young" first time sounds
worse to me.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 15:25 ` Avi Kivity
@ 2009-08-05 16:35 ` Andrea Arcangeli
-1 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 16:35 UTC (permalink / raw)
To: Avi Kivity
Cc: Rik van Riel, Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Wed, Aug 05, 2009 at 06:25:28PM +0300, Avi Kivity wrote:
> On 08/05/2009 06:15 PM, Rik van Riel wrote:
> > Avi Kivity wrote:
> >
> >>> If so, we could unmap them when they get moved from the
> >>> active to the inactive list, and soft fault them back in
> >>> on access, emulating the referenced bit for EPT pages and
> >>> making page replacement on them work like it should.
> >>
> >> It should be easy to implement via the mmu notifier callback: when
> >> the mm calls clear_flush_young(), mark it as young, and unmap it from
> >> the EPT pagetable.
> >
> > You mean "mark it as old"?
>
> I meant 'return young, and drop it from the EPT pagetable'.
>
> If we use the present bit as a replacement for the accessed bit, present
> means young, and clear_flush_young means "if present, return young and
> unmap, otherwise return old'.
This is the only way to provide accurate information, and it's still a
minor fault so not very different than return young first time around
and old second time around without invalidating the spte... but the
only reason I like it more is that it is done at the right time, like
for the ptes, so probably it's best to implement it this way to ensure
total fairness of the VM regardless if it's guest or qemu-kvm touching
the virtual memory.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 16:35 ` Andrea Arcangeli
0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 16:35 UTC (permalink / raw)
To: Avi Kivity
Cc: Rik van Riel, Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Wed, Aug 05, 2009 at 06:25:28PM +0300, Avi Kivity wrote:
> On 08/05/2009 06:15 PM, Rik van Riel wrote:
> > Avi Kivity wrote:
> >
> >>> If so, we could unmap them when they get moved from the
> >>> active to the inactive list, and soft fault them back in
> >>> on access, emulating the referenced bit for EPT pages and
> >>> making page replacement on them work like it should.
> >>
> >> It should be easy to implement via the mmu notifier callback: when
> >> the mm calls clear_flush_young(), mark it as young, and unmap it from
> >> the EPT pagetable.
> >
> > You mean "mark it as old"?
>
> I meant 'return young, and drop it from the EPT pagetable'.
>
> If we use the present bit as a replacement for the accessed bit, present
> means young, and clear_flush_young means "if present, return young and
> unmap, otherwise return old'.
This is the only way to provide accurate information, and it's still a
minor fault so not very different than return young first time around
and old second time around without invalidating the spte... but the
only reason I like it more is that it is done at the right time, like
for the ptes, so probably it's best to implement it this way to ensure
total fairness of the VM regardless if it's guest or qemu-kvm touching
the virtual memory.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 15:58 ` Andrea Arcangeli
@ 2009-08-05 17:20 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 17:20 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Andrea Arcangeli wrote:
> On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
>> */
>> - if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
>> + if ((vm_flags & VM_EXEC) || PageAnon(page)) {
>> list_add(&page->lru, &l_active);
>> continue;
>> }
>>
>
> Please nuke the whole check and do an unconditional list_add;
> continue; there.
That would reinstate the bug that the VM has no pages
available for evicting. There are very good reasons
that only VM_EXEC file pages get moved to the back of
the active list if they were referenced, and nothing
else.
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 17:20 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 17:20 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Andrea Arcangeli wrote:
> On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
>> */
>> - if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
>> + if ((vm_flags & VM_EXEC) || PageAnon(page)) {
>> list_add(&page->lru, &l_active);
>> continue;
>> }
>>
>
> Please nuke the whole check and do an unconditional list_add;
> continue; there.
That would reinstate the bug that the VM has no pages
available for evicting. There are very good reasons
that only VM_EXEC file pages get moved to the back of
the active list if they were referenced, and nothing
else.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 16:31 ` Andrea Arcangeli
@ 2009-08-05 17:25 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 17:25 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Avi Kivity, Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Andrea Arcangeli wrote:
> On Wed, Aug 05, 2009 at 10:15:16AM -0400, Rik van Riel wrote:
>> Not having a hardware accessed bit would explain why
>> the VM is not reactivating the pages that were accessed
>> while on the inactive list.
>
> Problem is, even with young bit functional the VM isn't reactivating
> those pages anyway because of that broken check...
That check is only done where active pages are moved to the
inactive list! Inactive pages that were referenced always
get moved to the active list (except for unmapped file pages).
> I think it'll still better than current situation, as young bit is
> always set for ptes. Otherwise EPT pages are too penalized, we need
> them to stay one round more in active list like everything else.
NOTHING ELSE stays on the active anon list for two rounds,
for very good reasons. Please read up on what has changed
in the VM since 2.6.27.
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 17:25 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 17:25 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Avi Kivity, Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Andrea Arcangeli wrote:
> On Wed, Aug 05, 2009 at 10:15:16AM -0400, Rik van Riel wrote:
>> Not having a hardware accessed bit would explain why
>> the VM is not reactivating the pages that were accessed
>> while on the inactive list.
>
> Problem is, even with young bit functional the VM isn't reactivating
> those pages anyway because of that broken check...
That check is only done where active pages are moved to the
inactive list! Inactive pages that were referenced always
get moved to the active list (except for unmapped file pages).
> I think it'll still better than current situation, as young bit is
> always set for ptes. Otherwise EPT pages are too penalized, we need
> them to stay one round more in active list like everything else.
NOTHING ELSE stays on the active anon list for two rounds,
for very good reasons. Please read up on what has changed
in the VM since 2.6.27.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 15:58 ` Andrea Arcangeli
@ 2009-08-05 17:42 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 17:42 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Andrea Arcangeli wrote:
> On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
>> */
>> - if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
>> + if ((vm_flags & VM_EXEC) || PageAnon(page)) {
>> list_add(&page->lru, &l_active);
>> continue;
>> }
>>
>
> Please nuke the whole check and do an unconditional list_add;
> continue; there.
<riel> aa: so you're saying we should _never_ add pages to the active
list at this point in the code
<aa> right
<riel> aa: and remove the list_add and continue completely
<aa> yes
<riel> aa: your email says the opposite :)
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 17:42 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 17:42 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Andrea Arcangeli wrote:
> On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
>> */
>> - if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
>> + if ((vm_flags & VM_EXEC) || PageAnon(page)) {
>> list_add(&page->lru, &l_active);
>> continue;
>> }
>>
>
> Please nuke the whole check and do an unconditional list_add;
> continue; there.
<riel> aa: so you're saying we should _never_ add pages to the active
list at this point in the code
<aa> right
<riel> aa: and remove the list_add and continue completely
<aa> yes
<riel> aa: your email says the opposite :)
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 2:40 ` Wu Fengguang
@ 2009-08-05 17:53 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 17:53 UTC (permalink / raw)
To: Wu Fengguang
Cc: Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Wu Fengguang wrote:
> The refaults can be drastically reduced by the following patch, which
> respects the referenced bit of all anonymous pages (including the KVM
> pages).
The big question is, which referenced bit?
All anonymous pages get the referenced bit set when they are
initially created. Acting on that bit is pretty useless, since
it does not add any information at all.
> However it risks reintroducing the problem addressed by commit 7e9cd4842
> (fix reclaim scalability problem by ignoring the referenced bit,
> mainly the pte young bit). I wonder if there are better solutions?
Reintroducing that problem is disastrous for large systems
running eg. JVMs or certain scientific computing workloads.
When you have a 256GB system that is low on memory, you need
to be able to find a page to swap out soon. If all 64 million
pages in your system are "recently referenced", you run into
BIG trouble.
I do not believe we can afford to reintroduce that problem.
Also, the inactive list (where references to anonymous pages
_do_ count) is pretty big. Is it not big enough in Jeff's
test case?
Jeff, what kind of workloads are you running in the guests?
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 17:53 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 17:53 UTC (permalink / raw)
To: Wu Fengguang
Cc: Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Wu Fengguang wrote:
> The refaults can be drastically reduced by the following patch, which
> respects the referenced bit of all anonymous pages (including the KVM
> pages).
The big question is, which referenced bit?
All anonymous pages get the referenced bit set when they are
initially created. Acting on that bit is pretty useless, since
it does not add any information at all.
> However it risks reintroducing the problem addressed by commit 7e9cd4842
> (fix reclaim scalability problem by ignoring the referenced bit,
> mainly the pte young bit). I wonder if there are better solutions?
Reintroducing that problem is disastrous for large systems
running eg. JVMs or certain scientific computing workloads.
When you have a 256GB system that is low on memory, you need
to be able to find a page to swap out soon. If all 64 million
pages in your system are "recently referenced", you run into
BIG trouble.
I do not believe we can afford to reintroduce that problem.
Also, the inactive list (where references to anonymous pages
_do_ count) is pretty big. Is it not big enough in Jeff's
test case?
Jeff, what kind of workloads are you running in the guests?
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 17:53 ` Rik van Riel
@ 2009-08-05 19:00 ` Dike, Jeffrey G
-1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-05 19:00 UTC (permalink / raw)
To: Rik van Riel, Wu, Fengguang
Cc: Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Avi Kivity,
Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro,
Mel Gorman, LKML, linux-mm
> Also, the inactive list (where references to anonymous pages
> _do_ count) is pretty big. Is it not big enough in Jeff's
> test case?
> Jeff, what kind of workloads are you running in the guests?
I'm looking at KVM on small systems. My "small system" is a 128M memory compartment on a 4G server.
The workload is boot up the instance, start Firefox and another app (whatever editor comes by default with Moblin), close them, and shut down the instance.
Jeff
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 19:00 ` Dike, Jeffrey G
0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-05 19:00 UTC (permalink / raw)
To: Rik van Riel, Wu, Fengguang
Cc: Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Avi Kivity,
Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro,
Mel Gorman, LKML, linux-mm
> Also, the inactive list (where references to anonymous pages
> _do_ count) is pretty big. Is it not big enough in Jeff's
> test case?
> Jeff, what kind of workloads are you running in the guests?
I'm looking at KVM on small systems. My "small system" is a 128M memory compartment on a 4G server.
The workload is boot up the instance, start Firefox and another app (whatever editor comes by default with Moblin), close them, and shut down the instance.
Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 19:00 ` Dike, Jeffrey G
@ 2009-08-05 19:07 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 19:07 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Wu, Fengguang, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Dike, Jeffrey G wrote:
>> Also, the inactive list (where references to anonymous pages
>> _do_ count) is pretty big. Is it not big enough in Jeff's
>> test case?
>
>> Jeff, what kind of workloads are you running in the guests?
>
> I'm looking at KVM on small systems. My "small system" is a 128M memory compartment on a 4G server.
How did you create that 128M memory compartment?
Did you use cgroups on the host system?
> The workload is boot up the instance, start Firefox and another app (whatever editor comes by default with Moblin), close them, and shut down the instance.
How much memory do you give your virtual machine?
That is, how much memory does it think it has?
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 19:07 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-05 19:07 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Wu, Fengguang, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Dike, Jeffrey G wrote:
>> Also, the inactive list (where references to anonymous pages
>> _do_ count) is pretty big. Is it not big enough in Jeff's
>> test case?
>
>> Jeff, what kind of workloads are you running in the guests?
>
> I'm looking at KVM on small systems. My "small system" is a 128M memory compartment on a 4G server.
How did you create that 128M memory compartment?
Did you use cgroups on the host system?
> The workload is boot up the instance, start Firefox and another app (whatever editor comes by default with Moblin), close them, and shut down the instance.
How much memory do you give your virtual machine?
That is, how much memory does it think it has?
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 19:07 ` Rik van Riel
@ 2009-08-05 19:18 ` Dike, Jeffrey G
-1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-05 19:18 UTC (permalink / raw)
To: Rik van Riel
Cc: Wu, Fengguang, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
> How did you create that 128M memory compartment?
>
> Did you use cgroups on the host system?
Yup.
> How much memory do you give your virtual machine?
>
> That is, how much memory does it think it has?
256M.
Jeff
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-05 19:18 ` Dike, Jeffrey G
0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-05 19:18 UTC (permalink / raw)
To: Rik van Riel
Cc: Wu, Fengguang, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
> How did you create that 128M memory compartment?
>
> Did you use cgroups on the host system?
Yup.
> How much memory do you give your virtual machine?
>
> That is, how much memory does it think it has?
256M.
Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 19:18 ` Dike, Jeffrey G
@ 2009-08-06 9:22 ` Avi Kivity
-1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 9:22 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Rik van Riel, Wu, Fengguang, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/05/2009 10:18 PM, Dike, Jeffrey G wrote:
>> How did you create that 128M memory compartment?
>>
>> Did you use cgroups on the host system?
>>
>
> Yup.
>
>
>> How much memory do you give your virtual machine?
>>
>> That is, how much memory does it think it has?
>>
>
> 256M.
>
So you're effectively running a 256M guest on a 128M host?
Do cgroups have private active/inactive lists?
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 9:22 ` Avi Kivity
0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 9:22 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Rik van Riel, Wu, Fengguang, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/05/2009 10:18 PM, Dike, Jeffrey G wrote:
>> How did you create that 128M memory compartment?
>>
>> Did you use cgroups on the host system?
>>
>
> Yup.
>
>
>> How much memory do you give your virtual machine?
>>
>> That is, how much memory does it think it has?
>>
>
> 256M.
>
So you're effectively running a 256M guest on a 128M host?
Do cgroups have private active/inactive lists?
--
error compiling committee.c: too many arguments to function
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 9:22 ` Avi Kivity
@ 2009-08-06 9:25 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06 9:25 UTC (permalink / raw)
To: Avi Kivity
Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Thu, Aug 06, 2009 at 05:22:23PM +0800, Avi Kivity wrote:
> On 08/05/2009 10:18 PM, Dike, Jeffrey G wrote:
> >> How did you create that 128M memory compartment?
> >>
> >> Did you use cgroups on the host system?
> >>
> >
> > Yup.
> >
> >
> >> How much memory do you give your virtual machine?
> >>
> >> That is, how much memory does it think it has?
> >>
> >
> > 256M.
> >
>
> So you're effectively running a 256M guest on a 128M host?
>
> Do cgroups have private active/inactive lists?
Yes, and they reuse the same page reclaim routines with the global
LRU lists.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 9:25 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06 9:25 UTC (permalink / raw)
To: Avi Kivity
Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Thu, Aug 06, 2009 at 05:22:23PM +0800, Avi Kivity wrote:
> On 08/05/2009 10:18 PM, Dike, Jeffrey G wrote:
> >> How did you create that 128M memory compartment?
> >>
> >> Did you use cgroups on the host system?
> >>
> >
> > Yup.
> >
> >
> >> How much memory do you give your virtual machine?
> >>
> >> That is, how much memory does it think it has?
> >>
> >
> > 256M.
> >
>
> So you're effectively running a 256M guest on a 128M host?
>
> Do cgroups have private active/inactive lists?
Yes, and they reuse the same page reclaim routines with the global
LRU lists.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 9:35 ` Avi Kivity
@ 2009-08-06 9:35 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06 9:35 UTC (permalink / raw)
To: Avi Kivity
Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Thu, Aug 06, 2009 at 05:35:59PM +0800, Avi Kivity wrote:
> On 08/06/2009 12:25 PM, Wu Fengguang wrote:
> >> So you're effectively running a 256M guest on a 128M host?
> >>
> >> Do cgroups have private active/inactive lists?
> >>
> >
> > Yes, and they reuse the same page reclaim routines with the global
> > LRU lists.
> >
>
> Then this looks like a bug in the shadow accessed bit handling.
Yes. One question is: why only stack pages hurts if it is a
general page reclaim problem?
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 9:35 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06 9:35 UTC (permalink / raw)
To: Avi Kivity
Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Thu, Aug 06, 2009 at 05:35:59PM +0800, Avi Kivity wrote:
> On 08/06/2009 12:25 PM, Wu Fengguang wrote:
> >> So you're effectively running a 256M guest on a 128M host?
> >>
> >> Do cgroups have private active/inactive lists?
> >>
> >
> > Yes, and they reuse the same page reclaim routines with the global
> > LRU lists.
> >
>
> Then this looks like a bug in the shadow accessed bit handling.
Yes. One question is: why only stack pages hurts if it is a
general page reclaim problem?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 9:25 ` Wu Fengguang
@ 2009-08-06 9:35 ` Avi Kivity
-1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 9:35 UTC (permalink / raw)
To: Wu Fengguang
Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/06/2009 12:25 PM, Wu Fengguang wrote:
>> So you're effectively running a 256M guest on a 128M host?
>>
>> Do cgroups have private active/inactive lists?
>>
>
> Yes, and they reuse the same page reclaim routines with the global
> LRU lists.
>
Then this looks like a bug in the shadow accessed bit handling.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 9:35 ` Avi Kivity
0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 9:35 UTC (permalink / raw)
To: Wu Fengguang
Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/06/2009 12:25 PM, Wu Fengguang wrote:
>> So you're effectively running a 256M guest on a 128M host?
>>
>> Do cgroups have private active/inactive lists?
>>
>
> Yes, and they reuse the same page reclaim routines with the global
> LRU lists.
>
Then this looks like a bug in the shadow accessed bit handling.
--
error compiling committee.c: too many arguments to function
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 9:59 ` Avi Kivity
@ 2009-08-06 9:59 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06 9:59 UTC (permalink / raw)
To: Avi Kivity
Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Thu, Aug 06, 2009 at 05:59:53PM +0800, Avi Kivity wrote:
> On 08/06/2009 12:35 PM, Wu Fengguang wrote:
> > On Thu, Aug 06, 2009 at 05:35:59PM +0800, Avi Kivity wrote:
> >
> >> On 08/06/2009 12:25 PM, Wu Fengguang wrote:
> >>
> >>>> So you're effectively running a 256M guest on a 128M host?
> >>>>
> >>>> Do cgroups have private active/inactive lists?
> >>>>
> >>>>
> >>> Yes, and they reuse the same page reclaim routines with the global
> >>> LRU lists.
> >>>
> >>>
> >> Then this looks like a bug in the shadow accessed bit handling.
> >>
> >
> > Yes. One question is: why only stack pages hurts if it is a
> > general page reclaim problem?
> >
>
> Do we know for a fact that only stack pages suffer, or is it what has
> been noticed?
It shall be the first case: "These pages are nearly all stack pages.",
Jeff said.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 9:59 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06 9:59 UTC (permalink / raw)
To: Avi Kivity
Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Thu, Aug 06, 2009 at 05:59:53PM +0800, Avi Kivity wrote:
> On 08/06/2009 12:35 PM, Wu Fengguang wrote:
> > On Thu, Aug 06, 2009 at 05:35:59PM +0800, Avi Kivity wrote:
> >
> >> On 08/06/2009 12:25 PM, Wu Fengguang wrote:
> >>
> >>>> So you're effectively running a 256M guest on a 128M host?
> >>>>
> >>>> Do cgroups have private active/inactive lists?
> >>>>
> >>>>
> >>> Yes, and they reuse the same page reclaim routines with the global
> >>> LRU lists.
> >>>
> >>>
> >> Then this looks like a bug in the shadow accessed bit handling.
> >>
> >
> > Yes. One question is: why only stack pages hurts if it is a
> > general page reclaim problem?
> >
>
> Do we know for a fact that only stack pages suffer, or is it what has
> been noticed?
It shall be the first case: "These pages are nearly all stack pages.",
Jeff said.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 9:35 ` Wu Fengguang
@ 2009-08-06 9:59 ` Avi Kivity
-1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 9:59 UTC (permalink / raw)
To: Wu Fengguang
Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/06/2009 12:35 PM, Wu Fengguang wrote:
> On Thu, Aug 06, 2009 at 05:35:59PM +0800, Avi Kivity wrote:
>
>> On 08/06/2009 12:25 PM, Wu Fengguang wrote:
>>
>>>> So you're effectively running a 256M guest on a 128M host?
>>>>
>>>> Do cgroups have private active/inactive lists?
>>>>
>>>>
>>> Yes, and they reuse the same page reclaim routines with the global
>>> LRU lists.
>>>
>>>
>> Then this looks like a bug in the shadow accessed bit handling.
>>
>
> Yes. One question is: why only stack pages hurts if it is a
> general page reclaim problem?
>
Do we know for a fact that only stack pages suffer, or is it what has
been noticed?
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 9:59 ` Avi Kivity
0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 9:59 UTC (permalink / raw)
To: Wu Fengguang
Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/06/2009 12:35 PM, Wu Fengguang wrote:
> On Thu, Aug 06, 2009 at 05:35:59PM +0800, Avi Kivity wrote:
>
>> On 08/06/2009 12:25 PM, Wu Fengguang wrote:
>>
>>>> So you're effectively running a 256M guest on a 128M host?
>>>>
>>>> Do cgroups have private active/inactive lists?
>>>>
>>>>
>>> Yes, and they reuse the same page reclaim routines with the global
>>> LRU lists.
>>>
>>>
>> Then this looks like a bug in the shadow accessed bit handling.
>>
>
> Yes. One question is: why only stack pages hurts if it is a
> general page reclaim problem?
>
Do we know for a fact that only stack pages suffer, or is it what has
been noticed?
--
error compiling committee.c: too many arguments to function
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 15:58 ` Andrea Arcangeli
@ 2009-08-06 10:08 ` Andrea Arcangeli
-1 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-06 10:08 UTC (permalink / raw)
To: Wu Fengguang
Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Wed, Aug 05, 2009 at 05:58:05PM +0200, Andrea Arcangeli wrote:
> On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
> > */
> > - if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> > + if ((vm_flags & VM_EXEC) || PageAnon(page)) {
> > list_add(&page->lru, &l_active);
> > continue;
> > }
> >
>
> Please nuke the whole check and do an unconditional list_add;
> continue; there.
After some conversation it seems reactivating on large systems
generates troubles to the VM as young bit have excessive time to be
reactivated, giving troubles to shrink active list. I see that, so
then the check should be still nuked, but the unconditional
deactivation should happen instead. Otherwise it's trivial to put the
VM to its knees and DoS it with a simple mmap of a file with MAP_EXEC
as parameter of mmap. My whole point is that deciding if activating or
deactivating pages can't be in function of VM_EXEC, and clearly it
helps on desktops but then it probably is a signal that the VM isn't
good enough by itself to identify the important working set using
young bits and stuff on desktop systems, and if there's a good reason
to not activate, we shouldn't activate the VM_EXEC either as anything
and anybody can generate a file mapping with VM_EXEC set...
Likely we need a cut-off point, if we detect it takes more than X
seconds to scan the whole active list, we start ignoring young bits,
as young bits don't provide any meaningful information then and they
just hang the VM in preventing it to shrink active list and looping
over it endlessy with million pages inside that list. But on small
systems if inactive list is short it may be too quick to just clear
the young bit and only giving it time to be re-enabled in inactive
list. That may be the source of the problem. Actually I'm speculating
here, because I barely understood that this is swapin... not sure
exactly what this regression is about but testing the patch posted is
good idea and it will tell us if we just need to dynamically
differentiating the algorithm between large and small systems and start
ignoring young bits only at some point.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 10:08 ` Andrea Arcangeli
0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-06 10:08 UTC (permalink / raw)
To: Wu Fengguang
Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Wed, Aug 05, 2009 at 05:58:05PM +0200, Andrea Arcangeli wrote:
> On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
> > */
> > - if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> > + if ((vm_flags & VM_EXEC) || PageAnon(page)) {
> > list_add(&page->lru, &l_active);
> > continue;
> > }
> >
>
> Please nuke the whole check and do an unconditional list_add;
> continue; there.
After some conversation it seems reactivating on large systems
generates troubles to the VM as young bit have excessive time to be
reactivated, giving troubles to shrink active list. I see that, so
then the check should be still nuked, but the unconditional
deactivation should happen instead. Otherwise it's trivial to put the
VM to its knees and DoS it with a simple mmap of a file with MAP_EXEC
as parameter of mmap. My whole point is that deciding if activating or
deactivating pages can't be in function of VM_EXEC, and clearly it
helps on desktops but then it probably is a signal that the VM isn't
good enough by itself to identify the important working set using
young bits and stuff on desktop systems, and if there's a good reason
to not activate, we shouldn't activate the VM_EXEC either as anything
and anybody can generate a file mapping with VM_EXEC set...
Likely we need a cut-off point, if we detect it takes more than X
seconds to scan the whole active list, we start ignoring young bits,
as young bits don't provide any meaningful information then and they
just hang the VM in preventing it to shrink active list and looping
over it endlessy with million pages inside that list. But on small
systems if inactive list is short it may be too quick to just clear
the young bit and only giving it time to be re-enabled in inactive
list. That may be the source of the problem. Actually I'm speculating
here, because I barely understood that this is swapin... not sure
exactly what this regression is about but testing the patch posted is
good idea and it will tell us if we just need to dynamically
differentiating the algorithm between large and small systems and start
ignoring young bits only at some point.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 9:59 ` Wu Fengguang
@ 2009-08-06 10:14 ` Avi Kivity
-1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 10:14 UTC (permalink / raw)
To: Wu Fengguang
Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/06/2009 12:59 PM, Wu Fengguang wrote:
>> Do we know for a fact that only stack pages suffer, or is it what has
>> been noticed?
>>
>
> It shall be the first case: "These pages are nearly all stack pages.",
> Jeff said.
>
Ok. I can't explain it. There's no special treatment for guest stack
pages. The accessed bit should be maintained for them exactly like all
other pages.
Are they kernel-mode stack pages, or user-mode stack pages (the
difference being that kernel mode stack pages are accessed through large
ptes, whereas user mode stack pages are accessed through normal ptes).
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 10:14 ` Avi Kivity
0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 10:14 UTC (permalink / raw)
To: Wu Fengguang
Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi,
Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/06/2009 12:59 PM, Wu Fengguang wrote:
>> Do we know for a fact that only stack pages suffer, or is it what has
>> been noticed?
>>
>
> It shall be the first case: "These pages are nearly all stack pages.",
> Jeff said.
>
Ok. I can't explain it. There's no special treatment for guest stack
pages. The accessed bit should be maintained for them exactly like all
other pages.
Are they kernel-mode stack pages, or user-mode stack pages (the
difference being that kernel mode stack pages are accessed through large
ptes, whereas user mode stack pages are accessed through normal ptes).
--
error compiling committee.c: too many arguments to function
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-05 17:42 ` Rik van Riel
@ 2009-08-06 10:15 ` Andrea Arcangeli
-1 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-06 10:15 UTC (permalink / raw)
To: Rik van Riel
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Wed, Aug 05, 2009 at 01:42:30PM -0400, Rik van Riel wrote:
> Andrea Arcangeli wrote:
> > On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
> >> */
> >> - if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> >> + if ((vm_flags & VM_EXEC) || PageAnon(page)) {
> >> list_add(&page->lru, &l_active);
> >> continue;
> >> }
> >>
> >
> > Please nuke the whole check and do an unconditional list_add;
> > continue; there.
>
> <riel> aa: so you're saying we should _never_ add pages to the active
> list at this point in the code
> <aa> right
> <riel> aa: and remove the list_add and continue completely
> <aa> yes
> <riel> aa: your email says the opposite :)
Posted more meaningful explanation in self-reply to the email that
said the opposite, which tries to explains why I changed my mind (well
my focus really was on VM_EXEC and I didn't change my mind about yet
but then I'm flexible so I'm listening if somebody thinks it's good
thing to keep it). The irc quote was greatly out of context and it
missed all the previous conversation... I hope my mail explains my
point in more detail than the above.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 10:15 ` Andrea Arcangeli
0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-06 10:15 UTC (permalink / raw)
To: Rik van Riel
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Wed, Aug 05, 2009 at 01:42:30PM -0400, Rik van Riel wrote:
> Andrea Arcangeli wrote:
> > On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote:
> >> */
> >> - if ((vm_flags & VM_EXEC) && !PageAnon(page)) {
> >> + if ((vm_flags & VM_EXEC) || PageAnon(page)) {
> >> list_add(&page->lru, &l_active);
> >> continue;
> >> }
> >>
> >
> > Please nuke the whole check and do an unconditional list_add;
> > continue; there.
>
> <riel> aa: so you're saying we should _never_ add pages to the active
> list at this point in the code
> <aa> right
> <riel> aa: and remove the list_add and continue completely
> <aa> yes
> <riel> aa: your email says the opposite :)
Posted more meaningful explanation in self-reply to the email that
said the opposite, which tries to explains why I changed my mind (well
my focus really was on VM_EXEC and I didn't change my mind about yet
but then I'm flexible so I'm listening if somebody thinks it's good
thing to keep it). The irc quote was greatly out of context and it
missed all the previous conversation... I hope my mail explains my
point in more detail than the above.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 10:08 ` Andrea Arcangeli
@ 2009-08-06 10:18 ` Avi Kivity
-1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 10:18 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Wu Fengguang, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/06/2009 01:08 PM, Andrea Arcangeli wrote:
> After some conversation it seems reactivating on large systems
> generates troubles to the VM as young bit have excessive time to be
> reactivated, giving troubles to shrink active list. I see that, so
> then the check should be still nuked, but the unconditional
> deactivation should happen instead. Otherwise it's trivial to put the
> VM to its knees and DoS it with a simple mmap of a file with MAP_EXEC
> as parameter of mmap. My whole point is that deciding if activating or
> deactivating pages can't be in function of VM_EXEC, and clearly it
> helps on desktops but then it probably is a signal that the VM isn't
> good enough by itself to identify the important working set using
> young bits and stuff on desktop systems, and if there's a good reason
> to not activate, we shouldn't activate the VM_EXEC either as anything
> and anybody can generate a file mapping with VM_EXEC set...
>
Reasonable; if you depend on a hint from userspace, that hint can be
used against you.
> Likely we need a cut-off point, if we detect it takes more than X
> seconds to scan the whole active list, we start ignoring young bits,
> as young bits don't provide any meaningful information then and they
> just hang the VM in preventing it to shrink active list and looping
> over it endlessy with million pages inside that list. But on small
> systems if inactive list is short it may be too quick to just clear
> the young bit and only giving it time to be re-enabled in inactive
> list. That may be the source of the problem. Actually I'm speculating
> here, because I barely understood that this is swapin... not sure
> exactly what this regression is about but testing the patch posted is
> good idea and it will tell us if we just need to dynamically
> differentiating the algorithm between large and small systems and start
> ignoring young bits only at some point.
>
How about, for every N pages that you scan, evict at least 1 page,
regardless of young bit status? That limits overscanning to a N:1
ratio. With N=250 we'll spend at most 25 usec in order to locate one
page to evict.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 10:18 ` Avi Kivity
0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 10:18 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Wu Fengguang, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/06/2009 01:08 PM, Andrea Arcangeli wrote:
> After some conversation it seems reactivating on large systems
> generates troubles to the VM as young bit have excessive time to be
> reactivated, giving troubles to shrink active list. I see that, so
> then the check should be still nuked, but the unconditional
> deactivation should happen instead. Otherwise it's trivial to put the
> VM to its knees and DoS it with a simple mmap of a file with MAP_EXEC
> as parameter of mmap. My whole point is that deciding if activating or
> deactivating pages can't be in function of VM_EXEC, and clearly it
> helps on desktops but then it probably is a signal that the VM isn't
> good enough by itself to identify the important working set using
> young bits and stuff on desktop systems, and if there's a good reason
> to not activate, we shouldn't activate the VM_EXEC either as anything
> and anybody can generate a file mapping with VM_EXEC set...
>
Reasonable; if you depend on a hint from userspace, that hint can be
used against you.
> Likely we need a cut-off point, if we detect it takes more than X
> seconds to scan the whole active list, we start ignoring young bits,
> as young bits don't provide any meaningful information then and they
> just hang the VM in preventing it to shrink active list and looping
> over it endlessy with million pages inside that list. But on small
> systems if inactive list is short it may be too quick to just clear
> the young bit and only giving it time to be re-enabled in inactive
> list. That may be the source of the problem. Actually I'm speculating
> here, because I barely understood that this is swapin... not sure
> exactly what this regression is about but testing the patch posted is
> good idea and it will tell us if we just need to dynamically
> differentiating the algorithm between large and small systems and start
> ignoring young bits only at some point.
>
How about, for every N pages that you scan, evict at least 1 page,
regardless of young bit status? That limits overscanning to a N:1
ratio. With N=250 we'll spend at most 25 usec in order to locate one
page to evict.
--
error compiling committee.c: too many arguments to function
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 10:18 ` Avi Kivity
@ 2009-08-06 10:20 ` Andrea Arcangeli
-1 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-06 10:20 UTC (permalink / raw)
To: Avi Kivity
Cc: Wu Fengguang, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Thu, Aug 06, 2009 at 01:18:47PM +0300, Avi Kivity wrote:
> Reasonable; if you depend on a hint from userspace, that hint can be
> used against you.
Correct, that is my whole point. Also we never know if applications
are mmapping huge files with MAP_EXEC just because they might need to
trampoline once in a while, or do some little JIT thing once in a
while. Sometime people open files with O_RDWR even if they only need
O_RDONLY. It's not a bug, but radically altering VM behavior because
of a bitflag doesn't sound good to me.
I certainly see this tends to help as it will reactivate all
.text. But this signals current VM behavior is not ok for small
systems IMHO if such an hack is required. I prefer a dynamic algorithm
that when active list grow too much stop reactivating pages and
reduces the time for young bit activation only to the time the page
sits on the inactive list. And if active list is small (like 128M
system) we fully trust young bit and if it set, we don't allow it to
go in inactive list as it's quick enough to scan the whole active
list, and young bit is meaningful there.
The issue I can see is with huge system and million pages in active
list, by the time we can it all, too much time has passed and we don't
get any meaningful information out of young bit. Things are radically
different on all regular workstations, and frankly regular
workstations are very important too, as I suspect there are more users
running on <64G systems than on >64G systems.
> How about, for every N pages that you scan, evict at least 1 page,
> regardless of young bit status? That limits overscanning to a N:1
> ratio. With N=250 we'll spend at most 25 usec in order to locate one
> page to evict.
Yes exactly, something like that I think will be dynamic, and then we
can drop VM_EXEC check and solve the issues on large systems while
still not almost totally ignoring young bit on small systems.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 10:20 ` Andrea Arcangeli
0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2009-08-06 10:20 UTC (permalink / raw)
To: Avi Kivity
Cc: Wu Fengguang, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Thu, Aug 06, 2009 at 01:18:47PM +0300, Avi Kivity wrote:
> Reasonable; if you depend on a hint from userspace, that hint can be
> used against you.
Correct, that is my whole point. Also we never know if applications
are mmapping huge files with MAP_EXEC just because they might need to
trampoline once in a while, or do some little JIT thing once in a
while. Sometime people open files with O_RDWR even if they only need
O_RDONLY. It's not a bug, but radically altering VM behavior because
of a bitflag doesn't sound good to me.
I certainly see this tends to help as it will reactivate all
.text. But this signals current VM behavior is not ok for small
systems IMHO if such an hack is required. I prefer a dynamic algorithm
that when active list grow too much stop reactivating pages and
reduces the time for young bit activation only to the time the page
sits on the inactive list. And if active list is small (like 128M
system) we fully trust young bit and if it set, we don't allow it to
go in inactive list as it's quick enough to scan the whole active
list, and young bit is meaningful there.
The issue I can see is with huge system and million pages in active
list, by the time we can it all, too much time has passed and we don't
get any meaningful information out of young bit. Things are radically
different on all regular workstations, and frankly regular
workstations are very important too, as I suspect there are more users
running on <64G systems than on >64G systems.
> How about, for every N pages that you scan, evict at least 1 page,
> regardless of young bit status? That limits overscanning to a N:1
> ratio. With N=250 we'll spend at most 25 usec in order to locate one
> page to evict.
Yes exactly, something like that I think will be dynamic, and then we
can drop VM_EXEC check and solve the issues on large systems while
still not almost totally ignoring young bit on small systems.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 10:20 ` Andrea Arcangeli
@ 2009-08-06 10:59 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06 10:59 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Avi Kivity, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Thu, Aug 06, 2009 at 06:20:57PM +0800, Andrea Arcangeli wrote:
> On Thu, Aug 06, 2009 at 01:18:47PM +0300, Avi Kivity wrote:
> > Reasonable; if you depend on a hint from userspace, that hint can be
> > used against you.
>
> Correct, that is my whole point. Also we never know if applications
> are mmapping huge files with MAP_EXEC just because they might need to
> trampoline once in a while, or do some little JIT thing once in a
> while. Sometime people open files with O_RDWR even if they only need
> O_RDONLY. It's not a bug, but radically altering VM behavior because
> of a bitflag doesn't sound good to me.
>
> I certainly see this tends to help as it will reactivate all
> .text. But this signals current VM behavior is not ok for small
> systems IMHO if such an hack is required. I prefer a dynamic algorithm
> that when active list grow too much stop reactivating pages and
> reduces the time for young bit activation only to the time the page
> sits on the inactive list. And if active list is small (like 128M
> system) we fully trust young bit and if it set, we don't allow it to
> go in inactive list as it's quick enough to scan the whole active
> list, and young bit is meaningful there.
>
> The issue I can see is with huge system and million pages in active
> list, by the time we can it all, too much time has passed and we don't
> get any meaningful information out of young bit. Things are radically
> different on all regular workstations, and frankly regular
> workstations are very important too, as I suspect there are more users
> running on <64G systems than on >64G systems.
>
> > How about, for every N pages that you scan, evict at least 1 page,
> > regardless of young bit status? That limits overscanning to a N:1
> > ratio. With N=250 we'll spend at most 25 usec in order to locate one
> > page to evict.
>
> Yes exactly, something like that I think will be dynamic, and then we
> can drop VM_EXEC check and solve the issues on large systems while
> still not almost totally ignoring young bit on small systems.
This is a quick hack to materialize the idea. It remembers roughly
the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
and if _all of them_ are referenced, then the referenced bit is
probably meaningless and should not be taken seriously.
As a refinement, the static variable 'recent_all_referenced' could be
moved to struct zone or made a per-cpu variable.
Thanks,
Fengguang
---
mm/vmscan.c | 28 +++++++++++++++-------------
1 file changed, 15 insertions(+), 13 deletions(-)
--- linux.orig/mm/vmscan.c 2009-08-06 18:31:20.000000000 +0800
+++ linux/mm/vmscan.c 2009-08-06 18:51:58.000000000 +0800
@@ -1239,6 +1239,9 @@ static void move_active_pages_to_lru(str
static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
struct scan_control *sc, int priority, int file)
{
+ static unsigned int recent_all_referenced;
+ int all_referenced = 1;
+ int referenced_bit_ok;
unsigned long pgmoved;
unsigned long pgscanned;
unsigned long vm_flags;
@@ -1267,6 +1270,8 @@ static void shrink_active_list(unsigned
__mod_zone_page_state(zone, NR_ACTIVE_FILE, -pgmoved);
else
__mod_zone_page_state(zone, NR_ACTIVE_ANON, -pgmoved);
+
+ referenced_bit_ok = !recent_all_referenced;
spin_unlock_irq(&zone->lru_lock);
pgmoved = 0; /* count referenced (mapping) mapped pages */
@@ -1281,19 +1286,15 @@ static void shrink_active_list(unsigned
}
/* page_referenced clears PageReferenced */
- if (page_mapping_inuse(page) &&
- page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
- pgmoved++;
- /*
- * Identify referenced, file-backed active pages and
- * give them one more trip around the active list. So
- * that executable code get better chances to stay in
- * memory under moderate memory pressure.
- *
- * Also protect anon pages: swapping could be costly,
- * and KVM guest's referenced bit is helpful.
- */
- if ((vm_flags & VM_EXEC) || PageAnon(page)) {
+ if (page_mapping_inuse(page)) {
+ referenced = page_referenced(page, 0, sc->mem_cgroup,
+ &vm_flags);
+ if (referenced)
+ pgmoved++;
+ else
+ all_referenced = 0;
+
+ if (referenced && referenced_bit_ok) {
list_add(&page->lru, &l_active);
continue;
}
@@ -1319,6 +1320,7 @@ static void shrink_active_list(unsigned
move_active_pages_to_lru(zone, &l_inactive,
LRU_BASE + file * LRU_FILE);
+ recent_all_referenced = (recent_all_referenced << 1) | all_referenced;
spin_unlock_irq(&zone->lru_lock);
}
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 10:59 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06 10:59 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Avi Kivity, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Thu, Aug 06, 2009 at 06:20:57PM +0800, Andrea Arcangeli wrote:
> On Thu, Aug 06, 2009 at 01:18:47PM +0300, Avi Kivity wrote:
> > Reasonable; if you depend on a hint from userspace, that hint can be
> > used against you.
>
> Correct, that is my whole point. Also we never know if applications
> are mmapping huge files with MAP_EXEC just because they might need to
> trampoline once in a while, or do some little JIT thing once in a
> while. Sometime people open files with O_RDWR even if they only need
> O_RDONLY. It's not a bug, but radically altering VM behavior because
> of a bitflag doesn't sound good to me.
>
> I certainly see this tends to help as it will reactivate all
> .text. But this signals current VM behavior is not ok for small
> systems IMHO if such an hack is required. I prefer a dynamic algorithm
> that when active list grow too much stop reactivating pages and
> reduces the time for young bit activation only to the time the page
> sits on the inactive list. And if active list is small (like 128M
> system) we fully trust young bit and if it set, we don't allow it to
> go in inactive list as it's quick enough to scan the whole active
> list, and young bit is meaningful there.
>
> The issue I can see is with huge system and million pages in active
> list, by the time we can it all, too much time has passed and we don't
> get any meaningful information out of young bit. Things are radically
> different on all regular workstations, and frankly regular
> workstations are very important too, as I suspect there are more users
> running on <64G systems than on >64G systems.
>
> > How about, for every N pages that you scan, evict at least 1 page,
> > regardless of young bit status? That limits overscanning to a N:1
> > ratio. With N=250 we'll spend at most 25 usec in order to locate one
> > page to evict.
>
> Yes exactly, something like that I think will be dynamic, and then we
> can drop VM_EXEC check and solve the issues on large systems while
> still not almost totally ignoring young bit on small systems.
This is a quick hack to materialize the idea. It remembers roughly
the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
and if _all of them_ are referenced, then the referenced bit is
probably meaningless and should not be taken seriously.
As a refinement, the static variable 'recent_all_referenced' could be
moved to struct zone or made a per-cpu variable.
Thanks,
Fengguang
---
mm/vmscan.c | 28 +++++++++++++++-------------
1 file changed, 15 insertions(+), 13 deletions(-)
--- linux.orig/mm/vmscan.c 2009-08-06 18:31:20.000000000 +0800
+++ linux/mm/vmscan.c 2009-08-06 18:51:58.000000000 +0800
@@ -1239,6 +1239,9 @@ static void move_active_pages_to_lru(str
static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
struct scan_control *sc, int priority, int file)
{
+ static unsigned int recent_all_referenced;
+ int all_referenced = 1;
+ int referenced_bit_ok;
unsigned long pgmoved;
unsigned long pgscanned;
unsigned long vm_flags;
@@ -1267,6 +1270,8 @@ static void shrink_active_list(unsigned
__mod_zone_page_state(zone, NR_ACTIVE_FILE, -pgmoved);
else
__mod_zone_page_state(zone, NR_ACTIVE_ANON, -pgmoved);
+
+ referenced_bit_ok = !recent_all_referenced;
spin_unlock_irq(&zone->lru_lock);
pgmoved = 0; /* count referenced (mapping) mapped pages */
@@ -1281,19 +1286,15 @@ static void shrink_active_list(unsigned
}
/* page_referenced clears PageReferenced */
- if (page_mapping_inuse(page) &&
- page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
- pgmoved++;
- /*
- * Identify referenced, file-backed active pages and
- * give them one more trip around the active list. So
- * that executable code get better chances to stay in
- * memory under moderate memory pressure.
- *
- * Also protect anon pages: swapping could be costly,
- * and KVM guest's referenced bit is helpful.
- */
- if ((vm_flags & VM_EXEC) || PageAnon(page)) {
+ if (page_mapping_inuse(page)) {
+ referenced = page_referenced(page, 0, sc->mem_cgroup,
+ &vm_flags);
+ if (referenced)
+ pgmoved++;
+ else
+ all_referenced = 0;
+
+ if (referenced && referenced_bit_ok) {
list_add(&page->lru, &l_active);
continue;
}
@@ -1319,6 +1320,7 @@ static void shrink_active_list(unsigned
move_active_pages_to_lru(zone, &l_inactive,
LRU_BASE + file * LRU_FILE);
+ recent_all_referenced = (recent_all_referenced << 1) | all_referenced;
spin_unlock_irq(&zone->lru_lock);
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 10:59 ` Wu Fengguang
@ 2009-08-06 11:44 ` Avi Kivity
-1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 11:44 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrea Arcangeli, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>
> This is a quick hack to materialize the idea. It remembers roughly
> the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
> and if _all of them_ are referenced, then the referenced bit is
> probably meaningless and should not be taken seriously.
>
>
I don't think we should ignore the referenced bit. There could still be
a large batch of unreferenced pages later on that we should
preferentially swap. If we swap at least 1 page for every 250 scanned,
after 4K swaps we will have traversed 1M pages, enough to find them.
> As a refinement, the static variable 'recent_all_referenced' could be
> moved to struct zone or made a per-cpu variable.
>
>
Definitely this should be made part of the zone structure, consider the
original report where the problem occurs in a 128MB zone (where we can
expect many pages to have their referenced bit set).
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 11:44 ` Avi Kivity
0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 11:44 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrea Arcangeli, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>
> This is a quick hack to materialize the idea. It remembers roughly
> the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
> and if _all of them_ are referenced, then the referenced bit is
> probably meaningless and should not be taken seriously.
>
>
I don't think we should ignore the referenced bit. There could still be
a large batch of unreferenced pages later on that we should
preferentially swap. If we swap at least 1 page for every 250 scanned,
after 4K swaps we will have traversed 1M pages, enough to find them.
> As a refinement, the static variable 'recent_all_referenced' could be
> moved to struct zone or made a per-cpu variable.
>
>
Definitely this should be made part of the zone structure, consider the
original report where the problem occurs in a 128MB zone (where we can
expect many pages to have their referenced bit set).
--
error compiling committee.c: too many arguments to function
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 11:44 ` Avi Kivity
@ 2009-08-06 13:06 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06 13:06 UTC (permalink / raw)
To: Avi Kivity
Cc: Andrea Arcangeli, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Thu, Aug 06, 2009 at 07:44:01PM +0800, Avi Kivity wrote:
> On 08/06/2009 01:59 PM, Wu Fengguang wrote:
scheme KEEP_MOST:
>> How about, for every N pages that you scan, evict at least 1 page,
>> regardless of young bit status? That limits overscanning to a N:1
>> ratio. With N=250 we'll spend at most 25 usec in order to locate one
>> page to evict.
scheme DROP_CONTINUOUS:
> > This is a quick hack to materialize the idea. It remembers roughly
> > the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
> > and if _all of them_ are referenced, then the referenced bit is
> > probably meaningless and should not be taken seriously.
> I don't think we should ignore the referenced bit. There could still be
> a large batch of unreferenced pages later on that we should
> preferentially swap. If we swap at least 1 page for every 250 scanned,
> after 4K swaps we will have traversed 1M pages, enough to find them.
I guess both schemes have unacceptable flaws.
For JVM/BIGMEM workload, most pages would be found referenced _all the time_.
So the KEEP_MOST scheme could increase reclaim overheads by N=250 times;
while the DROP_CONTINUOUS scheme is effectively zero cost.
However, the DROP_CONTINUOUS scheme does bring more _indeterminacy_.
It can behave vastly different on single active task and multi ones.
It is short sighted and can be cheated by bursty activities.
> > As a refinement, the static variable 'recent_all_referenced' could be
> > moved to struct zone or made a per-cpu variable.
>
> Definitely this should be made part of the zone structure, consider the
> original report where the problem occurs in a 128MB zone (where we can
> expect many pages to have their referenced bit set).
Good point. Here the cgroup list is highly stressed, while the global
zones are idling.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 13:06 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-06 13:06 UTC (permalink / raw)
To: Avi Kivity
Cc: Andrea Arcangeli, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Thu, Aug 06, 2009 at 07:44:01PM +0800, Avi Kivity wrote:
> On 08/06/2009 01:59 PM, Wu Fengguang wrote:
scheme KEEP_MOST:
>> How about, for every N pages that you scan, evict at least 1 page,
>> regardless of young bit status? That limits overscanning to a N:1
>> ratio. With N=250 we'll spend at most 25 usec in order to locate one
>> page to evict.
scheme DROP_CONTINUOUS:
> > This is a quick hack to materialize the idea. It remembers roughly
> > the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
> > and if _all of them_ are referenced, then the referenced bit is
> > probably meaningless and should not be taken seriously.
> I don't think we should ignore the referenced bit. There could still be
> a large batch of unreferenced pages later on that we should
> preferentially swap. If we swap at least 1 page for every 250 scanned,
> after 4K swaps we will have traversed 1M pages, enough to find them.
I guess both schemes have unacceptable flaws.
For JVM/BIGMEM workload, most pages would be found referenced _all the time_.
So the KEEP_MOST scheme could increase reclaim overheads by N=250 times;
while the DROP_CONTINUOUS scheme is effectively zero cost.
However, the DROP_CONTINUOUS scheme does bring more _indeterminacy_.
It can behave vastly different on single active task and multi ones.
It is short sighted and can be cheated by bursty activities.
> > As a refinement, the static variable 'recent_all_referenced' could be
> > moved to struct zone or made a per-cpu variable.
>
> Definitely this should be made part of the zone structure, consider the
> original report where the problem occurs in a 128MB zone (where we can
> expect many pages to have their referenced bit set).
Good point. Here the cgroup list is highly stressed, while the global
zones are idling.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 10:08 ` Andrea Arcangeli
@ 2009-08-06 13:08 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-06 13:08 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Andrea Arcangeli wrote:
> Likely we need a cut-off point, if we detect it takes more than X
> seconds to scan the whole active list, we start ignoring young bits,
We could just make this depend on the calculated inactive_ratio,
which depends on the size of the list.
For small systems, it may make sense to make every accessed bit
count, because the working set will often approach the size of
memory.
On very large systems, the working set may also approach the
size of memory, but the inactive list only contains a small
percentage of the pages, so there is enough space for everything.
Say, if the inactive_ratio is 3 or less, make the accessed bit
on the active lists count.
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 13:08 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-06 13:08 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Andrea Arcangeli wrote:
> Likely we need a cut-off point, if we detect it takes more than X
> seconds to scan the whole active list, we start ignoring young bits,
We could just make this depend on the calculated inactive_ratio,
which depends on the size of the list.
For small systems, it may make sense to make every accessed bit
count, because the working set will often approach the size of
memory.
On very large systems, the working set may also approach the
size of memory, but the inactive list only contains a small
percentage of the pages, so there is enough space for everything.
Say, if the inactive_ratio is 3 or less, make the accessed bit
on the active lists count.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 10:59 ` Wu Fengguang
@ 2009-08-06 13:11 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-06 13:11 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrea Arcangeli, Avi Kivity, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Wu Fengguang wrote:
> This is a quick hack to materialize the idea. It remembers roughly
> the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
> and if _all of them_ are referenced, then the referenced bit is
> probably meaningless and should not be taken seriously.
This has the potential to increase the number of active
pages scanned by almost a factor 1024. Let me whip up an
alternative idea when I get to the office later today.
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 13:11 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-06 13:11 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrea Arcangeli, Avi Kivity, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Wu Fengguang wrote:
> This is a quick hack to materialize the idea. It remembers roughly
> the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
> and if _all of them_ are referenced, then the referenced bit is
> probably meaningless and should not be taken seriously.
This has the potential to increase the number of active
pages scanned by almost a factor 1024. Let me whip up an
alternative idea when I get to the office later today.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 11:44 ` Avi Kivity
@ 2009-08-06 13:13 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-06 13:13 UTC (permalink / raw)
To: Avi Kivity
Cc: Wu Fengguang, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Avi Kivity wrote:
> On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>> As a refinement, the static variable 'recent_all_referenced' could be
>> moved to struct zone or made a per-cpu variable.
>
> Definitely this should be made part of the zone structure, consider the
> original report where the problem occurs in a 128MB zone (where we can
> expect many pages to have their referenced bit set).
The problem did not occur in a 128MB zone, but in a 128MB cgroup.
Putting it in the zone means that the cgroup, which may have
different behaviour from the rest of the zone, due to excessive
memory pressure inside the cgroup, does not get the right
statistics.
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 13:13 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-06 13:13 UTC (permalink / raw)
To: Avi Kivity
Cc: Wu Fengguang, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Avi Kivity wrote:
> On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>> As a refinement, the static variable 'recent_all_referenced' could be
>> moved to struct zone or made a per-cpu variable.
>
> Definitely this should be made part of the zone structure, consider the
> original report where the problem occurs in a 128MB zone (where we can
> expect many pages to have their referenced bit set).
The problem did not occur in a 128MB zone, but in a 128MB cgroup.
Putting it in the zone means that the cgroup, which may have
different behaviour from the rest of the zone, due to excessive
memory pressure inside the cgroup, does not get the right
statistics.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 13:06 ` Wu Fengguang
@ 2009-08-06 13:16 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-06 13:16 UTC (permalink / raw)
To: Wu Fengguang
Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Wu Fengguang wrote:
> I guess both schemes have unacceptable flaws.
>
> For JVM/BIGMEM workload, most pages would be found referenced _all the time_.
> So the KEEP_MOST scheme could increase reclaim overheads by N=250 times;
> while the DROP_CONTINUOUS scheme is effectively zero cost.
The higher overhead may not be an issue on smaller systems,
or inside smaller cgroups inside large systems, when doing
cgroup reclaim.
> However, the DROP_CONTINUOUS scheme does bring more _indeterminacy_.
> It can behave vastly different on single active task and multi ones.
> It is short sighted and can be cheated by bursty activities.
The split LRU VM tries to avoid the bursty page aging as
much as possible, by doing background deactivating of
anonymous pages whenever we reclaim page cache pages and
the number of anonymous pages in the zone (or cgroup) is
low.
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 13:16 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-06 13:16 UTC (permalink / raw)
To: Wu Fengguang
Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Wu Fengguang wrote:
> I guess both schemes have unacceptable flaws.
>
> For JVM/BIGMEM workload, most pages would be found referenced _all the time_.
> So the KEEP_MOST scheme could increase reclaim overheads by N=250 times;
> while the DROP_CONTINUOUS scheme is effectively zero cost.
The higher overhead may not be an issue on smaller systems,
or inside smaller cgroups inside large systems, when doing
cgroup reclaim.
> However, the DROP_CONTINUOUS scheme does bring more _indeterminacy_.
> It can behave vastly different on single active task and multi ones.
> It is short sighted and can be cheated by bursty activities.
The split LRU VM tries to avoid the bursty page aging as
much as possible, by doing background deactivating of
anonymous pages whenever we reclaim page cache pages and
the number of anonymous pages in the zone (or cgroup) is
low.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 13:06 ` Wu Fengguang
@ 2009-08-06 13:46 ` Avi Kivity
-1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 13:46 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrea Arcangeli, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/06/2009 04:06 PM, Wu Fengguang wrote:
> On Thu, Aug 06, 2009 at 07:44:01PM +0800, Avi Kivity wrote:
>
>> On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>>
>
> scheme KEEP_MOST:
>
>
>>> How about, for every N pages that you scan, evict at least 1 page,
>>> regardless of young bit status? That limits overscanning to a N:1
>>> ratio. With N=250 we'll spend at most 25 usec in order to locate one
>>> page to evict.
>>>
>
> scheme DROP_CONTINUOUS:
>
>
>>> This is a quick hack to materialize the idea. It remembers roughly
>>> the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
>>> and if _all of them_ are referenced, then the referenced bit is
>>> probably meaningless and should not be taken seriously.
>>>
>
>
Or one scheme, with N=parameter.
>> I don't think we should ignore the referenced bit. There could still be
>> a large batch of unreferenced pages later on that we should
>> preferentially swap. If we swap at least 1 page for every 250 scanned,
>> after 4K swaps we will have traversed 1M pages, enough to find them.
>>
>
> I guess both schemes have unacceptable flaws.
>
> For JVM/BIGMEM workload, most pages would be found referenced _all the time_.
> So the KEEP_MOST scheme could increase reclaim overheads by N=250 times;
> while the DROP_CONTINUOUS scheme is effectively zero cost.
>
Maybe 250 is an exaggeration. But even with 250, the cost is still
pretty low compared to the cpu cost of evicting a page (with IPIs and
tlb flushes).
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 13:46 ` Avi Kivity
0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 13:46 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrea Arcangeli, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/06/2009 04:06 PM, Wu Fengguang wrote:
> On Thu, Aug 06, 2009 at 07:44:01PM +0800, Avi Kivity wrote:
>
>> On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>>
>
> scheme KEEP_MOST:
>
>
>>> How about, for every N pages that you scan, evict at least 1 page,
>>> regardless of young bit status? That limits overscanning to a N:1
>>> ratio. With N=250 we'll spend at most 25 usec in order to locate one
>>> page to evict.
>>>
>
> scheme DROP_CONTINUOUS:
>
>
>>> This is a quick hack to materialize the idea. It remembers roughly
>>> the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned,
>>> and if _all of them_ are referenced, then the referenced bit is
>>> probably meaningless and should not be taken seriously.
>>>
>
>
Or one scheme, with N=parameter.
>> I don't think we should ignore the referenced bit. There could still be
>> a large batch of unreferenced pages later on that we should
>> preferentially swap. If we swap at least 1 page for every 250 scanned,
>> after 4K swaps we will have traversed 1M pages, enough to find them.
>>
>
> I guess both schemes have unacceptable flaws.
>
> For JVM/BIGMEM workload, most pages would be found referenced _all the time_.
> So the KEEP_MOST scheme could increase reclaim overheads by N=250 times;
> while the DROP_CONTINUOUS scheme is effectively zero cost.
>
Maybe 250 is an exaggeration. But even with 250, the cost is still
pretty low compared to the cpu cost of evicting a page (with IPIs and
tlb flushes).
--
error compiling committee.c: too many arguments to function
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 13:13 ` Rik van Riel
@ 2009-08-06 13:49 ` Avi Kivity
-1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 13:49 UTC (permalink / raw)
To: Rik van Riel
Cc: Wu Fengguang, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/06/2009 04:13 PM, Rik van Riel wrote:
> Avi Kivity wrote:
>> On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>
>>> As a refinement, the static variable 'recent_all_referenced' could be
>>> moved to struct zone or made a per-cpu variable.
>>
>> Definitely this should be made part of the zone structure, consider
>> the original report where the problem occurs in a 128MB zone (where
>> we can expect many pages to have their referenced bit set).
>
> The problem did not occur in a 128MB zone, but in a 128MB cgroup.
>
> Putting it in the zone means that the cgroup, which may have
> different behaviour from the rest of the zone, due to excessive
> memory pressure inside the cgroup, does not get the right
> statistics.
>
Well, it should be per inactive list, whether it's a zone or a cgroup.
What's the name of this thing? ("inactive list"?)
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 13:49 ` Avi Kivity
0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-06 13:49 UTC (permalink / raw)
To: Rik van Riel
Cc: Wu Fengguang, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On 08/06/2009 04:13 PM, Rik van Riel wrote:
> Avi Kivity wrote:
>> On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>
>>> As a refinement, the static variable 'recent_all_referenced' could be
>>> moved to struct zone or made a per-cpu variable.
>>
>> Definitely this should be made part of the zone structure, consider
>> the original report where the problem occurs in a 128MB zone (where
>> we can expect many pages to have their referenced bit set).
>
> The problem did not occur in a 128MB zone, but in a 128MB cgroup.
>
> Putting it in the zone means that the cgroup, which may have
> different behaviour from the rest of the zone, due to excessive
> memory pressure inside the cgroup, does not get the right
> statistics.
>
Well, it should be per inactive list, whether it's a zone or a cgroup.
What's the name of this thing? ("inactive list"?)
error compiling committee.c: too many arguments to function
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 13:06 ` Wu Fengguang
@ 2009-08-06 21:09 ` Jeff Dike
-1 siblings, 0 replies; 243+ messages in thread
From: Jeff Dike @ 2009-08-06 21:09 UTC (permalink / raw)
To: Wu Fengguang
Cc: Avi Kivity, Andrea Arcangeli, Rik van Riel, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Side question -
Is there a good reason for this to be in shrink_active_list()
as opposed to __isolate_lru_page?
if (unlikely(!page_evictable(page, NULL))) {
putback_lru_page(page);
continue;
}
Maybe we want to minimize the amount of code under the lru lock or
avoid duplicate logic in the isolate_page functions.
But if there are important mlock-heavy workloads, this could make the
scan come up empty, or at least emptier than we might like.
Jeff
--
Work email - jdike at linux dot intel dot com
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-06 21:09 ` Jeff Dike
0 siblings, 0 replies; 243+ messages in thread
From: Jeff Dike @ 2009-08-06 21:09 UTC (permalink / raw)
To: Wu Fengguang
Cc: Avi Kivity, Andrea Arcangeli, Rik van Riel, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Side question -
Is there a good reason for this to be in shrink_active_list()
as opposed to __isolate_lru_page?
if (unlikely(!page_evictable(page, NULL))) {
putback_lru_page(page);
continue;
}
Maybe we want to minimize the amount of code under the lru lock or
avoid duplicate logic in the isolate_page functions.
But if there are important mlock-heavy workloads, this could make the
scan come up empty, or at least emptier than we might like.
Jeff
--
Work email - jdike at linux dot intel dot com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 10:14 ` Avi Kivity
@ 2009-08-07 1:25 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 243+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-07 1:25 UTC (permalink / raw)
To: Avi Kivity
Cc: Wu Fengguang, Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen,
Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Thu, 06 Aug 2009 13:14:09 +0300
Avi Kivity <avi@redhat.com> wrote:
> On 08/06/2009 12:59 PM, Wu Fengguang wrote:
> >> Do we know for a fact that only stack pages suffer, or is it what has
> >> been noticed?
> >>
> >
> > It shall be the first case: "These pages are nearly all stack pages.",
> > Jeff said.
> >
>
> Ok. I can't explain it. There's no special treatment for guest stack
> pages. The accessed bit should be maintained for them exactly like all
> other pages.
>
> Are they kernel-mode stack pages, or user-mode stack pages (the
> difference being that kernel mode stack pages are accessed through large
> ptes, whereas user mode stack pages are accessed through normal ptes).
>
Hmm, finally, memcg's problem ?
just as an experiment, how following works ?
- memory.limit_in_bytes = 128MB
- memory.memsw.limit_in_bytes = 160MB
By this, if mamory+swap usage hits 160MB, no swap more.
But plz take care of OOM.
THanks,
-Kame
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-07 1:25 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 243+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-07 1:25 UTC (permalink / raw)
To: Avi Kivity
Cc: Wu Fengguang, Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen,
Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Thu, 06 Aug 2009 13:14:09 +0300
Avi Kivity <avi@redhat.com> wrote:
> On 08/06/2009 12:59 PM, Wu Fengguang wrote:
> >> Do we know for a fact that only stack pages suffer, or is it what has
> >> been noticed?
> >>
> >
> > It shall be the first case: "These pages are nearly all stack pages.",
> > Jeff said.
> >
>
> Ok. I can't explain it. There's no special treatment for guest stack
> pages. The accessed bit should be maintained for them exactly like all
> other pages.
>
> Are they kernel-mode stack pages, or user-mode stack pages (the
> difference being that kernel mode stack pages are accessed through large
> ptes, whereas user mode stack pages are accessed through normal ptes).
>
Hmm, finally, memcg's problem ?
just as an experiment, how following works ?
- memory.limit_in_bytes = 128MB
- memory.memsw.limit_in_bytes = 160MB
By this, if mamory+swap usage hits 160MB, no swap more.
But plz take care of OOM.
THanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 13:13 ` Rik van Riel
@ 2009-08-07 3:11 ` KOSAKI Motohiro
-1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-07 3:11 UTC (permalink / raw)
To: Rik van Riel
Cc: kosaki.motohiro, Avi Kivity, Wu Fengguang, Andrea Arcangeli,
Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm,
KAMEZAWA Hiroyuki, Balbir Singh
(cc to memcg folks)
> Avi Kivity wrote:
> > On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>
> >> As a refinement, the static variable 'recent_all_referenced' could be
> >> moved to struct zone or made a per-cpu variable.
> >
> > Definitely this should be made part of the zone structure, consider the
> > original report where the problem occurs in a 128MB zone (where we can
> > expect many pages to have their referenced bit set).
>
> The problem did not occur in a 128MB zone, but in a 128MB cgroup.
>
> Putting it in the zone means that the cgroup, which may have
> different behaviour from the rest of the zone, due to excessive
> memory pressure inside the cgroup, does not get the right
> statistics.
maybe, I heven't catch your point.
Current memcgroup logic also use recent_scan/recent_rotate statistics.
Isn't it enought?
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-07 3:11 ` KOSAKI Motohiro
0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-07 3:11 UTC (permalink / raw)
To: Rik van Riel
Cc: kosaki.motohiro, Avi Kivity, Wu Fengguang, Andrea Arcangeli,
Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm,
KAMEZAWA Hiroyuki, Balbir Singh
(cc to memcg folks)
> Avi Kivity wrote:
> > On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>
> >> As a refinement, the static variable 'recent_all_referenced' could be
> >> moved to struct zone or made a per-cpu variable.
> >
> > Definitely this should be made part of the zone structure, consider the
> > original report where the problem occurs in a 128MB zone (where we can
> > expect many pages to have their referenced bit set).
>
> The problem did not occur in a 128MB zone, but in a 128MB cgroup.
>
> Putting it in the zone means that the cgroup, which may have
> different behaviour from the rest of the zone, due to excessive
> memory pressure inside the cgroup, does not get the right
> statistics.
maybe, I heven't catch your point.
Current memcgroup logic also use recent_scan/recent_rotate statistics.
Isn't it enought?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 13:08 ` Rik van Riel
@ 2009-08-07 3:17 ` KOSAKI Motohiro
-1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-07 3:17 UTC (permalink / raw)
To: Rik van Riel
Cc: kosaki.motohiro, Andrea Arcangeli, Wu Fengguang, Dike, Jeffrey G,
Yu, Wilfred, Kleen, Andi, Avi Kivity, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
> Andrea Arcangeli wrote:
>
> > Likely we need a cut-off point, if we detect it takes more than X
> > seconds to scan the whole active list, we start ignoring young bits,
>
> We could just make this depend on the calculated inactive_ratio,
> which depends on the size of the list.
>
> For small systems, it may make sense to make every accessed bit
> count, because the working set will often approach the size of
> memory.
>
> On very large systems, the working set may also approach the
> size of memory, but the inactive list only contains a small
> percentage of the pages, so there is enough space for everything.
>
> Say, if the inactive_ratio is 3 or less, make the accessed bit
> on the active lists count.
Sound reasonable. How do we confirm the idea correctness?
Wu, your X focus switching benchmark is sufficient test?
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-07 3:17 ` KOSAKI Motohiro
0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-07 3:17 UTC (permalink / raw)
To: Rik van Riel
Cc: kosaki.motohiro, Andrea Arcangeli, Wu Fengguang, Dike, Jeffrey G,
Yu, Wilfred, Kleen, Andi, Avi Kivity, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
> Andrea Arcangeli wrote:
>
> > Likely we need a cut-off point, if we detect it takes more than X
> > seconds to scan the whole active list, we start ignoring young bits,
>
> We could just make this depend on the calculated inactive_ratio,
> which depends on the size of the list.
>
> For small systems, it may make sense to make every accessed bit
> count, because the working set will often approach the size of
> memory.
>
> On very large systems, the working set may also approach the
> size of memory, but the inactive list only contains a small
> percentage of the pages, so there is enough space for everything.
>
> Say, if the inactive_ratio is 3 or less, make the accessed bit
> on the active lists count.
Sound reasonable. How do we confirm the idea correctness?
Wu, your X focus switching benchmark is sufficient test?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-07 3:11 ` KOSAKI Motohiro
@ 2009-08-07 7:54 ` Balbir Singh
-1 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-07 7:54 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Rik van Riel, Avi Kivity, Wu Fengguang, Andrea Arcangeli, Dike,
Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm, KAMEZAWA Hiroyuki
On Fri, Aug 7, 2009 at 8:41 AM, KOSAKI
Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> (cc to memcg folks)
>
>> Avi Kivity wrote:
>> > On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>>
>> >> As a refinement, the static variable 'recent_all_referenced' could be
>> >> moved to struct zone or made a per-cpu variable.
>> >
>> > Definitely this should be made part of the zone structure, consider the
>> > original report where the problem occurs in a 128MB zone (where we can
>> > expect many pages to have their referenced bit set).
>>
>> The problem did not occur in a 128MB zone, but in a 128MB cgroup.
>>
>> Putting it in the zone means that the cgroup, which may have
>> different behaviour from the rest of the zone, due to excessive
>> memory pressure inside the cgroup, does not get the right
>> statistics.
>
> maybe, I heven't catch your point.
>
> Current memcgroup logic also use recent_scan/recent_rotate statistics.
> Isn't it enought?
I don't understand the context, I'll look at the problem when I am
back (I am away from work for the next few days).
Balbir
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-07 7:54 ` Balbir Singh
0 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-07 7:54 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Rik van Riel, Avi Kivity, Wu Fengguang, Andrea Arcangeli, Dike,
Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm, KAMEZAWA Hiroyuki
On Fri, Aug 7, 2009 at 8:41 AM, KOSAKI
Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> (cc to memcg folks)
>
>> Avi Kivity wrote:
>> > On 08/06/2009 01:59 PM, Wu Fengguang wrote:
>>
>> >> As a refinement, the static variable 'recent_all_referenced' could be
>> >> moved to struct zone or made a per-cpu variable.
>> >
>> > Definitely this should be made part of the zone structure, consider the
>> > original report where the problem occurs in a 128MB zone (where we can
>> > expect many pages to have their referenced bit set).
>>
>> The problem did not occur in a 128MB zone, but in a 128MB cgroup.
>>
>> Putting it in the zone means that the cgroup, which may have
>> different behaviour from the rest of the zone, due to excessive
>> memory pressure inside the cgroup, does not get the right
>> statistics.
>
> maybe, I heven't catch your point.
>
> Current memcgroup logic also use recent_scan/recent_rotate statistics.
> Isn't it enought?
I don't understand the context, I'll look at the problem when I am
back (I am away from work for the next few days).
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-07 7:54 ` Balbir Singh
@ 2009-08-07 8:24 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 243+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-07 8:24 UTC (permalink / raw)
To: Balbir Singh
Cc: KOSAKI Motohiro, Rik van Riel, Avi Kivity, Wu Fengguang,
Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
linux-mm
On Fri, 7 Aug 2009 13:24:34 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> On Fri, Aug 7, 2009 at 8:41 AM, KOSAKI
> Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> > Current memcgroup logic also use recent_scan/recent_rotate statistics.
> > Isn't it enought?
>
> I don't understand the context, I'll look at the problem when I am
> back (I am away from work for the next few days).
>
Brief summary: (please point out if not correct)
prepare a memcg with
memory.limit_in_bytex=128MB
Run kvm on it, and use apps, those working-set is near to 256MB (then, heavy swap)
In this case,
- Anon memory are swapped-out even while there are file caches.
Especially, a page for stack which is frequently accessed can be
easily swapped out, again and again.
One of reasone is a recent change as:
"a page mapped with VM_EXEC is not pageout even if no reference"
Without memcg, a user can use Gigabytes of memory, above change
is very welcomed.
Then, current point is "how we can handle this case without bad effect".
One possibility I wonder is this is a problem of configuration mistake.
setting memory.memsw.limit_in_bytes to be proper value may change bahavior.
But it seems just a workaround.
Can't we find algorithmic/heuristic way to avoid too much swap-out ?
I think memcg can check # of swap-ins, but now, we don't have a tag
to see the sign of "recently swapped-out page is reused" case or
executable file pages are too much.
I wonder we can comapre
# of pageouted file-caches v.s. # of swapout anon.
and keeping "# of pageouted file-caches < # of swapout anon." (or use swappiness)
This state can be checked by recalim_stat. (per memcg)
Hmm?
I'm sorry I'll be on a trip Aug/11-Aug/17 and response will be delayed.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-07 8:24 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 243+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-07 8:24 UTC (permalink / raw)
To: Balbir Singh
Cc: KOSAKI Motohiro, Rik van Riel, Avi Kivity, Wu Fengguang,
Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
linux-mm
On Fri, 7 Aug 2009 13:24:34 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> On Fri, Aug 7, 2009 at 8:41 AM, KOSAKI
> Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> > Current memcgroup logic also use recent_scan/recent_rotate statistics.
> > Isn't it enought?
>
> I don't understand the context, I'll look at the problem when I am
> back (I am away from work for the next few days).
>
Brief summary: (please point out if not correct)
prepare a memcg with
memory.limit_in_bytex=128MB
Run kvm on it, and use apps, those working-set is near to 256MB (then, heavy swap)
In this case,
- Anon memory are swapped-out even while there are file caches.
Especially, a page for stack which is frequently accessed can be
easily swapped out, again and again.
One of reasone is a recent change as:
"a page mapped with VM_EXEC is not pageout even if no reference"
Without memcg, a user can use Gigabytes of memory, above change
is very welcomed.
Then, current point is "how we can handle this case without bad effect".
One possibility I wonder is this is a problem of configuration mistake.
setting memory.memsw.limit_in_bytes to be proper value may change bahavior.
But it seems just a workaround.
Can't we find algorithmic/heuristic way to avoid too much swap-out ?
I think memcg can check # of swap-ins, but now, we don't have a tag
to see the sign of "recently swapped-out page is reused" case or
executable file pages are too much.
I wonder we can comapre
# of pageouted file-caches v.s. # of swapout anon.
and keeping "# of pageouted file-caches < # of swapout anon." (or use swappiness)
This state can be checked by recalim_stat. (per memcg)
Hmm?
I'm sorry I'll be on a trip Aug/11-Aug/17 and response will be delayed.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-07 3:17 ` KOSAKI Motohiro
(?)
@ 2009-08-12 7:48 ` Wu Fengguang
2009-08-12 14:31 ` Rik van Riel
-1 siblings, 1 reply; 243+ messages in thread
From: Wu Fengguang @ 2009-08-12 7:48 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Rik van Riel, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
[-- Attachment #1: Type: text/plain, Size: 28582 bytes --]
On Fri, Aug 07, 2009 at 11:17:22AM +0800, KOSAKI Motohiro wrote:
> > Andrea Arcangeli wrote:
> >
> > > Likely we need a cut-off point, if we detect it takes more than X
> > > seconds to scan the whole active list, we start ignoring young bits,
> >
> > We could just make this depend on the calculated inactive_ratio,
> > which depends on the size of the list.
> >
> > For small systems, it may make sense to make every accessed bit
> > count, because the working set will often approach the size of
> > memory.
> >
> > On very large systems, the working set may also approach the
> > size of memory, but the inactive list only contains a small
> > percentage of the pages, so there is enough space for everything.
> >
> > Say, if the inactive_ratio is 3 or less, make the accessed bit
> > on the active lists count.
>
> Sound reasonable.
Yes, such kind of global measurements would be much better.
> How do we confirm the idea correctness?
In general the active list tends to grow large on under-scanned LRU.
I guess Rik is pretty familiar with typical inactive_ratio values of
the large memory systems and may even have some real numbers :)
> Wu, your X focus switching benchmark is sufficient test?
It is a major test case for memory tight desktop. Jeff presents
another interesting one for KVM, hehe.
Anyway I collected the active/inactive list sizes, and the numbers
show that the inactive_ratio is roughly 1 when the LRU is scanned
actively and may go very high when it is under-scanned.
Thanks,
Fengguang
4GB desktop, kernel 2.6.30
--------------------------
1) fresh startup:
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
0 80255 68932 24066
2) read 10GB sparse file:
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
48096 52312 830142 10971
3) kvm -m 512M:
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
82606 155588 684375 15380
4) exit kvm:
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
66364 35275 679033 17009
512MB desktop, kernel 2.6.31
----------------------------
1) fresh startup, console:
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
0 1870 7082 2075
2) fresh startx:
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
0 30021 31551 6893
3) start many x apps, no swap:
(script attached)
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
0 56475 29608 9707
4074 54886 27431 9743
5452 54025 26685 9950
7417 53428 25394 9963
8522 52388 24717 10553
10684 51955 22055 11384
11644 51597 21329 11342
12341 51221 20822 11513
13874 49738 19916 11516
13874 50494 19916 11517
15284 48778 19739 12127
15668 49037 19196 12380
16821 48571 17661 13133
18329 49175 14470 14490
18961 49652 13081 14432
18961 49608 13236 14414
20563 51379 11171 13823
21044 50281 10311 13948
21426 49906 10268 13984
21771 50479 9734 14019
23246 49062 9672 13431
23984 49490 10083 12763
24479 49373 10332 12446
25782 49053 9655 12101
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
26970 48078 9891 11415
28041 47873 9617 11079
29485 51183 8445 8293
30484 50140 8441 7997
31841 50578 6904 7413
32579 49873 6937 7804
34117 49336 6447 7440
35380 48300 5816 7471
38055 46486 4778 7546
39528 45227 5043 7417
40777 44681 4148 7325
41902 44468 3967 6534
43107 43378 4630 5846
43418 43538 5019 5698
43563 43514 4839 5514
43660 43587 5228 5431
43645 43315 4919 5886
43618 43555 4531 5704
43751 43646 4584 5600
43839 43703 4507 5569
44015 44057 4757 5378
44115 44089 4707 4724
44331 44184 4710 4701
44577 44554 4221 4265
[...]
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
47945 47876 1594 1547
47944 47888 1944 1494
47944 47888 1351 1392
47974 47844 1976 1498
47974 47858 1411 1549
47974 47857 1482 1423
47973 47874 2105 1435
47969 47349 1884 1592
47966 47353 1993 1700
47966 47343 1913 1882
47965 47306 1683 1746
47960 47373 1598 1583
47960 47375 1808 1677
47960 47004 2444 1625
47959 47060 2017 1825
47956 47047 1866 1742
47955 47080 2039 1987
47954 47072 1734 1822
47954 47092 1963 1867
47954 47130 1851 1846
47954 47154 2134 1813
47954 47181 1952 1813
47953 47138 1678 1810
47951 47125 1848 1951
4) start many x apps, with swap:
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
0 6571 13646 3251
0 6823 14637 3900
0 7426 17187 3935
0 8188 19989 3959
0 9994 21582 4148
0 12556 21889 5157
0 13846 23764 5249
0 20383 25393 5546
0 21830 26019 5696
0 22856 26608 5972
0 28651 28128 6146
0 28058 28482 6309
0 27726 28595 6312
0 27634 28775 6471
0 27636 28774 6464
0 31299 28848 6834
0 35102 29539 6886
0 39561 29980 6915
0 41573 30008 6917
0 47562 30041 6917
0 54603 30041 6917
3040 55528 29273 6945
16937 44916 23406 7675
16937 44932 23416 7670
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
16937 44961 23416 7670
16937 40583 23416 7670
16937 40596 23417 7670
16937 40607 23417 7670
16937 40139 23404 7668
12181 11794 22932 8144
12181 11794 22932 8144
12181 11794 22946 8144
12181 11794 22946 8144
12147 13063 23148 8280
12146 15994 22842 8565
12146 17491 22654 8718
12146 17488 22654 8718
12146 17653 22634 8733
12146 18656 21030 10513
12146 19717 20778 10770
12146 20341 20859 10846
12146 21134 21096 11133
12146 22692 21129 11453
12144 24698 22225 11476
12144 27726 22609 11536
12144 27774 22648 11555
12144 28447 22844 11564
12144 30286 23238 11567
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
12121 31489 23350 11761
12099 33117 23336 11779
12099 33632 23555 11787
12099 35393 23566 11806
12099 35828 23490 11882
12099 35879 23486 11887
12099 35889 23486 11888
12099 36078 24124 11890
12099 36449 25079 11895
12099 37782 25334 11898
12099 39494 25564 11904
12099 40620 25657 11905
14200 41298 25399 12069
21555 35228 22969 12495
22829 33097 22703 12617
25519 31496 22115 12552
28590 28947 21617 13051
28940 29076 19806 13270
29430 29344 19153 13825
30183 30399 17643 13418
32242 32203 13535 13969
33319 33294 12236 13659
33154 33085 11431 13482
33572 33569 11315 13102
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
36246 35355 8033 9355
35659 35558 8491 8394
35330 35142 8233 8278
35788 35561 8460 8454
36129 36359 8413 8627
36727 36365 8311 8509
36672 36870 8437 8479
36772 36656 8090 8354
36754 36614 8237 8378
37591 36065 8352 8470
36530 36383 7607 7611
36113 35992 7271 7296
36149 35667 7092 7052
36014 35350 7408 7206
36409 35890 8027 7396
36300 35418 7892 7704
36369 36589 7723 7838
36243 36168 7576 7793
35804 35622 7422 7726
35498 35435 7443 7557
35078 35159 7542 7243
35478 35415 8199 7552
35143 35025 7828 7763
35312 34754 7745 7545
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
35093 34933 7166 7748
36253 36236 7171 7408
36225 36929 8236 7532
36197 36169 7562 7632
35711 35647 7312 7471
35210 35144 7202 7227
35052 35021 7073 7084
35263 35047 7128 6963
35359 35177 7572 7048
35665 35523 7927 7025
34988 34788 7279 7340
34678 34438 7352 7141
34352 34270 7033 6980
34307 34175 6881 6809
34038 34469 7603 6700
34169 33854 7105 6868
34048 34124 7051 6869
33630 33445 6821 6875
33047 32992 6617 6554
33232 33012 7114 6659
33442 33217 7408 6700
32942 32707 6830 7257
32672 32593 6801 7207
32406 32142 6656 6960
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
32127 32036 6641 6798
31929 31769 6567 6664
31786 31968 7532 6670
32208 31228 7448 6859
30904 30835 6503 6774
30543 30559 6345 6709
30394 30278 6235 6288
30541 30239 6470 6243
30463 30656 6959 6587
31020 29794 7393 6897
30169 30128 6295 6905
29755 29644 6236 6598
29765 29617 6342 6475
29874 29748 6215 6335
29654 29491 6355 6358
29972 29853 7079 6607
29437 29267 6670 7205
29160 28956 6602 6982
29411 29017 6578 6937
29069 28952 6539 6717
29570 29342 6982 6850
28882 28927 6912 6809
29326 28731 6928 6814
28883 28817 6762 6819
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
29072 28696 6756 6803
29296 29120 6993 6972
28426 28167 6238 7182
28071 27862 6197 6953
27944 27767 6872 6780
28141 27819 6839 6654
27547 27309 6285 6209
27578 27842 7730 6465
27741 27470 7180 6665
27481 27217 7566 6919
27568 27405 7696 7027
27274 27004 7416 7120
27110 26920 7111 7303
27282 27056 7476 7046
27549 27044 7779 7074
27325 26968 7290 6972
27665 27528 8465 7058
27093 26974 7662 7243
27155 27068 7299 7344
26638 26553 6925 7325
26718 26383 6571 7425
26264 26150 6470 6960
26463 26176 6590 6803
27155 26396 7387 6709
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
26532 26408 6722 6702
26491 26421 6789 6731
26783 26950 8389 6849
27129 26584 7713 6991
26791 26228 7316 7202
26208 26168 7115 7172
26031 25907 6957 7118
25980 25675 6764 7216
25608 25535 6779 7042
25571 25501 6520 6943
26068 25287 6574 6948
25734 25305 6778 6776
25442 25134 6629 6556
25514 25217 7469 6543
25659 25552 8561 6620
26082 24784 8494 6676
25312 25194 7052 7026
25386 25267 7422 6973
25070 24965 6716 6886
25143 24801 6597 6785
24971 24866 6643 6786
25223 25212 6829 6757
25504 24778 7589 6840
25531 24786 8068 6896
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
25343 25169 7227 7042
25195 25129 6804 7149
25355 25071 6958 6941
25294 25202 6676 6850
25688 25050 6743 6694
25736 25268 6910 6580
25750 25530 7299 6557
25401 25273 6622 6810
25672 25525 6798 6770
25192 25067 6226 6486
26011 25360 6540 6466
26673 25768 6444 6411
27211 26326 6370 6423
27527 26764 6615 6534
27355 26820 6337 6467
27385 26962 6098 6446
27528 27431 5832 6303
26955 26918 6016 6015
26816 26469 5847 5894
26961 26390 6077 5866
26781 26664 5625 5815
26755 26454 6114 5806
26994 26552 6016 5784
27482 26910 5945 5714
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
27609 27143 5929 5715
27885 27443 7168 5947
27684 27635 8231 6145
27320 27205 7359 6415
27679 27265 7898 6445
27655 26342 7651 6574
27033 26853 7385 6831
26696 26468 6533 6721
26464 26310 6374 6465
26192 26084 6261 6417
26182 25801 6511 6367
26010 25880 6251 6266
26130 26032 5974 6280
26417 26175 5830 6610
26558 26450 6002 6623
26758 26016 6141 6526
26481 26363 5911 6356
26765 26622 6401 6266
27022 26534 6593 6210
27587 26515 6560 6193
27156 27029 6123 6109
27284 26926 6159 5776
27153 26996 5698 5642
26712 26603 6151 5541
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
26697 27024 7919 5565
26651 26471 7134 5965
27021 26479 7617 5996
26323 26024 7091 6273
26081 25894 6267 6527
25605 25487 5814 6407
25564 25447 5613 6422
25460 25406 5630 6374
25380 25380 5776 6358
25661 25653 6045 6037
25790 24706 6069 6045
25512 25043 6024 5982
25440 25102 6067 5807
25802 25181 5953 5838
25864 25314 5694 5711
26022 25737 5592 5510
25964 25741 6784 5376
26092 25952 7929 5537
26110 25990 7120 5789
25311 25252 6157 6146
25432 25379 6658 6197
25552 25390 6176 6357
25388 25237 5742 6303
25841 25173 5932 6325
nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file
25054 24965 5532 6072
24754 25191 7503 5941
25529 25133 7319 6000
25350 24510 6993 6078
24672 24541 6027 6068
24610 24492 5811 5804
24819 24674 5841 5820
24775 24394 5719 5696
24991 25179 6639 5816
25282 24538 6870 6088
25172 24727 6628 6090
25363 24721 6644 6091
25676 24705 6672 6102
24998 24909 5683 5957
24762 24736 6034 5869
24965 24890 6374 6614
25050 24895 6436 6616
25087 24932 6436 6617
25139 24860 6435 6619
25159 24903 6435 6620
25168 25362 6004 7051
25209 25524 6004 7052
25209 25504 6004 7053
25262 25447 6011 7054
[-- Attachment #2: run-many-x-apps.sh --]
[-- Type: application/x-sh, Size: 1735 bytes --]
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-12 7:48 ` Wu Fengguang
@ 2009-08-12 14:31 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-12 14:31 UTC (permalink / raw)
To: Wu Fengguang
Cc: KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
Wu Fengguang wrote:
> On Fri, Aug 07, 2009 at 11:17:22AM +0800, KOSAKI Motohiro wrote:
>>> Andrea Arcangeli wrote:
>>>
>>>> Likely we need a cut-off point, if we detect it takes more than X
>>>> seconds to scan the whole active list, we start ignoring young bits,
>>> We could just make this depend on the calculated inactive_ratio,
>>> which depends on the size of the list.
>>>
>>> For small systems, it may make sense to make every accessed bit
>>> count, because the working set will often approach the size of
>>> memory.
>>>
>>> On very large systems, the working set may also approach the
>>> size of memory, but the inactive list only contains a small
>>> percentage of the pages, so there is enough space for everything.
>>>
>>> Say, if the inactive_ratio is 3 or less, make the accessed bit
>>> on the active lists count.
>> Sound reasonable.
>
> Yes, such kind of global measurements would be much better.
>
>> How do we confirm the idea correctness?
>
> In general the active list tends to grow large on under-scanned LRU.
> I guess Rik is pretty familiar with typical inactive_ratio values of
> the large memory systems and may even have some real numbers :)
>
>> Wu, your X focus switching benchmark is sufficient test?
>
> It is a major test case for memory tight desktop. Jeff presents
> another interesting one for KVM, hehe.
>
> Anyway I collected the active/inactive list sizes, and the numbers
> show that the inactive_ratio is roughly 1 when the LRU is scanned
> actively and may go very high when it is under-scanned.
inactive_ratio is based on the zone (or cgroup) size.
For zones it is a fixed value, which is available in
/proc/zoneinfo
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-12 14:31 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-12 14:31 UTC (permalink / raw)
To: Wu Fengguang
Cc: KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
Wu Fengguang wrote:
> On Fri, Aug 07, 2009 at 11:17:22AM +0800, KOSAKI Motohiro wrote:
>>> Andrea Arcangeli wrote:
>>>
>>>> Likely we need a cut-off point, if we detect it takes more than X
>>>> seconds to scan the whole active list, we start ignoring young bits,
>>> We could just make this depend on the calculated inactive_ratio,
>>> which depends on the size of the list.
>>>
>>> For small systems, it may make sense to make every accessed bit
>>> count, because the working set will often approach the size of
>>> memory.
>>>
>>> On very large systems, the working set may also approach the
>>> size of memory, but the inactive list only contains a small
>>> percentage of the pages, so there is enough space for everything.
>>>
>>> Say, if the inactive_ratio is 3 or less, make the accessed bit
>>> on the active lists count.
>> Sound reasonable.
>
> Yes, such kind of global measurements would be much better.
>
>> How do we confirm the idea correctness?
>
> In general the active list tends to grow large on under-scanned LRU.
> I guess Rik is pretty familiar with typical inactive_ratio values of
> the large memory systems and may even have some real numbers :)
>
>> Wu, your X focus switching benchmark is sufficient test?
>
> It is a major test case for memory tight desktop. Jeff presents
> another interesting one for KVM, hehe.
>
> Anyway I collected the active/inactive list sizes, and the numbers
> show that the inactive_ratio is roughly 1 when the LRU is scanned
> actively and may go very high when it is under-scanned.
inactive_ratio is based on the zone (or cgroup) size.
For zones it is a fixed value, which is available in
/proc/zoneinfo
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-12 14:31 ` Rik van Riel
@ 2009-08-13 1:03 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-13 1:03 UTC (permalink / raw)
To: Rik van Riel
Cc: KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
On Wed, Aug 12, 2009 at 10:31:41PM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
> > On Fri, Aug 07, 2009 at 11:17:22AM +0800, KOSAKI Motohiro wrote:
> >>> Andrea Arcangeli wrote:
> >>>
> >>>> Likely we need a cut-off point, if we detect it takes more than X
> >>>> seconds to scan the whole active list, we start ignoring young bits,
> >>> We could just make this depend on the calculated inactive_ratio,
> >>> which depends on the size of the list.
> >>>
> >>> For small systems, it may make sense to make every accessed bit
> >>> count, because the working set will often approach the size of
> >>> memory.
> >>>
> >>> On very large systems, the working set may also approach the
> >>> size of memory, but the inactive list only contains a small
> >>> percentage of the pages, so there is enough space for everything.
> >>>
> >>> Say, if the inactive_ratio is 3 or less, make the accessed bit
> >>> on the active lists count.
> >> Sound reasonable.
> >
> > Yes, such kind of global measurements would be much better.
> >
> >> How do we confirm the idea correctness?
> >
> > In general the active list tends to grow large on under-scanned LRU.
> > I guess Rik is pretty familiar with typical inactive_ratio values of
> > the large memory systems and may even have some real numbers :)
> >
> >> Wu, your X focus switching benchmark is sufficient test?
> >
> > It is a major test case for memory tight desktop. Jeff presents
> > another interesting one for KVM, hehe.
> >
> > Anyway I collected the active/inactive list sizes, and the numbers
> > show that the inactive_ratio is roughly 1 when the LRU is scanned
> > actively and may go very high when it is under-scanned.
>
> inactive_ratio is based on the zone (or cgroup) size.
Ah sorry, my word 'inactive_ratio' means runtime active:inactive ratio.
> For zones it is a fixed value, which is available in
> /proc/zoneinfo
On my 64bit desktop with 4GB memory:
DMA inactive_ratio: 1
DMA32 inactive_ratio: 4
Normal inactive_ratio: 1
The biggest zone DMA32 has inactive_ratio=4. But I guess the
referenced bit should not be ignored on this typical desktop
configuration?
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-13 1:03 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-13 1:03 UTC (permalink / raw)
To: Rik van Riel
Cc: KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
On Wed, Aug 12, 2009 at 10:31:41PM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
> > On Fri, Aug 07, 2009 at 11:17:22AM +0800, KOSAKI Motohiro wrote:
> >>> Andrea Arcangeli wrote:
> >>>
> >>>> Likely we need a cut-off point, if we detect it takes more than X
> >>>> seconds to scan the whole active list, we start ignoring young bits,
> >>> We could just make this depend on the calculated inactive_ratio,
> >>> which depends on the size of the list.
> >>>
> >>> For small systems, it may make sense to make every accessed bit
> >>> count, because the working set will often approach the size of
> >>> memory.
> >>>
> >>> On very large systems, the working set may also approach the
> >>> size of memory, but the inactive list only contains a small
> >>> percentage of the pages, so there is enough space for everything.
> >>>
> >>> Say, if the inactive_ratio is 3 or less, make the accessed bit
> >>> on the active lists count.
> >> Sound reasonable.
> >
> > Yes, such kind of global measurements would be much better.
> >
> >> How do we confirm the idea correctness?
> >
> > In general the active list tends to grow large on under-scanned LRU.
> > I guess Rik is pretty familiar with typical inactive_ratio values of
> > the large memory systems and may even have some real numbers :)
> >
> >> Wu, your X focus switching benchmark is sufficient test?
> >
> > It is a major test case for memory tight desktop. Jeff presents
> > another interesting one for KVM, hehe.
> >
> > Anyway I collected the active/inactive list sizes, and the numbers
> > show that the inactive_ratio is roughly 1 when the LRU is scanned
> > actively and may go very high when it is under-scanned.
>
> inactive_ratio is based on the zone (or cgroup) size.
Ah sorry, my word 'inactive_ratio' means runtime active:inactive ratio.
> For zones it is a fixed value, which is available in
> /proc/zoneinfo
On my 64bit desktop with 4GB memory:
DMA inactive_ratio: 1
DMA32 inactive_ratio: 4
Normal inactive_ratio: 1
The biggest zone DMA32 has inactive_ratio=4. But I guess the
referenced bit should not be ignored on this typical desktop
configuration?
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-13 1:03 ` Wu Fengguang
@ 2009-08-13 15:46 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-13 15:46 UTC (permalink / raw)
To: Wu Fengguang
Cc: KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
Wu Fengguang wrote:
> On Wed, Aug 12, 2009 at 10:31:41PM +0800, Rik van Riel wrote:
>> For zones it is a fixed value, which is available in
>> /proc/zoneinfo
>
> On my 64bit desktop with 4GB memory:
>
> DMA inactive_ratio: 1
> DMA32 inactive_ratio: 4
> Normal inactive_ratio: 1
>
> The biggest zone DMA32 has inactive_ratio=4. But I guess the
> referenced bit should not be ignored on this typical desktop
> configuration?
We need to ignore the referenced bit on active anon pages
on very large systems, but it could indeed be helpful to
respect the referenced bit on smaller systems.
I have no idea where the cut-off between them would be.
Maybe at inactive_ratio <= 4 ?
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-13 15:46 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-13 15:46 UTC (permalink / raw)
To: Wu Fengguang
Cc: KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
Wu Fengguang wrote:
> On Wed, Aug 12, 2009 at 10:31:41PM +0800, Rik van Riel wrote:
>> For zones it is a fixed value, which is available in
>> /proc/zoneinfo
>
> On my 64bit desktop with 4GB memory:
>
> DMA inactive_ratio: 1
> DMA32 inactive_ratio: 4
> Normal inactive_ratio: 1
>
> The biggest zone DMA32 has inactive_ratio=4. But I guess the
> referenced bit should not be ignored on this typical desktop
> configuration?
We need to ignore the referenced bit on active anon pages
on very large systems, but it could indeed be helpful to
respect the referenced bit on smaller systems.
I have no idea where the cut-off between them would be.
Maybe at inactive_ratio <= 4 ?
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-13 15:46 ` Rik van Riel
@ 2009-08-13 16:12 ` Avi Kivity
-1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-13 16:12 UTC (permalink / raw)
To: Rik van Riel
Cc: Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G,
Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
On 08/13/2009 06:46 PM, Rik van Riel wrote:
> We need to ignore the referenced bit on active anon pages
> on very large systems, but it could indeed be helpful to
> respect the referenced bit on smaller systems.
>
> I have no idea where the cut-off between them would be.
>
> Maybe at inactive_ratio <= 4 ?
Why do we need to ignore the referenced bit in such cases? To avoid
overscanning?
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-13 16:12 ` Avi Kivity
0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-13 16:12 UTC (permalink / raw)
To: Rik van Riel
Cc: Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G,
Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
On 08/13/2009 06:46 PM, Rik van Riel wrote:
> We need to ignore the referenced bit on active anon pages
> on very large systems, but it could indeed be helpful to
> respect the referenced bit on smaller systems.
>
> I have no idea where the cut-off between them would be.
>
> Maybe at inactive_ratio <= 4 ?
Why do we need to ignore the referenced bit in such cases? To avoid
overscanning?
--
error compiling committee.c: too many arguments to function
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-13 16:12 ` Avi Kivity
@ 2009-08-13 16:26 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-13 16:26 UTC (permalink / raw)
To: Avi Kivity
Cc: Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G,
Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
Avi Kivity wrote:
> On 08/13/2009 06:46 PM, Rik van Riel wrote:
>> We need to ignore the referenced bit on active anon pages
>> on very large systems, but it could indeed be helpful to
>> respect the referenced bit on smaller systems.
>>
>> I have no idea where the cut-off between them would be.
>>
>> Maybe at inactive_ratio <= 4 ?
>
> Why do we need to ignore the referenced bit in such cases? To avoid
> overscanning?
Because swapping out anonymous pages tends to be a relatively
rare operation, we'll have many gigabytes of anonymous pages
that all have the referenced bit set (because there was lots
of time between swapout bursts).
Ignoring the referenced bit on active anon pages makes no
difference on these systems, because all active anon pages
have the referenced bit set, anyway.
All we need to do is put the pages on the inactive list and
give them a chance to get referenced.
However, on smaller systems (and cgroups!), the speed at
which we can do pageout IO is larger, compared to the amount
of memory. This means we can cycle through the pages more
quickly and we may want to count references on the active
list, too.
Yes, on smaller systems we'll also often end up with bursty
swapout loads and all pages referenced - but since we have
fewer pages to begin with, it won't hurt as much.
I suspect that an inactive_ratio of 3 or 4 might make a
good cutoff value.
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-13 16:26 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-13 16:26 UTC (permalink / raw)
To: Avi Kivity
Cc: Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G,
Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
Avi Kivity wrote:
> On 08/13/2009 06:46 PM, Rik van Riel wrote:
>> We need to ignore the referenced bit on active anon pages
>> on very large systems, but it could indeed be helpful to
>> respect the referenced bit on smaller systems.
>>
>> I have no idea where the cut-off between them would be.
>>
>> Maybe at inactive_ratio <= 4 ?
>
> Why do we need to ignore the referenced bit in such cases? To avoid
> overscanning?
Because swapping out anonymous pages tends to be a relatively
rare operation, we'll have many gigabytes of anonymous pages
that all have the referenced bit set (because there was lots
of time between swapout bursts).
Ignoring the referenced bit on active anon pages makes no
difference on these systems, because all active anon pages
have the referenced bit set, anyway.
All we need to do is put the pages on the inactive list and
give them a chance to get referenced.
However, on smaller systems (and cgroups!), the speed at
which we can do pageout IO is larger, compared to the amount
of memory. This means we can cycle through the pages more
quickly and we may want to count references on the active
list, too.
Yes, on smaller systems we'll also often end up with bursty
swapout loads and all pages referenced - but since we have
fewer pages to begin with, it won't hurt as much.
I suspect that an inactive_ratio of 3 or 4 might make a
good cutoff value.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-13 16:26 ` Rik van Riel
@ 2009-08-13 19:12 ` Avi Kivity
-1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-13 19:12 UTC (permalink / raw)
To: Rik van Riel
Cc: Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G,
Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
On 08/13/2009 07:26 PM, Rik van Riel wrote:
>> Why do we need to ignore the referenced bit in such cases? To avoid
>> overscanning?
>
>
> Because swapping out anonymous pages tends to be a relatively
> rare operation, we'll have many gigabytes of anonymous pages
> that all have the referenced bit set (because there was lots
> of time between swapout bursts).
>
> Ignoring the referenced bit on active anon pages makes no
> difference on these systems, because all active anon pages
> have the referenced bit set, anyway.
>
> All we need to do is put the pages on the inactive list and
> give them a chance to get referenced.
>
> However, on smaller systems (and cgroups!), the speed at
> which we can do pageout IO is larger, compared to the amount
> of memory. This means we can cycle through the pages more
> quickly and we may want to count references on the active
> list, too.
>
> Yes, on smaller systems we'll also often end up with bursty
> swapout loads and all pages referenced - but since we have
> fewer pages to begin with, it won't hurt as much.
>
> I suspect that an inactive_ratio of 3 or 4 might make a
> good cutoff value.
>
Thanks for the explanation. I think my earlier idea of
- do not ignore the referenced bit
- if you see a run of N pages which all have the referenced bit set, do
swap one
has merit. It means we cycle more quickly (by a factor of N) through
the list, looking for unreferenced pages. If we don't find any we've
spent a some more cpu, but if we do find an unreferenced page, we win by
swapping a truly unneeded page.
Cycling faster also means reducing the time between examinations of any
particular page, so it increases the meaningfulness of the check on
large systems (otherwise even rarely used pages will always show up as
referenced).
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-13 19:12 ` Avi Kivity
0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-13 19:12 UTC (permalink / raw)
To: Rik van Riel
Cc: Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G,
Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
On 08/13/2009 07:26 PM, Rik van Riel wrote:
>> Why do we need to ignore the referenced bit in such cases? To avoid
>> overscanning?
>
>
> Because swapping out anonymous pages tends to be a relatively
> rare operation, we'll have many gigabytes of anonymous pages
> that all have the referenced bit set (because there was lots
> of time between swapout bursts).
>
> Ignoring the referenced bit on active anon pages makes no
> difference on these systems, because all active anon pages
> have the referenced bit set, anyway.
>
> All we need to do is put the pages on the inactive list and
> give them a chance to get referenced.
>
> However, on smaller systems (and cgroups!), the speed at
> which we can do pageout IO is larger, compared to the amount
> of memory. This means we can cycle through the pages more
> quickly and we may want to count references on the active
> list, too.
>
> Yes, on smaller systems we'll also often end up with bursty
> swapout loads and all pages referenced - but since we have
> fewer pages to begin with, it won't hurt as much.
>
> I suspect that an inactive_ratio of 3 or 4 might make a
> good cutoff value.
>
Thanks for the explanation. I think my earlier idea of
- do not ignore the referenced bit
- if you see a run of N pages which all have the referenced bit set, do
swap one
has merit. It means we cycle more quickly (by a factor of N) through
the list, looking for unreferenced pages. If we don't find any we've
spent a some more cpu, but if we do find an unreferenced page, we win by
swapping a truly unneeded page.
Cycling faster also means reducing the time between examinations of any
particular page, so it increases the meaningfulness of the check on
large systems (otherwise even rarely used pages will always show up as
referenced).
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-13 19:12 ` Avi Kivity
@ 2009-08-13 21:16 ` Johannes Weiner
-1 siblings, 0 replies; 243+ messages in thread
From: Johannes Weiner @ 2009-08-13 21:16 UTC (permalink / raw)
To: Avi Kivity
Cc: Rik van Riel, Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli,
Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Thu, Aug 13, 2009 at 10:12:01PM +0300, Avi Kivity wrote:
> On 08/13/2009 07:26 PM, Rik van Riel wrote:
> >>Why do we need to ignore the referenced bit in such cases? To avoid
> >>overscanning?
> >
> >
> >Because swapping out anonymous pages tends to be a relatively
> >rare operation, we'll have many gigabytes of anonymous pages
> >that all have the referenced bit set (because there was lots
> >of time between swapout bursts).
> >
> >Ignoring the referenced bit on active anon pages makes no
> >difference on these systems, because all active anon pages
> >have the referenced bit set, anyway.
> >
> >All we need to do is put the pages on the inactive list and
> >give them a chance to get referenced.
> >
> >However, on smaller systems (and cgroups!), the speed at
> >which we can do pageout IO is larger, compared to the amount
> >of memory. This means we can cycle through the pages more
> >quickly and we may want to count references on the active
> >list, too.
> >
> >Yes, on smaller systems we'll also often end up with bursty
> >swapout loads and all pages referenced - but since we have
> >fewer pages to begin with, it won't hurt as much.
> >
> >I suspect that an inactive_ratio of 3 or 4 might make a
> >good cutoff value.
> >
>
> Thanks for the explanation. I think my earlier idea of
>
> - do not ignore the referenced bit
> - if you see a run of N pages which all have the referenced bit set, do
> swap one
>
> has merit. It means we cycle more quickly (by a factor of N) through
> the list, looking for unreferenced pages. If we don't find any we've
> spent a some more cpu, but if we do find an unreferenced page, we win by
> swapping a truly unneeded page.
But it also means destroying the LRU order.
Okay, we ignore the referenced bit, but we keep LRU buddies together
which then get reactivated together as well, if they are indeed in
active use.
I could imagine the VM going nuts when you separate them by a
predicate that is rather unrelated to the pages's actual usage
patterns.
After all, the list order is the primary input to selecting pages for
eviction.
It would need actual testing, of course, but I bet Rik's idea of using
the referenced bit always or never is going to show better results.
Hannes
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-13 21:16 ` Johannes Weiner
0 siblings, 0 replies; 243+ messages in thread
From: Johannes Weiner @ 2009-08-13 21:16 UTC (permalink / raw)
To: Avi Kivity
Cc: Rik van Riel, Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli,
Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Thu, Aug 13, 2009 at 10:12:01PM +0300, Avi Kivity wrote:
> On 08/13/2009 07:26 PM, Rik van Riel wrote:
> >>Why do we need to ignore the referenced bit in such cases? To avoid
> >>overscanning?
> >
> >
> >Because swapping out anonymous pages tends to be a relatively
> >rare operation, we'll have many gigabytes of anonymous pages
> >that all have the referenced bit set (because there was lots
> >of time between swapout bursts).
> >
> >Ignoring the referenced bit on active anon pages makes no
> >difference on these systems, because all active anon pages
> >have the referenced bit set, anyway.
> >
> >All we need to do is put the pages on the inactive list and
> >give them a chance to get referenced.
> >
> >However, on smaller systems (and cgroups!), the speed at
> >which we can do pageout IO is larger, compared to the amount
> >of memory. This means we can cycle through the pages more
> >quickly and we may want to count references on the active
> >list, too.
> >
> >Yes, on smaller systems we'll also often end up with bursty
> >swapout loads and all pages referenced - but since we have
> >fewer pages to begin with, it won't hurt as much.
> >
> >I suspect that an inactive_ratio of 3 or 4 might make a
> >good cutoff value.
> >
>
> Thanks for the explanation. I think my earlier idea of
>
> - do not ignore the referenced bit
> - if you see a run of N pages which all have the referenced bit set, do
> swap one
>
> has merit. It means we cycle more quickly (by a factor of N) through
> the list, looking for unreferenced pages. If we don't find any we've
> spent a some more cpu, but if we do find an unreferenced page, we win by
> swapping a truly unneeded page.
But it also means destroying the LRU order.
Okay, we ignore the referenced bit, but we keep LRU buddies together
which then get reactivated together as well, if they are indeed in
active use.
I could imagine the VM going nuts when you separate them by a
predicate that is rather unrelated to the pages's actual usage
patterns.
After all, the list order is the primary input to selecting pages for
eviction.
It would need actual testing, of course, but I bet Rik's idea of using
the referenced bit always or never is going to show better results.
Hannes
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-13 21:16 ` Johannes Weiner
@ 2009-08-14 7:16 ` Avi Kivity
-1 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-14 7:16 UTC (permalink / raw)
To: Johannes Weiner
Cc: Rik van Riel, Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli,
Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On 08/14/2009 12:16 AM, Johannes Weiner wrote:
>
>> - do not ignore the referenced bit
>> - if you see a run of N pages which all have the referenced bit set, do
>> swap one
>>
>>
>
> But it also means destroying the LRU order.
>
>
True, it would, but if we ignore the referenced bit, LRU order is really
FIFO.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-14 7:16 ` Avi Kivity
0 siblings, 0 replies; 243+ messages in thread
From: Avi Kivity @ 2009-08-14 7:16 UTC (permalink / raw)
To: Johannes Weiner
Cc: Rik van Riel, Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli,
Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On 08/14/2009 12:16 AM, Johannes Weiner wrote:
>
>> - do not ignore the referenced bit
>> - if you see a run of N pages which all have the referenced bit set, do
>> swap one
>>
>>
>
> But it also means destroying the LRU order.
>
>
True, it would, but if we ignore the referenced bit, LRU order is really
FIFO.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-14 7:16 ` Avi Kivity
@ 2009-08-14 9:10 ` Johannes Weiner
-1 siblings, 0 replies; 243+ messages in thread
From: Johannes Weiner @ 2009-08-14 9:10 UTC (permalink / raw)
To: Avi Kivity
Cc: Rik van Riel, Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli,
Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Fri, Aug 14, 2009 at 10:16:26AM +0300, Avi Kivity wrote:
> On 08/14/2009 12:16 AM, Johannes Weiner wrote:
> >
> >>- do not ignore the referenced bit
> >>- if you see a run of N pages which all have the referenced bit set, do
> >>swap one
> >>
> >>
> >
> >But it also means destroying the LRU order.
> >
> >
>
> True, it would, but if we ignore the referenced bit, LRU order is really
> FIFO.
For the active list, yes. But it's not that we degrade to First Fault
First Out in a global scope, we still update the order from
mark_page_accessed() and by activating referenced pages in
shrink_page_list() etc.
So even with the active list being a FIFO, we keep usage information
gathered from the inactive list. If we deactivate pages in arbitrary
list intervals, we throw this away.
And while global FIFO-based reclaim does not work too well, initial
fault order is a valuable hint in the aspect of referential locality
as the pages get used in groups and thus move around the lists in
groups.
Our granularity for regrouping decisions is pretty coarse, for
non-filecache pages it's basically 'referenced or not refrenced in the
last list round-trip', so it will take quite some time to regroup
pages that are used in truly similar intervals.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-14 9:10 ` Johannes Weiner
0 siblings, 0 replies; 243+ messages in thread
From: Johannes Weiner @ 2009-08-14 9:10 UTC (permalink / raw)
To: Avi Kivity
Cc: Rik van Riel, Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli,
Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Fri, Aug 14, 2009 at 10:16:26AM +0300, Avi Kivity wrote:
> On 08/14/2009 12:16 AM, Johannes Weiner wrote:
> >
> >>- do not ignore the referenced bit
> >>- if you see a run of N pages which all have the referenced bit set, do
> >>swap one
> >>
> >>
> >
> >But it also means destroying the LRU order.
> >
> >
>
> True, it would, but if we ignore the referenced bit, LRU order is really
> FIFO.
For the active list, yes. But it's not that we degrade to First Fault
First Out in a global scope, we still update the order from
mark_page_accessed() and by activating referenced pages in
shrink_page_list() etc.
So even with the active list being a FIFO, we keep usage information
gathered from the inactive list. If we deactivate pages in arbitrary
list intervals, we throw this away.
And while global FIFO-based reclaim does not work too well, initial
fault order is a valuable hint in the aspect of referential locality
as the pages get used in groups and thus move around the lists in
groups.
Our granularity for regrouping decisions is pretty coarse, for
non-filecache pages it's basically 'referenced or not refrenced in the
last list round-trip', so it will take quite some time to regroup
pages that are used in truly similar intervals.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-14 9:10 ` Johannes Weiner
@ 2009-08-14 9:51 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-14 9:51 UTC (permalink / raw)
To: Johannes Weiner
Cc: Avi Kivity, Rik van Riel, KOSAKI Motohiro, Andrea Arcangeli,
Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> On Fri, Aug 14, 2009 at 10:16:26AM +0300, Avi Kivity wrote:
> > On 08/14/2009 12:16 AM, Johannes Weiner wrote:
> > >
> > >>- do not ignore the referenced bit
> > >>- if you see a run of N pages which all have the referenced bit set, do
> > >>swap one
> > >>
> > >>
> > >
> > >But it also means destroying the LRU order.
> > >
> > >
> >
> > True, it would, but if we ignore the referenced bit, LRU order is really
> > FIFO.
>
> For the active list, yes. But it's not that we degrade to First Fault
> First Out in a global scope, we still update the order from
> mark_page_accessed() and by activating referenced pages in
> shrink_page_list() etc.
>
> So even with the active list being a FIFO, we keep usage information
> gathered from the inactive list. If we deactivate pages in arbitrary
> list intervals, we throw this away.
We do have the danger of FIFO, if inactive list is small enough, so
that (unconditionally) deactivated pages quickly get reclaimed and
their life window in inactive list is too small to be useful.
> And while global FIFO-based reclaim does not work too well, initial
> fault order is a valuable hint in the aspect of referential locality
> as the pages get used in groups and thus move around the lists in
> groups.
>
> Our granularity for regrouping decisions is pretty coarse, for
> non-filecache pages it's basically 'referenced or not refrenced in the
> last list round-trip', so it will take quite some time to regroup
> pages that are used in truly similar intervals.
Agreed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-14 9:51 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-14 9:51 UTC (permalink / raw)
To: Johannes Weiner
Cc: Avi Kivity, Rik van Riel, KOSAKI Motohiro, Andrea Arcangeli,
Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> On Fri, Aug 14, 2009 at 10:16:26AM +0300, Avi Kivity wrote:
> > On 08/14/2009 12:16 AM, Johannes Weiner wrote:
> > >
> > >>- do not ignore the referenced bit
> > >>- if you see a run of N pages which all have the referenced bit set, do
> > >>swap one
> > >>
> > >>
> > >
> > >But it also means destroying the LRU order.
> > >
> > >
> >
> > True, it would, but if we ignore the referenced bit, LRU order is really
> > FIFO.
>
> For the active list, yes. But it's not that we degrade to First Fault
> First Out in a global scope, we still update the order from
> mark_page_accessed() and by activating referenced pages in
> shrink_page_list() etc.
>
> So even with the active list being a FIFO, we keep usage information
> gathered from the inactive list. If we deactivate pages in arbitrary
> list intervals, we throw this away.
We do have the danger of FIFO, if inactive list is small enough, so
that (unconditionally) deactivated pages quickly get reclaimed and
their life window in inactive list is too small to be useful.
> And while global FIFO-based reclaim does not work too well, initial
> fault order is a valuable hint in the aspect of referential locality
> as the pages get used in groups and thus move around the lists in
> groups.
>
> Our granularity for regrouping decisions is pretty coarse, for
> non-filecache pages it's basically 'referenced or not refrenced in the
> last list round-trip', so it will take quite some time to regroup
> pages that are used in truly similar intervals.
Agreed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-14 9:51 ` Wu Fengguang
@ 2009-08-14 13:19 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-14 13:19 UTC (permalink / raw)
To: Wu Fengguang
Cc: Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli,
Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
Wu Fengguang wrote:
> On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
>> So even with the active list being a FIFO, we keep usage information
>> gathered from the inactive list. If we deactivate pages in arbitrary
>> list intervals, we throw this away.
>
> We do have the danger of FIFO, if inactive list is small enough, so
> that (unconditionally) deactivated pages quickly get reclaimed and
> their life window in inactive list is too small to be useful.
This one of the reasons why we unconditionally deactivate
the active anon pages, and do background scanning of the
active anon list when reclaiming page cache pages.
We want to always move some pages to the inactive anon
list, so it does not get too small.
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-14 13:19 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-14 13:19 UTC (permalink / raw)
To: Wu Fengguang
Cc: Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli,
Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
Wu Fengguang wrote:
> On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
>> So even with the active list being a FIFO, we keep usage information
>> gathered from the inactive list. If we deactivate pages in arbitrary
>> list intervals, we throw this away.
>
> We do have the danger of FIFO, if inactive list is small enough, so
> that (unconditionally) deactivated pages quickly get reclaimed and
> their life window in inactive list is too small to be useful.
This one of the reasons why we unconditionally deactivate
the active anon pages, and do background scanning of the
active anon list when reclaiming page cache pages.
We want to always move some pages to the inactive anon
list, so it does not get too small.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
2009-08-14 9:51 ` Wu Fengguang
@ 2009-08-14 21:42 ` Dike, Jeffrey G
-1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-14 21:42 UTC (permalink / raw)
To: Wu, Fengguang, Johannes Weiner
Cc: Avi Kivity, Rik van Riel, KOSAKI Motohiro, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
A side note - I've been doing some tracing and shrink_active_list is called a humongous number of times (25000-ish during a ~90 kvm run), with a net result of zero pages moved nearly all the time. Your test is rescuing essentially all candidate pages from the inactive list. Right now, I have the VM_EXEC || PageAnon version of your test.
Jeff
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-14 21:42 ` Dike, Jeffrey G
0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-14 21:42 UTC (permalink / raw)
To: Wu, Fengguang, Johannes Weiner
Cc: Avi Kivity, Rik van Riel, KOSAKI Motohiro, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
A side note - I've been doing some tracing and shrink_active_list is called a humongous number of times (25000-ish during a ~90 kvm run), with a net result of zero pages moved nearly all the time. Your test is rescuing essentially all candidate pages from the inactive list. Right now, I have the VM_EXEC || PageAnon version of your test.
Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-14 21:42 ` Dike, Jeffrey G
@ 2009-08-14 22:37 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-14 22:37 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Wu, Fengguang, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
Dike, Jeffrey G wrote:
> A side note - I've been doing some tracing and shrink_active_list is called a humongous number of times (25000-ish during a ~90 kvm run), with a net result of zero pages moved nearly all the time. Your test is rescuing essentially all candidate pages from the inactive list. Right now, I have the VM_EXEC || PageAnon version of your test.
That is exactly why the the split LRU VM does an unconditional
deactivation of active anon pages :)
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-14 22:37 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-14 22:37 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Wu, Fengguang, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
Dike, Jeffrey G wrote:
> A side note - I've been doing some tracing and shrink_active_list is called a humongous number of times (25000-ish during a ~90 kvm run), with a net result of zero pages moved nearly all the time. Your test is rescuing essentially all candidate pages from the inactive list. Right now, I have the VM_EXEC || PageAnon version of your test.
That is exactly why the the split LRU VM does an unconditional
deactivation of active anon pages :)
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-14 22:37 ` Rik van Riel
@ 2009-08-15 5:32 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-15 5:32 UTC (permalink / raw)
To: Rik van Riel
Cc: Dike, Jeffrey G, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Sat, Aug 15, 2009 at 06:37:22AM +0800, Rik van Riel wrote:
> Dike, Jeffrey G wrote:
> > A side note - I've been doing some tracing and shrink_active_list
> > is called a humongous number of times (25000-ish during a ~90 kvm
> > run), with a net result of zero pages moved nearly all the time.
Your mean "no pages get deactivated at all in most invocations"?
This is possible in the steady (thrashing) state of a memory tight
system(the working set is bigger than memory size).
> > Your test is rescuing essentially all candidate pages from the
> > inactive list. Right now, I have the VM_EXEC || PageAnon version
> > of your test.
>
> That is exactly why the the split LRU VM does an unconditional
> deactivation of active anon pages :)
In general it is :) However in Jeff's small memory case, there
will be many refaults without the "PageAnon" protection. But the
patch does not imply that I'm happy with the "PageAnon" test ;)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-15 5:32 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-15 5:32 UTC (permalink / raw)
To: Rik van Riel
Cc: Dike, Jeffrey G, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Sat, Aug 15, 2009 at 06:37:22AM +0800, Rik van Riel wrote:
> Dike, Jeffrey G wrote:
> > A side note - I've been doing some tracing and shrink_active_list
> > is called a humongous number of times (25000-ish during a ~90 kvm
> > run), with a net result of zero pages moved nearly all the time.
Your mean "no pages get deactivated at all in most invocations"?
This is possible in the steady (thrashing) state of a memory tight
system(the working set is bigger than memory size).
> > Your test is rescuing essentially all candidate pages from the
> > inactive list. Right now, I have the VM_EXEC || PageAnon version
> > of your test.
>
> That is exactly why the the split LRU VM does an unconditional
> deactivation of active anon pages :)
In general it is :) However in Jeff's small memory case, there
will be many refaults without the "PageAnon" protection. But the
patch does not imply that I'm happy with the "PageAnon" test ;)
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-14 13:19 ` Rik van Riel
@ 2009-08-15 5:45 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-15 5:45 UTC (permalink / raw)
To: Rik van Riel
Cc: Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli,
Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
> > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
>
> >> So even with the active list being a FIFO, we keep usage information
> >> gathered from the inactive list. If we deactivate pages in arbitrary
> >> list intervals, we throw this away.
> >
> > We do have the danger of FIFO, if inactive list is small enough, so
> > that (unconditionally) deactivated pages quickly get reclaimed and
> > their life window in inactive list is too small to be useful.
>
> This one of the reasons why we unconditionally deactivate
> the active anon pages, and do background scanning of the
> active anon list when reclaiming page cache pages.
>
> We want to always move some pages to the inactive anon
> list, so it does not get too small.
Right, the current code tries to pull inactive list out of
smallish-size state as long as there are vmscan activities.
However there is a possible (and tricky) hole: mem cgroups
don't do batched vmscan. shrink_zone() may call shrink_list()
with nr_to_scan=1, in which case shrink_list() _still_ calls
isolate_pages() with the much larger SWAP_CLUSTER_MAX.
It effectively scales up the inactive list scan rate by 10 times when
it is still small, and may thus prevent it from growing up for ever.
In that case, LRU becomes FIFO.
Jeff, can you confirm if the mem cgroup's inactive list is small?
If so, this patch should help.
Thanks,
Fengguang
---
mm: do batched scans for mem_cgroup
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/memcontrol.h | 3 +++
mm/memcontrol.c | 12 ++++++++++++
mm/vmscan.c | 9 +++++----
3 files changed, 20 insertions(+), 4 deletions(-)
--- linux.orig/include/linux/memcontrol.h 2009-08-15 13:12:49.000000000 +0800
+++ linux/include/linux/memcontrol.h 2009-08-15 13:18:13.000000000 +0800
@@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru
unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
struct zone *zone,
enum lru_list lru);
+unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
+ struct zone *zone,
+ enum lru_list lru);
struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
struct zone *zone);
struct zone_reclaim_stat*
--- linux.orig/mm/memcontrol.c 2009-08-15 13:07:34.000000000 +0800
+++ linux/mm/memcontrol.c 2009-08-15 13:17:56.000000000 +0800
@@ -115,6 +115,7 @@ struct mem_cgroup_per_zone {
*/
struct list_head lists[NR_LRU_LISTS];
unsigned long count[NR_LRU_LISTS];
+ unsigned long nr_saved_scan[NR_LRU_LISTS];
struct zone_reclaim_stat reclaim_stat;
};
@@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s
return MEM_CGROUP_ZSTAT(mz, lru);
}
+unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
+ struct zone *zone,
+ enum lru_list lru)
+{
+ int nid = zone->zone_pgdat->node_id;
+ int zid = zone_idx(zone);
+ struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+ return &mz->nr_saved_scan[lru];
+}
+
struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
struct zone *zone)
{
--- linux.orig/mm/vmscan.c 2009-08-15 13:04:54.000000000 +0800
+++ linux/mm/vmscan.c 2009-08-15 13:19:03.000000000 +0800
@@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st
for_each_evictable_lru(l) {
int file = is_file_lru(l);
unsigned long scan;
+ unsigned long *saved_scan;
scan = zone_nr_pages(zone, sc, l);
if (priority || noswap) {
@@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
scan = (scan * percent[file]) / 100;
}
if (scanning_global_lru(sc))
- nr[l] = nr_scan_try_batch(scan,
- &zone->lru[l].nr_saved_scan,
- swap_cluster_max);
+ saved_scan = &zone->lru[l].nr_saved_scan;
else
- nr[l] = scan;
+ saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
+ zone, l);
+ nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
}
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-15 5:45 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-15 5:45 UTC (permalink / raw)
To: Rik van Riel
Cc: Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli,
Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
> > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
>
> >> So even with the active list being a FIFO, we keep usage information
> >> gathered from the inactive list. If we deactivate pages in arbitrary
> >> list intervals, we throw this away.
> >
> > We do have the danger of FIFO, if inactive list is small enough, so
> > that (unconditionally) deactivated pages quickly get reclaimed and
> > their life window in inactive list is too small to be useful.
>
> This one of the reasons why we unconditionally deactivate
> the active anon pages, and do background scanning of the
> active anon list when reclaiming page cache pages.
>
> We want to always move some pages to the inactive anon
> list, so it does not get too small.
Right, the current code tries to pull inactive list out of
smallish-size state as long as there are vmscan activities.
However there is a possible (and tricky) hole: mem cgroups
don't do batched vmscan. shrink_zone() may call shrink_list()
with nr_to_scan=1, in which case shrink_list() _still_ calls
isolate_pages() with the much larger SWAP_CLUSTER_MAX.
It effectively scales up the inactive list scan rate by 10 times when
it is still small, and may thus prevent it from growing up for ever.
In that case, LRU becomes FIFO.
Jeff, can you confirm if the mem cgroup's inactive list is small?
If so, this patch should help.
Thanks,
Fengguang
---
mm: do batched scans for mem_cgroup
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/memcontrol.h | 3 +++
mm/memcontrol.c | 12 ++++++++++++
mm/vmscan.c | 9 +++++----
3 files changed, 20 insertions(+), 4 deletions(-)
--- linux.orig/include/linux/memcontrol.h 2009-08-15 13:12:49.000000000 +0800
+++ linux/include/linux/memcontrol.h 2009-08-15 13:18:13.000000000 +0800
@@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru
unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
struct zone *zone,
enum lru_list lru);
+unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
+ struct zone *zone,
+ enum lru_list lru);
struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
struct zone *zone);
struct zone_reclaim_stat*
--- linux.orig/mm/memcontrol.c 2009-08-15 13:07:34.000000000 +0800
+++ linux/mm/memcontrol.c 2009-08-15 13:17:56.000000000 +0800
@@ -115,6 +115,7 @@ struct mem_cgroup_per_zone {
*/
struct list_head lists[NR_LRU_LISTS];
unsigned long count[NR_LRU_LISTS];
+ unsigned long nr_saved_scan[NR_LRU_LISTS];
struct zone_reclaim_stat reclaim_stat;
};
@@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s
return MEM_CGROUP_ZSTAT(mz, lru);
}
+unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
+ struct zone *zone,
+ enum lru_list lru)
+{
+ int nid = zone->zone_pgdat->node_id;
+ int zid = zone_idx(zone);
+ struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+ return &mz->nr_saved_scan[lru];
+}
+
struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
struct zone *zone)
{
--- linux.orig/mm/vmscan.c 2009-08-15 13:04:54.000000000 +0800
+++ linux/mm/vmscan.c 2009-08-15 13:19:03.000000000 +0800
@@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st
for_each_evictable_lru(l) {
int file = is_file_lru(l);
unsigned long scan;
+ unsigned long *saved_scan;
scan = zone_nr_pages(zone, sc, l);
if (priority || noswap) {
@@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
scan = (scan * percent[file]) / 100;
}
if (scanning_global_lru(sc))
- nr[l] = nr_scan_try_batch(scan,
- &zone->lru[l].nr_saved_scan,
- swap_cluster_max);
+ saved_scan = &zone->lru[l].nr_saved_scan;
else
- nr[l] = scan;
+ saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
+ zone, l);
+ nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
}
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 21:09 ` Jeff Dike
@ 2009-08-16 3:18 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16 3:18 UTC (permalink / raw)
To: Jeff Dike
Cc: Avi Kivity, Andrea Arcangeli, Rik van Riel, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> Side question -
> Is there a good reason for this to be in shrink_active_list()
> as opposed to __isolate_lru_page?
>
> if (unlikely(!page_evictable(page, NULL))) {
> putback_lru_page(page);
> continue;
> }
>
> Maybe we want to minimize the amount of code under the lru lock or
> avoid duplicate logic in the isolate_page functions.
I guess the quick test means to avoid the expensive page_referenced()
call that follows it. But that should be mostly one shot cost - the
unevictable pages are unlikely to cycle in active/inactive list again
and again.
> But if there are important mlock-heavy workloads, this could make the
> scan come up empty, or at least emptier than we might like.
Yes, if the above 'if' block is removed, the inactive lists might get
more expensive to reclaim.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16 3:18 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16 3:18 UTC (permalink / raw)
To: Jeff Dike
Cc: Avi Kivity, Andrea Arcangeli, Rik van Riel, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> Side question -
> Is there a good reason for this to be in shrink_active_list()
> as opposed to __isolate_lru_page?
>
> if (unlikely(!page_evictable(page, NULL))) {
> putback_lru_page(page);
> continue;
> }
>
> Maybe we want to minimize the amount of code under the lru lock or
> avoid duplicate logic in the isolate_page functions.
I guess the quick test means to avoid the expensive page_referenced()
call that follows it. But that should be mostly one shot cost - the
unevictable pages are unlikely to cycle in active/inactive list again
and again.
> But if there are important mlock-heavy workloads, this could make the
> scan come up empty, or at least emptier than we might like.
Yes, if the above 'if' block is removed, the inactive lists might get
more expensive to reclaim.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-06 13:16 ` Rik van Riel
@ 2009-08-16 3:28 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16 3:28 UTC (permalink / raw)
To: Rik van Riel
Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Thu, Aug 06, 2009 at 09:16:14PM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
>
> > I guess both schemes have unacceptable flaws.
> >
> > For JVM/BIGMEM workload, most pages would be found referenced _all the time_.
> > So the KEEP_MOST scheme could increase reclaim overheads by N=250 times;
> > while the DROP_CONTINUOUS scheme is effectively zero cost.
>
> The higher overhead may not be an issue on smaller systems,
> or inside smaller cgroups inside large systems, when doing
> cgroup reclaim.
Right.
> > However, the DROP_CONTINUOUS scheme does bring more _indeterminacy_.
> > It can behave vastly different on single active task and multi ones.
> > It is short sighted and can be cheated by bursty activities.
>
> The split LRU VM tries to avoid the bursty page aging as
> much as possible, by doing background deactivating of
> anonymous pages whenever we reclaim page cache pages and
> the number of anonymous pages in the zone (or cgroup) is
> low.
Right, but I meant busty page allocations and accesses on them, which
can make a large continuous segment of referenced pages in LRU list,
say 50MB. They may or may not be valuable as a whole, however a local
algorithm may keep the first 4MB and drop the remaining 46MB.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16 3:28 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16 3:28 UTC (permalink / raw)
To: Rik van Riel
Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Thu, Aug 06, 2009 at 09:16:14PM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
>
> > I guess both schemes have unacceptable flaws.
> >
> > For JVM/BIGMEM workload, most pages would be found referenced _all the time_.
> > So the KEEP_MOST scheme could increase reclaim overheads by N=250 times;
> > while the DROP_CONTINUOUS scheme is effectively zero cost.
>
> The higher overhead may not be an issue on smaller systems,
> or inside smaller cgroups inside large systems, when doing
> cgroup reclaim.
Right.
> > However, the DROP_CONTINUOUS scheme does bring more _indeterminacy_.
> > It can behave vastly different on single active task and multi ones.
> > It is short sighted and can be cheated by bursty activities.
>
> The split LRU VM tries to avoid the bursty page aging as
> much as possible, by doing background deactivating of
> anonymous pages whenever we reclaim page cache pages and
> the number of anonymous pages in the zone (or cgroup) is
> low.
Right, but I meant busty page allocations and accesses on them, which
can make a large continuous segment of referenced pages in LRU list,
say 50MB. They may or may not be valuable as a whole, however a local
algorithm may keep the first 4MB and drop the remaining 46MB.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-16 3:18 ` Wu Fengguang
@ 2009-08-16 3:53 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-16 3:53 UTC (permalink / raw)
To: Wu Fengguang
Cc: Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Wu Fengguang wrote:
> On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
>> Side question -
>> Is there a good reason for this to be in shrink_active_list()
>> as opposed to __isolate_lru_page?
>>
>> if (unlikely(!page_evictable(page, NULL))) {
>> putback_lru_page(page);
>> continue;
>> }
>>
>> Maybe we want to minimize the amount of code under the lru lock or
>> avoid duplicate logic in the isolate_page functions.
>
> I guess the quick test means to avoid the expensive page_referenced()
> call that follows it. But that should be mostly one shot cost - the
> unevictable pages are unlikely to cycle in active/inactive list again
> and again.
Please read what putback_lru_page does.
It moves the page onto the unevictable list, so that
it will not end up in this scan again.
>> But if there are important mlock-heavy workloads, this could make the
>> scan come up empty, or at least emptier than we might like.
>
> Yes, if the above 'if' block is removed, the inactive lists might get
> more expensive to reclaim.
Why?
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16 3:53 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-16 3:53 UTC (permalink / raw)
To: Wu Fengguang
Cc: Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Wu Fengguang wrote:
> On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
>> Side question -
>> Is there a good reason for this to be in shrink_active_list()
>> as opposed to __isolate_lru_page?
>>
>> if (unlikely(!page_evictable(page, NULL))) {
>> putback_lru_page(page);
>> continue;
>> }
>>
>> Maybe we want to minimize the amount of code under the lru lock or
>> avoid duplicate logic in the isolate_page functions.
>
> I guess the quick test means to avoid the expensive page_referenced()
> call that follows it. But that should be mostly one shot cost - the
> unevictable pages are unlikely to cycle in active/inactive list again
> and again.
Please read what putback_lru_page does.
It moves the page onto the unevictable list, so that
it will not end up in this scan again.
>> But if there are important mlock-heavy workloads, this could make the
>> scan come up empty, or at least emptier than we might like.
>
> Yes, if the above 'if' block is removed, the inactive lists might get
> more expensive to reclaim.
Why?
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-16 3:28 ` Wu Fengguang
@ 2009-08-16 3:56 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-16 3:56 UTC (permalink / raw)
To: Wu Fengguang
Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Wu Fengguang wrote:
> Right, but I meant busty page allocations and accesses on them, which
> can make a large continuous segment of referenced pages in LRU list,
> say 50MB. They may or may not be valuable as a whole, however a local
> algorithm may keep the first 4MB and drop the remaining 46MB.
I wonder if the problem is that we simply do not keep a large
enough inactive list in Jeff's test. If we do not, pages do
not have a chance to be referenced again before the reclaim
code comes in.
The cgroup stats should show how many active anon and inactive
anon pages there are in the cgroup.
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16 3:56 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-16 3:56 UTC (permalink / raw)
To: Wu Fengguang
Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Wu Fengguang wrote:
> Right, but I meant busty page allocations and accesses on them, which
> can make a large continuous segment of referenced pages in LRU list,
> say 50MB. They may or may not be valuable as a whole, however a local
> algorithm may keep the first 4MB and drop the remaining 46MB.
I wonder if the problem is that we simply do not keep a large
enough inactive list in Jeff's test. If we do not, pages do
not have a chance to be referenced again before the reclaim
code comes in.
The cgroup stats should show how many active anon and inactive
anon pages there are in the cgroup.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-16 3:56 ` Rik van Riel
@ 2009-08-16 4:43 ` Balbir Singh
-1 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-16 4:43 UTC (permalink / raw)
To: Rik van Riel
Cc: Wu Fengguang, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
* Rik van Riel <riel@redhat.com> [2009-08-15 23:56:39]:
> Wu Fengguang wrote:
>
>> Right, but I meant busty page allocations and accesses on them, which
>> can make a large continuous segment of referenced pages in LRU list,
>> say 50MB. They may or may not be valuable as a whole, however a local
>> algorithm may keep the first 4MB and drop the remaining 46MB.
>
> I wonder if the problem is that we simply do not keep a large
> enough inactive list in Jeff's test. If we do not, pages do
> not have a chance to be referenced again before the reclaim
> code comes in.
>
> The cgroup stats should show how many active anon and inactive
> anon pages there are in the cgroup.
>
Yes, we do show active and inactive anon pages in the mem cgroup
controller in the memory.stat file.
--
Balbir
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16 4:43 ` Balbir Singh
0 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-16 4:43 UTC (permalink / raw)
To: Rik van Riel
Cc: Wu Fengguang, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
* Rik van Riel <riel@redhat.com> [2009-08-15 23:56:39]:
> Wu Fengguang wrote:
>
>> Right, but I meant busty page allocations and accesses on them, which
>> can make a large continuous segment of referenced pages in LRU list,
>> say 50MB. They may or may not be valuable as a whole, however a local
>> algorithm may keep the first 4MB and drop the remaining 46MB.
>
> I wonder if the problem is that we simply do not keep a large
> enough inactive list in Jeff's test. If we do not, pages do
> not have a chance to be referenced again before the reclaim
> code comes in.
>
> The cgroup stats should show how many active anon and inactive
> anon pages there are in the cgroup.
>
Yes, we do show active and inactive anon pages in the mem cgroup
controller in the memory.stat file.
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-16 3:56 ` Rik van Riel
@ 2009-08-16 4:55 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16 4:55 UTC (permalink / raw)
To: Rik van Riel
Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Sun, Aug 16, 2009 at 11:56:39AM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
>
> > Right, but I meant busty page allocations and accesses on them, which
> > can make a large continuous segment of referenced pages in LRU list,
> > say 50MB. They may or may not be valuable as a whole, however a local
> > algorithm may keep the first 4MB and drop the remaining 46MB.
>
> I wonder if the problem is that we simply do not keep a large
> enough inactive list in Jeff's test. If we do not, pages do
> not have a chance to be referenced again before the reclaim
> code comes in.
Exactly, that's the case I call the list FIFO.
> The cgroup stats should show how many active anon and inactive
> anon pages there are in the cgroup.
Jeff, can you have a look at these stats? Thanks!
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16 4:55 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16 4:55 UTC (permalink / raw)
To: Rik van Riel
Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Sun, Aug 16, 2009 at 11:56:39AM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
>
> > Right, but I meant busty page allocations and accesses on them, which
> > can make a large continuous segment of referenced pages in LRU list,
> > say 50MB. They may or may not be valuable as a whole, however a local
> > algorithm may keep the first 4MB and drop the remaining 46MB.
>
> I wonder if the problem is that we simply do not keep a large
> enough inactive list in Jeff's test. If we do not, pages do
> not have a chance to be referenced again before the reclaim
> code comes in.
Exactly, that's the case I call the list FIFO.
> The cgroup stats should show how many active anon and inactive
> anon pages there are in the cgroup.
Jeff, can you have a look at these stats? Thanks!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-15 5:45 ` Wu Fengguang
@ 2009-08-16 5:09 ` Balbir Singh
-1 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-16 5:09 UTC (permalink / raw)
To: Wu Fengguang
Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
linux-mm
* Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]:
> On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> > Wu Fengguang wrote:
> > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> >
> > >> So even with the active list being a FIFO, we keep usage information
> > >> gathered from the inactive list. If we deactivate pages in arbitrary
> > >> list intervals, we throw this away.
> > >
> > > We do have the danger of FIFO, if inactive list is small enough, so
> > > that (unconditionally) deactivated pages quickly get reclaimed and
> > > their life window in inactive list is too small to be useful.
> >
> > This one of the reasons why we unconditionally deactivate
> > the active anon pages, and do background scanning of the
> > active anon list when reclaiming page cache pages.
> >
> > We want to always move some pages to the inactive anon
> > list, so it does not get too small.
>
> Right, the current code tries to pull inactive list out of
> smallish-size state as long as there are vmscan activities.
>
> However there is a possible (and tricky) hole: mem cgroups
> don't do batched vmscan. shrink_zone() may call shrink_list()
> with nr_to_scan=1, in which case shrink_list() _still_ calls
> isolate_pages() with the much larger SWAP_CLUSTER_MAX.
>
> It effectively scales up the inactive list scan rate by 10 times when
> it is still small, and may thus prevent it from growing up for ever.
>
I think we need to possibly export some scanning data under DEBUG_VM
to cross verify.
> In that case, LRU becomes FIFO.
>
> Jeff, can you confirm if the mem cgroup's inactive list is small?
> If so, this patch should help.
>
> Thanks,
> Fengguang
> ---
>
> mm: do batched scans for mem_cgroup
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> include/linux/memcontrol.h | 3 +++
> mm/memcontrol.c | 12 ++++++++++++
> mm/vmscan.c | 9 +++++----
> 3 files changed, 20 insertions(+), 4 deletions(-)
>
> --- linux.orig/include/linux/memcontrol.h 2009-08-15 13:12:49.000000000 +0800
> +++ linux/include/linux/memcontrol.h 2009-08-15 13:18:13.000000000 +0800
> @@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru
> unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> struct zone *zone,
> enum lru_list lru);
> +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> + struct zone *zone,
> + enum lru_list lru);
> struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> struct zone *zone);
> struct zone_reclaim_stat*
> --- linux.orig/mm/memcontrol.c 2009-08-15 13:07:34.000000000 +0800
> +++ linux/mm/memcontrol.c 2009-08-15 13:17:56.000000000 +0800
> @@ -115,6 +115,7 @@ struct mem_cgroup_per_zone {
> */
> struct list_head lists[NR_LRU_LISTS];
> unsigned long count[NR_LRU_LISTS];
> + unsigned long nr_saved_scan[NR_LRU_LISTS];
>
> struct zone_reclaim_stat reclaim_stat;
> };
> @@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s
> return MEM_CGROUP_ZSTAT(mz, lru);
> }
>
> +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> + struct zone *zone,
> + enum lru_list lru)
> +{
> + int nid = zone->zone_pgdat->node_id;
> + int zid = zone_idx(zone);
> + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> + return &mz->nr_saved_scan[lru];
> +}
> +
> struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> struct zone *zone)
> {
> --- linux.orig/mm/vmscan.c 2009-08-15 13:04:54.000000000 +0800
> +++ linux/mm/vmscan.c 2009-08-15 13:19:03.000000000 +0800
> @@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st
> for_each_evictable_lru(l) {
> int file = is_file_lru(l);
> unsigned long scan;
> + unsigned long *saved_scan;
>
> scan = zone_nr_pages(zone, sc, l);
> if (priority || noswap) {
> @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
> scan = (scan * percent[file]) / 100;
> }
> if (scanning_global_lru(sc))
> - nr[l] = nr_scan_try_batch(scan,
> - &zone->lru[l].nr_saved_scan,
> - swap_cluster_max);
> + saved_scan = &zone->lru[l].nr_saved_scan;
> else
> - nr[l] = scan;
> + saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
> + zone, l);
> + nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
> }
>
This might be a concern (although not a big ATM), since we can't
afford to miss limits by much. If a cgroup is near its limit and we
drop scanning it. We'll have to work out what this means for the end
user. May be more fundamental look through is required at the priority
based logic of exposing how much to scan, I don't know.
--
Balbir
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16 5:09 ` Balbir Singh
0 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-16 5:09 UTC (permalink / raw)
To: Wu Fengguang
Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
linux-mm
* Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]:
> On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> > Wu Fengguang wrote:
> > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> >
> > >> So even with the active list being a FIFO, we keep usage information
> > >> gathered from the inactive list. If we deactivate pages in arbitrary
> > >> list intervals, we throw this away.
> > >
> > > We do have the danger of FIFO, if inactive list is small enough, so
> > > that (unconditionally) deactivated pages quickly get reclaimed and
> > > their life window in inactive list is too small to be useful.
> >
> > This one of the reasons why we unconditionally deactivate
> > the active anon pages, and do background scanning of the
> > active anon list when reclaiming page cache pages.
> >
> > We want to always move some pages to the inactive anon
> > list, so it does not get too small.
>
> Right, the current code tries to pull inactive list out of
> smallish-size state as long as there are vmscan activities.
>
> However there is a possible (and tricky) hole: mem cgroups
> don't do batched vmscan. shrink_zone() may call shrink_list()
> with nr_to_scan=1, in which case shrink_list() _still_ calls
> isolate_pages() with the much larger SWAP_CLUSTER_MAX.
>
> It effectively scales up the inactive list scan rate by 10 times when
> it is still small, and may thus prevent it from growing up for ever.
>
I think we need to possibly export some scanning data under DEBUG_VM
to cross verify.
> In that case, LRU becomes FIFO.
>
> Jeff, can you confirm if the mem cgroup's inactive list is small?
> If so, this patch should help.
>
> Thanks,
> Fengguang
> ---
>
> mm: do batched scans for mem_cgroup
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> include/linux/memcontrol.h | 3 +++
> mm/memcontrol.c | 12 ++++++++++++
> mm/vmscan.c | 9 +++++----
> 3 files changed, 20 insertions(+), 4 deletions(-)
>
> --- linux.orig/include/linux/memcontrol.h 2009-08-15 13:12:49.000000000 +0800
> +++ linux/include/linux/memcontrol.h 2009-08-15 13:18:13.000000000 +0800
> @@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru
> unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> struct zone *zone,
> enum lru_list lru);
> +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> + struct zone *zone,
> + enum lru_list lru);
> struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> struct zone *zone);
> struct zone_reclaim_stat*
> --- linux.orig/mm/memcontrol.c 2009-08-15 13:07:34.000000000 +0800
> +++ linux/mm/memcontrol.c 2009-08-15 13:17:56.000000000 +0800
> @@ -115,6 +115,7 @@ struct mem_cgroup_per_zone {
> */
> struct list_head lists[NR_LRU_LISTS];
> unsigned long count[NR_LRU_LISTS];
> + unsigned long nr_saved_scan[NR_LRU_LISTS];
>
> struct zone_reclaim_stat reclaim_stat;
> };
> @@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s
> return MEM_CGROUP_ZSTAT(mz, lru);
> }
>
> +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> + struct zone *zone,
> + enum lru_list lru)
> +{
> + int nid = zone->zone_pgdat->node_id;
> + int zid = zone_idx(zone);
> + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> + return &mz->nr_saved_scan[lru];
> +}
> +
> struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> struct zone *zone)
> {
> --- linux.orig/mm/vmscan.c 2009-08-15 13:04:54.000000000 +0800
> +++ linux/mm/vmscan.c 2009-08-15 13:19:03.000000000 +0800
> @@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st
> for_each_evictable_lru(l) {
> int file = is_file_lru(l);
> unsigned long scan;
> + unsigned long *saved_scan;
>
> scan = zone_nr_pages(zone, sc, l);
> if (priority || noswap) {
> @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
> scan = (scan * percent[file]) / 100;
> }
> if (scanning_global_lru(sc))
> - nr[l] = nr_scan_try_batch(scan,
> - &zone->lru[l].nr_saved_scan,
> - swap_cluster_max);
> + saved_scan = &zone->lru[l].nr_saved_scan;
> else
> - nr[l] = scan;
> + saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
> + zone, l);
> + nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
> }
>
This might be a concern (although not a big ATM), since we can't
afford to miss limits by much. If a cgroup is near its limit and we
drop scanning it. We'll have to work out what this means for the end
user. May be more fundamental look through is required at the priority
based logic of exposing how much to scan, I don't know.
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-16 3:53 ` Rik van Riel
@ 2009-08-16 5:15 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16 5:15 UTC (permalink / raw)
To: Rik van Riel
Cc: Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
> > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> >> Side question -
> >> Is there a good reason for this to be in shrink_active_list()
> >> as opposed to __isolate_lru_page?
> >>
> >> if (unlikely(!page_evictable(page, NULL))) {
> >> putback_lru_page(page);
> >> continue;
> >> }
> >>
> >> Maybe we want to minimize the amount of code under the lru lock or
> >> avoid duplicate logic in the isolate_page functions.
> >
> > I guess the quick test means to avoid the expensive page_referenced()
> > call that follows it. But that should be mostly one shot cost - the
> > unevictable pages are unlikely to cycle in active/inactive list again
> > and again.
>
> Please read what putback_lru_page does.
>
> It moves the page onto the unevictable list, so that
> it will not end up in this scan again.
Yes it does. I said 'mostly' because there is a small hole that an
unevictable page may be scanned but still not moved to unevictable
list: when a page is mapped in two places, the first pte has the
referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
page_referenced() will return 1 and shrink_page_list() will move it
into active list instead of unevictable list. Shall we fix this rare
case?
> >> But if there are important mlock-heavy workloads, this could make the
> >> scan come up empty, or at least emptier than we might like.
> >
> > Yes, if the above 'if' block is removed, the inactive lists might get
> > more expensive to reclaim.
>
> Why?
Without the 'if' block, an unevictable page may well be deactivated into
inactive list (and some time later be moved to unevictable list
from there), increasing the inactive list's scanned:reclaimed ratio.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16 5:15 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16 5:15 UTC (permalink / raw)
To: Rik van Riel
Cc: Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
> > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> >> Side question -
> >> Is there a good reason for this to be in shrink_active_list()
> >> as opposed to __isolate_lru_page?
> >>
> >> if (unlikely(!page_evictable(page, NULL))) {
> >> putback_lru_page(page);
> >> continue;
> >> }
> >>
> >> Maybe we want to minimize the amount of code under the lru lock or
> >> avoid duplicate logic in the isolate_page functions.
> >
> > I guess the quick test means to avoid the expensive page_referenced()
> > call that follows it. But that should be mostly one shot cost - the
> > unevictable pages are unlikely to cycle in active/inactive list again
> > and again.
>
> Please read what putback_lru_page does.
>
> It moves the page onto the unevictable list, so that
> it will not end up in this scan again.
Yes it does. I said 'mostly' because there is a small hole that an
unevictable page may be scanned but still not moved to unevictable
list: when a page is mapped in two places, the first pte has the
referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
page_referenced() will return 1 and shrink_page_list() will move it
into active list instead of unevictable list. Shall we fix this rare
case?
> >> But if there are important mlock-heavy workloads, this could make the
> >> scan come up empty, or at least emptier than we might like.
> >
> > Yes, if the above 'if' block is removed, the inactive lists might get
> > more expensive to reclaim.
>
> Why?
Without the 'if' block, an unevictable page may well be deactivated into
inactive list (and some time later be moved to unevictable list
from there), increasing the inactive list's scanned:reclaimed ratio.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-16 5:09 ` Balbir Singh
@ 2009-08-16 5:41 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16 5:41 UTC (permalink / raw)
To: Balbir Singh
Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
linux-mm
On Sun, Aug 16, 2009 at 01:09:03PM +0800, Balbir Singh wrote:
> * Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]:
>
> > On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> > > Wu Fengguang wrote:
> > > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> > >
> > > >> So even with the active list being a FIFO, we keep usage information
> > > >> gathered from the inactive list. If we deactivate pages in arbitrary
> > > >> list intervals, we throw this away.
> > > >
> > > > We do have the danger of FIFO, if inactive list is small enough, so
> > > > that (unconditionally) deactivated pages quickly get reclaimed and
> > > > their life window in inactive list is too small to be useful.
> > >
> > > This one of the reasons why we unconditionally deactivate
> > > the active anon pages, and do background scanning of the
> > > active anon list when reclaiming page cache pages.
> > >
> > > We want to always move some pages to the inactive anon
> > > list, so it does not get too small.
> >
> > Right, the current code tries to pull inactive list out of
> > smallish-size state as long as there are vmscan activities.
> >
> > However there is a possible (and tricky) hole: mem cgroups
> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> >
> > It effectively scales up the inactive list scan rate by 10 times when
> > it is still small, and may thus prevent it from growing up for ever.
> >
>
> I think we need to possibly export some scanning data under DEBUG_VM
> to cross verify.
Maybe we can do more general debugging code, but here is a quick patch
for examining the cgroup case. Note that even for the global zones,
max_scan may well not be the multiple of SWAP_CLUSTER_MAX, thus
shrink_inactive_list() will scan a little more in its last loop.
---
mm/vmscan.c | 7 +++++++
1 file changed, 7 insertions(+)
--- linux.orig/mm/vmscan.c 2009-08-16 13:24:25.000000000 +0800
+++ linux/mm/vmscan.c 2009-08-16 13:38:32.000000000 +0800
@@ -1043,6 +1043,13 @@ static unsigned long shrink_inactive_lis
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
int lumpy_reclaim = 0;
+ if (!scanning_global_lru(sc))
+ printk("shrink inactive %s count=%lu scan=%lu\n",
+ file ? "file" : "anon",
+ mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone,
+ LRU_INACTIVE_ANON + !!file),
+ max_scan);
+
/*
* If we need a large contiguous chunk of memory, or have
* trouble getting a small set of contiguous pages, we
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16 5:41 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16 5:41 UTC (permalink / raw)
To: Balbir Singh
Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
linux-mm
On Sun, Aug 16, 2009 at 01:09:03PM +0800, Balbir Singh wrote:
> * Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]:
>
> > On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> > > Wu Fengguang wrote:
> > > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> > >
> > > >> So even with the active list being a FIFO, we keep usage information
> > > >> gathered from the inactive list. If we deactivate pages in arbitrary
> > > >> list intervals, we throw this away.
> > > >
> > > > We do have the danger of FIFO, if inactive list is small enough, so
> > > > that (unconditionally) deactivated pages quickly get reclaimed and
> > > > their life window in inactive list is too small to be useful.
> > >
> > > This one of the reasons why we unconditionally deactivate
> > > the active anon pages, and do background scanning of the
> > > active anon list when reclaiming page cache pages.
> > >
> > > We want to always move some pages to the inactive anon
> > > list, so it does not get too small.
> >
> > Right, the current code tries to pull inactive list out of
> > smallish-size state as long as there are vmscan activities.
> >
> > However there is a possible (and tricky) hole: mem cgroups
> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> >
> > It effectively scales up the inactive list scan rate by 10 times when
> > it is still small, and may thus prevent it from growing up for ever.
> >
>
> I think we need to possibly export some scanning data under DEBUG_VM
> to cross verify.
Maybe we can do more general debugging code, but here is a quick patch
for examining the cgroup case. Note that even for the global zones,
max_scan may well not be the multiple of SWAP_CLUSTER_MAX, thus
shrink_inactive_list() will scan a little more in its last loop.
---
mm/vmscan.c | 7 +++++++
1 file changed, 7 insertions(+)
--- linux.orig/mm/vmscan.c 2009-08-16 13:24:25.000000000 +0800
+++ linux/mm/vmscan.c 2009-08-16 13:38:32.000000000 +0800
@@ -1043,6 +1043,13 @@ static unsigned long shrink_inactive_lis
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
int lumpy_reclaim = 0;
+ if (!scanning_global_lru(sc))
+ printk("shrink inactive %s count=%lu scan=%lu\n",
+ file ? "file" : "anon",
+ mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone,
+ LRU_INACTIVE_ANON + !!file),
+ max_scan);
+
/*
* If we need a large contiguous chunk of memory, or have
* trouble getting a small set of contiguous pages, we
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-16 5:09 ` Balbir Singh
@ 2009-08-16 5:50 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16 5:50 UTC (permalink / raw)
To: Balbir Singh
Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
linux-mm
On Sun, Aug 16, 2009 at 01:09:03PM +0800, Balbir Singh wrote:
> * Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]:
>
> > On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> > > Wu Fengguang wrote:
> > > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> > >
> > @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
> > scan = (scan * percent[file]) / 100;
> > }
> > if (scanning_global_lru(sc))
> > - nr[l] = nr_scan_try_batch(scan,
> > - &zone->lru[l].nr_saved_scan,
> > - swap_cluster_max);
> > + saved_scan = &zone->lru[l].nr_saved_scan;
> > else
> > - nr[l] = scan;
> > + saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
> > + zone, l);
> > + nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
> > }
> >
>
> This might be a concern (although not a big ATM), since we can't
> afford to miss limits by much. If a cgroup is near its limit and we
> drop scanning it. We'll have to work out what this means for the end
> user. May be more fundamental look through is required at the priority
> based logic of exposing how much to scan, I don't know.
I also had this worry at first. Then dismissed it because the page
reclaim should be driven by "pages reclaimed" rather than "pages
scanned". So when shrink_zone() decides to cancel one smallish scan,
it may well be called again and accumulate up nr_saved_scan.
So it should only be a problem for a very small mem_cgroup (which may
be _full_ scanned too much times in order to accumulate up nr_saved_scan).
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16 5:50 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16 5:50 UTC (permalink / raw)
To: Balbir Singh
Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
linux-mm
On Sun, Aug 16, 2009 at 01:09:03PM +0800, Balbir Singh wrote:
> * Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]:
>
> > On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> > > Wu Fengguang wrote:
> > > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> > >
> > @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
> > scan = (scan * percent[file]) / 100;
> > }
> > if (scanning_global_lru(sc))
> > - nr[l] = nr_scan_try_batch(scan,
> > - &zone->lru[l].nr_saved_scan,
> > - swap_cluster_max);
> > + saved_scan = &zone->lru[l].nr_saved_scan;
> > else
> > - nr[l] = scan;
> > + saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
> > + zone, l);
> > + nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
> > }
> >
>
> This might be a concern (although not a big ATM), since we can't
> afford to miss limits by much. If a cgroup is near its limit and we
> drop scanning it. We'll have to work out what this means for the end
> user. May be more fundamental look through is required at the priority
> based logic of exposing how much to scan, I don't know.
I also had this worry at first. Then dismissed it because the page
reclaim should be driven by "pages reclaimed" rather than "pages
scanned". So when shrink_zone() decides to cancel one smallish scan,
it may well be called again and accumulate up nr_saved_scan.
So it should only be a problem for a very small mem_cgroup (which may
be _full_ scanned too much times in order to accumulate up nr_saved_scan).
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-16 4:55 ` Wu Fengguang
@ 2009-08-16 5:59 ` Balbir Singh
-1 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-16 5:59 UTC (permalink / raw)
To: Wu Fengguang
Cc: Rik van Riel, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
* Wu Fengguang <fengguang.wu@intel.com> [2009-08-16 12:55:22]:
> On Sun, Aug 16, 2009 at 11:56:39AM +0800, Rik van Riel wrote:
> > Wu Fengguang wrote:
> >
> > > Right, but I meant busty page allocations and accesses on them, which
> > > can make a large continuous segment of referenced pages in LRU list,
> > > say 50MB. They may or may not be valuable as a whole, however a local
> > > algorithm may keep the first 4MB and drop the remaining 46MB.
> >
> > I wonder if the problem is that we simply do not keep a large
> > enough inactive list in Jeff's test. If we do not, pages do
> > not have a chance to be referenced again before the reclaim
> > code comes in.
>
> Exactly, that's the case I call the list FIFO.
>
> > The cgroup stats should show how many active anon and inactive
> > anon pages there are in the cgroup.
>
> Jeff, can you have a look at these stats? Thanks!
Another experiment would be to toy with memory.swappiness (although
defaults should work well). Could you compare the in-guest values of
nr_*active* with the cgroup values as seen by the host?
--
Balbir
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16 5:59 ` Balbir Singh
0 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-16 5:59 UTC (permalink / raw)
To: Wu Fengguang
Cc: Rik van Riel, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
* Wu Fengguang <fengguang.wu@intel.com> [2009-08-16 12:55:22]:
> On Sun, Aug 16, 2009 at 11:56:39AM +0800, Rik van Riel wrote:
> > Wu Fengguang wrote:
> >
> > > Right, but I meant busty page allocations and accesses on them, which
> > > can make a large continuous segment of referenced pages in LRU list,
> > > say 50MB. They may or may not be valuable as a whole, however a local
> > > algorithm may keep the first 4MB and drop the remaining 46MB.
> >
> > I wonder if the problem is that we simply do not keep a large
> > enough inactive list in Jeff's test. If we do not, pages do
> > not have a chance to be referenced again before the reclaim
> > code comes in.
>
> Exactly, that's the case I call the list FIFO.
>
> > The cgroup stats should show how many active anon and inactive
> > anon pages there are in the cgroup.
>
> Jeff, can you have a look at these stats? Thanks!
Another experiment would be to toy with memory.swappiness (although
defaults should work well). Could you compare the in-guest values of
nr_*active* with the cgroup values as seen by the host?
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-16 5:15 ` Wu Fengguang
@ 2009-08-16 11:29 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16 11:29 UTC (permalink / raw)
To: Rik van Riel
Cc: Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > Wu Fengguang wrote:
> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > >> Side question -
> > >> Is there a good reason for this to be in shrink_active_list()
> > >> as opposed to __isolate_lru_page?
> > >>
> > >> if (unlikely(!page_evictable(page, NULL))) {
> > >> putback_lru_page(page);
> > >> continue;
> > >> }
> > >>
> > >> Maybe we want to minimize the amount of code under the lru lock or
> > >> avoid duplicate logic in the isolate_page functions.
> > >
> > > I guess the quick test means to avoid the expensive page_referenced()
> > > call that follows it. But that should be mostly one shot cost - the
> > > unevictable pages are unlikely to cycle in active/inactive list again
> > > and again.
> >
> > Please read what putback_lru_page does.
> >
> > It moves the page onto the unevictable list, so that
> > it will not end up in this scan again.
>
> Yes it does. I said 'mostly' because there is a small hole that an
> unevictable page may be scanned but still not moved to unevictable
> list: when a page is mapped in two places, the first pte has the
> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> page_referenced() will return 1 and shrink_page_list() will move it
> into active list instead of unevictable list. Shall we fix this rare
> case?
How about this fix?
---
mm: stop circulating of referenced mlocked pages
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
--- linux.orig/mm/rmap.c 2009-08-16 19:11:13.000000000 +0800
+++ linux/mm/rmap.c 2009-08-16 19:22:46.000000000 +0800
@@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
*/
if (vma->vm_flags & VM_LOCKED) {
*mapcount = 1; /* break early from loop */
+ *vm_flags |= VM_LOCKED;
goto out_unmap;
}
@@ -482,6 +483,8 @@ static int page_referenced_file(struct p
}
spin_unlock(&mapping->i_mmap_lock);
+ if (*vm_flags & VM_LOCKED)
+ referenced = 0;
return referenced;
}
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-16 11:29 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-16 11:29 UTC (permalink / raw)
To: Rik van Riel
Cc: Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > Wu Fengguang wrote:
> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > >> Side question -
> > >> Is there a good reason for this to be in shrink_active_list()
> > >> as opposed to __isolate_lru_page?
> > >>
> > >> if (unlikely(!page_evictable(page, NULL))) {
> > >> putback_lru_page(page);
> > >> continue;
> > >> }
> > >>
> > >> Maybe we want to minimize the amount of code under the lru lock or
> > >> avoid duplicate logic in the isolate_page functions.
> > >
> > > I guess the quick test means to avoid the expensive page_referenced()
> > > call that follows it. But that should be mostly one shot cost - the
> > > unevictable pages are unlikely to cycle in active/inactive list again
> > > and again.
> >
> > Please read what putback_lru_page does.
> >
> > It moves the page onto the unevictable list, so that
> > it will not end up in this scan again.
>
> Yes it does. I said 'mostly' because there is a small hole that an
> unevictable page may be scanned but still not moved to unevictable
> list: when a page is mapped in two places, the first pte has the
> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> page_referenced() will return 1 and shrink_page_list() will move it
> into active list instead of unevictable list. Shall we fix this rare
> case?
How about this fix?
---
mm: stop circulating of referenced mlocked pages
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
--- linux.orig/mm/rmap.c 2009-08-16 19:11:13.000000000 +0800
+++ linux/mm/rmap.c 2009-08-16 19:22:46.000000000 +0800
@@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
*/
if (vma->vm_flags & VM_LOCKED) {
*mapcount = 1; /* break early from loop */
+ *vm_flags |= VM_LOCKED;
goto out_unmap;
}
@@ -482,6 +483,8 @@ static int page_referenced_file(struct p
}
spin_unlock(&mapping->i_mmap_lock);
+ if (*vm_flags & VM_LOCKED)
+ referenced = 0;
return referenced;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-16 11:29 ` Wu Fengguang
@ 2009-08-17 14:33 ` Minchan Kim
-1 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-17 14:33 UTC (permalink / raw)
To: Wu Fengguang
Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Hi, Wu.
On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
>> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
>> > Wu Fengguang wrote:
>> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
>> > >> Side question -
>> > >> Is there a good reason for this to be in shrink_active_list()
>> > >> as opposed to __isolate_lru_page?
>> > >>
>> > >> if (unlikely(!page_evictable(page, NULL))) {
>> > >> putback_lru_page(page);
>> > >> continue;
>> > >> }
>> > >>
>> > >> Maybe we want to minimize the amount of code under the lru lock or
>> > >> avoid duplicate logic in the isolate_page functions.
>> > >
>> > > I guess the quick test means to avoid the expensive page_referenced()
>> > > call that follows it. But that should be mostly one shot cost - the
>> > > unevictable pages are unlikely to cycle in active/inactive list again
>> > > and again.
>> >
>> > Please read what putback_lru_page does.
>> >
>> > It moves the page onto the unevictable list, so that
>> > it will not end up in this scan again.
>>
>> Yes it does. I said 'mostly' because there is a small hole that an
>> unevictable page may be scanned but still not moved to unevictable
>> list: when a page is mapped in two places, the first pte has the
>> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
>> page_referenced() will return 1 and shrink_page_list() will move it
>> into active list instead of unevictable list. Shall we fix this rare
>> case?
I think it's not a big deal.
As you mentioned, it's rare case so there would be few pages in active
list instead of unevictable list.
When next time to scan comes, we can try to move the pages into
unevictable list, again.
As I know about mlock pages, we already had some races condition.
They will be rescued like above.
>
> How about this fix?
>
> ---
> mm: stop circulating of referenced mlocked pages
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>
> --- linux.orig/mm/rmap.c 2009-08-16 19:11:13.000000000 +0800
> +++ linux/mm/rmap.c 2009-08-16 19:22:46.000000000 +0800
> @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
> */
> if (vma->vm_flags & VM_LOCKED) {
> *mapcount = 1; /* break early from loop */
> + *vm_flags |= VM_LOCKED;
> goto out_unmap;
> }
>
> @@ -482,6 +483,8 @@ static int page_referenced_file(struct p
> }
>
> spin_unlock(&mapping->i_mmap_lock);
> + if (*vm_flags & VM_LOCKED)
> + referenced = 0;
> return referenced;
> }
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-17 14:33 ` Minchan Kim
0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-17 14:33 UTC (permalink / raw)
To: Wu Fengguang
Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Hi, Wu.
On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
>> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
>> > Wu Fengguang wrote:
>> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
>> > >> Side question -
>> > >> Is there a good reason for this to be in shrink_active_list()
>> > >> as opposed to __isolate_lru_page?
>> > >>
>> > >> if (unlikely(!page_evictable(page, NULL))) {
>> > >> putback_lru_page(page);
>> > >> continue;
>> > >> }
>> > >>
>> > >> Maybe we want to minimize the amount of code under the lru lock or
>> > >> avoid duplicate logic in the isolate_page functions.
>> > >
>> > > I guess the quick test means to avoid the expensive page_referenced()
>> > > call that follows it. But that should be mostly one shot cost - the
>> > > unevictable pages are unlikely to cycle in active/inactive list again
>> > > and again.
>> >
>> > Please read what putback_lru_page does.
>> >
>> > It moves the page onto the unevictable list, so that
>> > it will not end up in this scan again.
>>
>> Yes it does. I said 'mostly' because there is a small hole that an
>> unevictable page may be scanned but still not moved to unevictable
>> list: when a page is mapped in two places, the first pte has the
>> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
>> page_referenced() will return 1 and shrink_page_list() will move it
>> into active list instead of unevictable list. Shall we fix this rare
>> case?
I think it's not a big deal.
As you mentioned, it's rare case so there would be few pages in active
list instead of unevictable list.
When next time to scan comes, we can try to move the pages into
unevictable list, again.
As I know about mlock pages, we already had some races condition.
They will be rescued like above.
>
> How about this fix?
>
> ---
> mm: stop circulating of referenced mlocked pages
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>
> --- linux.orig/mm/rmap.c 2009-08-16 19:11:13.000000000 +0800
> +++ linux/mm/rmap.c 2009-08-16 19:22:46.000000000 +0800
> @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
> */
> if (vma->vm_flags & VM_LOCKED) {
> *mapcount = 1; /* break early from loop */
> + *vm_flags |= VM_LOCKED;
> goto out_unmap;
> }
>
> @@ -482,6 +483,8 @@ static int page_referenced_file(struct p
> }
>
> spin_unlock(&mapping->i_mmap_lock);
> + if (*vm_flags & VM_LOCKED)
> + referenced = 0;
> return referenced;
> }
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
2009-08-15 5:45 ` Wu Fengguang
@ 2009-08-17 18:04 ` Dike, Jeffrey G
-1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-17 18:04 UTC (permalink / raw)
To: Wu, Fengguang, Rik van Riel
Cc: Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli,
Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
> Jeff, can you confirm if the mem cgroup's inactive list is small?
Nope. I have plenty on the inactive anon list, between 13K and 16K pages (i.e. 52M to 64M).
The inactive mapped list is much smaller - 0 to ~700 pages.
The active lists are comparable in size, but larger - 16K - 19K pages for anon and 60 - 450 pages for mapped.
Jeff
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-17 18:04 ` Dike, Jeffrey G
0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-17 18:04 UTC (permalink / raw)
To: Wu, Fengguang, Rik van Riel
Cc: Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli,
Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
> Jeff, can you confirm if the mem cgroup's inactive list is small?
Nope. I have plenty on the inactive anon list, between 13K and 16K pages (i.e. 52M to 64M).
The inactive mapped list is much smaller - 0 to ~700 pages.
The active lists are comparable in size, but larger - 16K - 19K pages for anon and 60 - 450 pages for mapped.
Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
2009-08-16 4:55 ` Wu Fengguang
@ 2009-08-17 19:47 ` Dike, Jeffrey G
-1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-17 19:47 UTC (permalink / raw)
To: Wu, Fengguang, Rik van Riel
Cc: Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi,
Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro,
Mel Gorman, LKML, linux-mm
> Jeff, can you have a look at these stats? Thanks!
Yeah, I just did after adding some tracing which dumped out the same data. It looks pretty much the same. Inactive anon and active anon are pretty similar. Inactive file and active file are smaller and fluctuate more, but doesn't look horribly unbalanced.
Below are the stats from memory.stat - inactive_anon, active_anon, inactive_file, active_file, plus some commentary on what's happening.
Jeff
114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784
# Fire up instance
20480 4403200 2699264 647168
20480 4411392 2740224 647168
20480 4411392 2740224 647168
20480 11436032 3289088 651264
20480 11587584 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 25387008 4263936 872448
20480 25387008 4263936 872448
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25411584 4198400 937984
20480 25411584 4198400 937984
20480 40665088 7573504 946176
20480 43606016 7573504 946176
20480 45346816 7581696 946176
20480 45752320 7581696 946176
20480 46575616 7581696 946176
20480 46682112 7581696 946176
20480 48975872 9920512 1073152
20480 64536576 38457344 1826816
# Booted, X is starting
# Run a browser and editor, then shut them down and halt the instance
10964992 72454144 47714304 3067904
16797696 71151616 42893312 3244032
16797696 73035776 41037824 3272704
16797696 73547776 40525824 3272704
16797696 73609216 40402944 3276800
16797696 73719808 40337408 3289088
16797696 73920512 40079360 3289088
16797696 78016512 36036608 3297280
16797696 78016512 36036608 3297280
16797696 80203776 33755136 3387392
16797696 86904832 26972160 3526656
16797696 93523968 19927040 3837952
29011968 90546176 10276864 4308992
45670400 83685376 991232 3854336
66400256 66416640 368640 933888
66715648 66654208 376832 471040
64811008 64802816 3416064 1114112
65236992 65085440 2535424 1228800
65212416 65011712 2519040 1343488
64626688 64610304 3534848 1429504
63807488 63758336 4829184 1695744
63975424 63946752 4419584 1744896
63975424 63946752 4419584 1744896
63975424 63946752 4419584 1744896
63975424 63946752 4423680 1744896
63975424 63946752 4423680 1744896
63975424 63946752 4423680 1744896
64045056 63946752 4440064 1757184
64069632 63946752 4440064 1757184
64077824 63946752 4403200 1757184
64077824 63946752 4403200 1757184
64077824 63946752 4403200 1757184
64147456 64016384 4222976 1757184
64638976 64733184 2801664 1900544
65208320 65605632 1310720 1892352
64946176 65863680 1335296 1998848
62701568 68599808 843776 1945600
66068480 66023424 778240 978944
65568768 65511424 2093056 1044480
66183168 66056192 966656 1011712
66478080 66555904 241664 864256
66912256 66899968 135168 270336
66646016 66539520 577536 262144
66134016 66179072 1421312 319488
66125824 66183168 1273856 380928
66330624 66445312 933888 475136
65970176 65966080 1581056 548864
66158592 66158592 1175552 708608
65781760 66084864 1503232 806912
66084864 66048000 1118208 843776
66420736 66449408 376832 851968
66351104 66138112 757760 864256
66285568 66138112 921600 868352
65945600 65847296 1495040 888832
66002944 65839104 1363968 888832
66002944 65839104 1363968 888832
66039808 65839104 1363968 888832
66043904 65839104 1363968 888832
66043904 65839104 1372160 888832
65523712 65490944 2224128 929792
66031616 66297856 827392 946176
64913408 68141056 188416 933888
64770048 68325376 73728 917504
65216512 67932160 81920 909312
65470464 67678208 81920 909312
65036288 67973120 356352 790528
63492096 69877760 110592 647168
63111168 70508544 73728 413696
66895872 66883584 16384 348160
66650112 67203072 20480 344064
66830336 67002368 28672 335872
66785280 67002368 32768 331776
67084288 66736128 45056 331776
67104768 66736128 45056 331776
66916352 66801664 45056 331776
66883584 66863104 45056 331776
66883584 66863104 45056 331776
66891776 66863104 45056 331776
66899968 66863104 45056 331776
66904064 66863104 45056 331776
66904064 66863104 45056 331776
66904064 66863104 45056 331776
66715648 66641920 385024 339968
66617344 66629632 237568 364544
66633728 66629632 237568 364544
66641920 66629632 237568 364544
66641920 66629632 237568 364544
66662400 66461696 589824 389120
66588672 66527232 659456 389120
66252800 66252800 1105920 413696
66277376 66297856 983040 413696
66498560 66285568 884736 413696
66129920 66183168 1163264 442368
66138112 66183168 1163264 442368
66256896 66465792 921600 442368
66560000 66465792 606208 442368
66662400 66445312 589824 446464
66560000 66490368 708608 446464
66629632 66441216 577536 446464
66711552 66441216 520192 462848
66617344 66531328 577536 466944
66412544 66605056 606208 466944
66605056 66637824 475136 471040
66297856 67018752 344064 483328
65863680 67588096 212992 483328
66334720 67129344 159744 516096
65204224 68337664 159744 516096
61399040 72212480 98304 499712
62427136 71114752 98304 499712
62775296 70680576 155648 503808
62595072 70209536 823296 565248
66490368 66576384 458752 577536
66818048 66625536 196608 577536
66699264 66793472 135168 507904
66707456 66736128 151552 512000
66314240 67293184 53248 483328
56832000 76705792 180224 495616
59273216 73998336 368640 499712
59351040 73928704 368640 499712
59400192 73928704 368640 499712
59678720 73478144 241664 495616
59875328 73531392 241664 495616
60690432 72830976 110592 503808
61919232 71634944 110592 503808
65523712 68096000 49152 483328
66748416 66801664 102400 487424
65994752 67555328 114688 487424
65994752 67555328 114688 487424
65994752 67555328 114688 487424
66285568 66195456 1093632 520192
66232320 65978368 1335296 557056
66232320 65978368 1335296 557056
66232320 65978368 1335296 557056
66232320 65978368 1335296 557056
66060288 65994752 1548288 557056
# Instance halted
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-17 19:47 ` Dike, Jeffrey G
0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-17 19:47 UTC (permalink / raw)
To: Wu, Fengguang, Rik van Riel
Cc: Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi,
Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro,
Mel Gorman, LKML, linux-mm
> Jeff, can you have a look at these stats? Thanks!
Yeah, I just did after adding some tracing which dumped out the same data. It looks pretty much the same. Inactive anon and active anon are pretty similar. Inactive file and active file are smaller and fluctuate more, but doesn't look horribly unbalanced.
Below are the stats from memory.stat - inactive_anon, active_anon, inactive_file, active_file, plus some commentary on what's happening.
Jeff
114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784
114688 0 1978368 630784
# Fire up instance
20480 4403200 2699264 647168
20480 4411392 2740224 647168
20480 4411392 2740224 647168
20480 11436032 3289088 651264
20480 11587584 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 12558336 3313664 651264
20480 25387008 4263936 872448
20480 25387008 4263936 872448
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25387008 4198400 937984
20480 25411584 4198400 937984
20480 25411584 4198400 937984
20480 40665088 7573504 946176
20480 43606016 7573504 946176
20480 45346816 7581696 946176
20480 45752320 7581696 946176
20480 46575616 7581696 946176
20480 46682112 7581696 946176
20480 48975872 9920512 1073152
20480 64536576 38457344 1826816
# Booted, X is starting
# Run a browser and editor, then shut them down and halt the instance
10964992 72454144 47714304 3067904
16797696 71151616 42893312 3244032
16797696 73035776 41037824 3272704
16797696 73547776 40525824 3272704
16797696 73609216 40402944 3276800
16797696 73719808 40337408 3289088
16797696 73920512 40079360 3289088
16797696 78016512 36036608 3297280
16797696 78016512 36036608 3297280
16797696 80203776 33755136 3387392
16797696 86904832 26972160 3526656
16797696 93523968 19927040 3837952
29011968 90546176 10276864 4308992
45670400 83685376 991232 3854336
66400256 66416640 368640 933888
66715648 66654208 376832 471040
64811008 64802816 3416064 1114112
65236992 65085440 2535424 1228800
65212416 65011712 2519040 1343488
64626688 64610304 3534848 1429504
63807488 63758336 4829184 1695744
63975424 63946752 4419584 1744896
63975424 63946752 4419584 1744896
63975424 63946752 4419584 1744896
63975424 63946752 4423680 1744896
63975424 63946752 4423680 1744896
63975424 63946752 4423680 1744896
64045056 63946752 4440064 1757184
64069632 63946752 4440064 1757184
64077824 63946752 4403200 1757184
64077824 63946752 4403200 1757184
64077824 63946752 4403200 1757184
64147456 64016384 4222976 1757184
64638976 64733184 2801664 1900544
65208320 65605632 1310720 1892352
64946176 65863680 1335296 1998848
62701568 68599808 843776 1945600
66068480 66023424 778240 978944
65568768 65511424 2093056 1044480
66183168 66056192 966656 1011712
66478080 66555904 241664 864256
66912256 66899968 135168 270336
66646016 66539520 577536 262144
66134016 66179072 1421312 319488
66125824 66183168 1273856 380928
66330624 66445312 933888 475136
65970176 65966080 1581056 548864
66158592 66158592 1175552 708608
65781760 66084864 1503232 806912
66084864 66048000 1118208 843776
66420736 66449408 376832 851968
66351104 66138112 757760 864256
66285568 66138112 921600 868352
65945600 65847296 1495040 888832
66002944 65839104 1363968 888832
66002944 65839104 1363968 888832
66039808 65839104 1363968 888832
66043904 65839104 1363968 888832
66043904 65839104 1372160 888832
65523712 65490944 2224128 929792
66031616 66297856 827392 946176
64913408 68141056 188416 933888
64770048 68325376 73728 917504
65216512 67932160 81920 909312
65470464 67678208 81920 909312
65036288 67973120 356352 790528
63492096 69877760 110592 647168
63111168 70508544 73728 413696
66895872 66883584 16384 348160
66650112 67203072 20480 344064
66830336 67002368 28672 335872
66785280 67002368 32768 331776
67084288 66736128 45056 331776
67104768 66736128 45056 331776
66916352 66801664 45056 331776
66883584 66863104 45056 331776
66883584 66863104 45056 331776
66891776 66863104 45056 331776
66899968 66863104 45056 331776
66904064 66863104 45056 331776
66904064 66863104 45056 331776
66904064 66863104 45056 331776
66715648 66641920 385024 339968
66617344 66629632 237568 364544
66633728 66629632 237568 364544
66641920 66629632 237568 364544
66641920 66629632 237568 364544
66662400 66461696 589824 389120
66588672 66527232 659456 389120
66252800 66252800 1105920 413696
66277376 66297856 983040 413696
66498560 66285568 884736 413696
66129920 66183168 1163264 442368
66138112 66183168 1163264 442368
66256896 66465792 921600 442368
66560000 66465792 606208 442368
66662400 66445312 589824 446464
66560000 66490368 708608 446464
66629632 66441216 577536 446464
66711552 66441216 520192 462848
66617344 66531328 577536 466944
66412544 66605056 606208 466944
66605056 66637824 475136 471040
66297856 67018752 344064 483328
65863680 67588096 212992 483328
66334720 67129344 159744 516096
65204224 68337664 159744 516096
61399040 72212480 98304 499712
62427136 71114752 98304 499712
62775296 70680576 155648 503808
62595072 70209536 823296 565248
66490368 66576384 458752 577536
66818048 66625536 196608 577536
66699264 66793472 135168 507904
66707456 66736128 151552 512000
66314240 67293184 53248 483328
56832000 76705792 180224 495616
59273216 73998336 368640 499712
59351040 73928704 368640 499712
59400192 73928704 368640 499712
59678720 73478144 241664 495616
59875328 73531392 241664 495616
60690432 72830976 110592 503808
61919232 71634944 110592 503808
65523712 68096000 49152 483328
66748416 66801664 102400 487424
65994752 67555328 114688 487424
65994752 67555328 114688 487424
65994752 67555328 114688 487424
66285568 66195456 1093632 520192
66232320 65978368 1335296 557056
66232320 65978368 1335296 557056
66232320 65978368 1335296 557056
66232320 65978368 1335296 557056
66060288 65994752 1548288 557056
# Instance halted
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
118784 0 1953792 569344
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-17 18:04 ` Dike, Jeffrey G
@ 2009-08-18 2:26 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18 2:26 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Tue, Aug 18, 2009 at 02:04:46AM +0800, Dike, Jeffrey G wrote:
> > Jeff, can you confirm if the mem cgroup's inactive list is small?
>
> Nope. I have plenty on the inactive anon list, between 13K and 16K pages (i.e. 52M to 64M).
>
> The inactive mapped list is much smaller - 0 to ~700 pages.
>
> The active lists are comparable in size, but larger - 16K - 19K pages for anon and 60 - 450 pages for mapped.
The anon inactive list is "over scanned". Take 16k pages for example,
with DEF_PRIORITY=12, (16k >> 12) = 4. So when shrink_zone() expects
to scan 4 pages in the active/inactive list, it will be scanned
SWAP_CLUSTER_MAX=32 pages in effect.
This triggers the background aging of active anon list because
inactive_anon_is_low() is found to be true, which keeps the
active:inactive ratio in balance.
So anon inactive list over scanned => anon active list over scanned =>
anon lists over scanned relative to file lists. (The inactive file list
may or may not be over scanned depending on its size <> (1<<prio) pages.)
Anyway this is not the expected way vmscan should work, and batching
up the cgroup vmscan could get rid of the mess.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 2:26 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18 2:26 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Tue, Aug 18, 2009 at 02:04:46AM +0800, Dike, Jeffrey G wrote:
> > Jeff, can you confirm if the mem cgroup's inactive list is small?
>
> Nope. I have plenty on the inactive anon list, between 13K and 16K pages (i.e. 52M to 64M).
>
> The inactive mapped list is much smaller - 0 to ~700 pages.
>
> The active lists are comparable in size, but larger - 16K - 19K pages for anon and 60 - 450 pages for mapped.
The anon inactive list is "over scanned". Take 16k pages for example,
with DEF_PRIORITY=12, (16k >> 12) = 4. So when shrink_zone() expects
to scan 4 pages in the active/inactive list, it will be scanned
SWAP_CLUSTER_MAX=32 pages in effect.
This triggers the background aging of active anon list because
inactive_anon_is_low() is found to be true, which keeps the
active:inactive ratio in balance.
So anon inactive list over scanned => anon active list over scanned =>
anon lists over scanned relative to file lists. (The inactive file list
may or may not be over scanned depending on its size <> (1<<prio) pages.)
Anyway this is not the expected way vmscan should work, and batching
up the cgroup vmscan could get rid of the mess.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-17 14:33 ` Minchan Kim
@ 2009-08-18 2:34 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18 2:34 UTC (permalink / raw)
To: Minchan Kim
Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Minchan,
On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> >> > Wu Fengguang wrote:
> >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> >> > >> Side question -
> >> > >> Is there a good reason for this to be in shrink_active_list()
> >> > >> as opposed to __isolate_lru_page?
> >> > >>
> >> > >> if (unlikely(!page_evictable(page, NULL))) {
> >> > >> putback_lru_page(page);
> >> > >> continue;
> >> > >> }
> >> > >>
> >> > >> Maybe we want to minimize the amount of code under the lru lock or
> >> > >> avoid duplicate logic in the isolate_page functions.
> >> > >
> >> > > I guess the quick test means to avoid the expensive page_referenced()
> >> > > call that follows it. But that should be mostly one shot cost - the
> >> > > unevictable pages are unlikely to cycle in active/inactive list again
> >> > > and again.
> >> >
> >> > Please read what putback_lru_page does.
> >> >
> >> > It moves the page onto the unevictable list, so that
> >> > it will not end up in this scan again.
> >>
> >> Yes it does. I said 'mostly' because there is a small hole that an
> >> unevictable page may be scanned but still not moved to unevictable
> >> list: when a page is mapped in two places, the first pte has the
> >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> >> page_referenced() will return 1 and shrink_page_list() will move it
> >> into active list instead of unevictable list. Shall we fix this rare
> >> case?
>
> I think it's not a big deal.
Maybe, otherwise I should bring up this issue long time before :)
> As you mentioned, it's rare case so there would be few pages in active
> list instead of unevictable list.
Yes.
> When next time to scan comes, we can try to move the pages into
> unevictable list, again.
Will PG_mlocked be set by then? Otherwise the situation is not likely
to change and the VM_LOCKED pages may circulate in active/inactive
list for countless times.
> As I know about mlock pages, we already had some races condition.
> They will be rescued like above.
Thanks,
Fengguang
> >
> > How about this fix?
> >
> > ---
> > mm: stop circulating of referenced mlocked pages
> >
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >
> > --- linux.orig/mm/rmap.c 2009-08-16 19:11:13.000000000 +0800
> > +++ linux/mm/rmap.c 2009-08-16 19:22:46.000000000 +0800
> > @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
> > */
> > if (vma->vm_flags & VM_LOCKED) {
> > *mapcount = 1; /* break early from loop */
> > + *vm_flags |= VM_LOCKED;
> > goto out_unmap;
> > }
> >
> > @@ -482,6 +483,8 @@ static int page_referenced_file(struct p
> > }
> >
> > spin_unlock(&mapping->i_mmap_lock);
> > + if (*vm_flags & VM_LOCKED)
> > + referenced = 0;
> > return referenced;
> > }
> >
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
>
>
>
> --
> Kind regards,
> Minchan Kim
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 2:34 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18 2:34 UTC (permalink / raw)
To: Minchan Kim
Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Minchan,
On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> >> > Wu Fengguang wrote:
> >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> >> > >> Side question -
> >> > >> A Is there a good reason for this to be in shrink_active_list()
> >> > >> as opposed to __isolate_lru_page?
> >> > >>
> >> > >> A A A A A if (unlikely(!page_evictable(page, NULL))) {
> >> > >> A A A A A A A A A putback_lru_page(page);
> >> > >> A A A A A A A A A continue;
> >> > >> A A A A A }
> >> > >>
> >> > >> Maybe we want to minimize the amount of code under the lru lock or
> >> > >> avoid duplicate logic in the isolate_page functions.
> >> > >
> >> > > I guess the quick test means to avoid the expensive page_referenced()
> >> > > call that follows it. But that should be mostly one shot cost - the
> >> > > unevictable pages are unlikely to cycle in active/inactive list again
> >> > > and again.
> >> >
> >> > Please read what putback_lru_page does.
> >> >
> >> > It moves the page onto the unevictable list, so that
> >> > it will not end up in this scan again.
> >>
> >> Yes it does. I said 'mostly' because there is a small hole that an
> >> unevictable page may be scanned but still not moved to unevictable
> >> list: when a page is mapped in two places, the first pte has the
> >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> >> page_referenced() will return 1 and shrink_page_list() will move it
> >> into active list instead of unevictable list. Shall we fix this rare
> >> case?
>
> I think it's not a big deal.
Maybe, otherwise I should bring up this issue long time before :)
> As you mentioned, it's rare case so there would be few pages in active
> list instead of unevictable list.
Yes.
> When next time to scan comes, we can try to move the pages into
> unevictable list, again.
Will PG_mlocked be set by then? Otherwise the situation is not likely
to change and the VM_LOCKED pages may circulate in active/inactive
list for countless times.
> As I know about mlock pages, we already had some races condition.
> They will be rescued like above.
Thanks,
Fengguang
> >
> > How about this fix?
> >
> > ---
> > mm: stop circulating of referenced mlocked pages
> >
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >
> > --- linux.orig/mm/rmap.c A A A A 2009-08-16 19:11:13.000000000 +0800
> > +++ linux/mm/rmap.c A A 2009-08-16 19:22:46.000000000 +0800
> > @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
> > A A A A */
> > A A A A if (vma->vm_flags & VM_LOCKED) {
> > A A A A A A A A *mapcount = 1; A /* break early from loop */
> > + A A A A A A A *vm_flags |= VM_LOCKED;
> > A A A A A A A A goto out_unmap;
> > A A A A }
> >
> > @@ -482,6 +483,8 @@ static int page_referenced_file(struct p
> > A A A A }
> >
> > A A A A spin_unlock(&mapping->i_mmap_lock);
> > + A A A if (*vm_flags & VM_LOCKED)
> > + A A A A A A A referenced = 0;
> > A A A A return referenced;
> > A }
> >
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org. A For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
>
>
>
> --
> Kind regards,
> Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-18 2:34 ` Wu Fengguang
@ 2009-08-18 4:17 ` Minchan Kim
-1 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-18 4:17 UTC (permalink / raw)
To: Wu Fengguang
Cc: Minchan Kim, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman,
LKML, linux-mm
On Tue, 18 Aug 2009 10:34:38 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> Minchan,
>
> On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > >> > Wu Fengguang wrote:
> > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > >> > >> Side question -
> > >> > >> Is there a good reason for this to be in shrink_active_list()
> > >> > >> as opposed to __isolate_lru_page?
> > >> > >>
> > >> > >> if (unlikely(!page_evictable(page, NULL))) {
> > >> > >> putback_lru_page(page);
> > >> > >> continue;
> > >> > >> }
> > >> > >>
> > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > >> > >> avoid duplicate logic in the isolate_page functions.
> > >> > >
> > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > >> > > call that follows it. But that should be mostly one shot cost - the
> > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > >> > > and again.
> > >> >
> > >> > Please read what putback_lru_page does.
> > >> >
> > >> > It moves the page onto the unevictable list, so that
> > >> > it will not end up in this scan again.
> > >>
> > >> Yes it does. I said 'mostly' because there is a small hole that an
> > >> unevictable page may be scanned but still not moved to unevictable
> > >> list: when a page is mapped in two places, the first pte has the
> > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > >> page_referenced() will return 1 and shrink_page_list() will move it
> > >> into active list instead of unevictable list. Shall we fix this rare
> > >> case?
> >
> > I think it's not a big deal.
>
> Maybe, otherwise I should bring up this issue long time before :)
>
> > As you mentioned, it's rare case so there would be few pages in active
> > list instead of unevictable list.
>
> Yes.
>
> > When next time to scan comes, we can try to move the pages into
> > unevictable list, again.
>
> Will PG_mlocked be set by then? Otherwise the situation is not likely
> to change and the VM_LOCKED pages may circulate in active/inactive
> list for countless times.
PG_mlocked is not important in that case.
Important thing is VM_LOCKED vma.
I think below annotaion can help you to understand my point. :)
----
/*
* called from munlock()/munmap() path with page supposedly on the LRU.
*
* Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
* [in try_to_munlock()] and then attempt to isolate the page. We must
* isolate the page to keep others from messing with its unevictable
* and mlocked state while trying to munlock. However, we pre-clear the
* mlocked state anyway as we might lose the isolation race and we might
* not get another chance to clear PageMlocked. If we successfully
* isolate the page and try_to_munlock() detects other VM_LOCKED vmas
* mapping the page, it will restore the PageMlocked state, unless the page
* is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(),
* perhaps redundantly.
* If we lose the isolation race, and the page is mapped by other VM_LOCKED
* vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
* either of which will restore the PageMlocked state by calling
* mlock_vma_page() above, if it can grab the vma's mmap sem.
*/
static void munlock_vma_page(struct page *page)
{
...
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 4:17 ` Minchan Kim
0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-18 4:17 UTC (permalink / raw)
To: Wu Fengguang
Cc: Minchan Kim, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman,
LKML, linux-mm
On Tue, 18 Aug 2009 10:34:38 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> Minchan,
>
> On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > >> > Wu Fengguang wrote:
> > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > >> > >> Side question -
> > >> > >> A Is there a good reason for this to be in shrink_active_list()
> > >> > >> as opposed to __isolate_lru_page?
> > >> > >>
> > >> > >> A A A A A if (unlikely(!page_evictable(page, NULL))) {
> > >> > >> A A A A A A A A A putback_lru_page(page);
> > >> > >> A A A A A A A A A continue;
> > >> > >> A A A A A }
> > >> > >>
> > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > >> > >> avoid duplicate logic in the isolate_page functions.
> > >> > >
> > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > >> > > call that follows it. But that should be mostly one shot cost - the
> > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > >> > > and again.
> > >> >
> > >> > Please read what putback_lru_page does.
> > >> >
> > >> > It moves the page onto the unevictable list, so that
> > >> > it will not end up in this scan again.
> > >>
> > >> Yes it does. I said 'mostly' because there is a small hole that an
> > >> unevictable page may be scanned but still not moved to unevictable
> > >> list: when a page is mapped in two places, the first pte has the
> > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > >> page_referenced() will return 1 and shrink_page_list() will move it
> > >> into active list instead of unevictable list. Shall we fix this rare
> > >> case?
> >
> > I think it's not a big deal.
>
> Maybe, otherwise I should bring up this issue long time before :)
>
> > As you mentioned, it's rare case so there would be few pages in active
> > list instead of unevictable list.
>
> Yes.
>
> > When next time to scan comes, we can try to move the pages into
> > unevictable list, again.
>
> Will PG_mlocked be set by then? Otherwise the situation is not likely
> to change and the VM_LOCKED pages may circulate in active/inactive
> list for countless times.
PG_mlocked is not important in that case.
Important thing is VM_LOCKED vma.
I think below annotaion can help you to understand my point. :)
----
/*
* called from munlock()/munmap() path with page supposedly on the LRU.
*
* Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
* [in try_to_munlock()] and then attempt to isolate the page. We must
* isolate the page to keep others from messing with its unevictable
* and mlocked state while trying to munlock. However, we pre-clear the
* mlocked state anyway as we might lose the isolation race and we might
* not get another chance to clear PageMlocked. If we successfully
* isolate the page and try_to_munlock() detects other VM_LOCKED vmas
* mapping the page, it will restore the PageMlocked state, unless the page
* is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(),
* perhaps redundantly.
* If we lose the isolation race, and the page is mapped by other VM_LOCKED
* vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
* either of which will restore the PageMlocked state by calling
* mlock_vma_page() above, if it can grab the vma's mmap sem.
*/
static void munlock_vma_page(struct page *page)
{
...
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-18 4:17 ` Minchan Kim
@ 2009-08-18 9:31 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18 9:31 UTC (permalink / raw)
To: Minchan Kim
Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> On Tue, 18 Aug 2009 10:34:38 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > Minchan,
> >
> > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > > >> > Wu Fengguang wrote:
> > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > > >> > >> Side question -
> > > >> > >> Is there a good reason for this to be in shrink_active_list()
> > > >> > >> as opposed to __isolate_lru_page?
> > > >> > >>
> > > >> > >> if (unlikely(!page_evictable(page, NULL))) {
> > > >> > >> putback_lru_page(page);
> > > >> > >> continue;
> > > >> > >> }
> > > >> > >>
> > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > > >> > >> avoid duplicate logic in the isolate_page functions.
> > > >> > >
> > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > > >> > > call that follows it. But that should be mostly one shot cost - the
> > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > > >> > > and again.
> > > >> >
> > > >> > Please read what putback_lru_page does.
> > > >> >
> > > >> > It moves the page onto the unevictable list, so that
> > > >> > it will not end up in this scan again.
> > > >>
> > > >> Yes it does. I said 'mostly' because there is a small hole that an
> > > >> unevictable page may be scanned but still not moved to unevictable
> > > >> list: when a page is mapped in two places, the first pte has the
> > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > > >> page_referenced() will return 1 and shrink_page_list() will move it
> > > >> into active list instead of unevictable list. Shall we fix this rare
> > > >> case?
> > >
> > > I think it's not a big deal.
> >
> > Maybe, otherwise I should bring up this issue long time before :)
> >
> > > As you mentioned, it's rare case so there would be few pages in active
> > > list instead of unevictable list.
> >
> > Yes.
> >
> > > When next time to scan comes, we can try to move the pages into
> > > unevictable list, again.
> >
> > Will PG_mlocked be set by then? Otherwise the situation is not likely
> > to change and the VM_LOCKED pages may circulate in active/inactive
> > list for countless times.
>
> PG_mlocked is not important in that case.
> Important thing is VM_LOCKED vma.
> I think below annotaion can help you to understand my point. :)
Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
PG_mlocked set, and so will be caught by page_evictable(). Is it?
Then I was worrying about a null problem. Sorry for the confusion!
Thanks,
Fengguang
> ----
>
> /*
> * called from munlock()/munmap() path with page supposedly on the LRU.
> *
> * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
> * [in try_to_munlock()] and then attempt to isolate the page. We must
> * isolate the page to keep others from messing with its unevictable
> * and mlocked state while trying to munlock. However, we pre-clear the
> * mlocked state anyway as we might lose the isolation race and we might
> * not get another chance to clear PageMlocked. If we successfully
> * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
> * mapping the page, it will restore the PageMlocked state, unless the page
> * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(),
> * perhaps redundantly.
> * If we lose the isolation race, and the page is mapped by other VM_LOCKED
> * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
> * either of which will restore the PageMlocked state by calling
> * mlock_vma_page() above, if it can grab the vma's mmap sem.
> */
> static void munlock_vma_page(struct page *page)
> {
> ...
>
> --
> Kind regards,
> Minchan Kim
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 9:31 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18 9:31 UTC (permalink / raw)
To: Minchan Kim
Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> On Tue, 18 Aug 2009 10:34:38 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > Minchan,
> >
> > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > > >> > Wu Fengguang wrote:
> > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > > >> > >> Side question -
> > > >> > >> A Is there a good reason for this to be in shrink_active_list()
> > > >> > >> as opposed to __isolate_lru_page?
> > > >> > >>
> > > >> > >> A A A A A if (unlikely(!page_evictable(page, NULL))) {
> > > >> > >> A A A A A A A A A putback_lru_page(page);
> > > >> > >> A A A A A A A A A continue;
> > > >> > >> A A A A A }
> > > >> > >>
> > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > > >> > >> avoid duplicate logic in the isolate_page functions.
> > > >> > >
> > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > > >> > > call that follows it. But that should be mostly one shot cost - the
> > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > > >> > > and again.
> > > >> >
> > > >> > Please read what putback_lru_page does.
> > > >> >
> > > >> > It moves the page onto the unevictable list, so that
> > > >> > it will not end up in this scan again.
> > > >>
> > > >> Yes it does. I said 'mostly' because there is a small hole that an
> > > >> unevictable page may be scanned but still not moved to unevictable
> > > >> list: when a page is mapped in two places, the first pte has the
> > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > > >> page_referenced() will return 1 and shrink_page_list() will move it
> > > >> into active list instead of unevictable list. Shall we fix this rare
> > > >> case?
> > >
> > > I think it's not a big deal.
> >
> > Maybe, otherwise I should bring up this issue long time before :)
> >
> > > As you mentioned, it's rare case so there would be few pages in active
> > > list instead of unevictable list.
> >
> > Yes.
> >
> > > When next time to scan comes, we can try to move the pages into
> > > unevictable list, again.
> >
> > Will PG_mlocked be set by then? Otherwise the situation is not likely
> > to change and the VM_LOCKED pages may circulate in active/inactive
> > list for countless times.
>
> PG_mlocked is not important in that case.
> Important thing is VM_LOCKED vma.
> I think below annotaion can help you to understand my point. :)
Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
PG_mlocked set, and so will be caught by page_evictable(). Is it?
Then I was worrying about a null problem. Sorry for the confusion!
Thanks,
Fengguang
> ----
>
> /*
> * called from munlock()/munmap() path with page supposedly on the LRU.
> *
> * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
> * [in try_to_munlock()] and then attempt to isolate the page. We must
> * isolate the page to keep others from messing with its unevictable
> * and mlocked state while trying to munlock. However, we pre-clear the
> * mlocked state anyway as we might lose the isolation race and we might
> * not get another chance to clear PageMlocked. If we successfully
> * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
> * mapping the page, it will restore the PageMlocked state, unless the page
> * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(),
> * perhaps redundantly.
> * If we lose the isolation race, and the page is mapped by other VM_LOCKED
> * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
> * either of which will restore the PageMlocked state by calling
> * mlock_vma_page() above, if it can grab the vma's mmap sem.
> */
> static void munlock_vma_page(struct page *page)
> {
> ...
>
> --
> Kind regards,
> Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-18 9:31 ` Wu Fengguang
@ 2009-08-18 9:52 ` Minchan Kim
-1 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-18 9:52 UTC (permalink / raw)
To: Wu Fengguang
Cc: Minchan Kim, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman,
LKML, linux-mm
On Tue, 18 Aug 2009 17:31:19 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> > On Tue, 18 Aug 2009 10:34:38 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> >
> > > Minchan,
> > >
> > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > > > >> > Wu Fengguang wrote:
> > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > > > >> > >> Side question -
> > > > >> > >> Is there a good reason for this to be in shrink_active_list()
> > > > >> > >> as opposed to __isolate_lru_page?
> > > > >> > >>
> > > > >> > >> if (unlikely(!page_evictable(page, NULL))) {
> > > > >> > >> putback_lru_page(page);
> > > > >> > >> continue;
> > > > >> > >> }
> > > > >> > >>
> > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > > > >> > >> avoid duplicate logic in the isolate_page functions.
> > > > >> > >
> > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > > > >> > > call that follows it. But that should be mostly one shot cost - the
> > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > > > >> > > and again.
> > > > >> >
> > > > >> > Please read what putback_lru_page does.
> > > > >> >
> > > > >> > It moves the page onto the unevictable list, so that
> > > > >> > it will not end up in this scan again.
> > > > >>
> > > > >> Yes it does. I said 'mostly' because there is a small hole that an
> > > > >> unevictable page may be scanned but still not moved to unevictable
> > > > >> list: when a page is mapped in two places, the first pte has the
> > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > > > >> page_referenced() will return 1 and shrink_page_list() will move it
> > > > >> into active list instead of unevictable list. Shall we fix this rare
> > > > >> case?
> > > >
> > > > I think it's not a big deal.
> > >
> > > Maybe, otherwise I should bring up this issue long time before :)
> > >
> > > > As you mentioned, it's rare case so there would be few pages in active
> > > > list instead of unevictable list.
> > >
> > > Yes.
> > >
> > > > When next time to scan comes, we can try to move the pages into
> > > > unevictable list, again.
> > >
> > > Will PG_mlocked be set by then? Otherwise the situation is not likely
> > > to change and the VM_LOCKED pages may circulate in active/inactive
> > > list for countless times.
> >
> > PG_mlocked is not important in that case.
> > Important thing is VM_LOCKED vma.
> > I think below annotaion can help you to understand my point. :)
>
> Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
> PG_mlocked set, and so will be caught by page_evictable(). Is it?
No. I am sorry for making my point not clear.
I meant following as.
When the next time to scan,
shrink_page_list
-> try_to_unmap
-> try_to_unmap_xxx
-> if (vma->vm_flags & VM_LOCKED)
-> try_to_mlock_page
-> TestSetPageMlocked
-> putback_lru_page
So at last, the page will be located in unevictable list.
> Then I was worrying about a null problem. Sorry for the confusion!
>
> Thanks,
> Fengguang
>
> > ----
> >
> > /*
> > * called from munlock()/munmap() path with page supposedly on the LRU.
> > *
> > * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
> > * [in try_to_munlock()] and then attempt to isolate the page. We must
> > * isolate the page to keep others from messing with its unevictable
> > * and mlocked state while trying to munlock. However, we pre-clear the
> > * mlocked state anyway as we might lose the isolation race and we might
> > * not get another chance to clear PageMlocked. If we successfully
> > * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
> > * mapping the page, it will restore the PageMlocked state, unless the page
> > * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(),
> > * perhaps redundantly.
> > * If we lose the isolation race, and the page is mapped by other VM_LOCKED
> > * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
> > * either of which will restore the PageMlocked state by calling
> > * mlock_vma_page() above, if it can grab the vma's mmap sem.
> > */
> > static void munlock_vma_page(struct page *page)
> > {
> > ...
> >
> > --
> > Kind regards,
> > Minchan Kim
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 9:52 ` Minchan Kim
0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-18 9:52 UTC (permalink / raw)
To: Wu Fengguang
Cc: Minchan Kim, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman,
LKML, linux-mm
On Tue, 18 Aug 2009 17:31:19 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> > On Tue, 18 Aug 2009 10:34:38 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> >
> > > Minchan,
> > >
> > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > > > >> > Wu Fengguang wrote:
> > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > > > >> > >> Side question -
> > > > >> > >> A Is there a good reason for this to be in shrink_active_list()
> > > > >> > >> as opposed to __isolate_lru_page?
> > > > >> > >>
> > > > >> > >> A A A A A if (unlikely(!page_evictable(page, NULL))) {
> > > > >> > >> A A A A A A A A A putback_lru_page(page);
> > > > >> > >> A A A A A A A A A continue;
> > > > >> > >> A A A A A }
> > > > >> > >>
> > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > > > >> > >> avoid duplicate logic in the isolate_page functions.
> > > > >> > >
> > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > > > >> > > call that follows it. But that should be mostly one shot cost - the
> > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > > > >> > > and again.
> > > > >> >
> > > > >> > Please read what putback_lru_page does.
> > > > >> >
> > > > >> > It moves the page onto the unevictable list, so that
> > > > >> > it will not end up in this scan again.
> > > > >>
> > > > >> Yes it does. I said 'mostly' because there is a small hole that an
> > > > >> unevictable page may be scanned but still not moved to unevictable
> > > > >> list: when a page is mapped in two places, the first pte has the
> > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > > > >> page_referenced() will return 1 and shrink_page_list() will move it
> > > > >> into active list instead of unevictable list. Shall we fix this rare
> > > > >> case?
> > > >
> > > > I think it's not a big deal.
> > >
> > > Maybe, otherwise I should bring up this issue long time before :)
> > >
> > > > As you mentioned, it's rare case so there would be few pages in active
> > > > list instead of unevictable list.
> > >
> > > Yes.
> > >
> > > > When next time to scan comes, we can try to move the pages into
> > > > unevictable list, again.
> > >
> > > Will PG_mlocked be set by then? Otherwise the situation is not likely
> > > to change and the VM_LOCKED pages may circulate in active/inactive
> > > list for countless times.
> >
> > PG_mlocked is not important in that case.
> > Important thing is VM_LOCKED vma.
> > I think below annotaion can help you to understand my point. :)
>
> Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
> PG_mlocked set, and so will be caught by page_evictable(). Is it?
No. I am sorry for making my point not clear.
I meant following as.
When the next time to scan,
shrink_page_list
-> try_to_unmap
-> try_to_unmap_xxx
-> if (vma->vm_flags & VM_LOCKED)
-> try_to_mlock_page
-> TestSetPageMlocked
-> putback_lru_page
So at last, the page will be located in unevictable list.
> Then I was worrying about a null problem. Sorry for the confusion!
>
> Thanks,
> Fengguang
>
> > ----
> >
> > /*
> > * called from munlock()/munmap() path with page supposedly on the LRU.
> > *
> > * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
> > * [in try_to_munlock()] and then attempt to isolate the page. We must
> > * isolate the page to keep others from messing with its unevictable
> > * and mlocked state while trying to munlock. However, we pre-clear the
> > * mlocked state anyway as we might lose the isolation race and we might
> > * not get another chance to clear PageMlocked. If we successfully
> > * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
> > * mapping the page, it will restore the PageMlocked state, unless the page
> > * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(),
> > * perhaps redundantly.
> > * If we lose the isolation race, and the page is mapped by other VM_LOCKED
> > * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
> > * either of which will restore the PageMlocked state by calling
> > * mlock_vma_page() above, if it can grab the vma's mmap sem.
> > */
> > static void munlock_vma_page(struct page *page)
> > {
> > ...
> >
> > --
> > Kind regards,
> > Minchan Kim
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-18 9:52 ` Minchan Kim
@ 2009-08-18 10:00 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18 10:00 UTC (permalink / raw)
To: Minchan Kim
Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
> On Tue, 18 Aug 2009 17:31:19 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> > > On Tue, 18 Aug 2009 10:34:38 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > >
> > > > Minchan,
> > > >
> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > > > > >> > Wu Fengguang wrote:
> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > > > > >> > >> Side question -
> > > > > >> > >> Is there a good reason for this to be in shrink_active_list()
> > > > > >> > >> as opposed to __isolate_lru_page?
> > > > > >> > >>
> > > > > >> > >> if (unlikely(!page_evictable(page, NULL))) {
> > > > > >> > >> putback_lru_page(page);
> > > > > >> > >> continue;
> > > > > >> > >> }
> > > > > >> > >>
> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
> > > > > >> > >
> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > > > > >> > > and again.
> > > > > >> >
> > > > > >> > Please read what putback_lru_page does.
> > > > > >> >
> > > > > >> > It moves the page onto the unevictable list, so that
> > > > > >> > it will not end up in this scan again.
> > > > > >>
> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
> > > > > >> unevictable page may be scanned but still not moved to unevictable
> > > > > >> list: when a page is mapped in two places, the first pte has the
> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
> > > > > >> into active list instead of unevictable list. Shall we fix this rare
> > > > > >> case?
> > > > >
> > > > > I think it's not a big deal.
> > > >
> > > > Maybe, otherwise I should bring up this issue long time before :)
> > > >
> > > > > As you mentioned, it's rare case so there would be few pages in active
> > > > > list instead of unevictable list.
> > > >
> > > > Yes.
> > > >
> > > > > When next time to scan comes, we can try to move the pages into
> > > > > unevictable list, again.
> > > >
> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely
> > > > to change and the VM_LOCKED pages may circulate in active/inactive
> > > > list for countless times.
> > >
> > > PG_mlocked is not important in that case.
> > > Important thing is VM_LOCKED vma.
> > > I think below annotaion can help you to understand my point. :)
> >
> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
>
> No. I am sorry for making my point not clear.
> I meant following as.
> When the next time to scan,
>
> shrink_page_list
->
referenced = page_referenced(page, 1,
sc->mem_cgroup, &vm_flags);
/* In active use or really unfreeable? Activate it. */
if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
referenced && page_mapping_inuse(page))
goto activate_locked;
> -> try_to_unmap
~~~~~~~~~~~~ this line won't be reached if page is found to be
referenced in the above lines?
Thanks,
Fengguang
> -> try_to_unmap_xxx
> -> if (vma->vm_flags & VM_LOCKED)
> -> try_to_mlock_page
> -> TestSetPageMlocked
> -> putback_lru_page
>
> So at last, the page will be located in unevictable list.
>
> > Then I was worrying about a null problem. Sorry for the confusion!
> >
> > Thanks,
> > Fengguang
> >
> > > ----
> > >
> > > /*
> > > * called from munlock()/munmap() path with page supposedly on the LRU.
> > > *
> > > * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
> > > * [in try_to_munlock()] and then attempt to isolate the page. We must
> > > * isolate the page to keep others from messing with its unevictable
> > > * and mlocked state while trying to munlock. However, we pre-clear the
> > > * mlocked state anyway as we might lose the isolation race and we might
> > > * not get another chance to clear PageMlocked. If we successfully
> > > * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
> > > * mapping the page, it will restore the PageMlocked state, unless the page
> > > * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(),
> > > * perhaps redundantly.
> > > * If we lose the isolation race, and the page is mapped by other VM_LOCKED
> > > * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
> > > * either of which will restore the PageMlocked state by calling
> > > * mlock_vma_page() above, if it can grab the vma's mmap sem.
> > > */
> > > static void munlock_vma_page(struct page *page)
> > > {
> > > ...
> > >
> > > --
> > > Kind regards,
> > > Minchan Kim
>
>
> --
> Kind regards,
> Minchan Kim
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 10:00 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18 10:00 UTC (permalink / raw)
To: Minchan Kim
Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
> On Tue, 18 Aug 2009 17:31:19 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> > > On Tue, 18 Aug 2009 10:34:38 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > >
> > > > Minchan,
> > > >
> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > > > > >> > Wu Fengguang wrote:
> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > > > > >> > >> Side question -
> > > > > >> > >> A Is there a good reason for this to be in shrink_active_list()
> > > > > >> > >> as opposed to __isolate_lru_page?
> > > > > >> > >>
> > > > > >> > >> A A A A A if (unlikely(!page_evictable(page, NULL))) {
> > > > > >> > >> A A A A A A A A A putback_lru_page(page);
> > > > > >> > >> A A A A A A A A A continue;
> > > > > >> > >> A A A A A }
> > > > > >> > >>
> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
> > > > > >> > >
> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > > > > >> > > and again.
> > > > > >> >
> > > > > >> > Please read what putback_lru_page does.
> > > > > >> >
> > > > > >> > It moves the page onto the unevictable list, so that
> > > > > >> > it will not end up in this scan again.
> > > > > >>
> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
> > > > > >> unevictable page may be scanned but still not moved to unevictable
> > > > > >> list: when a page is mapped in two places, the first pte has the
> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
> > > > > >> into active list instead of unevictable list. Shall we fix this rare
> > > > > >> case?
> > > > >
> > > > > I think it's not a big deal.
> > > >
> > > > Maybe, otherwise I should bring up this issue long time before :)
> > > >
> > > > > As you mentioned, it's rare case so there would be few pages in active
> > > > > list instead of unevictable list.
> > > >
> > > > Yes.
> > > >
> > > > > When next time to scan comes, we can try to move the pages into
> > > > > unevictable list, again.
> > > >
> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely
> > > > to change and the VM_LOCKED pages may circulate in active/inactive
> > > > list for countless times.
> > >
> > > PG_mlocked is not important in that case.
> > > Important thing is VM_LOCKED vma.
> > > I think below annotaion can help you to understand my point. :)
> >
> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
>
> No. I am sorry for making my point not clear.
> I meant following as.
> When the next time to scan,
>
> shrink_page_list
->
referenced = page_referenced(page, 1,
sc->mem_cgroup, &vm_flags);
/* In active use or really unfreeable? Activate it. */
if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
referenced && page_mapping_inuse(page))
goto activate_locked;
> -> try_to_unmap
~~~~~~~~~~~~ this line won't be reached if page is found to be
referenced in the above lines?
Thanks,
Fengguang
> -> try_to_unmap_xxx
> -> if (vma->vm_flags & VM_LOCKED)
> -> try_to_mlock_page
> -> TestSetPageMlocked
> -> putback_lru_page
>
> So at last, the page will be located in unevictable list.
>
> > Then I was worrying about a null problem. Sorry for the confusion!
> >
> > Thanks,
> > Fengguang
> >
> > > ----
> > >
> > > /*
> > > * called from munlock()/munmap() path with page supposedly on the LRU.
> > > *
> > > * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
> > > * [in try_to_munlock()] and then attempt to isolate the page. We must
> > > * isolate the page to keep others from messing with its unevictable
> > > * and mlocked state while trying to munlock. However, we pre-clear the
> > > * mlocked state anyway as we might lose the isolation race and we might
> > > * not get another chance to clear PageMlocked. If we successfully
> > > * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
> > > * mapping the page, it will restore the PageMlocked state, unless the page
> > > * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(),
> > > * perhaps redundantly.
> > > * If we lose the isolation race, and the page is mapped by other VM_LOCKED
> > > * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
> > > * either of which will restore the PageMlocked state by calling
> > > * mlock_vma_page() above, if it can grab the vma's mmap sem.
> > > */
> > > static void munlock_vma_page(struct page *page)
> > > {
> > > ...
> > >
> > > --
> > > Kind regards,
> > > Minchan Kim
>
>
> --
> Kind regards,
> Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-18 10:00 ` Wu Fengguang
@ 2009-08-18 11:00 ` Minchan Kim
-1 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-18 11:00 UTC (permalink / raw)
To: Wu Fengguang, Lee Schermerhorn
Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
>> On Tue, 18 Aug 2009 17:31:19 +0800
>> Wu Fengguang <fengguang.wu@intel.com> wrote:
>>
>> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
>> > > On Tue, 18 Aug 2009 10:34:38 +0800
>> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > >
>> > > > Minchan,
>> > > >
>> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
>> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
>> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
>> > > > > >> > Wu Fengguang wrote:
>> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
>> > > > > >> > >> Side question -
>> > > > > >> > >> Is there a good reason for this to be in shrink_active_list()
>> > > > > >> > >> as opposed to __isolate_lru_page?
>> > > > > >> > >>
>> > > > > >> > >> if (unlikely(!page_evictable(page, NULL))) {
>> > > > > >> > >> putback_lru_page(page);
>> > > > > >> > >> continue;
>> > > > > >> > >> }
>> > > > > >> > >>
>> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
>> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
>> > > > > >> > >
>> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
>> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
>> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
>> > > > > >> > > and again.
>> > > > > >> >
>> > > > > >> > Please read what putback_lru_page does.
>> > > > > >> >
>> > > > > >> > It moves the page onto the unevictable list, so that
>> > > > > >> > it will not end up in this scan again.
>> > > > > >>
>> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
>> > > > > >> unevictable page may be scanned but still not moved to unevictable
>> > > > > >> list: when a page is mapped in two places, the first pte has the
>> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
>> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
>> > > > > >> into active list instead of unevictable list. Shall we fix this rare
>> > > > > >> case?
>> > > > >
>> > > > > I think it's not a big deal.
>> > > >
>> > > > Maybe, otherwise I should bring up this issue long time before :)
>> > > >
>> > > > > As you mentioned, it's rare case so there would be few pages in active
>> > > > > list instead of unevictable list.
>> > > >
>> > > > Yes.
>> > > >
>> > > > > When next time to scan comes, we can try to move the pages into
>> > > > > unevictable list, again.
>> > > >
>> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely
>> > > > to change and the VM_LOCKED pages may circulate in active/inactive
>> > > > list for countless times.
>> > >
>> > > PG_mlocked is not important in that case.
>> > > Important thing is VM_LOCKED vma.
>> > > I think below annotaion can help you to understand my point. :)
>> >
>> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
>> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
>>
>> No. I am sorry for making my point not clear.
>> I meant following as.
>> When the next time to scan,
>>
>> shrink_page_list
> ->
> referenced = page_referenced(page, 1,
> sc->mem_cgroup, &vm_flags);
> /* In active use or really unfreeable? Activate it. */
> if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> referenced && page_mapping_inuse(page))
> goto activate_locked;
>
>> -> try_to_unmap
> ~~~~~~~~~~~~ this line won't be reached if page is found to be
> referenced in the above lines?
Indeed! In fact, I was worry about that.
It looks after live lock problem.
But I think it's very small race window so there isn't any report until now.
Let's Cced Lee.
If we have to fix it, how about this ?
This version has small overhead than yours since
there is less shrink_page_list call than page_referenced.
diff --git a/mm/rmap.c b/mm/rmap.c
index ed63894..283266c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page,
*/
if (vma->vm_flags & VM_LOCKED) {
*mapcount = 1; /* break early from loop */
+ *vm_flags |= VM_LOCKED;
goto out_unmap;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d224b28..d156e1d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct
list_head *page_list,
sc->mem_cgroup, &vm_flags);
/* In active use or really unfreeable? Activate it. */
if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
- referenced && page_mapping_inuse(page))
+ referenced && page_mapping_inuse(page)
+ && !(vm_flags & VM_LOCKED))
goto activate_locked;
>
> Thanks,
> Fengguang
>
>> -> try_to_unmap_xxx
>> -> if (vma->vm_flags & VM_LOCKED)
>> -> try_to_mlock_page
>> -> TestSetPageMlocked
>> -> putback_lru_page
>>
>> So at last, the page will be located in unevictable list.
>>
>> > Then I was worrying about a null problem. Sorry for the confusion!
>> >
>> > Thanks,
>> > Fengguang
>> >
>> > > ----
>> > >
>> > > /*
>> > > * called from munlock()/munmap() path with page supposedly on the LRU.
>> > > *
>> > > * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
>> > > * [in try_to_munlock()] and then attempt to isolate the page. We must
>> > > * isolate the page to keep others from messing with its unevictable
>> > > * and mlocked state while trying to munlock. However, we pre-clear the
>> > > * mlocked state anyway as we might lose the isolation race and we might
>> > > * not get another chance to clear PageMlocked. If we successfully
>> > > * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
>> > > * mapping the page, it will restore the PageMlocked state, unless the page
>> > > * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(),
>> > > * perhaps redundantly.
>> > > * If we lose the isolation race, and the page is mapped by other VM_LOCKED
>> > > * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
>> > > * either of which will restore the PageMlocked state by calling
>> > > * mlock_vma_page() above, if it can grab the vma's mmap sem.
>> > > */
>> > > static void munlock_vma_page(struct page *page)
>> > > {
>> > > ...
>> > >
>> > > --
>> > > Kind regards,
>> > > Minchan Kim
>>
>>
>> --
>> Kind regards,
>> Minchan Kim
>
--
Kind regards,
Minchan Kim
^ permalink raw reply related [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 11:00 ` Minchan Kim
0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-18 11:00 UTC (permalink / raw)
To: Wu Fengguang, Lee Schermerhorn
Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
>> On Tue, 18 Aug 2009 17:31:19 +0800
>> Wu Fengguang <fengguang.wu@intel.com> wrote:
>>
>> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
>> > > On Tue, 18 Aug 2009 10:34:38 +0800
>> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > >
>> > > > Minchan,
>> > > >
>> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
>> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
>> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
>> > > > > >> > Wu Fengguang wrote:
>> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
>> > > > > >> > >> Side question -
>> > > > > >> > >> Is there a good reason for this to be in shrink_active_list()
>> > > > > >> > >> as opposed to __isolate_lru_page?
>> > > > > >> > >>
>> > > > > >> > >> if (unlikely(!page_evictable(page, NULL))) {
>> > > > > >> > >> putback_lru_page(page);
>> > > > > >> > >> continue;
>> > > > > >> > >> }
>> > > > > >> > >>
>> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
>> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
>> > > > > >> > >
>> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
>> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
>> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
>> > > > > >> > > and again.
>> > > > > >> >
>> > > > > >> > Please read what putback_lru_page does.
>> > > > > >> >
>> > > > > >> > It moves the page onto the unevictable list, so that
>> > > > > >> > it will not end up in this scan again.
>> > > > > >>
>> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
>> > > > > >> unevictable page may be scanned but still not moved to unevictable
>> > > > > >> list: when a page is mapped in two places, the first pte has the
>> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
>> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
>> > > > > >> into active list instead of unevictable list. Shall we fix this rare
>> > > > > >> case?
>> > > > >
>> > > > > I think it's not a big deal.
>> > > >
>> > > > Maybe, otherwise I should bring up this issue long time before :)
>> > > >
>> > > > > As you mentioned, it's rare case so there would be few pages in active
>> > > > > list instead of unevictable list.
>> > > >
>> > > > Yes.
>> > > >
>> > > > > When next time to scan comes, we can try to move the pages into
>> > > > > unevictable list, again.
>> > > >
>> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely
>> > > > to change and the VM_LOCKED pages may circulate in active/inactive
>> > > > list for countless times.
>> > >
>> > > PG_mlocked is not important in that case.
>> > > Important thing is VM_LOCKED vma.
>> > > I think below annotaion can help you to understand my point. :)
>> >
>> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
>> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
>>
>> No. I am sorry for making my point not clear.
>> I meant following as.
>> When the next time to scan,
>>
>> shrink_page_list
> ->
> referenced = page_referenced(page, 1,
> sc->mem_cgroup, &vm_flags);
> /* In active use or really unfreeable? Activate it. */
> if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> referenced && page_mapping_inuse(page))
> goto activate_locked;
>
>> -> try_to_unmap
> ~~~~~~~~~~~~ this line won't be reached if page is found to be
> referenced in the above lines?
Indeed! In fact, I was worry about that.
It looks after live lock problem.
But I think it's very small race window so there isn't any report until now.
Let's Cced Lee.
If we have to fix it, how about this ?
This version has small overhead than yours since
there is less shrink_page_list call than page_referenced.
diff --git a/mm/rmap.c b/mm/rmap.c
index ed63894..283266c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page,
*/
if (vma->vm_flags & VM_LOCKED) {
*mapcount = 1; /* break early from loop */
+ *vm_flags |= VM_LOCKED;
goto out_unmap;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d224b28..d156e1d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct
list_head *page_list,
sc->mem_cgroup, &vm_flags);
/* In active use or really unfreeable? Activate it. */
if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
- referenced && page_mapping_inuse(page))
+ referenced && page_mapping_inuse(page)
+ && !(vm_flags & VM_LOCKED))
goto activate_locked;
>
> Thanks,
> Fengguang
>
>> -> try_to_unmap_xxx
>> -> if (vma->vm_flags & VM_LOCKED)
>> -> try_to_mlock_page
>> -> TestSetPageMlocked
>> -> putback_lru_page
>>
>> So at last, the page will be located in unevictable list.
>>
>> > Then I was worrying about a null problem. Sorry for the confusion!
>> >
>> > Thanks,
>> > Fengguang
>> >
>> > > ----
>> > >
>> > > /*
>> > > * called from munlock()/munmap() path with page supposedly on the LRU.
>> > > *
>> > > * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
>> > > * [in try_to_munlock()] and then attempt to isolate the page. We must
>> > > * isolate the page to keep others from messing with its unevictable
>> > > * and mlocked state while trying to munlock. However, we pre-clear the
>> > > * mlocked state anyway as we might lose the isolation race and we might
>> > > * not get another chance to clear PageMlocked. If we successfully
>> > > * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
>> > > * mapping the page, it will restore the PageMlocked state, unless the page
>> > > * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(),
>> > > * perhaps redundantly.
>> > > * If we lose the isolation race, and the page is mapped by other VM_LOCKED
>> > > * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
>> > > * either of which will restore the PageMlocked state by calling
>> > > * mlock_vma_page() above, if it can grab the vma's mmap sem.
>> > > */
>> > > static void munlock_vma_page(struct page *page)
>> > > {
>> > > ...
>> > >
>> > > --
>> > > Kind regards,
>> > > Minchan Kim
>>
>>
>> --
>> Kind regards,
>> Minchan Kim
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-18 11:00 ` Minchan Kim
@ 2009-08-18 11:11 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18 11:11 UTC (permalink / raw)
To: Minchan Kim
Cc: Lee Schermerhorn, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman,
LKML, linux-mm
On Tue, Aug 18, 2009 at 07:00:48PM +0800, Minchan Kim wrote:
> On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
> >> On Tue, 18 Aug 2009 17:31:19 +0800
> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
> >>
> >> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> >> > > On Tue, 18 Aug 2009 10:34:38 +0800
> >> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> > >
> >> > > > Minchan,
> >> > > >
> >> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> >> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> >> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> >> > > > > >> > Wu Fengguang wrote:
> >> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> >> > > > > >> > >> Side question -
> >> > > > > >> > >> Is there a good reason for this to be in shrink_active_list()
> >> > > > > >> > >> as opposed to __isolate_lru_page?
> >> > > > > >> > >>
> >> > > > > >> > >> if (unlikely(!page_evictable(page, NULL))) {
> >> > > > > >> > >> putback_lru_page(page);
> >> > > > > >> > >> continue;
> >> > > > > >> > >> }
> >> > > > > >> > >>
> >> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> >> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
> >> > > > > >> > >
> >> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> >> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
> >> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> >> > > > > >> > > and again.
> >> > > > > >> >
> >> > > > > >> > Please read what putback_lru_page does.
> >> > > > > >> >
> >> > > > > >> > It moves the page onto the unevictable list, so that
> >> > > > > >> > it will not end up in this scan again.
> >> > > > > >>
> >> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
> >> > > > > >> unevictable page may be scanned but still not moved to unevictable
> >> > > > > >> list: when a page is mapped in two places, the first pte has the
> >> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> >> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
> >> > > > > >> into active list instead of unevictable list. Shall we fix this rare
> >> > > > > >> case?
> >> > > > >
> >> > > > > I think it's not a big deal.
> >> > > >
> >> > > > Maybe, otherwise I should bring up this issue long time before :)
> >> > > >
> >> > > > > As you mentioned, it's rare case so there would be few pages in active
> >> > > > > list instead of unevictable list.
> >> > > >
> >> > > > Yes.
> >> > > >
> >> > > > > When next time to scan comes, we can try to move the pages into
> >> > > > > unevictable list, again.
> >> > > >
> >> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely
> >> > > > to change and the VM_LOCKED pages may circulate in active/inactive
> >> > > > list for countless times.
> >> > >
> >> > > PG_mlocked is not important in that case.
> >> > > Important thing is VM_LOCKED vma.
> >> > > I think below annotaion can help you to understand my point. :)
> >> >
> >> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
> >> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
> >>
> >> No. I am sorry for making my point not clear.
> >> I meant following as.
> >> When the next time to scan,
> >>
> >> shrink_page_list
> > ->
> > referenced = page_referenced(page, 1,
> > sc->mem_cgroup, &vm_flags);
> > /* In active use or really unfreeable? Activate it. */
> > if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> > referenced && page_mapping_inuse(page))
> > goto activate_locked;
> >
> >> -> try_to_unmap
> > ~~~~~~~~~~~~ this line won't be reached if page is found to be
> > referenced in the above lines?
>
> Indeed! In fact, I was worry about that.
> It looks after live lock problem.
> But I think it's very small race window so there isn't any report until now.
> Let's Cced Lee.
>
> If we have to fix it, how about this ?
> This version has small overhead than yours since
> there is less shrink_page_list call than page_referenced.
Yeah, it looks better. However I still wonder if (VM_LOCKED && !PG_mlocked)
is possible and somehow persistent. Does anyone have the answer? Thanks!
Thanks,
Fengguang
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ed63894..283266c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page,
> */
> if (vma->vm_flags & VM_LOCKED) {
> *mapcount = 1; /* break early from loop */
> + *vm_flags |= VM_LOCKED;
> goto out_unmap;
> }
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d224b28..d156e1d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct
> list_head *page_list,
> sc->mem_cgroup, &vm_flags);
> /* In active use or really unfreeable? Activate it. */
> if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> - referenced && page_mapping_inuse(page))
> + referenced && page_mapping_inuse(page)
> + && !(vm_flags & VM_LOCKED))
> goto activate_locked;
>
>
>
> >
> > Thanks,
> > Fengguang
> >
> >> -> try_to_unmap_xxx
> >> -> if (vma->vm_flags & VM_LOCKED)
> >> -> try_to_mlock_page
> >> -> TestSetPageMlocked
> >> -> putback_lru_page
> >>
> >> So at last, the page will be located in unevictable list.
> >>
> >> > Then I was worrying about a null problem. Sorry for the confusion!
> >> >
> >> > Thanks,
> >> > Fengguang
> >> >
> >> > > ----
> >> > >
> >> > > /*
> >> > > * called from munlock()/munmap() path with page supposedly on the LRU.
> >> > > *
> >> > > * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
> >> > > * [in try_to_munlock()] and then attempt to isolate the page. We must
> >> > > * isolate the page to keep others from messing with its unevictable
> >> > > * and mlocked state while trying to munlock. However, we pre-clear the
> >> > > * mlocked state anyway as we might lose the isolation race and we might
> >> > > * not get another chance to clear PageMlocked. If we successfully
> >> > > * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
> >> > > * mapping the page, it will restore the PageMlocked state, unless the page
> >> > > * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(),
> >> > > * perhaps redundantly.
> >> > > * If we lose the isolation race, and the page is mapped by other VM_LOCKED
> >> > > * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
> >> > > * either of which will restore the PageMlocked state by calling
> >> > > * mlock_vma_page() above, if it can grab the vma's mmap sem.
> >> > > */
> >> > > static void munlock_vma_page(struct page *page)
> >> > > {
> >> > > ...
> >> > >
> >> > > --
> >> > > Kind regards,
> >> > > Minchan Kim
> >>
> >>
> >> --
> >> Kind regards,
> >> Minchan Kim
> >
>
>
>
> --
> Kind regards,
> Minchan Kim
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 11:11 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-18 11:11 UTC (permalink / raw)
To: Minchan Kim
Cc: Lee Schermerhorn, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman,
LKML, linux-mm
On Tue, Aug 18, 2009 at 07:00:48PM +0800, Minchan Kim wrote:
> On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
> >> On Tue, 18 Aug 2009 17:31:19 +0800
> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
> >>
> >> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> >> > > On Tue, 18 Aug 2009 10:34:38 +0800
> >> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> > >
> >> > > > Minchan,
> >> > > >
> >> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> >> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> >> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> >> > > > > >> > Wu Fengguang wrote:
> >> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> >> > > > > >> > >> Side question -
> >> > > > > >> > >> A Is there a good reason for this to be in shrink_active_list()
> >> > > > > >> > >> as opposed to __isolate_lru_page?
> >> > > > > >> > >>
> >> > > > > >> > >> A A A A A if (unlikely(!page_evictable(page, NULL))) {
> >> > > > > >> > >> A A A A A A A A A putback_lru_page(page);
> >> > > > > >> > >> A A A A A A A A A continue;
> >> > > > > >> > >> A A A A A }
> >> > > > > >> > >>
> >> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> >> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
> >> > > > > >> > >
> >> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> >> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
> >> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> >> > > > > >> > > and again.
> >> > > > > >> >
> >> > > > > >> > Please read what putback_lru_page does.
> >> > > > > >> >
> >> > > > > >> > It moves the page onto the unevictable list, so that
> >> > > > > >> > it will not end up in this scan again.
> >> > > > > >>
> >> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
> >> > > > > >> unevictable page may be scanned but still not moved to unevictable
> >> > > > > >> list: when a page is mapped in two places, the first pte has the
> >> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> >> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
> >> > > > > >> into active list instead of unevictable list. Shall we fix this rare
> >> > > > > >> case?
> >> > > > >
> >> > > > > I think it's not a big deal.
> >> > > >
> >> > > > Maybe, otherwise I should bring up this issue long time before :)
> >> > > >
> >> > > > > As you mentioned, it's rare case so there would be few pages in active
> >> > > > > list instead of unevictable list.
> >> > > >
> >> > > > Yes.
> >> > > >
> >> > > > > When next time to scan comes, we can try to move the pages into
> >> > > > > unevictable list, again.
> >> > > >
> >> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely
> >> > > > to change and the VM_LOCKED pages may circulate in active/inactive
> >> > > > list for countless times.
> >> > >
> >> > > PG_mlocked is not important in that case.
> >> > > Important thing is VM_LOCKED vma.
> >> > > I think below annotaion can help you to understand my point. :)
> >> >
> >> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
> >> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
> >>
> >> No. I am sorry for making my point not clear.
> >> I meant following as.
> >> When the next time to scan,
> >>
> >> shrink_page_list
> > A ->
> > A A A A A A A A referenced = page_referenced(page, 1,
> > A A A A A A A A A A A A A A A A A A A A A A A A sc->mem_cgroup, &vm_flags);
> > A A A A A A A A /* In active use or really unfreeable? A Activate it. */
> > A A A A A A A A if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> > A A A A A A A A A A A A A A A A A A A A referenced && page_mapping_inuse(page))
> > A A A A A A A A A A A A goto activate_locked;
> >
> >> -> try_to_unmap
> > A A ~~~~~~~~~~~~ this line won't be reached if page is found to be
> > A A referenced in the above lines?
>
> Indeed! In fact, I was worry about that.
> It looks after live lock problem.
> But I think it's very small race window so there isn't any report until now.
> Let's Cced Lee.
>
> If we have to fix it, how about this ?
> This version has small overhead than yours since
> there is less shrink_page_list call than page_referenced.
Yeah, it looks better. However I still wonder if (VM_LOCKED && !PG_mlocked)
is possible and somehow persistent. Does anyone have the answer? Thanks!
Thanks,
Fengguang
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ed63894..283266c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page,
> */
> if (vma->vm_flags & VM_LOCKED) {
> *mapcount = 1; /* break early from loop */
> + *vm_flags |= VM_LOCKED;
> goto out_unmap;
> }
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d224b28..d156e1d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct
> list_head *page_list,
> sc->mem_cgroup, &vm_flags);
> /* In active use or really unfreeable? Activate it. */
> if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> - referenced && page_mapping_inuse(page))
> + referenced && page_mapping_inuse(page)
> + && !(vm_flags & VM_LOCKED))
> goto activate_locked;
>
>
>
> >
> > Thanks,
> > Fengguang
> >
> >> A A A -> try_to_unmap_xxx
> >> A A A A A A A -> if (vma->vm_flags & VM_LOCKED)
> >> A A A A A A A -> try_to_mlock_page
> >> A A A A A A A A A A A -> TestSetPageMlocked
> >> A A A A A A A A A A A -> putback_lru_page
> >>
> >> So at last, the page will be located in unevictable list.
> >>
> >> > Then I was worrying about a null problem. Sorry for the confusion!
> >> >
> >> > Thanks,
> >> > Fengguang
> >> >
> >> > > ----
> >> > >
> >> > > /*
> >> > > A * called from munlock()/munmap() path with page supposedly on the LRU.
> >> > > A *
> >> > > A * Note: A unlike mlock_vma_page(), we can't just clear the PageMlocked
> >> > > A * [in try_to_munlock()] and then attempt to isolate the page. A We must
> >> > > A * isolate the page to keep others from messing with its unevictable
> >> > > A * and mlocked state while trying to munlock. A However, we pre-clear the
> >> > > A * mlocked state anyway as we might lose the isolation race and we might
> >> > > A * not get another chance to clear PageMlocked. A If we successfully
> >> > > A * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
> >> > > A * mapping the page, it will restore the PageMlocked state, unless the page
> >> > > A * is mapped in a non-linear vma. A So, we go ahead and SetPageMlocked(),
> >> > > A * perhaps redundantly.
> >> > > A * If we lose the isolation race, and the page is mapped by other VM_LOCKED
> >> > > A * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
> >> > > A * either of which will restore the PageMlocked state by calling
> >> > > A * mlock_vma_page() above, if it can grab the vma's mmap sem.
> >> > > A */
> >> > > static void munlock_vma_page(struct page *page)
> >> > > {
> >> > > ...
> >> > >
> >> > > --
> >> > > Kind regards,
> >> > > Minchan Kim
> >>
> >>
> >> --
> >> Kind regards,
> >> Minchan Kim
> >
>
>
>
> --
> Kind regards,
> Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-18 11:11 ` Wu Fengguang
@ 2009-08-18 14:03 ` Minchan Kim
-1 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-18 14:03 UTC (permalink / raw)
To: Wu Fengguang
Cc: Lee Schermerhorn, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman,
LKML, linux-mm
On Tue, Aug 18, 2009 at 8:11 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Tue, Aug 18, 2009 at 07:00:48PM +0800, Minchan Kim wrote:
>> On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
>> >> On Tue, 18 Aug 2009 17:31:19 +0800
>> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >>
>> >> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
>> >> > > On Tue, 18 Aug 2009 10:34:38 +0800
>> >> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >> > >
>> >> > > > Minchan,
>> >> > > >
>> >> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
>> >> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> >> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
>> >> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
>> >> > > > > >> > Wu Fengguang wrote:
>> >> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
>> >> > > > > >> > >> Side question -
>> >> > > > > >> > >> Is there a good reason for this to be in shrink_active_list()
>> >> > > > > >> > >> as opposed to __isolate_lru_page?
>> >> > > > > >> > >>
>> >> > > > > >> > >> if (unlikely(!page_evictable(page, NULL))) {
>> >> > > > > >> > >> putback_lru_page(page);
>> >> > > > > >> > >> continue;
>> >> > > > > >> > >> }
>> >> > > > > >> > >>
>> >> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
>> >> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
>> >> > > > > >> > >
>> >> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
>> >> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
>> >> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
>> >> > > > > >> > > and again.
>> >> > > > > >> >
>> >> > > > > >> > Please read what putback_lru_page does.
>> >> > > > > >> >
>> >> > > > > >> > It moves the page onto the unevictable list, so that
>> >> > > > > >> > it will not end up in this scan again.
>> >> > > > > >>
>> >> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
>> >> > > > > >> unevictable page may be scanned but still not moved to unevictable
>> >> > > > > >> list: when a page is mapped in two places, the first pte has the
>> >> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
>> >> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
>> >> > > > > >> into active list instead of unevictable list. Shall we fix this rare
>> >> > > > > >> case?
>> >> > > > >
>> >> > > > > I think it's not a big deal.
>> >> > > >
>> >> > > > Maybe, otherwise I should bring up this issue long time before :)
>> >> > > >
>> >> > > > > As you mentioned, it's rare case so there would be few pages in active
>> >> > > > > list instead of unevictable list.
>> >> > > >
>> >> > > > Yes.
>> >> > > >
>> >> > > > > When next time to scan comes, we can try to move the pages into
>> >> > > > > unevictable list, again.
>> >> > > >
>> >> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely
>> >> > > > to change and the VM_LOCKED pages may circulate in active/inactive
>> >> > > > list for countless times.
>> >> > >
>> >> > > PG_mlocked is not important in that case.
>> >> > > Important thing is VM_LOCKED vma.
>> >> > > I think below annotaion can help you to understand my point. :)
>> >> >
>> >> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
>> >> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
>> >>
>> >> No. I am sorry for making my point not clear.
>> >> I meant following as.
>> >> When the next time to scan,
>> >>
>> >> shrink_page_list
>> > ->
>> > referenced = page_referenced(page, 1,
>> > sc->mem_cgroup, &vm_flags);
>> > /* In active use or really unfreeable? Activate it. */
>> > if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
>> > referenced && page_mapping_inuse(page))
>> > goto activate_locked;
>> >
>> >> -> try_to_unmap
>> > ~~~~~~~~~~~~ this line won't be reached if page is found to be
>> > referenced in the above lines?
>>
>> Indeed! In fact, I was worry about that.
>> It looks after live lock problem.
>> But I think it's very small race window so there isn't any report until now.
>> Let's Cced Lee.
>>
>> If we have to fix it, how about this ?
>> This version has small overhead than yours since
>> there is less shrink_page_list call than page_referenced.
>
> Yeah, it looks better. However I still wonder if (VM_LOCKED && !PG_mlocked)
> is possible and somehow persistent. Does anyone have the answer? Thanks!
I think it's possible.
munlock_vma_page pre-clears PG_mlocked of page.
And then if isolate_lru_page fail, the page have no PG_mlocked and vma which
have VM_LOCKED.
As munlock_vma_page's annotation said, we hope the page will be rescued by
try_to_unmap. But As you pointed out, if the page have PG_referenced, it can't
reach try_to_unmap so that it will go into the active list.
What are others' opinion ?
> Thanks,
> Fengguang
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 14:03 ` Minchan Kim
0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-18 14:03 UTC (permalink / raw)
To: Wu Fengguang
Cc: Lee Schermerhorn, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman,
LKML, linux-mm
On Tue, Aug 18, 2009 at 8:11 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Tue, Aug 18, 2009 at 07:00:48PM +0800, Minchan Kim wrote:
>> On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
>> >> On Tue, 18 Aug 2009 17:31:19 +0800
>> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >>
>> >> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
>> >> > > On Tue, 18 Aug 2009 10:34:38 +0800
>> >> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >> > >
>> >> > > > Minchan,
>> >> > > >
>> >> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
>> >> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> >> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
>> >> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
>> >> > > > > >> > Wu Fengguang wrote:
>> >> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
>> >> > > > > >> > >> Side question -
>> >> > > > > >> > >> Is there a good reason for this to be in shrink_active_list()
>> >> > > > > >> > >> as opposed to __isolate_lru_page?
>> >> > > > > >> > >>
>> >> > > > > >> > >> if (unlikely(!page_evictable(page, NULL))) {
>> >> > > > > >> > >> putback_lru_page(page);
>> >> > > > > >> > >> continue;
>> >> > > > > >> > >> }
>> >> > > > > >> > >>
>> >> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
>> >> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
>> >> > > > > >> > >
>> >> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
>> >> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
>> >> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
>> >> > > > > >> > > and again.
>> >> > > > > >> >
>> >> > > > > >> > Please read what putback_lru_page does.
>> >> > > > > >> >
>> >> > > > > >> > It moves the page onto the unevictable list, so that
>> >> > > > > >> > it will not end up in this scan again.
>> >> > > > > >>
>> >> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
>> >> > > > > >> unevictable page may be scanned but still not moved to unevictable
>> >> > > > > >> list: when a page is mapped in two places, the first pte has the
>> >> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
>> >> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
>> >> > > > > >> into active list instead of unevictable list. Shall we fix this rare
>> >> > > > > >> case?
>> >> > > > >
>> >> > > > > I think it's not a big deal.
>> >> > > >
>> >> > > > Maybe, otherwise I should bring up this issue long time before :)
>> >> > > >
>> >> > > > > As you mentioned, it's rare case so there would be few pages in active
>> >> > > > > list instead of unevictable list.
>> >> > > >
>> >> > > > Yes.
>> >> > > >
>> >> > > > > When next time to scan comes, we can try to move the pages into
>> >> > > > > unevictable list, again.
>> >> > > >
>> >> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely
>> >> > > > to change and the VM_LOCKED pages may circulate in active/inactive
>> >> > > > list for countless times.
>> >> > >
>> >> > > PG_mlocked is not important in that case.
>> >> > > Important thing is VM_LOCKED vma.
>> >> > > I think below annotaion can help you to understand my point. :)
>> >> >
>> >> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
>> >> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
>> >>
>> >> No. I am sorry for making my point not clear.
>> >> I meant following as.
>> >> When the next time to scan,
>> >>
>> >> shrink_page_list
>> > ->
>> > referenced = page_referenced(page, 1,
>> > sc->mem_cgroup, &vm_flags);
>> > /* In active use or really unfreeable? Activate it. */
>> > if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
>> > referenced && page_mapping_inuse(page))
>> > goto activate_locked;
>> >
>> >> -> try_to_unmap
>> > ~~~~~~~~~~~~ this line won't be reached if page is found to be
>> > referenced in the above lines?
>>
>> Indeed! In fact, I was worry about that.
>> It looks after live lock problem.
>> But I think it's very small race window so there isn't any report until now.
>> Let's Cced Lee.
>>
>> If we have to fix it, how about this ?
>> This version has small overhead than yours since
>> there is less shrink_page_list call than page_referenced.
>
> Yeah, it looks better. However I still wonder if (VM_LOCKED && !PG_mlocked)
> is possible and somehow persistent. Does anyone have the answer? Thanks!
I think it's possible.
munlock_vma_page pre-clears PG_mlocked of page.
And then if isolate_lru_page fail, the page have no PG_mlocked and vma which
have VM_LOCKED.
As munlock_vma_page's annotation said, we hope the page will be rescued by
try_to_unmap. But As you pointed out, if the page have PG_referenced, it can't
reach try_to_unmap so that it will go into the active list.
What are others' opinion ?
> Thanks,
> Fengguang
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-16 5:09 ` Balbir Singh
@ 2009-08-18 15:57 ` KOSAKI Motohiro
-1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-18 15:57 UTC (permalink / raw)
To: balbir
Cc: kosaki.motohiro, Wu Fengguang, Rik van Riel, Johannes Weiner,
Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
Mel Gorman, LKML, linux-mm
> * Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]:
>
> > On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> > > Wu Fengguang wrote:
> > > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> > >
> > > >> So even with the active list being a FIFO, we keep usage information
> > > >> gathered from the inactive list. If we deactivate pages in arbitrary
> > > >> list intervals, we throw this away.
> > > >
> > > > We do have the danger of FIFO, if inactive list is small enough, so
> > > > that (unconditionally) deactivated pages quickly get reclaimed and
> > > > their life window in inactive list is too small to be useful.
> > >
> > > This one of the reasons why we unconditionally deactivate
> > > the active anon pages, and do background scanning of the
> > > active anon list when reclaiming page cache pages.
> > >
> > > We want to always move some pages to the inactive anon
> > > list, so it does not get too small.
> >
> > Right, the current code tries to pull inactive list out of
> > smallish-size state as long as there are vmscan activities.
> >
> > However there is a possible (and tricky) hole: mem cgroups
> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> >
> > It effectively scales up the inactive list scan rate by 10 times when
> > it is still small, and may thus prevent it from growing up for ever.
> >
>
> I think we need to possibly export some scanning data under DEBUG_VM
> to cross verify.
Sorry for the delay.
How about this?
=======================================
Subject: [PATCH] vmscan: show recent_scanned/rotated stat
On recent discussion, Balbir Singh pointed out VM developer shold be
able to see recent_scanned/rotated statistics.
This patch does it.
output example
--------------------
% cat /proc/zoneinfo
Node 0, zone DMA32
pages free 347590
min 613
low 766
high 919
(snip)
inactive_ratio: 3
recent_rotated_anon: 127305
recent_rotated_file: 67439
recent_scanned_anon: 135591
recent_scanned_file: 180399
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/vmstat.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
Index: b/mm/vmstat.c
===================================================================
--- a/mm/vmstat.c 2009-08-08 14:16:53.000000000 +0900
+++ b/mm/vmstat.c 2009-08-18 22:07:25.000000000 +0900
@@ -762,6 +762,20 @@ static void zoneinfo_show_print(struct s
zone->prev_priority,
zone->zone_start_pfn,
zone->inactive_ratio);
+
+#ifdef CONFIG_DEBUG_VM
+ seq_printf(m,
+ "\n recent_rotated_anon: %lu"
+ "\n recent_rotated_file: %lu"
+ "\n recent_scanned_anon: %lu"
+ "\n recent_scanned_file: %lu",
+ zone->reclaim_stat.recent_rotated[0],
+ zone->reclaim_stat.recent_rotated[1],
+ zone->reclaim_stat.recent_scanned[0],
+ zone->reclaim_stat.recent_scanned[1]
+ );
+#endif
+
seq_putc(m, '\n');
}
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 15:57 ` KOSAKI Motohiro
0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-18 15:57 UTC (permalink / raw)
To: balbir
Cc: kosaki.motohiro, Wu Fengguang, Rik van Riel, Johannes Weiner,
Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
Mel Gorman, LKML, linux-mm
> * Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]:
>
> > On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> > > Wu Fengguang wrote:
> > > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> > >
> > > >> So even with the active list being a FIFO, we keep usage information
> > > >> gathered from the inactive list. If we deactivate pages in arbitrary
> > > >> list intervals, we throw this away.
> > > >
> > > > We do have the danger of FIFO, if inactive list is small enough, so
> > > > that (unconditionally) deactivated pages quickly get reclaimed and
> > > > their life window in inactive list is too small to be useful.
> > >
> > > This one of the reasons why we unconditionally deactivate
> > > the active anon pages, and do background scanning of the
> > > active anon list when reclaiming page cache pages.
> > >
> > > We want to always move some pages to the inactive anon
> > > list, so it does not get too small.
> >
> > Right, the current code tries to pull inactive list out of
> > smallish-size state as long as there are vmscan activities.
> >
> > However there is a possible (and tricky) hole: mem cgroups
> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> >
> > It effectively scales up the inactive list scan rate by 10 times when
> > it is still small, and may thus prevent it from growing up for ever.
> >
>
> I think we need to possibly export some scanning data under DEBUG_VM
> to cross verify.
Sorry for the delay.
How about this?
=======================================
Subject: [PATCH] vmscan: show recent_scanned/rotated stat
On recent discussion, Balbir Singh pointed out VM developer shold be
able to see recent_scanned/rotated statistics.
This patch does it.
output example
--------------------
% cat /proc/zoneinfo
Node 0, zone DMA32
pages free 347590
min 613
low 766
high 919
(snip)
inactive_ratio: 3
recent_rotated_anon: 127305
recent_rotated_file: 67439
recent_scanned_anon: 135591
recent_scanned_file: 180399
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/vmstat.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
Index: b/mm/vmstat.c
===================================================================
--- a/mm/vmstat.c 2009-08-08 14:16:53.000000000 +0900
+++ b/mm/vmstat.c 2009-08-18 22:07:25.000000000 +0900
@@ -762,6 +762,20 @@ static void zoneinfo_show_print(struct s
zone->prev_priority,
zone->zone_start_pfn,
zone->inactive_ratio);
+
+#ifdef CONFIG_DEBUG_VM
+ seq_printf(m,
+ "\n recent_rotated_anon: %lu"
+ "\n recent_rotated_file: %lu"
+ "\n recent_scanned_anon: %lu"
+ "\n recent_scanned_file: %lu",
+ zone->reclaim_stat.recent_rotated[0],
+ zone->reclaim_stat.recent_rotated[1],
+ zone->reclaim_stat.recent_scanned[0],
+ zone->reclaim_stat.recent_scanned[1]
+ );
+#endif
+
seq_putc(m, '\n');
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-15 5:45 ` Wu Fengguang
@ 2009-08-18 15:57 ` KOSAKI Motohiro
-1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-18 15:57 UTC (permalink / raw)
To: Wu Fengguang
Cc: kosaki.motohiro, Rik van Riel, Johannes Weiner, Avi Kivity,
Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
linux-mm
> > This one of the reasons why we unconditionally deactivate
> > the active anon pages, and do background scanning of the
> > active anon list when reclaiming page cache pages.
> >
> > We want to always move some pages to the inactive anon
> > list, so it does not get too small.
>
> Right, the current code tries to pull inactive list out of
> smallish-size state as long as there are vmscan activities.
>
> However there is a possible (and tricky) hole: mem cgroups
> don't do batched vmscan. shrink_zone() may call shrink_list()
> with nr_to_scan=1, in which case shrink_list() _still_ calls
> isolate_pages() with the much larger SWAP_CLUSTER_MAX.
>
> It effectively scales up the inactive list scan rate by 10 times when
> it is still small, and may thus prevent it from growing up for ever.
>
> In that case, LRU becomes FIFO.
>
> Jeff, can you confirm if the mem cgroup's inactive list is small?
> If so, this patch should help.
This patch does right thing.
However, I would explain why I and memcg folks didn't do that in past days.
Strangely, some memcg struct declaration is hide in *.c. Thus we can't
make inline function and we hesitated to introduce many function calling
overhead.
So, Can we move some memcg structure declaration to *.h and make
mem_cgroup_get_saved_scan() inlined function?
>
> Thanks,
> Fengguang
> ---
>
> mm: do batched scans for mem_cgroup
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> include/linux/memcontrol.h | 3 +++
> mm/memcontrol.c | 12 ++++++++++++
> mm/vmscan.c | 9 +++++----
> 3 files changed, 20 insertions(+), 4 deletions(-)
>
> --- linux.orig/include/linux/memcontrol.h 2009-08-15 13:12:49.000000000 +0800
> +++ linux/include/linux/memcontrol.h 2009-08-15 13:18:13.000000000 +0800
> @@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru
> unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> struct zone *zone,
> enum lru_list lru);
> +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> + struct zone *zone,
> + enum lru_list lru);
> struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> struct zone *zone);
> struct zone_reclaim_stat*
> --- linux.orig/mm/memcontrol.c 2009-08-15 13:07:34.000000000 +0800
> +++ linux/mm/memcontrol.c 2009-08-15 13:17:56.000000000 +0800
> @@ -115,6 +115,7 @@ struct mem_cgroup_per_zone {
> */
> struct list_head lists[NR_LRU_LISTS];
> unsigned long count[NR_LRU_LISTS];
> + unsigned long nr_saved_scan[NR_LRU_LISTS];
>
> struct zone_reclaim_stat reclaim_stat;
> };
> @@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s
> return MEM_CGROUP_ZSTAT(mz, lru);
> }
>
> +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> + struct zone *zone,
> + enum lru_list lru)
> +{
> + int nid = zone->zone_pgdat->node_id;
> + int zid = zone_idx(zone);
> + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> + return &mz->nr_saved_scan[lru];
> +}
I think this fuction is a bit strange.
shrink_zone don't hold any lock. so, shouldn't we case memcg removing race?
> +
> struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> struct zone *zone)
> {
> --- linux.orig/mm/vmscan.c 2009-08-15 13:04:54.000000000 +0800
> +++ linux/mm/vmscan.c 2009-08-15 13:19:03.000000000 +0800
> @@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st
> for_each_evictable_lru(l) {
> int file = is_file_lru(l);
> unsigned long scan;
> + unsigned long *saved_scan;
>
> scan = zone_nr_pages(zone, sc, l);
> if (priority || noswap) {
> @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
> scan = (scan * percent[file]) / 100;
> }
> if (scanning_global_lru(sc))
> - nr[l] = nr_scan_try_batch(scan,
> - &zone->lru[l].nr_saved_scan,
> - swap_cluster_max);
> + saved_scan = &zone->lru[l].nr_saved_scan;
> else
> - nr[l] = scan;
> + saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
> + zone, l);
> + nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
> }
>
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 15:57 ` KOSAKI Motohiro
0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-18 15:57 UTC (permalink / raw)
To: Wu Fengguang
Cc: kosaki.motohiro, Rik van Riel, Johannes Weiner, Avi Kivity,
Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi,
Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML,
linux-mm
> > This one of the reasons why we unconditionally deactivate
> > the active anon pages, and do background scanning of the
> > active anon list when reclaiming page cache pages.
> >
> > We want to always move some pages to the inactive anon
> > list, so it does not get too small.
>
> Right, the current code tries to pull inactive list out of
> smallish-size state as long as there are vmscan activities.
>
> However there is a possible (and tricky) hole: mem cgroups
> don't do batched vmscan. shrink_zone() may call shrink_list()
> with nr_to_scan=1, in which case shrink_list() _still_ calls
> isolate_pages() with the much larger SWAP_CLUSTER_MAX.
>
> It effectively scales up the inactive list scan rate by 10 times when
> it is still small, and may thus prevent it from growing up for ever.
>
> In that case, LRU becomes FIFO.
>
> Jeff, can you confirm if the mem cgroup's inactive list is small?
> If so, this patch should help.
This patch does right thing.
However, I would explain why I and memcg folks didn't do that in past days.
Strangely, some memcg struct declaration is hide in *.c. Thus we can't
make inline function and we hesitated to introduce many function calling
overhead.
So, Can we move some memcg structure declaration to *.h and make
mem_cgroup_get_saved_scan() inlined function?
>
> Thanks,
> Fengguang
> ---
>
> mm: do batched scans for mem_cgroup
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> include/linux/memcontrol.h | 3 +++
> mm/memcontrol.c | 12 ++++++++++++
> mm/vmscan.c | 9 +++++----
> 3 files changed, 20 insertions(+), 4 deletions(-)
>
> --- linux.orig/include/linux/memcontrol.h 2009-08-15 13:12:49.000000000 +0800
> +++ linux/include/linux/memcontrol.h 2009-08-15 13:18:13.000000000 +0800
> @@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru
> unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> struct zone *zone,
> enum lru_list lru);
> +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> + struct zone *zone,
> + enum lru_list lru);
> struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> struct zone *zone);
> struct zone_reclaim_stat*
> --- linux.orig/mm/memcontrol.c 2009-08-15 13:07:34.000000000 +0800
> +++ linux/mm/memcontrol.c 2009-08-15 13:17:56.000000000 +0800
> @@ -115,6 +115,7 @@ struct mem_cgroup_per_zone {
> */
> struct list_head lists[NR_LRU_LISTS];
> unsigned long count[NR_LRU_LISTS];
> + unsigned long nr_saved_scan[NR_LRU_LISTS];
>
> struct zone_reclaim_stat reclaim_stat;
> };
> @@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s
> return MEM_CGROUP_ZSTAT(mz, lru);
> }
>
> +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> + struct zone *zone,
> + enum lru_list lru)
> +{
> + int nid = zone->zone_pgdat->node_id;
> + int zid = zone_idx(zone);
> + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> + return &mz->nr_saved_scan[lru];
> +}
I think this fuction is a bit strange.
shrink_zone don't hold any lock. so, shouldn't we case memcg removing race?
> +
> struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> struct zone *zone)
> {
> --- linux.orig/mm/vmscan.c 2009-08-15 13:04:54.000000000 +0800
> +++ linux/mm/vmscan.c 2009-08-15 13:19:03.000000000 +0800
> @@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st
> for_each_evictable_lru(l) {
> int file = is_file_lru(l);
> unsigned long scan;
> + unsigned long *saved_scan;
>
> scan = zone_nr_pages(zone, sc, l);
> if (priority || noswap) {
> @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
> scan = (scan * percent[file]) / 100;
> }
> if (scanning_global_lru(sc))
> - nr[l] = nr_scan_try_batch(scan,
> - &zone->lru[l].nr_saved_scan,
> - swap_cluster_max);
> + saved_scan = &zone->lru[l].nr_saved_scan;
> else
> - nr[l] = scan;
> + saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
> + zone, l);
> + nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
> }
>
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-16 11:29 ` Wu Fengguang
@ 2009-08-18 15:57 ` KOSAKI Motohiro
-1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-18 15:57 UTC (permalink / raw)
To: Wu Fengguang
Cc: kosaki.motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
> > Yes it does. I said 'mostly' because there is a small hole that an
> > unevictable page may be scanned but still not moved to unevictable
> > list: when a page is mapped in two places, the first pte has the
> > referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > page_referenced() will return 1 and shrink_page_list() will move it
> > into active list instead of unevictable list. Shall we fix this rare
> > case?
>
> How about this fix?
Good spotting.
Yes, this is rare case. but I also don't think your patch introduce
performance degression.
However, I think your patch have one bug.
>
> ---
> mm: stop circulating of referenced mlocked pages
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>
> --- linux.orig/mm/rmap.c 2009-08-16 19:11:13.000000000 +0800
> +++ linux/mm/rmap.c 2009-08-16 19:22:46.000000000 +0800
> @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
> */
> if (vma->vm_flags & VM_LOCKED) {
> *mapcount = 1; /* break early from loop */
> + *vm_flags |= VM_LOCKED;
> goto out_unmap;
> }
>
> @@ -482,6 +483,8 @@ static int page_referenced_file(struct p
> }
>
> spin_unlock(&mapping->i_mmap_lock);
> + if (*vm_flags & VM_LOCKED)
> + referenced = 0;
> return referenced;
> }
>
page_referenced_file?
I think we should change page_referenced().
Instead, How about this?
==============================================
Subject: [PATCH] mm: stop circulating of referenced mlocked pages
Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
because some race prevent page grabbing.
In that case, instead vmscan move the page to unevictable lru.
However, Recently Wu Fengguang pointed out current vmscan logic isn't so
efficient.
mlocked page can move circulatly active and inactive list because
vmscan check the page is referenced _before_ cull mlocked page.
Plus, vmscan should mark PG_Mlocked when cull mlocked page.
Otherwise vm stastics show strange number.
This patch does that.
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/internal.h | 5 +++--
mm/rmap.c | 8 +++++++-
mm/vmscan.c | 2 +-
3 files changed, 11 insertions(+), 4 deletions(-)
Index: b/mm/internal.h
===================================================================
--- a/mm/internal.h 2009-06-26 21:06:43.000000000 +0900
+++ b/mm/internal.h 2009-08-18 23:31:11.000000000 +0900
@@ -91,7 +91,8 @@ static inline void unevictable_migrate_p
* to determine if it's being mapped into a LOCKED vma.
* If so, mark page as mlocked.
*/
-static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
+static inline int try_set_page_mlocked(struct vm_area_struct *vma,
+ struct page *page)
{
VM_BUG_ON(PageLRU(page));
@@ -144,7 +145,7 @@ static inline void mlock_migrate_page(st
}
#else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
-static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
+static inline int try_set_page_mlocked(struct vm_area_struct *v, struct page *p)
{
return 0;
}
Index: b/mm/rmap.c
===================================================================
--- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900
+++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900
@@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
* unevictable list.
*/
if (vma->vm_flags & VM_LOCKED) {
- *mapcount = 1; /* break early from loop */
+ *mapcount = 1; /* break early from loop */
+ *vm_flags |= VM_LOCKED; /* for prevent to move active list */
+ try_set_page_mlocked(vma, page);
goto out_unmap;
}
@@ -531,6 +533,9 @@ int page_referenced(struct page *page,
if (page_test_and_clear_young(page))
referenced++;
+ if (unlikely(*vm_flags & VM_LOCKED))
+ referenced = 0;
+
return referenced;
}
@@ -784,6 +789,7 @@ static int try_to_unmap_one(struct page
*/
if (!(flags & TTU_IGNORE_MLOCK)) {
if (vma->vm_flags & VM_LOCKED) {
+ try_set_page_mlocked(vma, page);
ret = SWAP_MLOCK;
goto out_unmap;
}
Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c 2009-08-18 19:48:14.000000000 +0900
+++ b/mm/vmscan.c 2009-08-18 23:30:51.000000000 +0900
@@ -2666,7 +2666,7 @@ int page_evictable(struct page *page, st
if (mapping_unevictable(page_mapping(page)))
return 0;
- if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page)))
+ if (PageMlocked(page) || (vma && try_set_page_mlocked(vma, page)))
return 0;
return 1;
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 15:57 ` KOSAKI Motohiro
0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-18 15:57 UTC (permalink / raw)
To: Wu Fengguang
Cc: kosaki.motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
> > Yes it does. I said 'mostly' because there is a small hole that an
> > unevictable page may be scanned but still not moved to unevictable
> > list: when a page is mapped in two places, the first pte has the
> > referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > page_referenced() will return 1 and shrink_page_list() will move it
> > into active list instead of unevictable list. Shall we fix this rare
> > case?
>
> How about this fix?
Good spotting.
Yes, this is rare case. but I also don't think your patch introduce
performance degression.
However, I think your patch have one bug.
>
> ---
> mm: stop circulating of referenced mlocked pages
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>
> --- linux.orig/mm/rmap.c 2009-08-16 19:11:13.000000000 +0800
> +++ linux/mm/rmap.c 2009-08-16 19:22:46.000000000 +0800
> @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
> */
> if (vma->vm_flags & VM_LOCKED) {
> *mapcount = 1; /* break early from loop */
> + *vm_flags |= VM_LOCKED;
> goto out_unmap;
> }
>
> @@ -482,6 +483,8 @@ static int page_referenced_file(struct p
> }
>
> spin_unlock(&mapping->i_mmap_lock);
> + if (*vm_flags & VM_LOCKED)
> + referenced = 0;
> return referenced;
> }
>
page_referenced_file?
I think we should change page_referenced().
Instead, How about this?
==============================================
Subject: [PATCH] mm: stop circulating of referenced mlocked pages
Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
because some race prevent page grabbing.
In that case, instead vmscan move the page to unevictable lru.
However, Recently Wu Fengguang pointed out current vmscan logic isn't so
efficient.
mlocked page can move circulatly active and inactive list because
vmscan check the page is referenced _before_ cull mlocked page.
Plus, vmscan should mark PG_Mlocked when cull mlocked page.
Otherwise vm stastics show strange number.
This patch does that.
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/internal.h | 5 +++--
mm/rmap.c | 8 +++++++-
mm/vmscan.c | 2 +-
3 files changed, 11 insertions(+), 4 deletions(-)
Index: b/mm/internal.h
===================================================================
--- a/mm/internal.h 2009-06-26 21:06:43.000000000 +0900
+++ b/mm/internal.h 2009-08-18 23:31:11.000000000 +0900
@@ -91,7 +91,8 @@ static inline void unevictable_migrate_p
* to determine if it's being mapped into a LOCKED vma.
* If so, mark page as mlocked.
*/
-static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
+static inline int try_set_page_mlocked(struct vm_area_struct *vma,
+ struct page *page)
{
VM_BUG_ON(PageLRU(page));
@@ -144,7 +145,7 @@ static inline void mlock_migrate_page(st
}
#else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
-static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
+static inline int try_set_page_mlocked(struct vm_area_struct *v, struct page *p)
{
return 0;
}
Index: b/mm/rmap.c
===================================================================
--- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900
+++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900
@@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
* unevictable list.
*/
if (vma->vm_flags & VM_LOCKED) {
- *mapcount = 1; /* break early from loop */
+ *mapcount = 1; /* break early from loop */
+ *vm_flags |= VM_LOCKED; /* for prevent to move active list */
+ try_set_page_mlocked(vma, page);
goto out_unmap;
}
@@ -531,6 +533,9 @@ int page_referenced(struct page *page,
if (page_test_and_clear_young(page))
referenced++;
+ if (unlikely(*vm_flags & VM_LOCKED))
+ referenced = 0;
+
return referenced;
}
@@ -784,6 +789,7 @@ static int try_to_unmap_one(struct page
*/
if (!(flags & TTU_IGNORE_MLOCK)) {
if (vma->vm_flags & VM_LOCKED) {
+ try_set_page_mlocked(vma, page);
ret = SWAP_MLOCK;
goto out_unmap;
}
Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c 2009-08-18 19:48:14.000000000 +0900
+++ b/mm/vmscan.c 2009-08-18 23:30:51.000000000 +0900
@@ -2666,7 +2666,7 @@ int page_evictable(struct page *page, st
if (mapping_unevictable(page_mapping(page)))
return 0;
- if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page)))
+ if (PageMlocked(page) || (vma && try_set_page_mlocked(vma, page)))
return 0;
return 1;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-18 11:11 ` Wu Fengguang
@ 2009-08-18 16:27 ` KOSAKI Motohiro
-1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-18 16:27 UTC (permalink / raw)
To: Wu Fengguang
Cc: kosaki.motohiro, Minchan Kim, Lee Schermerhorn, Rik van Riel,
Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman,
LKML, linux-mm
> On Tue, Aug 18, 2009 at 07:00:48PM +0800, Minchan Kim wrote:
> > On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
> > >> On Tue, 18 Aug 2009 17:31:19 +0800
> > >> Wu Fengguang <fengguang.wu@intel.com> wrote:
> > >>
> > >> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> > >> > > On Tue, 18 Aug 2009 10:34:38 +0800
> > >> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > >> > >
> > >> > > > Minchan,
> > >> > > >
> > >> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > >> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > >> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > >> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > >> > > > > >> > Wu Fengguang wrote:
> > >> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > >> > > > > >> > >> Side question -
> > >> > > > > >> > >> Is there a good reason for this to be in shrink_active_list()
> > >> > > > > >> > >> as opposed to __isolate_lru_page?
> > >> > > > > >> > >>
> > >> > > > > >> > >> if (unlikely(!page_evictable(page, NULL))) {
> > >> > > > > >> > >> putback_lru_page(page);
> > >> > > > > >> > >> continue;
> > >> > > > > >> > >> }
> > >> > > > > >> > >>
> > >> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > >> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
> > >> > > > > >> > >
> > >> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > >> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
> > >> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > >> > > > > >> > > and again.
> > >> > > > > >> >
> > >> > > > > >> > Please read what putback_lru_page does.
> > >> > > > > >> >
> > >> > > > > >> > It moves the page onto the unevictable list, so that
> > >> > > > > >> > it will not end up in this scan again.
> > >> > > > > >>
> > >> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
> > >> > > > > >> unevictable page may be scanned but still not moved to unevictable
> > >> > > > > >> list: when a page is mapped in two places, the first pte has the
> > >> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > >> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
> > >> > > > > >> into active list instead of unevictable list. Shall we fix this rare
> > >> > > > > >> case?
> > >> > > > >
> > >> > > > > I think it's not a big deal.
> > >> > > >
> > >> > > > Maybe, otherwise I should bring up this issue long time before :)
> > >> > > >
> > >> > > > > As you mentioned, it's rare case so there would be few pages in active
> > >> > > > > list instead of unevictable list.
> > >> > > >
> > >> > > > Yes.
> > >> > > >
> > >> > > > > When next time to scan comes, we can try to move the pages into
> > >> > > > > unevictable list, again.
> > >> > > >
> > >> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely
> > >> > > > to change and the VM_LOCKED pages may circulate in active/inactive
> > >> > > > list for countless times.
> > >> > >
> > >> > > PG_mlocked is not important in that case.
> > >> > > Important thing is VM_LOCKED vma.
> > >> > > I think below annotaion can help you to understand my point. :)
> > >> >
> > >> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
> > >> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
> > >>
> > >> No. I am sorry for making my point not clear.
> > >> I meant following as.
> > >> When the next time to scan,
> > >>
> > >> shrink_page_list
> > > ->
> > > referenced = page_referenced(page, 1,
> > > sc->mem_cgroup, &vm_flags);
> > > /* In active use or really unfreeable? Activate it. */
> > > if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> > > referenced && page_mapping_inuse(page))
> > > goto activate_locked;
> > >
> > >> -> try_to_unmap
> > > ~~~~~~~~~~~~ this line won't be reached if page is found to be
> > > referenced in the above lines?
> >
> > Indeed! In fact, I was worry about that.
> > It looks after live lock problem.
> > But I think it's very small race window so there isn't any report until now.
> > Let's Cced Lee.
> >
> > If we have to fix it, how about this ?
> > This version has small overhead than yours since
> > there is less shrink_page_list call than page_referenced.
>
> Yeah, it looks better. However I still wonder if (VM_LOCKED && !PG_mlocked)
> is possible and somehow persistent. Does anyone have the answer? Thanks!
hehe, that's bug. you spotted very good thing IMHO ;)
I posted fixed patch. can you see it?
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-18 16:27 ` KOSAKI Motohiro
0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-18 16:27 UTC (permalink / raw)
To: Wu Fengguang
Cc: kosaki.motohiro, Minchan Kim, Lee Schermerhorn, Rik van Riel,
Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen,
Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman,
LKML, linux-mm
> On Tue, Aug 18, 2009 at 07:00:48PM +0800, Minchan Kim wrote:
> > On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote:
> > >> On Tue, 18 Aug 2009 17:31:19 +0800
> > >> Wu Fengguang <fengguang.wu@intel.com> wrote:
> > >>
> > >> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote:
> > >> > > On Tue, 18 Aug 2009 10:34:38 +0800
> > >> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > >> > >
> > >> > > > Minchan,
> > >> > > >
> > >> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote:
> > >> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > >> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote:
> > >> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote:
> > >> > > > > >> > Wu Fengguang wrote:
> > >> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote:
> > >> > > > > >> > >> Side question -
> > >> > > > > >> > >> A Is there a good reason for this to be in shrink_active_list()
> > >> > > > > >> > >> as opposed to __isolate_lru_page?
> > >> > > > > >> > >>
> > >> > > > > >> > >> A A A A A if (unlikely(!page_evictable(page, NULL))) {
> > >> > > > > >> > >> A A A A A A A A A putback_lru_page(page);
> > >> > > > > >> > >> A A A A A A A A A continue;
> > >> > > > > >> > >> A A A A A }
> > >> > > > > >> > >>
> > >> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or
> > >> > > > > >> > >> avoid duplicate logic in the isolate_page functions.
> > >> > > > > >> > >
> > >> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced()
> > >> > > > > >> > > call that follows it. But that should be mostly one shot cost - the
> > >> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again
> > >> > > > > >> > > and again.
> > >> > > > > >> >
> > >> > > > > >> > Please read what putback_lru_page does.
> > >> > > > > >> >
> > >> > > > > >> > It moves the page onto the unevictable list, so that
> > >> > > > > >> > it will not end up in this scan again.
> > >> > > > > >>
> > >> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an
> > >> > > > > >> unevictable page may be scanned but still not moved to unevictable
> > >> > > > > >> list: when a page is mapped in two places, the first pte has the
> > >> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > >> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it
> > >> > > > > >> into active list instead of unevictable list. Shall we fix this rare
> > >> > > > > >> case?
> > >> > > > >
> > >> > > > > I think it's not a big deal.
> > >> > > >
> > >> > > > Maybe, otherwise I should bring up this issue long time before :)
> > >> > > >
> > >> > > > > As you mentioned, it's rare case so there would be few pages in active
> > >> > > > > list instead of unevictable list.
> > >> > > >
> > >> > > > Yes.
> > >> > > >
> > >> > > > > When next time to scan comes, we can try to move the pages into
> > >> > > > > unevictable list, again.
> > >> > > >
> > >> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely
> > >> > > > to change and the VM_LOCKED pages may circulate in active/inactive
> > >> > > > list for countless times.
> > >> > >
> > >> > > PG_mlocked is not important in that case.
> > >> > > Important thing is VM_LOCKED vma.
> > >> > > I think below annotaion can help you to understand my point. :)
> > >> >
> > >> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have
> > >> > PG_mlocked set, and so will be caught by page_evictable(). Is it?
> > >>
> > >> No. I am sorry for making my point not clear.
> > >> I meant following as.
> > >> When the next time to scan,
> > >>
> > >> shrink_page_list
> > > A ->
> > > A A A A A A A A referenced = page_referenced(page, 1,
> > > A A A A A A A A A A A A A A A A A A A A A A A A sc->mem_cgroup, &vm_flags);
> > > A A A A A A A A /* In active use or really unfreeable? A Activate it. */
> > > A A A A A A A A if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> > > A A A A A A A A A A A A A A A A A A A A referenced && page_mapping_inuse(page))
> > > A A A A A A A A A A A A goto activate_locked;
> > >
> > >> -> try_to_unmap
> > > A A ~~~~~~~~~~~~ this line won't be reached if page is found to be
> > > A A referenced in the above lines?
> >
> > Indeed! In fact, I was worry about that.
> > It looks after live lock problem.
> > But I think it's very small race window so there isn't any report until now.
> > Let's Cced Lee.
> >
> > If we have to fix it, how about this ?
> > This version has small overhead than yours since
> > there is less shrink_page_list call than page_referenced.
>
> Yeah, it looks better. However I still wonder if (VM_LOCKED && !PG_mlocked)
> is possible and somehow persistent. Does anyone have the answer? Thanks!
hehe, that's bug. you spotted very good thing IMHO ;)
I posted fixed patch. can you see it?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-18 15:57 ` KOSAKI Motohiro
@ 2009-08-19 12:01 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 12:01 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
On Tue, Aug 18, 2009 at 11:57:54PM +0800, KOSAKI Motohiro wrote:
> > > Yes it does. I said 'mostly' because there is a small hole that an
> > > unevictable page may be scanned but still not moved to unevictable
> > > list: when a page is mapped in two places, the first pte has the
> > > referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > > page_referenced() will return 1 and shrink_page_list() will move it
> > > into active list instead of unevictable list. Shall we fix this rare
> > > case?
> >
> > How about this fix?
>
> Good spotting.
> Yes, this is rare case. but I also don't think your patch introduce
> performance degression.
Thanks.
> However, I think your patch have one bug.
Hehe, sorry for being careless :)
> >
> > ---
> > mm: stop circulating of referenced mlocked pages
> >
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >
> > --- linux.orig/mm/rmap.c 2009-08-16 19:11:13.000000000 +0800
> > +++ linux/mm/rmap.c 2009-08-16 19:22:46.000000000 +0800
> > @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
> > */
> > if (vma->vm_flags & VM_LOCKED) {
> > *mapcount = 1; /* break early from loop */
> > + *vm_flags |= VM_LOCKED;
> > goto out_unmap;
> > }
> >
> > @@ -482,6 +483,8 @@ static int page_referenced_file(struct p
> > }
> >
> > spin_unlock(&mapping->i_mmap_lock);
> > + if (*vm_flags & VM_LOCKED)
> > + referenced = 0;
> > return referenced;
> > }
> >
>
> page_referenced_file?
> I think we should change page_referenced().
Yeah, good catch.
>
> Instead, How about this?
> ==============================================
>
> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>
> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
mark PG_mlocked
> because some race prevent page grabbing.
> In that case, instead vmscan move the page to unevictable lru.
>
> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
> efficient.
> mlocked page can move circulatly active and inactive list because
> vmscan check the page is referenced _before_ cull mlocked page.
>
> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
PG_mlocked
> Otherwise vm stastics show strange number.
>
> This patch does that.
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> Reported-by: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
> mm/internal.h | 5 +++--
> mm/rmap.c | 8 +++++++-
> mm/vmscan.c | 2 +-
> 3 files changed, 11 insertions(+), 4 deletions(-)
>
> Index: b/mm/internal.h
> ===================================================================
> --- a/mm/internal.h 2009-06-26 21:06:43.000000000 +0900
> +++ b/mm/internal.h 2009-08-18 23:31:11.000000000 +0900
> @@ -91,7 +91,8 @@ static inline void unevictable_migrate_p
> * to determine if it's being mapped into a LOCKED vma.
> * If so, mark page as mlocked.
> */
> -static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
> +static inline int try_set_page_mlocked(struct vm_area_struct *vma,
> + struct page *page)
> {
> VM_BUG_ON(PageLRU(page));
>
> @@ -144,7 +145,7 @@ static inline void mlock_migrate_page(st
> }
>
> #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
> -static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
> +static inline int try_set_page_mlocked(struct vm_area_struct *v, struct page *p)
> {
> return 0;
> }
> Index: b/mm/rmap.c
> ===================================================================
> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900
> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900
> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
> * unevictable list.
> */
> if (vma->vm_flags & VM_LOCKED) {
> - *mapcount = 1; /* break early from loop */
> + *mapcount = 1; /* break early from loop */
> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */
> + try_set_page_mlocked(vma, page);
That call is not absolutely necessary?
Thanks,
Fengguang
> goto out_unmap;
> }
>
> @@ -531,6 +533,9 @@ int page_referenced(struct page *page,
> if (page_test_and_clear_young(page))
> referenced++;
>
> + if (unlikely(*vm_flags & VM_LOCKED))
> + referenced = 0;
> +
> return referenced;
> }
>
> @@ -784,6 +789,7 @@ static int try_to_unmap_one(struct page
> */
> if (!(flags & TTU_IGNORE_MLOCK)) {
> if (vma->vm_flags & VM_LOCKED) {
> + try_set_page_mlocked(vma, page);
> ret = SWAP_MLOCK;
> goto out_unmap;
> }
> Index: b/mm/vmscan.c
> ===================================================================
> --- a/mm/vmscan.c 2009-08-18 19:48:14.000000000 +0900
> +++ b/mm/vmscan.c 2009-08-18 23:30:51.000000000 +0900
> @@ -2666,7 +2666,7 @@ int page_evictable(struct page *page, st
> if (mapping_unevictable(page_mapping(page)))
> return 0;
>
> - if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page)))
> + if (PageMlocked(page) || (vma && try_set_page_mlocked(vma, page)))
> return 0;
>
> return 1;
>
>
>
>
>
>
>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 12:01 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 12:01 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
On Tue, Aug 18, 2009 at 11:57:54PM +0800, KOSAKI Motohiro wrote:
> > > Yes it does. I said 'mostly' because there is a small hole that an
> > > unevictable page may be scanned but still not moved to unevictable
> > > list: when a page is mapped in two places, the first pte has the
> > > referenced bit set, the _second_ VMA has VM_LOCKED bit set, then
> > > page_referenced() will return 1 and shrink_page_list() will move it
> > > into active list instead of unevictable list. Shall we fix this rare
> > > case?
> >
> > How about this fix?
>
> Good spotting.
> Yes, this is rare case. but I also don't think your patch introduce
> performance degression.
Thanks.
> However, I think your patch have one bug.
Hehe, sorry for being careless :)
> >
> > ---
> > mm: stop circulating of referenced mlocked pages
> >
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >
> > --- linux.orig/mm/rmap.c 2009-08-16 19:11:13.000000000 +0800
> > +++ linux/mm/rmap.c 2009-08-16 19:22:46.000000000 +0800
> > @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa
> > */
> > if (vma->vm_flags & VM_LOCKED) {
> > *mapcount = 1; /* break early from loop */
> > + *vm_flags |= VM_LOCKED;
> > goto out_unmap;
> > }
> >
> > @@ -482,6 +483,8 @@ static int page_referenced_file(struct p
> > }
> >
> > spin_unlock(&mapping->i_mmap_lock);
> > + if (*vm_flags & VM_LOCKED)
> > + referenced = 0;
> > return referenced;
> > }
> >
>
> page_referenced_file?
> I think we should change page_referenced().
Yeah, good catch.
>
> Instead, How about this?
> ==============================================
>
> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>
> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
mark PG_mlocked
> because some race prevent page grabbing.
> In that case, instead vmscan move the page to unevictable lru.
>
> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
> efficient.
> mlocked page can move circulatly active and inactive list because
> vmscan check the page is referenced _before_ cull mlocked page.
>
> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
PG_mlocked
> Otherwise vm stastics show strange number.
>
> This patch does that.
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> Reported-by: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
> mm/internal.h | 5 +++--
> mm/rmap.c | 8 +++++++-
> mm/vmscan.c | 2 +-
> 3 files changed, 11 insertions(+), 4 deletions(-)
>
> Index: b/mm/internal.h
> ===================================================================
> --- a/mm/internal.h 2009-06-26 21:06:43.000000000 +0900
> +++ b/mm/internal.h 2009-08-18 23:31:11.000000000 +0900
> @@ -91,7 +91,8 @@ static inline void unevictable_migrate_p
> * to determine if it's being mapped into a LOCKED vma.
> * If so, mark page as mlocked.
> */
> -static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
> +static inline int try_set_page_mlocked(struct vm_area_struct *vma,
> + struct page *page)
> {
> VM_BUG_ON(PageLRU(page));
>
> @@ -144,7 +145,7 @@ static inline void mlock_migrate_page(st
> }
>
> #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
> -static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
> +static inline int try_set_page_mlocked(struct vm_area_struct *v, struct page *p)
> {
> return 0;
> }
> Index: b/mm/rmap.c
> ===================================================================
> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900
> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900
> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
> * unevictable list.
> */
> if (vma->vm_flags & VM_LOCKED) {
> - *mapcount = 1; /* break early from loop */
> + *mapcount = 1; /* break early from loop */
> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */
> + try_set_page_mlocked(vma, page);
That call is not absolutely necessary?
Thanks,
Fengguang
> goto out_unmap;
> }
>
> @@ -531,6 +533,9 @@ int page_referenced(struct page *page,
> if (page_test_and_clear_young(page))
> referenced++;
>
> + if (unlikely(*vm_flags & VM_LOCKED))
> + referenced = 0;
> +
> return referenced;
> }
>
> @@ -784,6 +789,7 @@ static int try_to_unmap_one(struct page
> */
> if (!(flags & TTU_IGNORE_MLOCK)) {
> if (vma->vm_flags & VM_LOCKED) {
> + try_set_page_mlocked(vma, page);
> ret = SWAP_MLOCK;
> goto out_unmap;
> }
> Index: b/mm/vmscan.c
> ===================================================================
> --- a/mm/vmscan.c 2009-08-18 19:48:14.000000000 +0900
> +++ b/mm/vmscan.c 2009-08-18 23:30:51.000000000 +0900
> @@ -2666,7 +2666,7 @@ int page_evictable(struct page *page, st
> if (mapping_unevictable(page_mapping(page)))
> return 0;
>
> - if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page)))
> + if (PageMlocked(page) || (vma && try_set_page_mlocked(vma, page)))
> return 0;
>
> return 1;
>
>
>
>
>
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-19 12:01 ` Wu Fengguang
@ 2009-08-19 12:05 ` KOSAKI Motohiro
-1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-19 12:05 UTC (permalink / raw)
To: Wu Fengguang
Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
>> page_referenced_file?
>> I think we should change page_referenced().
>
> Yeah, good catch.
>
>>
>> Instead, How about this?
>> ==============================================
>>
>> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>>
>> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>
> mark PG_mlocked
>
>> because some race prevent page grabbing.
>> In that case, instead vmscan move the page to unevictable lru.
>>
>> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>> efficient.
>> mlocked page can move circulatly active and inactive list because
>> vmscan check the page is referenced _before_ cull mlocked page.
>>
>> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>
> PG_mlocked
>
>> Otherwise vm stastics show strange number.
>>
>> This patch does that.
>
> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Thanks.
>> Index: b/mm/rmap.c
>> ===================================================================
>> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900
>> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900
>> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>> * unevictable list.
>> */
>> if (vma->vm_flags & VM_LOCKED) {
>> - *mapcount = 1; /* break early from loop */
>> + *mapcount = 1; /* break early from loop */
>> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>
>> + try_set_page_mlocked(vma, page);
>
> That call is not absolutely necessary?
Why? I haven't catch your point.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 12:05 ` KOSAKI Motohiro
0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-19 12:05 UTC (permalink / raw)
To: Wu Fengguang
Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
>> page_referenced_file?
>> I think we should change page_referenced().
>
> Yeah, good catch.
>
>>
>> Instead, How about this?
>> ==============================================
>>
>> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>>
>> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>
> mark PG_mlocked
>
>> because some race prevent page grabbing.
>> In that case, instead vmscan move the page to unevictable lru.
>>
>> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>> efficient.
>> mlocked page can move circulatly active and inactive list because
>> vmscan check the page is referenced _before_ cull mlocked page.
>>
>> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>
> PG_mlocked
>
>> Otherwise vm stastics show strange number.
>>
>> This patch does that.
>
> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Thanks.
>> Index: b/mm/rmap.c
>> ===================================================================
>> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900
>> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900
>> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>> * unevictable list.
>> */
>> if (vma->vm_flags & VM_LOCKED) {
>> - *mapcount = 1; /* break early from loop */
>> + *mapcount = 1; /* break early from loop */
>> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>
>> + try_set_page_mlocked(vma, page);
>
> That call is not absolutely necessary?
Why? I haven't catch your point.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-18 15:57 ` KOSAKI Motohiro
@ 2009-08-19 12:08 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 12:08 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
>
> > > This one of the reasons why we unconditionally deactivate
> > > the active anon pages, and do background scanning of the
> > > active anon list when reclaiming page cache pages.
> > >
> > > We want to always move some pages to the inactive anon
> > > list, so it does not get too small.
> >
> > Right, the current code tries to pull inactive list out of
> > smallish-size state as long as there are vmscan activities.
> >
> > However there is a possible (and tricky) hole: mem cgroups
> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> >
> > It effectively scales up the inactive list scan rate by 10 times when
> > it is still small, and may thus prevent it from growing up for ever.
> >
> > In that case, LRU becomes FIFO.
> >
> > Jeff, can you confirm if the mem cgroup's inactive list is small?
> > If so, this patch should help.
>
> This patch does right thing.
> However, I would explain why I and memcg folks didn't do that in past days.
>
> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
> make inline function and we hesitated to introduce many function calling
> overhead.
>
> So, Can we move some memcg structure declaration to *.h and make
> mem_cgroup_get_saved_scan() inlined function?
Good idea, I'll do that btw.
>
> >
> > Thanks,
> > Fengguang
> > ---
> >
> > mm: do batched scans for mem_cgroup
> >
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> > include/linux/memcontrol.h | 3 +++
> > mm/memcontrol.c | 12 ++++++++++++
> > mm/vmscan.c | 9 +++++----
> > 3 files changed, 20 insertions(+), 4 deletions(-)
> >
> > --- linux.orig/include/linux/memcontrol.h 2009-08-15 13:12:49.000000000 +0800
> > +++ linux/include/linux/memcontrol.h 2009-08-15 13:18:13.000000000 +0800
> > @@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru
> > unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> > struct zone *zone,
> > enum lru_list lru);
> > +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> > + struct zone *zone,
> > + enum lru_list lru);
> > struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> > struct zone *zone);
> > struct zone_reclaim_stat*
> > --- linux.orig/mm/memcontrol.c 2009-08-15 13:07:34.000000000 +0800
> > +++ linux/mm/memcontrol.c 2009-08-15 13:17:56.000000000 +0800
> > @@ -115,6 +115,7 @@ struct mem_cgroup_per_zone {
> > */
> > struct list_head lists[NR_LRU_LISTS];
> > unsigned long count[NR_LRU_LISTS];
> > + unsigned long nr_saved_scan[NR_LRU_LISTS];
> >
> > struct zone_reclaim_stat reclaim_stat;
> > };
> > @@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s
> > return MEM_CGROUP_ZSTAT(mz, lru);
> > }
> >
> > +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> > + struct zone *zone,
> > + enum lru_list lru)
> > +{
> > + int nid = zone->zone_pgdat->node_id;
> > + int zid = zone_idx(zone);
> > + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> > +
> > + return &mz->nr_saved_scan[lru];
> > +}
>
> I think this fuction is a bit strange.
> shrink_zone don't hold any lock. so, shouldn't we case memcg removing race?
We've been doing that racy computation for long time. It may hurt a
bit balancing. But the balanced vmscan was never perfect, and required
to perfect. So let's just go with it?
Thanks,
Fengguang
>
> > +
> > struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> > struct zone *zone)
> > {
> > --- linux.orig/mm/vmscan.c 2009-08-15 13:04:54.000000000 +0800
> > +++ linux/mm/vmscan.c 2009-08-15 13:19:03.000000000 +0800
> > @@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st
> > for_each_evictable_lru(l) {
> > int file = is_file_lru(l);
> > unsigned long scan;
> > + unsigned long *saved_scan;
> >
> > scan = zone_nr_pages(zone, sc, l);
> > if (priority || noswap) {
> > @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
> > scan = (scan * percent[file]) / 100;
> > }
> > if (scanning_global_lru(sc))
> > - nr[l] = nr_scan_try_batch(scan,
> > - &zone->lru[l].nr_saved_scan,
> > - swap_cluster_max);
> > + saved_scan = &zone->lru[l].nr_saved_scan;
> > else
> > - nr[l] = scan;
> > + saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
> > + zone, l);
> > + nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
> > }
> >
> > while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
>
>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 12:08 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 12:08 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
>
> > > This one of the reasons why we unconditionally deactivate
> > > the active anon pages, and do background scanning of the
> > > active anon list when reclaiming page cache pages.
> > >
> > > We want to always move some pages to the inactive anon
> > > list, so it does not get too small.
> >
> > Right, the current code tries to pull inactive list out of
> > smallish-size state as long as there are vmscan activities.
> >
> > However there is a possible (and tricky) hole: mem cgroups
> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> >
> > It effectively scales up the inactive list scan rate by 10 times when
> > it is still small, and may thus prevent it from growing up for ever.
> >
> > In that case, LRU becomes FIFO.
> >
> > Jeff, can you confirm if the mem cgroup's inactive list is small?
> > If so, this patch should help.
>
> This patch does right thing.
> However, I would explain why I and memcg folks didn't do that in past days.
>
> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
> make inline function and we hesitated to introduce many function calling
> overhead.
>
> So, Can we move some memcg structure declaration to *.h and make
> mem_cgroup_get_saved_scan() inlined function?
Good idea, I'll do that btw.
>
> >
> > Thanks,
> > Fengguang
> > ---
> >
> > mm: do batched scans for mem_cgroup
> >
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> > include/linux/memcontrol.h | 3 +++
> > mm/memcontrol.c | 12 ++++++++++++
> > mm/vmscan.c | 9 +++++----
> > 3 files changed, 20 insertions(+), 4 deletions(-)
> >
> > --- linux.orig/include/linux/memcontrol.h 2009-08-15 13:12:49.000000000 +0800
> > +++ linux/include/linux/memcontrol.h 2009-08-15 13:18:13.000000000 +0800
> > @@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru
> > unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> > struct zone *zone,
> > enum lru_list lru);
> > +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> > + struct zone *zone,
> > + enum lru_list lru);
> > struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> > struct zone *zone);
> > struct zone_reclaim_stat*
> > --- linux.orig/mm/memcontrol.c 2009-08-15 13:07:34.000000000 +0800
> > +++ linux/mm/memcontrol.c 2009-08-15 13:17:56.000000000 +0800
> > @@ -115,6 +115,7 @@ struct mem_cgroup_per_zone {
> > */
> > struct list_head lists[NR_LRU_LISTS];
> > unsigned long count[NR_LRU_LISTS];
> > + unsigned long nr_saved_scan[NR_LRU_LISTS];
> >
> > struct zone_reclaim_stat reclaim_stat;
> > };
> > @@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s
> > return MEM_CGROUP_ZSTAT(mz, lru);
> > }
> >
> > +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
> > + struct zone *zone,
> > + enum lru_list lru)
> > +{
> > + int nid = zone->zone_pgdat->node_id;
> > + int zid = zone_idx(zone);
> > + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> > +
> > + return &mz->nr_saved_scan[lru];
> > +}
>
> I think this fuction is a bit strange.
> shrink_zone don't hold any lock. so, shouldn't we case memcg removing race?
We've been doing that racy computation for long time. It may hurt a
bit balancing. But the balanced vmscan was never perfect, and required
to perfect. So let's just go with it?
Thanks,
Fengguang
>
> > +
> > struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> > struct zone *zone)
> > {
> > --- linux.orig/mm/vmscan.c 2009-08-15 13:04:54.000000000 +0800
> > +++ linux/mm/vmscan.c 2009-08-15 13:19:03.000000000 +0800
> > @@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st
> > for_each_evictable_lru(l) {
> > int file = is_file_lru(l);
> > unsigned long scan;
> > + unsigned long *saved_scan;
> >
> > scan = zone_nr_pages(zone, sc, l);
> > if (priority || noswap) {
> > @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
> > scan = (scan * percent[file]) / 100;
> > }
> > if (scanning_global_lru(sc))
> > - nr[l] = nr_scan_try_batch(scan,
> > - &zone->lru[l].nr_saved_scan,
> > - swap_cluster_max);
> > + saved_scan = &zone->lru[l].nr_saved_scan;
> > else
> > - nr[l] = scan;
> > + saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
> > + zone, l);
> > + nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
> > }
> >
> > while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-19 12:05 ` KOSAKI Motohiro
@ 2009-08-19 12:10 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 12:10 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
> >> page_referenced_file?
> >> I think we should change page_referenced().
> >
> > Yeah, good catch.
> >
> >>
> >> Instead, How about this?
> >> ==============================================
> >>
> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
> >>
> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
> >
> > mark PG_mlocked
> >
> >> because some race prevent page grabbing.
> >> In that case, instead vmscan move the page to unevictable lru.
> >>
> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
> >> efficient.
> >> mlocked page can move circulatly active and inactive list because
> >> vmscan check the page is referenced _before_ cull mlocked page.
> >>
> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
> >
> > PG_mlocked
> >
> >> Otherwise vm stastics show strange number.
> >>
> >> This patch does that.
> >
> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>
> Thanks.
>
>
>
> >> Index: b/mm/rmap.c
> >> ===================================================================
> >> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900
> >> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900
> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
> >> * unevictable list.
> >> */
> >> if (vma->vm_flags & VM_LOCKED) {
> >> - *mapcount = 1; /* break early from loop */
> >> + *mapcount = 1; /* break early from loop */
> >> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */
> >
> >> + try_set_page_mlocked(vma, page);
> >
> > That call is not absolutely necessary?
>
> Why? I haven't catch your point.
Because we'll eventually hit another try_set_page_mlocked() when
trying to unmap the page. Ie. duplicated with another call you added
in this patch.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 12:10 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 12:10 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm
On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
> >> page_referenced_file?
> >> I think we should change page_referenced().
> >
> > Yeah, good catch.
> >
> >>
> >> Instead, How about this?
> >> ==============================================
> >>
> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
> >>
> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
> >
> > A A A A A A A A A A A A A A A A A A A A A A A A A A mark PG_mlocked
> >
> >> because some race prevent page grabbing.
> >> In that case, instead vmscan move the page to unevictable lru.
> >>
> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
> >> efficient.
> >> mlocked page can move circulatly active and inactive list because
> >> vmscan check the page is referenced _before_ cull mlocked page.
> >>
> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
> >
> > A A A A A A A A A A A A A PG_mlocked
> >
> >> Otherwise vm stastics show strange number.
> >>
> >> This patch does that.
> >
> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>
> Thanks.
>
>
>
> >> Index: b/mm/rmap.c
> >> ===================================================================
> >> --- a/mm/rmap.c A A A 2009-08-18 19:48:14.000000000 +0900
> >> +++ b/mm/rmap.c A A A 2009-08-18 23:47:34.000000000 +0900
> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
> >> A A A A * unevictable list.
> >> A A A A */
> >> A A A if (vma->vm_flags & VM_LOCKED) {
> >> - A A A A A A *mapcount = 1; A /* break early from loop */
> >> + A A A A A A *mapcount = 1; A A A A A /* break early from loop */
> >> + A A A A A A *vm_flags |= VM_LOCKED; /* for prevent to move active list */
> >
> >> + A A A A A A try_set_page_mlocked(vma, page);
> >
> > That call is not absolutely necessary?
>
> Why? I haven't catch your point.
Because we'll eventually hit another try_set_page_mlocked() when
trying to unmap the page. Ie. duplicated with another call you added
in this patch.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-19 12:10 ` Wu Fengguang
@ 2009-08-19 12:25 ` Minchan Kim
-1 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-19 12:25 UTC (permalink / raw)
To: Wu Fengguang
Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
>> >> page_referenced_file?
>> >> I think we should change page_referenced().
>> >
>> > Yeah, good catch.
>> >
>> >>
>> >> Instead, How about this?
>> >> ==============================================
>> >>
>> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>> >>
>> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>> >
>> > mark PG_mlocked
>> >
>> >> because some race prevent page grabbing.
>> >> In that case, instead vmscan move the page to unevictable lru.
>> >>
>> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>> >> efficient.
>> >> mlocked page can move circulatly active and inactive list because
>> >> vmscan check the page is referenced _before_ cull mlocked page.
>> >>
>> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>> >
>> > PG_mlocked
>> >
>> >> Otherwise vm stastics show strange number.
>> >>
>> >> This patch does that.
>> >
>> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>>
>> Thanks.
>>
>>
>>
>> >> Index: b/mm/rmap.c
>> >> ===================================================================
>> >> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900
>> >> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900
>> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>> >> * unevictable list.
>> >> */
>> >> if (vma->vm_flags & VM_LOCKED) {
>> >> - *mapcount = 1; /* break early from loop */
>> >> + *mapcount = 1; /* break early from loop */
>> >> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>> >
>> >> + try_set_page_mlocked(vma, page);
>> >
>> > That call is not absolutely necessary?
>>
>> Why? I haven't catch your point.
>
> Because we'll eventually hit another try_set_page_mlocked() when
> trying to unmap the page. Ie. duplicated with another call you added
> in this patch.
Yes. we don't have to call it and we can make patch simple.
I already sent patch on yesterday.
http://marc.info/?l=linux-mm&m=125059325722370&w=2
I think It's more simple than KOSAKI's idea.
Is any problem in my patch ?
>
> Thanks,
> Fengguang
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 12:25 ` Minchan Kim
0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-19 12:25 UTC (permalink / raw)
To: Wu Fengguang
Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
>> >> page_referenced_file?
>> >> I think we should change page_referenced().
>> >
>> > Yeah, good catch.
>> >
>> >>
>> >> Instead, How about this?
>> >> ==============================================
>> >>
>> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>> >>
>> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>> >
>> > mark PG_mlocked
>> >
>> >> because some race prevent page grabbing.
>> >> In that case, instead vmscan move the page to unevictable lru.
>> >>
>> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>> >> efficient.
>> >> mlocked page can move circulatly active and inactive list because
>> >> vmscan check the page is referenced _before_ cull mlocked page.
>> >>
>> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>> >
>> > PG_mlocked
>> >
>> >> Otherwise vm stastics show strange number.
>> >>
>> >> This patch does that.
>> >
>> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>>
>> Thanks.
>>
>>
>>
>> >> Index: b/mm/rmap.c
>> >> ===================================================================
>> >> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900
>> >> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900
>> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>> >> * unevictable list.
>> >> */
>> >> if (vma->vm_flags & VM_LOCKED) {
>> >> - *mapcount = 1; /* break early from loop */
>> >> + *mapcount = 1; /* break early from loop */
>> >> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>> >
>> >> + try_set_page_mlocked(vma, page);
>> >
>> > That call is not absolutely necessary?
>>
>> Why? I haven't catch your point.
>
> Because we'll eventually hit another try_set_page_mlocked() when
> trying to unmap the page. Ie. duplicated with another call you added
> in this patch.
Yes. we don't have to call it and we can make patch simple.
I already sent patch on yesterday.
http://marc.info/?l=linux-mm&m=125059325722370&w=2
I think It's more simple than KOSAKI's idea.
Is any problem in my patch ?
>
> Thanks,
> Fengguang
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-19 12:25 ` Minchan Kim
@ 2009-08-19 13:19 ` KOSAKI Motohiro
-1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-19 13:19 UTC (permalink / raw)
To: Minchan Kim
Cc: Wu Fengguang, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
2009/8/19 Minchan Kim <minchan.kim@gmail.com>:
> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
>>> >> page_referenced_file?
>>> >> I think we should change page_referenced().
>>> >
>>> > Yeah, good catch.
>>> >
>>> >>
>>> >> Instead, How about this?
>>> >> ==============================================
>>> >>
>>> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>>> >>
>>> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>>> >
>>> > mark PG_mlocked
>>> >
>>> >> because some race prevent page grabbing.
>>> >> In that case, instead vmscan move the page to unevictable lru.
>>> >>
>>> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>>> >> efficient.
>>> >> mlocked page can move circulatly active and inactive list because
>>> >> vmscan check the page is referenced _before_ cull mlocked page.
>>> >>
>>> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>>> >
>>> > PG_mlocked
>>> >
>>> >> Otherwise vm stastics show strange number.
>>> >>
>>> >> This patch does that.
>>> >
>>> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>>>
>>> Thanks.
>>>
>>>
>>>
>>> >> Index: b/mm/rmap.c
>>> >> ===================================================================
>>> >> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900
>>> >> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900
>>> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>>> >> * unevictable list.
>>> >> */
>>> >> if (vma->vm_flags & VM_LOCKED) {
>>> >> - *mapcount = 1; /* break early from loop */
>>> >> + *mapcount = 1; /* break early from loop */
>>> >> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>>> >
>>> >> + try_set_page_mlocked(vma, page);
>>> >
>>> > That call is not absolutely necessary?
>>>
>>> Why? I haven't catch your point.
>>
>> Because we'll eventually hit another try_set_page_mlocked() when
>> trying to unmap the page. Ie. duplicated with another call you added
>> in this patch.
Correct.
> Yes. we don't have to call it and we can make patch simple.
> I already sent patch on yesterday.
>
> http://marc.info/?l=linux-mm&m=125059325722370&w=2
>
> I think It's more simple than KOSAKI's idea.
> Is any problem in my patch ?
Hmm, I think
1. Anyway, we need turn on PG_mlock.
2. PG_mlock prevent livelock because page_evictable() check is called
at very early in shrink_page_list().
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 13:19 ` KOSAKI Motohiro
0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-19 13:19 UTC (permalink / raw)
To: Minchan Kim
Cc: Wu Fengguang, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
2009/8/19 Minchan Kim <minchan.kim@gmail.com>:
> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
>>> >> page_referenced_file?
>>> >> I think we should change page_referenced().
>>> >
>>> > Yeah, good catch.
>>> >
>>> >>
>>> >> Instead, How about this?
>>> >> ==============================================
>>> >>
>>> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>>> >>
>>> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>>> >
>>> > mark PG_mlocked
>>> >
>>> >> because some race prevent page grabbing.
>>> >> In that case, instead vmscan move the page to unevictable lru.
>>> >>
>>> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>>> >> efficient.
>>> >> mlocked page can move circulatly active and inactive list because
>>> >> vmscan check the page is referenced _before_ cull mlocked page.
>>> >>
>>> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>>> >
>>> > PG_mlocked
>>> >
>>> >> Otherwise vm stastics show strange number.
>>> >>
>>> >> This patch does that.
>>> >
>>> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>>>
>>> Thanks.
>>>
>>>
>>>
>>> >> Index: b/mm/rmap.c
>>> >> ===================================================================
>>> >> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900
>>> >> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900
>>> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>>> >> * unevictable list.
>>> >> */
>>> >> if (vma->vm_flags & VM_LOCKED) {
>>> >> - *mapcount = 1; /* break early from loop */
>>> >> + *mapcount = 1; /* break early from loop */
>>> >> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>>> >
>>> >> + try_set_page_mlocked(vma, page);
>>> >
>>> > That call is not absolutely necessary?
>>>
>>> Why? I haven't catch your point.
>>
>> Because we'll eventually hit another try_set_page_mlocked() when
>> trying to unmap the page. Ie. duplicated with another call you added
>> in this patch.
Correct.
> Yes. we don't have to call it and we can make patch simple.
> I already sent patch on yesterday.
>
> http://marc.info/?l=linux-mm&m=125059325722370&w=2
>
> I think It's more simple than KOSAKI's idea.
> Is any problem in my patch ?
Hmm, I think
1. Anyway, we need turn on PG_mlock.
2. PG_mlock prevent livelock because page_evictable() check is called
at very early in shrink_page_list().
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-19 12:25 ` Minchan Kim
@ 2009-08-19 13:24 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 13:24 UTC (permalink / raw)
To: Minchan Kim
Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Wed, Aug 19, 2009 at 08:25:56PM +0800, Minchan Kim wrote:
> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
> >> >> page_referenced_file?
> >> >> I think we should change page_referenced().
> >> >
> >> > Yeah, good catch.
> >> >
> >> >>
> >> >> Instead, How about this?
> >> >> ==============================================
> >> >>
> >> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
> >> >>
> >> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
> >> >
> >> > mark PG_mlocked
> >> >
> >> >> because some race prevent page grabbing.
> >> >> In that case, instead vmscan move the page to unevictable lru.
> >> >>
> >> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
> >> >> efficient.
> >> >> mlocked page can move circulatly active and inactive list because
> >> >> vmscan check the page is referenced _before_ cull mlocked page.
> >> >>
> >> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
> >> >
> >> > PG_mlocked
> >> >
> >> >> Otherwise vm stastics show strange number.
> >> >>
> >> >> This patch does that.
> >> >
> >> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> >>
> >> Thanks.
> >>
> >>
> >>
> >> >> Index: b/mm/rmap.c
> >> >> ===================================================================
> >> >> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900
> >> >> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900
> >> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
> >> >> * unevictable list.
> >> >> */
> >> >> if (vma->vm_flags & VM_LOCKED) {
> >> >> - *mapcount = 1; /* break early from loop */
> >> >> + *mapcount = 1; /* break early from loop */
> >> >> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */
> >> >
> >> >> + try_set_page_mlocked(vma, page);
> >> >
> >> > That call is not absolutely necessary?
> >>
> >> Why? I haven't catch your point.
> >
> > Because we'll eventually hit another try_set_page_mlocked() when
> > trying to unmap the page. Ie. duplicated with another call you added
> > in this patch.
>
> Yes. we don't have to call it and we can make patch simple.
> I already sent patch on yesterday.
>
> http://marc.info/?l=linux-mm&m=125059325722370&w=2
>
> I think It's more simple than KOSAKI's idea.
> Is any problem in my patch ?
No, IMHO your patch is simple and good, while KOSAKI's is more
complete :)
- the try_set_page_mlocked() rename is suitable
- the call to try_set_page_mlocked() is necessary on try_to_unmap()
- the "if (VM_LOCKED) referenced = 0" in page_referenced() could
cover both active/inactive vmscan
I did like your proposed
if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
- referenced && page_mapping_inuse(page))
+ referenced && page_mapping_inuse(page)
+ && !(vm_flags & VM_LOCKED))
goto activate_locked;
which looks more intuitive and less confusing.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 13:24 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 13:24 UTC (permalink / raw)
To: Minchan Kim
Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Wed, Aug 19, 2009 at 08:25:56PM +0800, Minchan Kim wrote:
> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
> >> >> page_referenced_file?
> >> >> I think we should change page_referenced().
> >> >
> >> > Yeah, good catch.
> >> >
> >> >>
> >> >> Instead, How about this?
> >> >> ==============================================
> >> >>
> >> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
> >> >>
> >> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
> >> >
> >> > A A A A A A A A A A A A A A A A A A A A A A A A A A mark PG_mlocked
> >> >
> >> >> because some race prevent page grabbing.
> >> >> In that case, instead vmscan move the page to unevictable lru.
> >> >>
> >> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
> >> >> efficient.
> >> >> mlocked page can move circulatly active and inactive list because
> >> >> vmscan check the page is referenced _before_ cull mlocked page.
> >> >>
> >> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
> >> >
> >> > A A A A A A A A A A A A A PG_mlocked
> >> >
> >> >> Otherwise vm stastics show strange number.
> >> >>
> >> >> This patch does that.
> >> >
> >> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> >>
> >> Thanks.
> >>
> >>
> >>
> >> >> Index: b/mm/rmap.c
> >> >> ===================================================================
> >> >> --- a/mm/rmap.c A A A 2009-08-18 19:48:14.000000000 +0900
> >> >> +++ b/mm/rmap.c A A A 2009-08-18 23:47:34.000000000 +0900
> >> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
> >> >> A A A A * unevictable list.
> >> >> A A A A */
> >> >> A A A if (vma->vm_flags & VM_LOCKED) {
> >> >> - A A A A A A *mapcount = 1; A /* break early from loop */
> >> >> + A A A A A A *mapcount = 1; A A A A A /* break early from loop */
> >> >> + A A A A A A *vm_flags |= VM_LOCKED; /* for prevent to move active list */
> >> >
> >> >> + A A A A A A try_set_page_mlocked(vma, page);
> >> >
> >> > That call is not absolutely necessary?
> >>
> >> Why? I haven't catch your point.
> >
> > Because we'll eventually hit another try_set_page_mlocked() when
> > trying to unmap the page. Ie. duplicated with another call you added
> > in this patch.
>
> Yes. we don't have to call it and we can make patch simple.
> I already sent patch on yesterday.
>
> http://marc.info/?l=linux-mm&m=125059325722370&w=2
>
> I think It's more simple than KOSAKI's idea.
> Is any problem in my patch ?
No, IMHO your patch is simple and good, while KOSAKI's is more
complete :)
- the try_set_page_mlocked() rename is suitable
- the call to try_set_page_mlocked() is necessary on try_to_unmap()
- the "if (VM_LOCKED) referenced = 0" in page_referenced() could
cover both active/inactive vmscan
I did like your proposed
if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
- referenced && page_mapping_inuse(page))
+ referenced && page_mapping_inuse(page)
+ && !(vm_flags & VM_LOCKED))
goto activate_locked;
which looks more intuitive and less confusing.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-19 13:19 ` KOSAKI Motohiro
@ 2009-08-19 13:28 ` Minchan Kim
-1 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-19 13:28 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Wu Fengguang, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Wed, Aug 19, 2009 at 10:19 PM, KOSAKI
Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> 2009/8/19 Minchan Kim <minchan.kim@gmail.com>:
>> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>>> On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
>>>> >> page_referenced_file?
>>>> >> I think we should change page_referenced().
>>>> >
>>>> > Yeah, good catch.
>>>> >
>>>> >>
>>>> >> Instead, How about this?
>>>> >> ==============================================
>>>> >>
>>>> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>>>> >>
>>>> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>>>> >
>>>> > mark PG_mlocked
>>>> >
>>>> >> because some race prevent page grabbing.
>>>> >> In that case, instead vmscan move the page to unevictable lru.
>>>> >>
>>>> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>>>> >> efficient.
>>>> >> mlocked page can move circulatly active and inactive list because
>>>> >> vmscan check the page is referenced _before_ cull mlocked page.
>>>> >>
>>>> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>>>> >
>>>> > PG_mlocked
>>>> >
>>>> >> Otherwise vm stastics show strange number.
>>>> >>
>>>> >> This patch does that.
>>>> >
>>>> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>> >> Index: b/mm/rmap.c
>>>> >> ===================================================================
>>>> >> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900
>>>> >> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900
>>>> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>>>> >> * unevictable list.
>>>> >> */
>>>> >> if (vma->vm_flags & VM_LOCKED) {
>>>> >> - *mapcount = 1; /* break early from loop */
>>>> >> + *mapcount = 1; /* break early from loop */
>>>> >> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>>>> >
>>>> >> + try_set_page_mlocked(vma, page);
>>>> >
>>>> > That call is not absolutely necessary?
>>>>
>>>> Why? I haven't catch your point.
>>>
>>> Because we'll eventually hit another try_set_page_mlocked() when
>>> trying to unmap the page. Ie. duplicated with another call you added
>>> in this patch.
>
> Correct.
>
>
>> Yes. we don't have to call it and we can make patch simple.
>> I already sent patch on yesterday.
>>
>> http://marc.info/?l=linux-mm&m=125059325722370&w=2
>>
>> I think It's more simple than KOSAKI's idea.
>> Is any problem in my patch ?
>
> Hmm, I think
>
> 1. Anyway, we need turn on PG_mlock.
I add my patch again to explain.
diff --git a/mm/rmap.c b/mm/rmap.c
index ed63894..283266c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page,
*/
if (vma->vm_flags & VM_LOCKED) {
*mapcount = 1; /* break early from loop */
+ *vm_flags |= VM_LOCKED;
goto out_unmap;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d224b28..d156e1d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct
list_head *page_list,
sc->mem_cgroup, &vm_flags);
/* In active use or really unfreeable? Activate it. */
if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
- referenced && page_mapping_inuse(page))
+ referenced && page_mapping_inuse(page)
+ && !(vm_flags & VM_LOCKED))
goto activate_locked;
By this check, the page can be reached at try_to_unmap after
page_referenced in shrink_page_list. At that time PG_mlocked will be
set.
> 2. PG_mlock prevent livelock because page_evictable() check is called
> at very early in shrink_page_list().
--
Kind regards,
Minchan Kim
^ permalink raw reply related [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 13:28 ` Minchan Kim
0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-19 13:28 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Wu Fengguang, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Wed, Aug 19, 2009 at 10:19 PM, KOSAKI
Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> 2009/8/19 Minchan Kim <minchan.kim@gmail.com>:
>> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>>> On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
>>>> >> page_referenced_file?
>>>> >> I think we should change page_referenced().
>>>> >
>>>> > Yeah, good catch.
>>>> >
>>>> >>
>>>> >> Instead, How about this?
>>>> >> ==============================================
>>>> >>
>>>> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>>>> >>
>>>> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>>>> >
>>>> > mark PG_mlocked
>>>> >
>>>> >> because some race prevent page grabbing.
>>>> >> In that case, instead vmscan move the page to unevictable lru.
>>>> >>
>>>> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>>>> >> efficient.
>>>> >> mlocked page can move circulatly active and inactive list because
>>>> >> vmscan check the page is referenced _before_ cull mlocked page.
>>>> >>
>>>> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>>>> >
>>>> > PG_mlocked
>>>> >
>>>> >> Otherwise vm stastics show strange number.
>>>> >>
>>>> >> This patch does that.
>>>> >
>>>> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>> >> Index: b/mm/rmap.c
>>>> >> ===================================================================
>>>> >> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900
>>>> >> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900
>>>> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>>>> >> * unevictable list.
>>>> >> */
>>>> >> if (vma->vm_flags & VM_LOCKED) {
>>>> >> - *mapcount = 1; /* break early from loop */
>>>> >> + *mapcount = 1; /* break early from loop */
>>>> >> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>>>> >
>>>> >> + try_set_page_mlocked(vma, page);
>>>> >
>>>> > That call is not absolutely necessary?
>>>>
>>>> Why? I haven't catch your point.
>>>
>>> Because we'll eventually hit another try_set_page_mlocked() when
>>> trying to unmap the page. Ie. duplicated with another call you added
>>> in this patch.
>
> Correct.
>
>
>> Yes. we don't have to call it and we can make patch simple.
>> I already sent patch on yesterday.
>>
>> http://marc.info/?l=linux-mm&m=125059325722370&w=2
>>
>> I think It's more simple than KOSAKI's idea.
>> Is any problem in my patch ?
>
> Hmm, I think
>
> 1. Anyway, we need turn on PG_mlock.
I add my patch again to explain.
diff --git a/mm/rmap.c b/mm/rmap.c
index ed63894..283266c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page,
*/
if (vma->vm_flags & VM_LOCKED) {
*mapcount = 1; /* break early from loop */
+ *vm_flags |= VM_LOCKED;
goto out_unmap;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d224b28..d156e1d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct
list_head *page_list,
sc->mem_cgroup, &vm_flags);
/* In active use or really unfreeable? Activate it. */
if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
- referenced && page_mapping_inuse(page))
+ referenced && page_mapping_inuse(page)
+ && !(vm_flags & VM_LOCKED))
goto activate_locked;
By this check, the page can be reached at try_to_unmap after
page_referenced in shrink_page_list. At that time PG_mlocked will be
set.
> 2. PG_mlock prevent livelock because page_evictable() check is called
> at very early in shrink_page_list().
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-19 13:24 ` Wu Fengguang
@ 2009-08-19 13:38 ` Minchan Kim
-1 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-19 13:38 UTC (permalink / raw)
To: Wu Fengguang
Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Wed, Aug 19, 2009 at 10:24 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Wed, Aug 19, 2009 at 08:25:56PM +0800, Minchan Kim wrote:
>> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
>> >> >> page_referenced_file?
>> >> >> I think we should change page_referenced().
>> >> >
>> >> > Yeah, good catch.
>> >> >
>> >> >>
>> >> >> Instead, How about this?
>> >> >> ==============================================
>> >> >>
>> >> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>> >> >>
>> >> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>> >> >
>> >> > mark PG_mlocked
>> >> >
>> >> >> because some race prevent page grabbing.
>> >> >> In that case, instead vmscan move the page to unevictable lru.
>> >> >>
>> >> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>> >> >> efficient.
>> >> >> mlocked page can move circulatly active and inactive list because
>> >> >> vmscan check the page is referenced _before_ cull mlocked page.
>> >> >>
>> >> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>> >> >
>> >> > PG_mlocked
>> >> >
>> >> >> Otherwise vm stastics show strange number.
>> >> >>
>> >> >> This patch does that.
>> >> >
>> >> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>> >>
>> >> Thanks.
>> >>
>> >>
>> >>
>> >> >> Index: b/mm/rmap.c
>> >> >> ===================================================================
>> >> >> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900
>> >> >> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900
>> >> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>> >> >> * unevictable list.
>> >> >> */
>> >> >> if (vma->vm_flags & VM_LOCKED) {
>> >> >> - *mapcount = 1; /* break early from loop */
>> >> >> + *mapcount = 1; /* break early from loop */
>> >> >> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>> >> >
>> >> >> + try_set_page_mlocked(vma, page);
>> >> >
>> >> > That call is not absolutely necessary?
>> >>
>> >> Why? I haven't catch your point.
>> >
>> > Because we'll eventually hit another try_set_page_mlocked() when
>> > trying to unmap the page. Ie. duplicated with another call you added
>> > in this patch.
>>
>> Yes. we don't have to call it and we can make patch simple.
>> I already sent patch on yesterday.
>>
>> http://marc.info/?l=linux-mm&m=125059325722370&w=2
>>
>> I think It's more simple than KOSAKI's idea.
>> Is any problem in my patch ?
>
> No, IMHO your patch is simple and good, while KOSAKI's is more
> complete :)
>
> - the try_set_page_mlocked() rename is suitable
> - the call to try_set_page_mlocked() is necessary on try_to_unmap()
We don't need try_set_page_mlocked call in try_to_unmap.
That's because try_to_unmap_xxx will call try_to_mlock_page if the
page is included in any VM_LOCKED vma. Eventually, It can move
unevictable list.
> - the "if (VM_LOCKED) referenced = 0" in page_referenced() could
> cover both active/inactive vmscan
ASAP we set PG_mlocked in page, we can save unnecessary vmscan cost from
active list to inactive list. But I think it's rare case so that there
would be few pages.
So I think that will be not big overhead.
As I know, Rescue by vmscan page losing the isolation race was the
Lee's design.
But as you pointed out, it have a bug that vmscan can't rescue the
page due to reach try_to_unmap.
So I think this approach is proper. :)
> I did like your proposed
>
> if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> - referenced && page_mapping_inuse(page))
> + referenced && page_mapping_inuse(page)
> + && !(vm_flags & VM_LOCKED))
> goto activate_locked;
>
> which looks more intuitive and less confusing.
>
> Thanks,
> Fengguang
>
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 13:38 ` Minchan Kim
0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2009-08-19 13:38 UTC (permalink / raw)
To: Wu Fengguang
Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Wed, Aug 19, 2009 at 10:24 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Wed, Aug 19, 2009 at 08:25:56PM +0800, Minchan Kim wrote:
>> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
>> >> >> page_referenced_file?
>> >> >> I think we should change page_referenced().
>> >> >
>> >> > Yeah, good catch.
>> >> >
>> >> >>
>> >> >> Instead, How about this?
>> >> >> ==============================================
>> >> >>
>> >> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
>> >> >>
>> >> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
>> >> >
>> >> > mark PG_mlocked
>> >> >
>> >> >> because some race prevent page grabbing.
>> >> >> In that case, instead vmscan move the page to unevictable lru.
>> >> >>
>> >> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
>> >> >> efficient.
>> >> >> mlocked page can move circulatly active and inactive list because
>> >> >> vmscan check the page is referenced _before_ cull mlocked page.
>> >> >>
>> >> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
>> >> >
>> >> > PG_mlocked
>> >> >
>> >> >> Otherwise vm stastics show strange number.
>> >> >>
>> >> >> This patch does that.
>> >> >
>> >> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>> >>
>> >> Thanks.
>> >>
>> >>
>> >>
>> >> >> Index: b/mm/rmap.c
>> >> >> ===================================================================
>> >> >> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900
>> >> >> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900
>> >> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
>> >> >> * unevictable list.
>> >> >> */
>> >> >> if (vma->vm_flags & VM_LOCKED) {
>> >> >> - *mapcount = 1; /* break early from loop */
>> >> >> + *mapcount = 1; /* break early from loop */
>> >> >> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */
>> >> >
>> >> >> + try_set_page_mlocked(vma, page);
>> >> >
>> >> > That call is not absolutely necessary?
>> >>
>> >> Why? I haven't catch your point.
>> >
>> > Because we'll eventually hit another try_set_page_mlocked() when
>> > trying to unmap the page. Ie. duplicated with another call you added
>> > in this patch.
>>
>> Yes. we don't have to call it and we can make patch simple.
>> I already sent patch on yesterday.
>>
>> http://marc.info/?l=linux-mm&m=125059325722370&w=2
>>
>> I think It's more simple than KOSAKI's idea.
>> Is any problem in my patch ?
>
> No, IMHO your patch is simple and good, while KOSAKI's is more
> complete :)
>
> - the try_set_page_mlocked() rename is suitable
> - the call to try_set_page_mlocked() is necessary on try_to_unmap()
We don't need try_set_page_mlocked call in try_to_unmap.
That's because try_to_unmap_xxx will call try_to_mlock_page if the
page is included in any VM_LOCKED vma. Eventually, It can move
unevictable list.
> - the "if (VM_LOCKED) referenced = 0" in page_referenced() could
> cover both active/inactive vmscan
ASAP we set PG_mlocked in page, we can save unnecessary vmscan cost from
active list to inactive list. But I think it's rare case so that there
would be few pages.
So I think that will be not big overhead.
As I know, Rescue by vmscan page losing the isolation race was the
Lee's design.
But as you pointed out, it have a bug that vmscan can't rescue the
page due to reach try_to_unmap.
So I think this approach is proper. :)
> I did like your proposed
>
> if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> - referenced && page_mapping_inuse(page))
> + referenced && page_mapping_inuse(page)
> + && !(vm_flags & VM_LOCKED))
> goto activate_locked;
>
> which looks more intuitive and less confusing.
>
> Thanks,
> Fengguang
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* [RFC] memcg: move definitions to .h and inline some functions
2009-08-18 15:57 ` KOSAKI Motohiro
@ 2009-08-19 13:40 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 13:40 UTC (permalink / raw)
To: KOSAKI Motohiro, Balbir Singh, KAMEZAWA Hiroyuki
Cc: Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm,
nishimura, lizf, menage
On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
>
> > > This one of the reasons why we unconditionally deactivate
> > > the active anon pages, and do background scanning of the
> > > active anon list when reclaiming page cache pages.
> > >
> > > We want to always move some pages to the inactive anon
> > > list, so it does not get too small.
> >
> > Right, the current code tries to pull inactive list out of
> > smallish-size state as long as there are vmscan activities.
> >
> > However there is a possible (and tricky) hole: mem cgroups
> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> >
> > It effectively scales up the inactive list scan rate by 10 times when
> > it is still small, and may thus prevent it from growing up for ever.
> >
> > In that case, LRU becomes FIFO.
> >
> > Jeff, can you confirm if the mem cgroup's inactive list is small?
> > If so, this patch should help.
>
> This patch does right thing.
> However, I would explain why I and memcg folks didn't do that in past days.
>
> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
> make inline function and we hesitated to introduce many function calling
> overhead.
>
> So, Can we move some memcg structure declaration to *.h and make
> mem_cgroup_get_saved_scan() inlined function?
OK here it is. I have to move big chunks to make it compile, and it
does reduced a dozen lines of code :)
Is this big copy&paste acceptable? (memcg developers CCed).
Thanks,
Fengguang
---
memcg: move definitions to .h and inline some functions
So as to make inline functions.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/memcontrol.h | 154 ++++++++++++++++++++++++++++++-----
mm/memcontrol.c | 131 -----------------------------
2 files changed, 134 insertions(+), 151 deletions(-)
--- linux.orig/include/linux/memcontrol.h 2009-08-19 20:18:55.000000000 +0800
+++ linux/include/linux/memcontrol.h 2009-08-19 20:51:06.000000000 +0800
@@ -20,11 +20,144 @@
#ifndef _LINUX_MEMCONTROL_H
#define _LINUX_MEMCONTROL_H
#include <linux/cgroup.h>
-struct mem_cgroup;
+#include <linux/res_counter.h>
struct page_cgroup;
struct page;
struct mm_struct;
+/*
+ * Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+ /*
+ * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+ */
+ MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
+ MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
+ MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */
+ MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
+ MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
+
+ MEM_CGROUP_STAT_NSTATS,
+};
+
+struct mem_cgroup_stat_cpu {
+ s64 count[MEM_CGROUP_STAT_NSTATS];
+} ____cacheline_aligned_in_smp;
+
+struct mem_cgroup_stat {
+ struct mem_cgroup_stat_cpu cpustat[0];
+};
+
+/*
+ * per-zone information in memory controller.
+ */
+struct mem_cgroup_per_zone {
+ /*
+ * spin_lock to protect the per cgroup LRU
+ */
+ struct list_head lists[NR_LRU_LISTS];
+ unsigned long count[NR_LRU_LISTS];
+
+ struct zone_reclaim_stat reclaim_stat;
+};
+/* Macro for accessing counter */
+#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)])
+
+struct mem_cgroup_per_node {
+ struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
+};
+
+struct mem_cgroup_lru_info {
+ struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
+};
+
+/*
+ * The memory controller data structure. The memory controller controls both
+ * page cache and RSS per cgroup. We would eventually like to provide
+ * statistics based on the statistics developed by Rik Van Riel for clock-pro,
+ * to help the administrator determine what knobs to tune.
+ *
+ * TODO: Add a water mark for the memory controller. Reclaim will begin when
+ * we hit the water mark. May be even add a low water mark, such that
+ * no reclaim occurs from a cgroup at it's low water mark, this is
+ * a feature that will be implemented much later in the future.
+ */
+struct mem_cgroup {
+ struct cgroup_subsys_state css;
+ /*
+ * the counter to account for memory usage
+ */
+ struct res_counter res;
+ /*
+ * the counter to account for mem+swap usage.
+ */
+ struct res_counter memsw;
+ /*
+ * Per cgroup active and inactive list, similar to the
+ * per zone LRU lists.
+ */
+ struct mem_cgroup_lru_info info;
+
+ /*
+ protect against reclaim related member.
+ */
+ spinlock_t reclaim_param_lock;
+
+ int prev_priority; /* for recording reclaim priority */
+
+ /*
+ * While reclaiming in a hiearchy, we cache the last child we
+ * reclaimed from.
+ */
+ int last_scanned_child;
+ /*
+ * Should the accounting and control be hierarchical, per subtree?
+ */
+ bool use_hierarchy;
+ unsigned long last_oom_jiffies;
+ atomic_t refcnt;
+
+ unsigned int swappiness;
+
+ /* set when res.limit == memsw.limit */
+ bool memsw_is_minimum;
+
+ /*
+ * statistics. This must be placed at the end of memcg.
+ */
+ struct mem_cgroup_stat stat;
+};
+
+static inline struct mem_cgroup_per_zone *
+mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
+{
+ return &mem->info.nodeinfo[nid]->zoneinfo[zid];
+}
+
+static inline unsigned long
+mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
+ struct zone *zone,
+ enum lru_list lru)
+{
+ int nid = zone->zone_pgdat->node_id;
+ int zid = zone_idx(zone);
+ struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+ return MEM_CGROUP_ZSTAT(mz, lru);
+}
+
+static inline struct zone_reclaim_stat *
+mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
+{
+ int nid = zone->zone_pgdat->node_id;
+ int zid = zone_idx(zone);
+ struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+ return &mz->reclaim_stat;
+}
+
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
/*
* All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -95,11 +228,6 @@ extern void mem_cgroup_record_reclaim_pr
int priority);
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
-unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
- struct zone *zone,
- enum lru_list lru);
-struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
- struct zone *zone);
struct zone_reclaim_stat*
mem_cgroup_get_reclaim_stat_from_page(struct page *page);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
@@ -246,20 +374,6 @@ mem_cgroup_inactive_file_is_low(struct m
return 1;
}
-static inline unsigned long
-mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
- enum lru_list lru)
-{
- return 0;
-}
-
-
-static inline struct zone_reclaim_stat*
-mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
-{
- return NULL;
-}
-
static inline struct zone_reclaim_stat*
mem_cgroup_get_reclaim_stat_from_page(struct page *page)
{
--- linux.orig/mm/memcontrol.c 2009-08-19 20:14:56.000000000 +0800
+++ linux/mm/memcontrol.c 2009-08-19 20:46:50.000000000 +0800
@@ -55,30 +55,6 @@ static int really_do_swap_account __init
static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */
/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
- /*
- * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
- */
- MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
- MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
- MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */
- MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
- MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
-
- MEM_CGROUP_STAT_NSTATS,
-};
-
-struct mem_cgroup_stat_cpu {
- s64 count[MEM_CGROUP_STAT_NSTATS];
-} ____cacheline_aligned_in_smp;
-
-struct mem_cgroup_stat {
- struct mem_cgroup_stat_cpu cpustat[0];
-};
-
-/*
* For accounting under irq disable, no need for increment preempt count.
*/
static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu *stat,
@@ -106,86 +82,6 @@ static s64 mem_cgroup_local_usage(struct
return ret;
}
-/*
- * per-zone information in memory controller.
- */
-struct mem_cgroup_per_zone {
- /*
- * spin_lock to protect the per cgroup LRU
- */
- struct list_head lists[NR_LRU_LISTS];
- unsigned long count[NR_LRU_LISTS];
-
- struct zone_reclaim_stat reclaim_stat;
-};
-/* Macro for accessing counter */
-#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)])
-
-struct mem_cgroup_per_node {
- struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
-};
-
-struct mem_cgroup_lru_info {
- struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
-};
-
-/*
- * The memory controller data structure. The memory controller controls both
- * page cache and RSS per cgroup. We would eventually like to provide
- * statistics based on the statistics developed by Rik Van Riel for clock-pro,
- * to help the administrator determine what knobs to tune.
- *
- * TODO: Add a water mark for the memory controller. Reclaim will begin when
- * we hit the water mark. May be even add a low water mark, such that
- * no reclaim occurs from a cgroup at it's low water mark, this is
- * a feature that will be implemented much later in the future.
- */
-struct mem_cgroup {
- struct cgroup_subsys_state css;
- /*
- * the counter to account for memory usage
- */
- struct res_counter res;
- /*
- * the counter to account for mem+swap usage.
- */
- struct res_counter memsw;
- /*
- * Per cgroup active and inactive list, similar to the
- * per zone LRU lists.
- */
- struct mem_cgroup_lru_info info;
-
- /*
- protect against reclaim related member.
- */
- spinlock_t reclaim_param_lock;
-
- int prev_priority; /* for recording reclaim priority */
-
- /*
- * While reclaiming in a hiearchy, we cache the last child we
- * reclaimed from.
- */
- int last_scanned_child;
- /*
- * Should the accounting and control be hierarchical, per subtree?
- */
- bool use_hierarchy;
- unsigned long last_oom_jiffies;
- atomic_t refcnt;
-
- unsigned int swappiness;
-
- /* set when res.limit == memsw.limit */
- bool memsw_is_minimum;
-
- /*
- * statistics. This must be placed at the end of memcg.
- */
- struct mem_cgroup_stat stat;
-};
-
enum charge_type {
MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -244,12 +140,6 @@ static void mem_cgroup_charge_statistics
}
static struct mem_cgroup_per_zone *
-mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
-{
- return &mem->info.nodeinfo[nid]->zoneinfo[zid];
-}
-
-static struct mem_cgroup_per_zone *
page_cgroup_zoneinfo(struct page_cgroup *pc)
{
struct mem_cgroup *mem = pc->mem_cgroup;
@@ -586,27 +476,6 @@ int mem_cgroup_inactive_file_is_low(stru
return (active > inactive);
}
-unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
- struct zone *zone,
- enum lru_list lru)
-{
- int nid = zone->zone_pgdat->node_id;
- int zid = zone_idx(zone);
- struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
-
- return MEM_CGROUP_ZSTAT(mz, lru);
-}
-
-struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
- struct zone *zone)
-{
- int nid = zone->zone_pgdat->node_id;
- int zid = zone_idx(zone);
- struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
-
- return &mz->reclaim_stat;
-}
-
struct zone_reclaim_stat *
mem_cgroup_get_reclaim_stat_from_page(struct page *page)
{
^ permalink raw reply [flat|nested] 243+ messages in thread
* [RFC] memcg: move definitions to .h and inline some functions
@ 2009-08-19 13:40 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 13:40 UTC (permalink / raw)
To: KOSAKI Motohiro, Balbir Singh, KAMEZAWA Hiroyuki
Cc: Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm,
nishimura, lizf, menage
On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
>
> > > This one of the reasons why we unconditionally deactivate
> > > the active anon pages, and do background scanning of the
> > > active anon list when reclaiming page cache pages.
> > >
> > > We want to always move some pages to the inactive anon
> > > list, so it does not get too small.
> >
> > Right, the current code tries to pull inactive list out of
> > smallish-size state as long as there are vmscan activities.
> >
> > However there is a possible (and tricky) hole: mem cgroups
> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> >
> > It effectively scales up the inactive list scan rate by 10 times when
> > it is still small, and may thus prevent it from growing up for ever.
> >
> > In that case, LRU becomes FIFO.
> >
> > Jeff, can you confirm if the mem cgroup's inactive list is small?
> > If so, this patch should help.
>
> This patch does right thing.
> However, I would explain why I and memcg folks didn't do that in past days.
>
> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
> make inline function and we hesitated to introduce many function calling
> overhead.
>
> So, Can we move some memcg structure declaration to *.h and make
> mem_cgroup_get_saved_scan() inlined function?
OK here it is. I have to move big chunks to make it compile, and it
does reduced a dozen lines of code :)
Is this big copy&paste acceptable? (memcg developers CCed).
Thanks,
Fengguang
---
memcg: move definitions to .h and inline some functions
So as to make inline functions.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/memcontrol.h | 154 ++++++++++++++++++++++++++++++-----
mm/memcontrol.c | 131 -----------------------------
2 files changed, 134 insertions(+), 151 deletions(-)
--- linux.orig/include/linux/memcontrol.h 2009-08-19 20:18:55.000000000 +0800
+++ linux/include/linux/memcontrol.h 2009-08-19 20:51:06.000000000 +0800
@@ -20,11 +20,144 @@
#ifndef _LINUX_MEMCONTROL_H
#define _LINUX_MEMCONTROL_H
#include <linux/cgroup.h>
-struct mem_cgroup;
+#include <linux/res_counter.h>
struct page_cgroup;
struct page;
struct mm_struct;
+/*
+ * Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+ /*
+ * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+ */
+ MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
+ MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
+ MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */
+ MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
+ MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
+
+ MEM_CGROUP_STAT_NSTATS,
+};
+
+struct mem_cgroup_stat_cpu {
+ s64 count[MEM_CGROUP_STAT_NSTATS];
+} ____cacheline_aligned_in_smp;
+
+struct mem_cgroup_stat {
+ struct mem_cgroup_stat_cpu cpustat[0];
+};
+
+/*
+ * per-zone information in memory controller.
+ */
+struct mem_cgroup_per_zone {
+ /*
+ * spin_lock to protect the per cgroup LRU
+ */
+ struct list_head lists[NR_LRU_LISTS];
+ unsigned long count[NR_LRU_LISTS];
+
+ struct zone_reclaim_stat reclaim_stat;
+};
+/* Macro for accessing counter */
+#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)])
+
+struct mem_cgroup_per_node {
+ struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
+};
+
+struct mem_cgroup_lru_info {
+ struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
+};
+
+/*
+ * The memory controller data structure. The memory controller controls both
+ * page cache and RSS per cgroup. We would eventually like to provide
+ * statistics based on the statistics developed by Rik Van Riel for clock-pro,
+ * to help the administrator determine what knobs to tune.
+ *
+ * TODO: Add a water mark for the memory controller. Reclaim will begin when
+ * we hit the water mark. May be even add a low water mark, such that
+ * no reclaim occurs from a cgroup at it's low water mark, this is
+ * a feature that will be implemented much later in the future.
+ */
+struct mem_cgroup {
+ struct cgroup_subsys_state css;
+ /*
+ * the counter to account for memory usage
+ */
+ struct res_counter res;
+ /*
+ * the counter to account for mem+swap usage.
+ */
+ struct res_counter memsw;
+ /*
+ * Per cgroup active and inactive list, similar to the
+ * per zone LRU lists.
+ */
+ struct mem_cgroup_lru_info info;
+
+ /*
+ protect against reclaim related member.
+ */
+ spinlock_t reclaim_param_lock;
+
+ int prev_priority; /* for recording reclaim priority */
+
+ /*
+ * While reclaiming in a hiearchy, we cache the last child we
+ * reclaimed from.
+ */
+ int last_scanned_child;
+ /*
+ * Should the accounting and control be hierarchical, per subtree?
+ */
+ bool use_hierarchy;
+ unsigned long last_oom_jiffies;
+ atomic_t refcnt;
+
+ unsigned int swappiness;
+
+ /* set when res.limit == memsw.limit */
+ bool memsw_is_minimum;
+
+ /*
+ * statistics. This must be placed at the end of memcg.
+ */
+ struct mem_cgroup_stat stat;
+};
+
+static inline struct mem_cgroup_per_zone *
+mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
+{
+ return &mem->info.nodeinfo[nid]->zoneinfo[zid];
+}
+
+static inline unsigned long
+mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
+ struct zone *zone,
+ enum lru_list lru)
+{
+ int nid = zone->zone_pgdat->node_id;
+ int zid = zone_idx(zone);
+ struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+ return MEM_CGROUP_ZSTAT(mz, lru);
+}
+
+static inline struct zone_reclaim_stat *
+mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
+{
+ int nid = zone->zone_pgdat->node_id;
+ int zid = zone_idx(zone);
+ struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+ return &mz->reclaim_stat;
+}
+
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
/*
* All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -95,11 +228,6 @@ extern void mem_cgroup_record_reclaim_pr
int priority);
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
-unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
- struct zone *zone,
- enum lru_list lru);
-struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
- struct zone *zone);
struct zone_reclaim_stat*
mem_cgroup_get_reclaim_stat_from_page(struct page *page);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
@@ -246,20 +374,6 @@ mem_cgroup_inactive_file_is_low(struct m
return 1;
}
-static inline unsigned long
-mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
- enum lru_list lru)
-{
- return 0;
-}
-
-
-static inline struct zone_reclaim_stat*
-mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
-{
- return NULL;
-}
-
static inline struct zone_reclaim_stat*
mem_cgroup_get_reclaim_stat_from_page(struct page *page)
{
--- linux.orig/mm/memcontrol.c 2009-08-19 20:14:56.000000000 +0800
+++ linux/mm/memcontrol.c 2009-08-19 20:46:50.000000000 +0800
@@ -55,30 +55,6 @@ static int really_do_swap_account __init
static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */
/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
- /*
- * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
- */
- MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
- MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
- MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */
- MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
- MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
-
- MEM_CGROUP_STAT_NSTATS,
-};
-
-struct mem_cgroup_stat_cpu {
- s64 count[MEM_CGROUP_STAT_NSTATS];
-} ____cacheline_aligned_in_smp;
-
-struct mem_cgroup_stat {
- struct mem_cgroup_stat_cpu cpustat[0];
-};
-
-/*
* For accounting under irq disable, no need for increment preempt count.
*/
static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu *stat,
@@ -106,86 +82,6 @@ static s64 mem_cgroup_local_usage(struct
return ret;
}
-/*
- * per-zone information in memory controller.
- */
-struct mem_cgroup_per_zone {
- /*
- * spin_lock to protect the per cgroup LRU
- */
- struct list_head lists[NR_LRU_LISTS];
- unsigned long count[NR_LRU_LISTS];
-
- struct zone_reclaim_stat reclaim_stat;
-};
-/* Macro for accessing counter */
-#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)])
-
-struct mem_cgroup_per_node {
- struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
-};
-
-struct mem_cgroup_lru_info {
- struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
-};
-
-/*
- * The memory controller data structure. The memory controller controls both
- * page cache and RSS per cgroup. We would eventually like to provide
- * statistics based on the statistics developed by Rik Van Riel for clock-pro,
- * to help the administrator determine what knobs to tune.
- *
- * TODO: Add a water mark for the memory controller. Reclaim will begin when
- * we hit the water mark. May be even add a low water mark, such that
- * no reclaim occurs from a cgroup at it's low water mark, this is
- * a feature that will be implemented much later in the future.
- */
-struct mem_cgroup {
- struct cgroup_subsys_state css;
- /*
- * the counter to account for memory usage
- */
- struct res_counter res;
- /*
- * the counter to account for mem+swap usage.
- */
- struct res_counter memsw;
- /*
- * Per cgroup active and inactive list, similar to the
- * per zone LRU lists.
- */
- struct mem_cgroup_lru_info info;
-
- /*
- protect against reclaim related member.
- */
- spinlock_t reclaim_param_lock;
-
- int prev_priority; /* for recording reclaim priority */
-
- /*
- * While reclaiming in a hiearchy, we cache the last child we
- * reclaimed from.
- */
- int last_scanned_child;
- /*
- * Should the accounting and control be hierarchical, per subtree?
- */
- bool use_hierarchy;
- unsigned long last_oom_jiffies;
- atomic_t refcnt;
-
- unsigned int swappiness;
-
- /* set when res.limit == memsw.limit */
- bool memsw_is_minimum;
-
- /*
- * statistics. This must be placed at the end of memcg.
- */
- struct mem_cgroup_stat stat;
-};
-
enum charge_type {
MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -244,12 +140,6 @@ static void mem_cgroup_charge_statistics
}
static struct mem_cgroup_per_zone *
-mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
-{
- return &mem->info.nodeinfo[nid]->zoneinfo[zid];
-}
-
-static struct mem_cgroup_per_zone *
page_cgroup_zoneinfo(struct page_cgroup *pc)
{
struct mem_cgroup *mem = pc->mem_cgroup;
@@ -586,27 +476,6 @@ int mem_cgroup_inactive_file_is_low(stru
return (active > inactive);
}
-unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
- struct zone *zone,
- enum lru_list lru)
-{
- int nid = zone->zone_pgdat->node_id;
- int zid = zone_idx(zone);
- struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
-
- return MEM_CGROUP_ZSTAT(mz, lru);
-}
-
-struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
- struct zone *zone)
-{
- int nid = zone->zone_pgdat->node_id;
- int zid = zone_idx(zone);
- struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
-
- return &mz->reclaim_stat;
-}
-
struct zone_reclaim_stat *
mem_cgroup_get_reclaim_stat_from_page(struct page *page)
{
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-19 13:38 ` Minchan Kim
@ 2009-08-19 14:00 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 14:00 UTC (permalink / raw)
To: Minchan Kim
Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Wed, Aug 19, 2009 at 09:38:05PM +0800, Minchan Kim wrote:
> On Wed, Aug 19, 2009 at 10:24 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Wed, Aug 19, 2009 at 08:25:56PM +0800, Minchan Kim wrote:
> >> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> > On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
> >> >> >> page_referenced_file?
> >> >> >> I think we should change page_referenced().
> >> >> >
> >> >> > Yeah, good catch.
> >> >> >
> >> >> >>
> >> >> >> Instead, How about this?
> >> >> >> ==============================================
> >> >> >>
> >> >> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
> >> >> >>
> >> >> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
> >> >> >
> >> >> > mark PG_mlocked
> >> >> >
> >> >> >> because some race prevent page grabbing.
> >> >> >> In that case, instead vmscan move the page to unevictable lru.
> >> >> >>
> >> >> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
> >> >> >> efficient.
> >> >> >> mlocked page can move circulatly active and inactive list because
> >> >> >> vmscan check the page is referenced _before_ cull mlocked page.
> >> >> >>
> >> >> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
> >> >> >
> >> >> > PG_mlocked
> >> >> >
> >> >> >> Otherwise vm stastics show strange number.
> >> >> >>
> >> >> >> This patch does that.
> >> >> >
> >> >> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> >> >>
> >> >> Thanks.
> >> >>
> >> >>
> >> >>
> >> >> >> Index: b/mm/rmap.c
> >> >> >> ===================================================================
> >> >> >> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900
> >> >> >> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900
> >> >> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
> >> >> >> * unevictable list.
> >> >> >> */
> >> >> >> if (vma->vm_flags & VM_LOCKED) {
> >> >> >> - *mapcount = 1; /* break early from loop */
> >> >> >> + *mapcount = 1; /* break early from loop */
> >> >> >> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */
> >> >> >
> >> >> >> + try_set_page_mlocked(vma, page);
> >> >> >
> >> >> > That call is not absolutely necessary?
> >> >>
> >> >> Why? I haven't catch your point.
> >> >
> >> > Because we'll eventually hit another try_set_page_mlocked() when
> >> > trying to unmap the page. Ie. duplicated with another call you added
> >> > in this patch.
> >>
> >> Yes. we don't have to call it and we can make patch simple.
> >> I already sent patch on yesterday.
> >>
> >> http://marc.info/?l=linux-mm&m=125059325722370&w=2
> >>
> >> I think It's more simple than KOSAKI's idea.
> >> Is any problem in my patch ?
> >
> > No, IMHO your patch is simple and good, while KOSAKI's is more
> > complete :)
> >
> > - the try_set_page_mlocked() rename is suitable
> > - the call to try_set_page_mlocked() is necessary on try_to_unmap()
>
> We don't need try_set_page_mlocked call in try_to_unmap.
> That's because try_to_unmap_xxx will call try_to_mlock_page if the
> page is included in any VM_LOCKED vma. Eventually, It can move
> unevictable list.
Yes, indeed!
> > - the "if (VM_LOCKED) referenced = 0" in page_referenced() could
> > cover both active/inactive vmscan
>
> ASAP we set PG_mlocked in page, we can save unnecessary vmscan cost from
> active list to inactive list. But I think it's rare case so that there
> would be few pages.
> So I think that will be not big overhead.
The active list case can be persistent, when the mlocked (but without
PG_mlocked) page is executable and referenced by 2+ processes. But I
admit that executable pages are relatively rare.
> As I know, Rescue by vmscan page losing the isolation race was the
> Lee's design.
> But as you pointed out, it have a bug that vmscan can't rescue the
> page due to reach try_to_unmap.
>
> So I think this approach is proper. :)
Now you decide :)
Thanks,
Fengguang
> > I did like your proposed
> >
> > if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> > - referenced && page_mapping_inuse(page))
> > + referenced && page_mapping_inuse(page)
> > + && !(vm_flags & VM_LOCKED))
> > goto activate_locked;
> >
> > which looks more intuitive and less confusing.
> >
> > Thanks,
> > Fengguang
> >
>
>
>
> --
> Kind regards,
> Minchan Kim
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-19 14:00 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-19 14:00 UTC (permalink / raw)
To: Minchan Kim
Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
On Wed, Aug 19, 2009 at 09:38:05PM +0800, Minchan Kim wrote:
> On Wed, Aug 19, 2009 at 10:24 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Wed, Aug 19, 2009 at 08:25:56PM +0800, Minchan Kim wrote:
> >> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> > On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote:
> >> >> >> page_referenced_file?
> >> >> >> I think we should change page_referenced().
> >> >> >
> >> >> > Yeah, good catch.
> >> >> >
> >> >> >>
> >> >> >> Instead, How about this?
> >> >> >> ==============================================
> >> >> >>
> >> >> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages
> >> >> >>
> >> >> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked
> >> >> >
> >> >> > A A A A A A A A A A A A A A A A A A A A A A A A A A mark PG_mlocked
> >> >> >
> >> >> >> because some race prevent page grabbing.
> >> >> >> In that case, instead vmscan move the page to unevictable lru.
> >> >> >>
> >> >> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so
> >> >> >> efficient.
> >> >> >> mlocked page can move circulatly active and inactive list because
> >> >> >> vmscan check the page is referenced _before_ cull mlocked page.
> >> >> >>
> >> >> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page.
> >> >> >
> >> >> > A A A A A A A A A A A A A PG_mlocked
> >> >> >
> >> >> >> Otherwise vm stastics show strange number.
> >> >> >>
> >> >> >> This patch does that.
> >> >> >
> >> >> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> >> >>
> >> >> Thanks.
> >> >>
> >> >>
> >> >>
> >> >> >> Index: b/mm/rmap.c
> >> >> >> ===================================================================
> >> >> >> --- a/mm/rmap.c A A A 2009-08-18 19:48:14.000000000 +0900
> >> >> >> +++ b/mm/rmap.c A A A 2009-08-18 23:47:34.000000000 +0900
> >> >> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa
> >> >> >> A A A A * unevictable list.
> >> >> >> A A A A */
> >> >> >> A A A if (vma->vm_flags & VM_LOCKED) {
> >> >> >> - A A A A A A *mapcount = 1; A /* break early from loop */
> >> >> >> + A A A A A A *mapcount = 1; A A A A A /* break early from loop */
> >> >> >> + A A A A A A *vm_flags |= VM_LOCKED; /* for prevent to move active list */
> >> >> >
> >> >> >> + A A A A A A try_set_page_mlocked(vma, page);
> >> >> >
> >> >> > That call is not absolutely necessary?
> >> >>
> >> >> Why? I haven't catch your point.
> >> >
> >> > Because we'll eventually hit another try_set_page_mlocked() when
> >> > trying to unmap the page. Ie. duplicated with another call you added
> >> > in this patch.
> >>
> >> Yes. we don't have to call it and we can make patch simple.
> >> I already sent patch on yesterday.
> >>
> >> http://marc.info/?l=linux-mm&m=125059325722370&w=2
> >>
> >> I think It's more simple than KOSAKI's idea.
> >> Is any problem in my patch ?
> >
> > No, IMHO your patch is simple and good, while KOSAKI's is more
> > complete :)
> >
> > - the try_set_page_mlocked() rename is suitable
> > - the call to try_set_page_mlocked() is necessary on try_to_unmap()
>
> We don't need try_set_page_mlocked call in try_to_unmap.
> That's because try_to_unmap_xxx will call try_to_mlock_page if the
> page is included in any VM_LOCKED vma. Eventually, It can move
> unevictable list.
Yes, indeed!
> > - the "if (VM_LOCKED) referenced = 0" in page_referenced() could
> > A cover both active/inactive vmscan
>
> ASAP we set PG_mlocked in page, we can save unnecessary vmscan cost from
> active list to inactive list. But I think it's rare case so that there
> would be few pages.
> So I think that will be not big overhead.
The active list case can be persistent, when the mlocked (but without
PG_mlocked) page is executable and referenced by 2+ processes. But I
admit that executable pages are relatively rare.
> As I know, Rescue by vmscan page losing the isolation race was the
> Lee's design.
> But as you pointed out, it have a bug that vmscan can't rescue the
> page due to reach try_to_unmap.
>
> So I think this approach is proper. :)
Now you decide :)
Thanks,
Fengguang
> > I did like your proposed
> >
> > A A A A A A A A if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> > - A A A A A A A A A A A A A A A A A A A referenced && page_mapping_inuse(page))
> > + A A A A A A A A A A A A A A A A A A A referenced && page_mapping_inuse(page)
> > + A A A A A A A A A A A A A A A A A A A && !(vm_flags & VM_LOCKED))
> > A A A A A A A A A A A A goto activate_locked;
> >
> > which looks more intuitive and less confusing.
> >
> > Thanks,
> > Fengguang
> >
>
>
>
> --
> Kind regards,
> Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] memcg: move definitions to .h and inline some functions
2009-08-19 13:40 ` Wu Fengguang
@ 2009-08-19 14:18 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 243+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-19 14:18 UTC (permalink / raw)
To: Wu Fengguang
Cc: KOSAKI Motohiro, Balbir Singh, KAMEZAWA Hiroyuki, Rik van Riel,
Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm, nishimura, lizf,
menage
Wu Fengguang さんは書きました:
> On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
>>
>> > > This one of the reasons why we unconditionally deactivate
>> > > the active anon pages, and do background scanning of the
>> > > active anon list when reclaiming page cache pages.
>> > >
>> > > We want to always move some pages to the inactive anon
>> > > list, so it does not get too small.
>> >
>> > Right, the current code tries to pull inactive list out of
>> > smallish-size state as long as there are vmscan activities.
>> >
>> > However there is a possible (and tricky) hole: mem cgroups
>> > don't do batched vmscan. shrink_zone() may call shrink_list()
>> > with nr_to_scan=1, in which case shrink_list() _still_ calls
>> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
>> >
>> > It effectively scales up the inactive list scan rate by 10 times when
>> > it is still small, and may thus prevent it from growing up for ever.
>> >
>> > In that case, LRU becomes FIFO.
>> >
>> > Jeff, can you confirm if the mem cgroup's inactive list is small?
>> > If so, this patch should help.
>>
>> This patch does right thing.
>> However, I would explain why I and memcg folks didn't do that in past
>> days.
>>
>> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
>> make inline function and we hesitated to introduce many function calling
>> overhead.
>>
>> So, Can we move some memcg structure declaration to *.h and make
>> mem_cgroup_get_saved_scan() inlined function?
>
> OK here it is. I have to move big chunks to make it compile, and it
> does reduced a dozen lines of code :)
>
> Is this big copy&paste acceptable? (memcg developers CCed).
>
> Thanks,
> Fengguang
I don't like this. plz add hooks to necessary places, at this stage.
This will be too big for inlined function, anyway.
plz move this after you find overhead is too big.
Thanks,
-Kame
> ---
>
> memcg: move definitions to .h and inline some functions
>
> So as to make inline functions.
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> include/linux/memcontrol.h | 154 ++++++++++++++++++++++++++++++-----
> mm/memcontrol.c | 131 -----------------------------
> 2 files changed, 134 insertions(+), 151 deletions(-)
>
> --- linux.orig/include/linux/memcontrol.h 2009-08-19 20:18:55.000000000
> +0800
> +++ linux/include/linux/memcontrol.h 2009-08-19 20:51:06.000000000 +0800
> @@ -20,11 +20,144 @@
> #ifndef _LINUX_MEMCONTROL_H
> #define _LINUX_MEMCONTROL_H
> #include <linux/cgroup.h>
> -struct mem_cgroup;
> +#include <linux/res_counter.h>
> struct page_cgroup;
> struct page;
> struct mm_struct;
>
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> + /*
> + * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> + */
> + MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
> + MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
> + MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */
> + MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
> + MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
> +
> + MEM_CGROUP_STAT_NSTATS,
> +};
> +
> +struct mem_cgroup_stat_cpu {
> + s64 count[MEM_CGROUP_STAT_NSTATS];
> +} ____cacheline_aligned_in_smp;
> +
> +struct mem_cgroup_stat {
> + struct mem_cgroup_stat_cpu cpustat[0];
> +};
> +
> +/*
> + * per-zone information in memory controller.
> + */
> +struct mem_cgroup_per_zone {
> + /*
> + * spin_lock to protect the per cgroup LRU
> + */
> + struct list_head lists[NR_LRU_LISTS];
> + unsigned long count[NR_LRU_LISTS];
> +
> + struct zone_reclaim_stat reclaim_stat;
> +};
> +/* Macro for accessing counter */
> +#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)])
> +
> +struct mem_cgroup_per_node {
> + struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
> +};
> +
> +struct mem_cgroup_lru_info {
> + struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
> +};
> +
> +/*
> + * The memory controller data structure. The memory controller controls
> both
> + * page cache and RSS per cgroup. We would eventually like to provide
> + * statistics based on the statistics developed by Rik Van Riel for
> clock-pro,
> + * to help the administrator determine what knobs to tune.
> + *
> + * TODO: Add a water mark for the memory controller. Reclaim will begin
> when
> + * we hit the water mark. May be even add a low water mark, such that
> + * no reclaim occurs from a cgroup at it's low water mark, this is
> + * a feature that will be implemented much later in the future.
> + */
> +struct mem_cgroup {
> + struct cgroup_subsys_state css;
> + /*
> + * the counter to account for memory usage
> + */
> + struct res_counter res;
> + /*
> + * the counter to account for mem+swap usage.
> + */
> + struct res_counter memsw;
> + /*
> + * Per cgroup active and inactive list, similar to the
> + * per zone LRU lists.
> + */
> + struct mem_cgroup_lru_info info;
> +
> + /*
> + protect against reclaim related member.
> + */
> + spinlock_t reclaim_param_lock;
> +
> + int prev_priority; /* for recording reclaim priority */
> +
> + /*
> + * While reclaiming in a hiearchy, we cache the last child we
> + * reclaimed from.
> + */
> + int last_scanned_child;
> + /*
> + * Should the accounting and control be hierarchical, per subtree?
> + */
> + bool use_hierarchy;
> + unsigned long last_oom_jiffies;
> + atomic_t refcnt;
> +
> + unsigned int swappiness;
> +
> + /* set when res.limit == memsw.limit */
> + bool memsw_is_minimum;
> +
> + /*
> + * statistics. This must be placed at the end of memcg.
> + */
> + struct mem_cgroup_stat stat;
> +};
> +
> +static inline struct mem_cgroup_per_zone *
> +mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> +{
> + return &mem->info.nodeinfo[nid]->zoneinfo[zid];
> +}
> +
> +static inline unsigned long
> +mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> + struct zone *zone,
> + enum lru_list lru)
> +{
> + int nid = zone->zone_pgdat->node_id;
> + int zid = zone_idx(zone);
> + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> + return MEM_CGROUP_ZSTAT(mz, lru);
> +}
> +
> +static inline struct zone_reclaim_stat *
> +mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
> +{
> + int nid = zone->zone_pgdat->node_id;
> + int zid = zone_idx(zone);
> + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> + return &mz->reclaim_stat;
> +}
> +
> +
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> /*
> * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -95,11 +228,6 @@ extern void mem_cgroup_record_reclaim_pr
> int priority);
> int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
> int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> -unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> - struct zone *zone,
> - enum lru_list lru);
> -struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup
> *memcg,
> - struct zone *zone);
> struct zone_reclaim_stat*
> mem_cgroup_get_reclaim_stat_from_page(struct page *page);
> extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> @@ -246,20 +374,6 @@ mem_cgroup_inactive_file_is_low(struct m
> return 1;
> }
>
> -static inline unsigned long
> -mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
> - enum lru_list lru)
> -{
> - return 0;
> -}
> -
> -
> -static inline struct zone_reclaim_stat*
> -mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
> -{
> - return NULL;
> -}
> -
> static inline struct zone_reclaim_stat*
> mem_cgroup_get_reclaim_stat_from_page(struct page *page)
> {
> --- linux.orig/mm/memcontrol.c 2009-08-19 20:14:56.000000000 +0800
> +++ linux/mm/memcontrol.c 2009-08-19 20:46:50.000000000 +0800
> @@ -55,30 +55,6 @@ static int really_do_swap_account __init
> static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */
>
> /*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> - /*
> - * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> - */
> - MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
> - MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
> - MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */
> - MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
> - MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
> -
> - MEM_CGROUP_STAT_NSTATS,
> -};
> -
> -struct mem_cgroup_stat_cpu {
> - s64 count[MEM_CGROUP_STAT_NSTATS];
> -} ____cacheline_aligned_in_smp;
> -
> -struct mem_cgroup_stat {
> - struct mem_cgroup_stat_cpu cpustat[0];
> -};
> -
> -/*
> * For accounting under irq disable, no need for increment preempt count.
> */
> static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu
> *stat,
> @@ -106,86 +82,6 @@ static s64 mem_cgroup_local_usage(struct
> return ret;
> }
>
> -/*
> - * per-zone information in memory controller.
> - */
> -struct mem_cgroup_per_zone {
> - /*
> - * spin_lock to protect the per cgroup LRU
> - */
> - struct list_head lists[NR_LRU_LISTS];
> - unsigned long count[NR_LRU_LISTS];
> -
> - struct zone_reclaim_stat reclaim_stat;
> -};
> -/* Macro for accessing counter */
> -#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)])
> -
> -struct mem_cgroup_per_node {
> - struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
> -};
> -
> -struct mem_cgroup_lru_info {
> - struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
> -};
> -
> -/*
> - * The memory controller data structure. The memory controller controls
> both
> - * page cache and RSS per cgroup. We would eventually like to provide
> - * statistics based on the statistics developed by Rik Van Riel for
> clock-pro,
> - * to help the administrator determine what knobs to tune.
> - *
> - * TODO: Add a water mark for the memory controller. Reclaim will begin
> when
> - * we hit the water mark. May be even add a low water mark, such that
> - * no reclaim occurs from a cgroup at it's low water mark, this is
> - * a feature that will be implemented much later in the future.
> - */
> -struct mem_cgroup {
> - struct cgroup_subsys_state css;
> - /*
> - * the counter to account for memory usage
> - */
> - struct res_counter res;
> - /*
> - * the counter to account for mem+swap usage.
> - */
> - struct res_counter memsw;
> - /*
> - * Per cgroup active and inactive list, similar to the
> - * per zone LRU lists.
> - */
> - struct mem_cgroup_lru_info info;
> -
> - /*
> - protect against reclaim related member.
> - */
> - spinlock_t reclaim_param_lock;
> -
> - int prev_priority; /* for recording reclaim priority */
> -
> - /*
> - * While reclaiming in a hiearchy, we cache the last child we
> - * reclaimed from.
> - */
> - int last_scanned_child;
> - /*
> - * Should the accounting and control be hierarchical, per subtree?
> - */
> - bool use_hierarchy;
> - unsigned long last_oom_jiffies;
> - atomic_t refcnt;
> -
> - unsigned int swappiness;
> -
> - /* set when res.limit == memsw.limit */
> - bool memsw_is_minimum;
> -
> - /*
> - * statistics. This must be placed at the end of memcg.
> - */
> - struct mem_cgroup_stat stat;
> -};
> -
> enum charge_type {
> MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> MEM_CGROUP_CHARGE_TYPE_MAPPED,
> @@ -244,12 +140,6 @@ static void mem_cgroup_charge_statistics
> }
>
> static struct mem_cgroup_per_zone *
> -mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> -{
> - return &mem->info.nodeinfo[nid]->zoneinfo[zid];
> -}
> -
> -static struct mem_cgroup_per_zone *
> page_cgroup_zoneinfo(struct page_cgroup *pc)
> {
> struct mem_cgroup *mem = pc->mem_cgroup;
> @@ -586,27 +476,6 @@ int mem_cgroup_inactive_file_is_low(stru
> return (active > inactive);
> }
>
> -unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> - struct zone *zone,
> - enum lru_list lru)
> -{
> - int nid = zone->zone_pgdat->node_id;
> - int zid = zone_idx(zone);
> - struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> -
> - return MEM_CGROUP_ZSTAT(mz, lru);
> -}
> -
> -struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup
> *memcg,
> - struct zone *zone)
> -{
> - int nid = zone->zone_pgdat->node_id;
> - int zid = zone_idx(zone);
> - struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> -
> - return &mz->reclaim_stat;
> -}
> -
> struct zone_reclaim_stat *
> mem_cgroup_get_reclaim_stat_from_page(struct page *page)
> {
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] memcg: move definitions to .h and inline some functions
@ 2009-08-19 14:18 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 243+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-19 14:18 UTC (permalink / raw)
To: Wu Fengguang
Cc: KOSAKI Motohiro, Balbir Singh, KAMEZAWA Hiroyuki, Rik van Riel,
Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm, nishimura, lizf,
menage
Wu Fengguang さんは書きました:
> On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
>>
>> > > This one of the reasons why we unconditionally deactivate
>> > > the active anon pages, and do background scanning of the
>> > > active anon list when reclaiming page cache pages.
>> > >
>> > > We want to always move some pages to the inactive anon
>> > > list, so it does not get too small.
>> >
>> > Right, the current code tries to pull inactive list out of
>> > smallish-size state as long as there are vmscan activities.
>> >
>> > However there is a possible (and tricky) hole: mem cgroups
>> > don't do batched vmscan. shrink_zone() may call shrink_list()
>> > with nr_to_scan=1, in which case shrink_list() _still_ calls
>> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
>> >
>> > It effectively scales up the inactive list scan rate by 10 times when
>> > it is still small, and may thus prevent it from growing up for ever.
>> >
>> > In that case, LRU becomes FIFO.
>> >
>> > Jeff, can you confirm if the mem cgroup's inactive list is small?
>> > If so, this patch should help.
>>
>> This patch does right thing.
>> However, I would explain why I and memcg folks didn't do that in past
>> days.
>>
>> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
>> make inline function and we hesitated to introduce many function calling
>> overhead.
>>
>> So, Can we move some memcg structure declaration to *.h and make
>> mem_cgroup_get_saved_scan() inlined function?
>
> OK here it is. I have to move big chunks to make it compile, and it
> does reduced a dozen lines of code :)
>
> Is this big copy&paste acceptable? (memcg developers CCed).
>
> Thanks,
> Fengguang
I don't like this. plz add hooks to necessary places, at this stage.
This will be too big for inlined function, anyway.
plz move this after you find overhead is too big.
Thanks,
-Kame
> ---
>
> memcg: move definitions to .h and inline some functions
>
> So as to make inline functions.
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> include/linux/memcontrol.h | 154 ++++++++++++++++++++++++++++++-----
> mm/memcontrol.c | 131 -----------------------------
> 2 files changed, 134 insertions(+), 151 deletions(-)
>
> --- linux.orig/include/linux/memcontrol.h 2009-08-19 20:18:55.000000000
> +0800
> +++ linux/include/linux/memcontrol.h 2009-08-19 20:51:06.000000000 +0800
> @@ -20,11 +20,144 @@
> #ifndef _LINUX_MEMCONTROL_H
> #define _LINUX_MEMCONTROL_H
> #include <linux/cgroup.h>
> -struct mem_cgroup;
> +#include <linux/res_counter.h>
> struct page_cgroup;
> struct page;
> struct mm_struct;
>
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> + /*
> + * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> + */
> + MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
> + MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
> + MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */
> + MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
> + MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
> +
> + MEM_CGROUP_STAT_NSTATS,
> +};
> +
> +struct mem_cgroup_stat_cpu {
> + s64 count[MEM_CGROUP_STAT_NSTATS];
> +} ____cacheline_aligned_in_smp;
> +
> +struct mem_cgroup_stat {
> + struct mem_cgroup_stat_cpu cpustat[0];
> +};
> +
> +/*
> + * per-zone information in memory controller.
> + */
> +struct mem_cgroup_per_zone {
> + /*
> + * spin_lock to protect the per cgroup LRU
> + */
> + struct list_head lists[NR_LRU_LISTS];
> + unsigned long count[NR_LRU_LISTS];
> +
> + struct zone_reclaim_stat reclaim_stat;
> +};
> +/* Macro for accessing counter */
> +#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)])
> +
> +struct mem_cgroup_per_node {
> + struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
> +};
> +
> +struct mem_cgroup_lru_info {
> + struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
> +};
> +
> +/*
> + * The memory controller data structure. The memory controller controls
> both
> + * page cache and RSS per cgroup. We would eventually like to provide
> + * statistics based on the statistics developed by Rik Van Riel for
> clock-pro,
> + * to help the administrator determine what knobs to tune.
> + *
> + * TODO: Add a water mark for the memory controller. Reclaim will begin
> when
> + * we hit the water mark. May be even add a low water mark, such that
> + * no reclaim occurs from a cgroup at it's low water mark, this is
> + * a feature that will be implemented much later in the future.
> + */
> +struct mem_cgroup {
> + struct cgroup_subsys_state css;
> + /*
> + * the counter to account for memory usage
> + */
> + struct res_counter res;
> + /*
> + * the counter to account for mem+swap usage.
> + */
> + struct res_counter memsw;
> + /*
> + * Per cgroup active and inactive list, similar to the
> + * per zone LRU lists.
> + */
> + struct mem_cgroup_lru_info info;
> +
> + /*
> + protect against reclaim related member.
> + */
> + spinlock_t reclaim_param_lock;
> +
> + int prev_priority; /* for recording reclaim priority */
> +
> + /*
> + * While reclaiming in a hiearchy, we cache the last child we
> + * reclaimed from.
> + */
> + int last_scanned_child;
> + /*
> + * Should the accounting and control be hierarchical, per subtree?
> + */
> + bool use_hierarchy;
> + unsigned long last_oom_jiffies;
> + atomic_t refcnt;
> +
> + unsigned int swappiness;
> +
> + /* set when res.limit == memsw.limit */
> + bool memsw_is_minimum;
> +
> + /*
> + * statistics. This must be placed at the end of memcg.
> + */
> + struct mem_cgroup_stat stat;
> +};
> +
> +static inline struct mem_cgroup_per_zone *
> +mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> +{
> + return &mem->info.nodeinfo[nid]->zoneinfo[zid];
> +}
> +
> +static inline unsigned long
> +mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> + struct zone *zone,
> + enum lru_list lru)
> +{
> + int nid = zone->zone_pgdat->node_id;
> + int zid = zone_idx(zone);
> + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> + return MEM_CGROUP_ZSTAT(mz, lru);
> +}
> +
> +static inline struct zone_reclaim_stat *
> +mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
> +{
> + int nid = zone->zone_pgdat->node_id;
> + int zid = zone_idx(zone);
> + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> +
> + return &mz->reclaim_stat;
> +}
> +
> +
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> /*
> * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -95,11 +228,6 @@ extern void mem_cgroup_record_reclaim_pr
> int priority);
> int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
> int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> -unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> - struct zone *zone,
> - enum lru_list lru);
> -struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup
> *memcg,
> - struct zone *zone);
> struct zone_reclaim_stat*
> mem_cgroup_get_reclaim_stat_from_page(struct page *page);
> extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> @@ -246,20 +374,6 @@ mem_cgroup_inactive_file_is_low(struct m
> return 1;
> }
>
> -static inline unsigned long
> -mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
> - enum lru_list lru)
> -{
> - return 0;
> -}
> -
> -
> -static inline struct zone_reclaim_stat*
> -mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
> -{
> - return NULL;
> -}
> -
> static inline struct zone_reclaim_stat*
> mem_cgroup_get_reclaim_stat_from_page(struct page *page)
> {
> --- linux.orig/mm/memcontrol.c 2009-08-19 20:14:56.000000000 +0800
> +++ linux/mm/memcontrol.c 2009-08-19 20:46:50.000000000 +0800
> @@ -55,30 +55,6 @@ static int really_do_swap_account __init
> static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */
>
> /*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> - /*
> - * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> - */
> - MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
> - MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
> - MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */
> - MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
> - MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
> -
> - MEM_CGROUP_STAT_NSTATS,
> -};
> -
> -struct mem_cgroup_stat_cpu {
> - s64 count[MEM_CGROUP_STAT_NSTATS];
> -} ____cacheline_aligned_in_smp;
> -
> -struct mem_cgroup_stat {
> - struct mem_cgroup_stat_cpu cpustat[0];
> -};
> -
> -/*
> * For accounting under irq disable, no need for increment preempt count.
> */
> static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu
> *stat,
> @@ -106,86 +82,6 @@ static s64 mem_cgroup_local_usage(struct
> return ret;
> }
>
> -/*
> - * per-zone information in memory controller.
> - */
> -struct mem_cgroup_per_zone {
> - /*
> - * spin_lock to protect the per cgroup LRU
> - */
> - struct list_head lists[NR_LRU_LISTS];
> - unsigned long count[NR_LRU_LISTS];
> -
> - struct zone_reclaim_stat reclaim_stat;
> -};
> -/* Macro for accessing counter */
> -#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)])
> -
> -struct mem_cgroup_per_node {
> - struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
> -};
> -
> -struct mem_cgroup_lru_info {
> - struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
> -};
> -
> -/*
> - * The memory controller data structure. The memory controller controls
> both
> - * page cache and RSS per cgroup. We would eventually like to provide
> - * statistics based on the statistics developed by Rik Van Riel for
> clock-pro,
> - * to help the administrator determine what knobs to tune.
> - *
> - * TODO: Add a water mark for the memory controller. Reclaim will begin
> when
> - * we hit the water mark. May be even add a low water mark, such that
> - * no reclaim occurs from a cgroup at it's low water mark, this is
> - * a feature that will be implemented much later in the future.
> - */
> -struct mem_cgroup {
> - struct cgroup_subsys_state css;
> - /*
> - * the counter to account for memory usage
> - */
> - struct res_counter res;
> - /*
> - * the counter to account for mem+swap usage.
> - */
> - struct res_counter memsw;
> - /*
> - * Per cgroup active and inactive list, similar to the
> - * per zone LRU lists.
> - */
> - struct mem_cgroup_lru_info info;
> -
> - /*
> - protect against reclaim related member.
> - */
> - spinlock_t reclaim_param_lock;
> -
> - int prev_priority; /* for recording reclaim priority */
> -
> - /*
> - * While reclaiming in a hiearchy, we cache the last child we
> - * reclaimed from.
> - */
> - int last_scanned_child;
> - /*
> - * Should the accounting and control be hierarchical, per subtree?
> - */
> - bool use_hierarchy;
> - unsigned long last_oom_jiffies;
> - atomic_t refcnt;
> -
> - unsigned int swappiness;
> -
> - /* set when res.limit == memsw.limit */
> - bool memsw_is_minimum;
> -
> - /*
> - * statistics. This must be placed at the end of memcg.
> - */
> - struct mem_cgroup_stat stat;
> -};
> -
> enum charge_type {
> MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> MEM_CGROUP_CHARGE_TYPE_MAPPED,
> @@ -244,12 +140,6 @@ static void mem_cgroup_charge_statistics
> }
>
> static struct mem_cgroup_per_zone *
> -mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> -{
> - return &mem->info.nodeinfo[nid]->zoneinfo[zid];
> -}
> -
> -static struct mem_cgroup_per_zone *
> page_cgroup_zoneinfo(struct page_cgroup *pc)
> {
> struct mem_cgroup *mem = pc->mem_cgroup;
> @@ -586,27 +476,6 @@ int mem_cgroup_inactive_file_is_low(stru
> return (active > inactive);
> }
>
> -unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> - struct zone *zone,
> - enum lru_list lru)
> -{
> - int nid = zone->zone_pgdat->node_id;
> - int zid = zone_idx(zone);
> - struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> -
> - return MEM_CGROUP_ZSTAT(mz, lru);
> -}
> -
> -struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup
> *memcg,
> - struct zone *zone)
> -{
> - int nid = zone->zone_pgdat->node_id;
> - int zid = zone_idx(zone);
> - struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> -
> - return &mz->reclaim_stat;
> -}
> -
> struct zone_reclaim_stat *
> mem_cgroup_get_reclaim_stat_from_page(struct page *page)
> {
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] memcg: move definitions to .h and inline some functions
2009-08-19 14:18 ` KAMEZAWA Hiroyuki
@ 2009-08-19 14:27 ` Balbir Singh
-1 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-19 14:27 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Wu Fengguang, KOSAKI Motohiro, Rik van Riel, Johannes Weiner,
Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
Mel Gorman, LKML, linux-mm, nishimura, lizf, menage
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-19 23:18:01]:
> Wu Fengguang ?$B$5$s$O=q$-$^$7$?!'
> > On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
> >>
> >> > > This one of the reasons why we unconditionally deactivate
> >> > > the active anon pages, and do background scanning of the
> >> > > active anon list when reclaiming page cache pages.
> >> > >
> >> > > We want to always move some pages to the inactive anon
> >> > > list, so it does not get too small.
> >> >
> >> > Right, the current code tries to pull inactive list out of
> >> > smallish-size state as long as there are vmscan activities.
> >> >
> >> > However there is a possible (and tricky) hole: mem cgroups
> >> > don't do batched vmscan. shrink_zone() may call shrink_list()
> >> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> >> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> >> >
> >> > It effectively scales up the inactive list scan rate by 10 times when
> >> > it is still small, and may thus prevent it from growing up for ever.
> >> >
> >> > In that case, LRU becomes FIFO.
> >> >
> >> > Jeff, can you confirm if the mem cgroup's inactive list is small?
> >> > If so, this patch should help.
> >>
> >> This patch does right thing.
> >> However, I would explain why I and memcg folks didn't do that in past
> >> days.
> >>
> >> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
> >> make inline function and we hesitated to introduce many function calling
> >> overhead.
> >>
> >> So, Can we move some memcg structure declaration to *.h and make
> >> mem_cgroup_get_saved_scan() inlined function?
> >
> > OK here it is. I have to move big chunks to make it compile, and it
> > does reduced a dozen lines of code :)
> >
> > Is this big copy&paste acceptable? (memcg developers CCed).
> >
> > Thanks,
> > Fengguang
>
> I don't like this. plz add hooks to necessary places, at this stage.
> This will be too big for inlined function, anyway.
> plz move this after you find overhead is too big.
Me too.. I want to abstract the implementation within memcontrol.c to
be honest (I am concerned that someone might include memcontrol.h and
access its structure members, which scares me). Hiding it within
memcontrol.c provides the right level of abstraction.
Could you please explain your motivation for this change? I got cc'ed
on to a few emails, is this for the patch that export nr_save_scanned
approach?
--
Balbir
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] memcg: move definitions to .h and inline some functions
@ 2009-08-19 14:27 ` Balbir Singh
0 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-19 14:27 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Wu Fengguang, KOSAKI Motohiro, Rik van Riel, Johannes Weiner,
Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
Mel Gorman, LKML, linux-mm, nishimura, lizf, menage
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-19 23:18:01]:
> Wu Fengguang ?$B$5$s$O=q$-$^$7$?!'
> > On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
> >>
> >> > > This one of the reasons why we unconditionally deactivate
> >> > > the active anon pages, and do background scanning of the
> >> > > active anon list when reclaiming page cache pages.
> >> > >
> >> > > We want to always move some pages to the inactive anon
> >> > > list, so it does not get too small.
> >> >
> >> > Right, the current code tries to pull inactive list out of
> >> > smallish-size state as long as there are vmscan activities.
> >> >
> >> > However there is a possible (and tricky) hole: mem cgroups
> >> > don't do batched vmscan. shrink_zone() may call shrink_list()
> >> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> >> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> >> >
> >> > It effectively scales up the inactive list scan rate by 10 times when
> >> > it is still small, and may thus prevent it from growing up for ever.
> >> >
> >> > In that case, LRU becomes FIFO.
> >> >
> >> > Jeff, can you confirm if the mem cgroup's inactive list is small?
> >> > If so, this patch should help.
> >>
> >> This patch does right thing.
> >> However, I would explain why I and memcg folks didn't do that in past
> >> days.
> >>
> >> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
> >> make inline function and we hesitated to introduce many function calling
> >> overhead.
> >>
> >> So, Can we move some memcg structure declaration to *.h and make
> >> mem_cgroup_get_saved_scan() inlined function?
> >
> > OK here it is. I have to move big chunks to make it compile, and it
> > does reduced a dozen lines of code :)
> >
> > Is this big copy&paste acceptable? (memcg developers CCed).
> >
> > Thanks,
> > Fengguang
>
> I don't like this. plz add hooks to necessary places, at this stage.
> This will be too big for inlined function, anyway.
> plz move this after you find overhead is too big.
Me too.. I want to abstract the implementation within memcontrol.c to
be honest (I am concerned that someone might include memcontrol.h and
access its structure members, which scares me). Hiding it within
memcontrol.c provides the right level of abstraction.
Could you please explain your motivation for this change? I got cc'ed
on to a few emails, is this for the patch that export nr_save_scanned
approach?
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] memcg: move definitions to .h and inline some functions
2009-08-19 14:27 ` Balbir Singh
@ 2009-08-20 1:34 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-20 1:34 UTC (permalink / raw)
To: Balbir Singh
Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm, nishimura, lizf,
menage
On Wed, Aug 19, 2009 at 10:27:05PM +0800, Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-19 23:18:01]:
>
> > Wu Fengguang ?$B$5$s$O=q$-$^$7$?!'
> > > On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
> > >>
> > >> > > This one of the reasons why we unconditionally deactivate
> > >> > > the active anon pages, and do background scanning of the
> > >> > > active anon list when reclaiming page cache pages.
> > >> > >
> > >> > > We want to always move some pages to the inactive anon
> > >> > > list, so it does not get too small.
> > >> >
> > >> > Right, the current code tries to pull inactive list out of
> > >> > smallish-size state as long as there are vmscan activities.
> > >> >
> > >> > However there is a possible (and tricky) hole: mem cgroups
> > >> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > >> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > >> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> > >> >
> > >> > It effectively scales up the inactive list scan rate by 10 times when
> > >> > it is still small, and may thus prevent it from growing up for ever.
> > >> >
> > >> > In that case, LRU becomes FIFO.
> > >> >
> > >> > Jeff, can you confirm if the mem cgroup's inactive list is small?
> > >> > If so, this patch should help.
> > >>
> > >> This patch does right thing.
> > >> However, I would explain why I and memcg folks didn't do that in past
> > >> days.
> > >>
> > >> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
> > >> make inline function and we hesitated to introduce many function calling
> > >> overhead.
> > >>
> > >> So, Can we move some memcg structure declaration to *.h and make
> > >> mem_cgroup_get_saved_scan() inlined function?
> > >
> > > OK here it is. I have to move big chunks to make it compile, and it
> > > does reduced a dozen lines of code :)
> > >
> > > Is this big copy&paste acceptable? (memcg developers CCed).
> > >
> > > Thanks,
> > > Fengguang
> >
> > I don't like this. plz add hooks to necessary places, at this stage.
> > This will be too big for inlined function, anyway.
> > plz move this after you find overhead is too big.
It shall not be a performance regression, since the text size is slightly
smaller with the patch:
text data bss dec hex filename
before 8732148 2771858 11048432 22552438 1581f76 vmlinux
after 8731972 2771858 11048432 22552262 1581ec6 vmlinux
> Me too.. I want to abstract the implementation within memcontrol.c to
> be honest (I am concerned that someone might include memcontrol.h and
> access its structure members, which scares me). Hiding it within
> memcontrol.c provides the right level of abstraction.
Yeah quite reasonable.
> Could you please explain your motivation for this change? I got cc'ed
> on to a few emails, is this for the patch that export nr_save_scanned
> approach?
Yes, KOSAKI proposed to inline the mem_cgroup_get_saved_scan() function
introduced in that patch, which requires moving the structs into .h
I'll submit the original (un-inlined) patch.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] memcg: move definitions to .h and inline some functions
@ 2009-08-20 1:34 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-08-20 1:34 UTC (permalink / raw)
To: Balbir Singh
Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, Mel Gorman, LKML, linux-mm, nishimura, lizf,
menage
On Wed, Aug 19, 2009 at 10:27:05PM +0800, Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-19 23:18:01]:
>
> > Wu Fengguang ?$B$5$s$O=q$-$^$7$?!'
> > > On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote:
> > >>
> > >> > > This one of the reasons why we unconditionally deactivate
> > >> > > the active anon pages, and do background scanning of the
> > >> > > active anon list when reclaiming page cache pages.
> > >> > >
> > >> > > We want to always move some pages to the inactive anon
> > >> > > list, so it does not get too small.
> > >> >
> > >> > Right, the current code tries to pull inactive list out of
> > >> > smallish-size state as long as there are vmscan activities.
> > >> >
> > >> > However there is a possible (and tricky) hole: mem cgroups
> > >> > don't do batched vmscan. shrink_zone() may call shrink_list()
> > >> > with nr_to_scan=1, in which case shrink_list() _still_ calls
> > >> > isolate_pages() with the much larger SWAP_CLUSTER_MAX.
> > >> >
> > >> > It effectively scales up the inactive list scan rate by 10 times when
> > >> > it is still small, and may thus prevent it from growing up for ever.
> > >> >
> > >> > In that case, LRU becomes FIFO.
> > >> >
> > >> > Jeff, can you confirm if the mem cgroup's inactive list is small?
> > >> > If so, this patch should help.
> > >>
> > >> This patch does right thing.
> > >> However, I would explain why I and memcg folks didn't do that in past
> > >> days.
> > >>
> > >> Strangely, some memcg struct declaration is hide in *.c. Thus we can't
> > >> make inline function and we hesitated to introduce many function calling
> > >> overhead.
> > >>
> > >> So, Can we move some memcg structure declaration to *.h and make
> > >> mem_cgroup_get_saved_scan() inlined function?
> > >
> > > OK here it is. I have to move big chunks to make it compile, and it
> > > does reduced a dozen lines of code :)
> > >
> > > Is this big copy&paste acceptable? (memcg developers CCed).
> > >
> > > Thanks,
> > > Fengguang
> >
> > I don't like this. plz add hooks to necessary places, at this stage.
> > This will be too big for inlined function, anyway.
> > plz move this after you find overhead is too big.
It shall not be a performance regression, since the text size is slightly
smaller with the patch:
text data bss dec hex filename
before 8732148 2771858 11048432 22552438 1581f76 vmlinux
after 8731972 2771858 11048432 22552262 1581ec6 vmlinux
> Me too.. I want to abstract the implementation within memcontrol.c to
> be honest (I am concerned that someone might include memcontrol.h and
> access its structure members, which scares me). Hiding it within
> memcontrol.c provides the right level of abstraction.
Yeah quite reasonable.
> Could you please explain your motivation for this change? I got cc'ed
> on to a few emails, is this for the patch that export nr_save_scanned
> approach?
Yes, KOSAKI proposed to inline the mem_cgroup_get_saved_scan() function
introduced in that patch, which requires moving the structs into .h
I'll submit the original (un-inlined) patch.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-19 13:28 ` Minchan Kim
@ 2009-08-21 11:17 ` KOSAKI Motohiro
-1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-21 11:17 UTC (permalink / raw)
To: Minchan Kim
Cc: Wu Fengguang, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
>> Hmm, I think
>>
>> 1. Anyway, we need turn on PG_mlock.
>
> I add my patch again to explain.
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ed63894..283266c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page,
> */
> if (vma->vm_flags & VM_LOCKED) {
> *mapcount = 1; /* break early from loop */
> + *vm_flags |= VM_LOCKED;
> goto out_unmap;
> }
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d224b28..d156e1d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct
> list_head *page_list,
> sc->mem_cgroup, &vm_flags);
> /* In active use or really unfreeable? Activate it. */
> if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> - referenced && page_mapping_inuse(page))
> + referenced && page_mapping_inuse(page)
> + && !(vm_flags & VM_LOCKED))
> goto activate_locked;
>
> By this check, the page can be reached at try_to_unmap after
> page_referenced in shrink_page_list. At that time PG_mlocked will be
> set.
You are right.
Please add my Reviewed-by sign to your patch.
Thanks.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-21 11:17 ` KOSAKI Motohiro
0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-08-21 11:17 UTC (permalink / raw)
To: Minchan Kim
Cc: Wu Fengguang, Rik van Riel, Jeff Dike, Avi Kivity,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
>> Hmm, I think
>>
>> 1. Anyway, we need turn on PG_mlock.
>
> I add my patch again to explain.
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ed63894..283266c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page,
> */
> if (vma->vm_flags & VM_LOCKED) {
> *mapcount = 1; /* break early from loop */
> + *vm_flags |= VM_LOCKED;
> goto out_unmap;
> }
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d224b28..d156e1d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct
> list_head *page_list,
> sc->mem_cgroup, &vm_flags);
> /* In active use or really unfreeable? Activate it. */
> if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> - referenced && page_mapping_inuse(page))
> + referenced && page_mapping_inuse(page)
> + && !(vm_flags & VM_LOCKED))
> goto activate_locked;
>
> By this check, the page can be reached at try_to_unmap after
> page_referenced in shrink_page_list. At that time PG_mlocked will be
> set.
You are right.
Please add my Reviewed-by sign to your patch.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-17 19:47 ` Dike, Jeffrey G
@ 2009-08-21 18:24 ` Balbir Singh
-1 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-21 18:24 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Wu, Fengguang, Rik van Riel, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
* Dike, Jeffrey G <jeffrey.g.dike@intel.com> [2009-08-17 12:47:29]:
> > Jeff, can you have a look at these stats? Thanks!
>
> Yeah, I just did after adding some tracing which dumped out the same data. It looks pretty much the same. Inactive anon and active anon are pretty similar. Inactive file and active file are smaller and fluctuate more, but doesn't look horribly unbalanced.
>
> Below are the stats from memory.stat - inactive_anon, active_anon, inactive_file, active_file, plus some commentary on what's happening.
>
Interesting.. there seems to be sufficient number of inactive memory,
specifically inactive_file. My biggest suspicion now is passing of
reference info from shadow page tables to host (although to be
honest, I've never looked at that code).
What do the stats for / from within kvm look like?
--
Balbir
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-21 18:24 ` Balbir Singh
0 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-21 18:24 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Wu, Fengguang, Rik van Riel, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
* Dike, Jeffrey G <jeffrey.g.dike@intel.com> [2009-08-17 12:47:29]:
> > Jeff, can you have a look at these stats? Thanks!
>
> Yeah, I just did after adding some tracing which dumped out the same data. It looks pretty much the same. Inactive anon and active anon are pretty similar. Inactive file and active file are smaller and fluctuate more, but doesn't look horribly unbalanced.
>
> Below are the stats from memory.stat - inactive_anon, active_anon, inactive_file, active_file, plus some commentary on what's happening.
>
Interesting.. there seems to be sufficient number of inactive memory,
specifically inactive_file. My biggest suspicion now is passing of
reference info from shadow page tables to host (although to be
honest, I've never looked at that code).
What do the stats for / from within kvm look like?
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
2009-08-21 18:24 ` Balbir Singh
@ 2009-08-31 19:43 ` Dike, Jeffrey G
-1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-31 19:43 UTC (permalink / raw)
To: balbir
Cc: Wu, Fengguang, Rik van Riel, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
> What do the stats for / from within kvm look like?
Interesting - what they look like is inactive_anon is always zero. Details below - I took the host numbers at the same time and they are similar to what I reported before.
Jeff
The fields are inactive_anon, active_anon, inactive_file, active_file - shortly after the data started being collected, I started firefox and an editor thingy. The data continues as far into the shutdown as it could.
0 10858 13516 3279
0 10872 13516 3286
0 10867 13513 3286
0 11455 13268 3552
0 13068 12871 3949
0 13281 12810 4012
0 13701 12719 4103
0 14133 12631 4191
0 10878 11742 5087
0 10878 11741 5085
0 10878 11741 5085
0 10877 11741 5085
0 10877 11741 5085
0 10878 11741 5085
0 10878 11741 5085
0 10877 11741 5085
0 10905 11741 5085
0 11118 11776 5106
0 11594 14541 5169
0 11084 15314 5248
0 12022 15686 5300
0 12813 16379 5608
0 13614 16744 5915
0 14230 16849 5936
0 14461 16943 5953
0 14706 17412 5967
0 15574 17445 6011
0 15623 17459 6011
0 15596 17461 6015
0 15941 17523 6048
0 16508 17684 6048
0 17095 18154 6056
0 18635 18175 6056
0 18867 18195 6060
0 18972 18195 6060
0 18975 18185 6073
0 19220 18234 6073
0 19809 18276 6076
0 19571 18276 6076
0 19567 18276 6076
0 19588 18276 6076
0 19588 18276 6076
0 19588 18276 6076
0 19589 18276 6076
0 19603 18276 6076
0 19607 18277 6077
0 19600 18277 6077
0 19034 18235 6119
0 19041 18235 6119
0 19040 18233 6121
0 19040 18233 6121
0 18724 18240 6121
0 11674 16376 7977
0 11674 16376 7977
0 11673 16376 7977
0 11708 16376 7977
0 11703 16374 7979
0 11703 16374 7979
0 11702 16374 7979
0 11702 16374 7979
0 11716 16374 7979
0 11716 16374 7979
0 11718 16374 7979
0 11711 16374 7979
0 11811 16413 7986
0 11811 16413 7986
0 11897 16413 7986
0 12247 16434 7986
0 12495 16457 7990
0 12495 16457 7990
0 12491 16457 7990
0 12491 16457 7990
0 12737 16457 7990
0 11844 16457 7990
0 10969 16436 8011
0 9586 16140 8328
0 9209 16253 8333
0 8467 16120 8550
0 7857 16504 8592
0 7215 16467 8681
0 7194 16481 8723
0 7155 16475 8730
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-31 19:43 ` Dike, Jeffrey G
0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-31 19:43 UTC (permalink / raw)
To: balbir
Cc: Wu, Fengguang, Rik van Riel, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
> What do the stats for / from within kvm look like?
Interesting - what they look like is inactive_anon is always zero. Details below - I took the host numbers at the same time and they are similar to what I reported before.
Jeff
The fields are inactive_anon, active_anon, inactive_file, active_file - shortly after the data started being collected, I started firefox and an editor thingy. The data continues as far into the shutdown as it could.
0 10858 13516 3279
0 10872 13516 3286
0 10867 13513 3286
0 11455 13268 3552
0 13068 12871 3949
0 13281 12810 4012
0 13701 12719 4103
0 14133 12631 4191
0 10878 11742 5087
0 10878 11741 5085
0 10878 11741 5085
0 10877 11741 5085
0 10877 11741 5085
0 10878 11741 5085
0 10878 11741 5085
0 10877 11741 5085
0 10905 11741 5085
0 11118 11776 5106
0 11594 14541 5169
0 11084 15314 5248
0 12022 15686 5300
0 12813 16379 5608
0 13614 16744 5915
0 14230 16849 5936
0 14461 16943 5953
0 14706 17412 5967
0 15574 17445 6011
0 15623 17459 6011
0 15596 17461 6015
0 15941 17523 6048
0 16508 17684 6048
0 17095 18154 6056
0 18635 18175 6056
0 18867 18195 6060
0 18972 18195 6060
0 18975 18185 6073
0 19220 18234 6073
0 19809 18276 6076
0 19571 18276 6076
0 19567 18276 6076
0 19588 18276 6076
0 19588 18276 6076
0 19588 18276 6076
0 19589 18276 6076
0 19603 18276 6076
0 19607 18277 6077
0 19600 18277 6077
0 19034 18235 6119
0 19041 18235 6119
0 19040 18233 6121
0 19040 18233 6121
0 18724 18240 6121
0 11674 16376 7977
0 11674 16376 7977
0 11673 16376 7977
0 11708 16376 7977
0 11703 16374 7979
0 11703 16374 7979
0 11702 16374 7979
0 11702 16374 7979
0 11716 16374 7979
0 11716 16374 7979
0 11718 16374 7979
0 11711 16374 7979
0 11811 16413 7986
0 11811 16413 7986
0 11897 16413 7986
0 12247 16434 7986
0 12495 16457 7990
0 12495 16457 7990
0 12491 16457 7990
0 12491 16457 7990
0 12737 16457 7990
0 11844 16457 7990
0 10969 16436 8011
0 9586 16140 8328
0 9209 16253 8333
0 8467 16120 8550
0 7857 16504 8592
0 7215 16467 8681
0 7194 16481 8723
0 7155 16475 8730
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-31 19:43 ` Dike, Jeffrey G
@ 2009-08-31 19:52 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-31 19:52 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: balbir, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Dike, Jeffrey G wrote:
>> What do the stats for / from within kvm look like?
>
> Interesting - what they look like is inactive_anon is always zero.
This will be because the VM does not start aging pages
from the active to the inactive list unless there is
some memory pressure.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-31 19:52 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-31 19:52 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: balbir, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Dike, Jeffrey G wrote:
>> What do the stats for / from within kvm look like?
>
> Interesting - what they look like is inactive_anon is always zero.
This will be because the VM does not start aging pages
from the active to the inactive list unless there is
some memory pressure.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
2009-08-31 19:52 ` Rik van Riel
@ 2009-08-31 20:06 ` Dike, Jeffrey G
-1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-31 20:06 UTC (permalink / raw)
To: Rik van Riel
Cc: balbir, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
> This will be because the VM does not start aging pages
> from the active to the inactive list unless there is
> some memory pressure.
Which is the reason I gave the VM a puny amount of memory. We know the thing is under memory pressure because I've been complaining about page discards. I didn't collect that data on this run, but I'll do it again to make sure.
Jeff
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-31 20:06 ` Dike, Jeffrey G
0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-31 20:06 UTC (permalink / raw)
To: Rik van Riel
Cc: balbir, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
> This will be because the VM does not start aging pages
> from the active to the inactive list unless there is
> some memory pressure.
Which is the reason I gave the VM a puny amount of memory. We know the thing is under memory pressure because I've been complaining about page discards. I didn't collect that data on this run, but I'll do it again to make sure.
Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-31 20:06 ` Dike, Jeffrey G
@ 2009-08-31 20:09 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-31 20:09 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: balbir, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Dike, Jeffrey G wrote:
>> This will be because the VM does not start aging pages
>> from the active to the inactive list unless there is
>> some memory pressure.
>
> Which is the reason I gave the VM a puny amount of memory.
> We know the thing is under memory pressure because I've been
> complaining about page discards.
Page discards by the host, which are invisible to the guest
OS.
The guest OS thinks it has enough pages. The host disagrees
and swaps out some guest memory.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-31 20:09 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-08-31 20:09 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: balbir, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
Dike, Jeffrey G wrote:
>> This will be because the VM does not start aging pages
>> from the active to the inactive list unless there is
>> some memory pressure.
>
> Which is the reason I gave the VM a puny amount of memory.
> We know the thing is under memory pressure because I've been
> complaining about page discards.
Page discards by the host, which are invisible to the guest
OS.
The guest OS thinks it has enough pages. The host disagrees
and swaps out some guest memory.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
2009-08-31 20:09 ` Rik van Riel
@ 2009-08-31 20:11 ` Dike, Jeffrey G
-1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-31 20:11 UTC (permalink / raw)
To: Rik van Riel
Cc: balbir, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
> Page discards by the host, which are invisible to the guest
> OS.
Duh. Right - I can't keep my VM systems straight...
Jeff
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-31 20:11 ` Dike, Jeffrey G
0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-08-31 20:11 UTC (permalink / raw)
To: Rik van Riel
Cc: balbir, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred,
Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter,
KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
> Page discards by the host, which are invisible to the guest
> OS.
Duh. Right - I can't keep my VM systems straight...
Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-31 20:11 ` Dike, Jeffrey G
@ 2009-08-31 20:42 ` Balbir Singh
-1 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-31 20:42 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Rik van Riel, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Tue, Sep 1, 2009 at 1:41 AM, Dike, Jeffrey G<jeffrey.g.dike@intel.com> wrote:
>> Page discards by the host, which are invisible to the guest
>> OS.
>
> Duh. Right - I can't keep my VM systems straight...
>
Sounds like we need a way of indicating reference information. Guest
page hinting (cough; cough) anyone? May be a simpler version?
Balbir Singh.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-08-31 20:42 ` Balbir Singh
0 siblings, 0 replies; 243+ messages in thread
From: Balbir Singh @ 2009-08-31 20:42 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Rik van Riel, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu,
Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton,
Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm
On Tue, Sep 1, 2009 at 1:41 AM, Dike, Jeffrey G<jeffrey.g.dike@intel.com> wrote:
>> Page discards by the host, which are invisible to the guest
>> OS.
>
> Duh. Right - I can't keep my VM systems straight...
>
Sounds like we need a way of indicating reference information. Guest
page hinting (cough; cough) anyone? May be a simpler version?
Balbir Singh.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
2009-08-18 2:26 ` Wu Fengguang
@ 2009-09-02 19:30 ` Dike, Jeffrey G
-1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-09-02 19:30 UTC (permalink / raw)
To: Wu, Fengguang
Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
I'm trying to better understand the motivation for your make-mapped-exec-pages-first-class-citizens patch. As I read your (very detailed!) description, you are diagnosing a threshold effect from Rik's evict-use-once-pages-first patch where if the inactive list is slightly smaller than the active list, the active list will start being scanned, pushing text (and other) pages onto the inactive list where they will be quickly kicked out to swap.
As I read Rik's patch, if the active list is one page larger than the inactive list, then a batch of pages will get moved from one to the other. For this to have a noticeable effect on the system once the streaming is done, there must be something continuing to keep the active list larger than the inactive list. Maybe there is a consistent percentage of the streamed pages which are use-twice.
So, we a threshold effect where a small change in input (the size of the streaming file vs the number of active pages) causes a large change in output (lots of text pages suddenly start getting thrown out). My immediate reaction to that is that there shouldn't be this sudden change in behavior, and that maybe there should only be enough scanning in shink_active_list to bring the two lists back to parity. However, if there's something keeping the active list bigger than the inactive list, this will just put off the inevitable required scanning.
As for your patch, it seems like we have a problem with scanning I/O, and instead of looking at those pages, you are looking to protect some other set of pages (mapped text). That, in turn, increases pressure on anonymous pages (which is where I came in). Wouldn't it be a better idea to keep looking at those streaming pages and figure out how to get them out of memory quickly?
Jeff
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-09-02 19:30 ` Dike, Jeffrey G
0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-09-02 19:30 UTC (permalink / raw)
To: Wu, Fengguang
Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
I'm trying to better understand the motivation for your make-mapped-exec-pages-first-class-citizens patch. As I read your (very detailed!) description, you are diagnosing a threshold effect from Rik's evict-use-once-pages-first patch where if the inactive list is slightly smaller than the active list, the active list will start being scanned, pushing text (and other) pages onto the inactive list where they will be quickly kicked out to swap.
As I read Rik's patch, if the active list is one page larger than the inactive list, then a batch of pages will get moved from one to the other. For this to have a noticeable effect on the system once the streaming is done, there must be something continuing to keep the active list larger than the inactive list. Maybe there is a consistent percentage of the streamed pages which are use-twice.
So, we a threshold effect where a small change in input (the size of the streaming file vs the number of active pages) causes a large change in output (lots of text pages suddenly start getting thrown out). My immediate reaction to that is that there shouldn't be this sudden change in behavior, and that maybe there should only be enough scanning in shink_active_list to bring the two lists back to parity. However, if there's something keeping the active list bigger than the inactive list, this will just put off the inevitable required scanning.
As for your patch, it seems like we have a problem with scanning I/O, and instead of looking at those pages, you are looking to protect some other set of pages (mapped text). That, in turn, increases pressure on anonymous pages (which is where I came in). Wouldn't it be a better idea to keep looking at those streaming pages and figure out how to get them out of memory quickly?
Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-09-02 19:30 ` Dike, Jeffrey G
@ 2009-09-03 2:04 ` Wu Fengguang
-1 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-09-03 2:04 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
Jeff,
On Thu, Sep 03, 2009 at 03:30:59AM +0800, Dike, Jeffrey G wrote:
> I'm trying to better understand the motivation for your
> make-mapped-exec-pages-first-class-citizens patch. As I read your
> (very detailed!) description, you are diagnosing a threshold effect
> from Rik's evict-use-once-pages-first patch where if the inactive
> list is slightly smaller than the active list, the active list will
> start being scanned, pushing text (and other) pages onto the
> inactive list where they will be quickly kicked out to swap.
Right.
> As I read Rik's patch, if the active list is one page larger than
> the inactive list, then a batch of pages will get moved from one to
> the other. For this to have a noticeable effect on the system once
> the streaming is done, there must be something continuing to keep
> the active list larger than the inactive list. Maybe there is a
> consistent percentage of the streamed pages which are use-twice.
Right. Besides the use-twice case, I also explored the
desktop-working-set-cannot-fit-in-memory case in the patch.
> So, we a threshold effect where a small change in input (the size of
> the streaming file vs the number of active pages) causes a large
> change in output (lots of text pages suddenly start getting thrown
> out). My immediate reaction to that is that there shouldn't be
> this sudden change in behavior, and that maybe there should only be
> enough scanning in shink_active_list to bring the two lists back to
> parity. However, if there's something keeping the active list
> bigger than the inactive list, this will just put off the inevitable
> required scanning.
Yes there will be a sudden "behavior change" as soon as active list
grows larger than inactive list. However the "output change" is
bounded and not as large, because that extra behavior to scan active
list stops as soon as the two lists are back to parity.
> As for your patch, it seems like we have a problem with scanning
> I/O, and instead of looking at those pages, you are looking to
> protect some other set of pages (mapped text). That, in turn,
> increases pressure on anonymous pages (which is where I came in).
> Wouldn't it be a better idea to keep looking at those streaming
> pages and figure out how to get them out of memory quickly?
The scanning I/O problem has been largely addressed by Rik's patch.
It is not optimal(which is hard), but fair enough for common cases.
Your kvm test case sounds like desktop-working-set-cannot-fit-in-memory.
In that case, it obviously benefits to protect the exec-mapped pages,
and there are not too much kvm exec-mapped pages to impact anon pages.
I ran a kvm and collected its number exec-mapped pages as follows.
They sum up to ~3MB. This is not a big pressure on memory thrashing.
Thanks,
Fengguang
---
Rss of kvm:
% grep -A2 x /proc/7640/smaps | grep -v Size
00400000-005fe000 r-xp 00000000 08:02 1890389 /usr/bin/kvm
Rss: 680 kB
--
7f6c029f9000-7f6c02a0f000 r-xp 00000000 08:02 458771 /lib/libgcc_s.so.1
Rss: 16 kB
--
7f6c02c10000-7f6c02d00000 r-xp 00000000 08:02 1885409 /usr/lib/libstdc++.so.6.0.10
Rss: 364 kB
--
7f6c03bf2000-7f6c03bf7000 r-xp 00000000 08:02 1887873 /usr/lib/libXdmcp.so.6.0.0
Rss: 12 kB
--
7f6c03df7000-7f6c03df9000 r-xp 00000000 08:02 1887871 /usr/lib/libXau.so.6.0.0
Rss: 8 kB
--
7f6c03ff9000-7f6c04019000 r-xp 00000000 08:02 458890 /lib/libx86.so.1
Rss: 36 kB
--
7f6c04019000-7f6c04218000 ---p 00020000 08:02 458890 /lib/libx86.so.1
Rss: 0 kB
--
7f6c04218000-7f6c0421a000 rw-p 0001f000 08:02 458890 /lib/libx86.so.1
Rss: 8 kB
--
7f6c0421b000-7f6c0421f000 r-xp 00000000 08:02 458861 /lib/libattr.so.1.1.0
Rss: 12 kB
--
7f6c0441f000-7f6c04434000 r-xp 00000000 08:02 460739 /lib/libnsl-2.9.so
Rss: 24 kB
--
7f6c04637000-7f6c04647000 r-xp 00000000 08:02 1897259 /usr/lib/libXext.so.6.4.0
Rss: 20 kB
--
7f6c04647000-7f6c04847000 ---p 00010000 08:02 1897259 /usr/lib/libXext.so.6.4.0
Rss: 0 kB
--
7f6c04847000-7f6c04848000 rw-p 00010000 08:02 1897259 /usr/lib/libXext.so.6.4.0
Rss: 4 kB
--
7f6c04848000-7f6c04978000 r-xp 00000000 08:02 1889103 /usr/lib/libicuuc.so.38.1
Rss: 244 kB
--
7f6c04b89000-7f6c04ba4000 r-xp 00000000 08:02 1886322 /usr/lib/libxcb.so.1.1.0
Rss: 28 kB
--
7f6c04ba4000-7f6c04da4000 ---p 0001b000 08:02 1886322 /usr/lib/libxcb.so.1.1.0
Rss: 0 kB
--
7f6c04da4000-7f6c04da5000 rw-p 0001b000 08:02 1886322 /usr/lib/libxcb.so.1.1.0
Rss: 4 kB
--
7f6c04da5000-7f6c04df3000 r-xp 00000000 08:02 1899899 /usr/lib/libvga.so.1.4.3
Rss: 68 kB
--
7f6c05004000-7f6c05018000 r-xp 00000000 08:02 1896343 /usr/lib/libdirect-1.0.so.0.1.0
Rss: 24 kB
--
7f6c05219000-7f6c05221000 r-xp 00000000 08:02 1892825 /usr/lib/libfusion-1.0.so.0.1.0
Rss: 16 kB
--
7f6c05421000-7f6c0548d000 r-xp 00000000 08:02 1892826 /usr/lib/libdirectfb-1.0.so.0.1.0
Rss: 64 kB
--
7f6c05691000-7f6c056a4000 r-xp 00000000 08:02 460720 /lib/libresolv-2.9.so
Rss: 20 kB
--
7f6c058a7000-7f6c058aa000 r-xp 00000000 08:02 1895568 /usr/lib/libgpg-error.so.0.4.0
Rss: 4 kB
--
7f6c05aaa000-7f6c05b1d000 r-xp 00000000 08:02 1890493 /usr/lib/libgcrypt.so.11.5.2
Rss: 36 kB
--
7f6c05d20000-7f6c05d30000 r-xp 00000000 08:02 2081187 /usr/lib/libtasn1.so.3.1.2
Rss: 12 kB
--
7f6c05f30000-7f6c05f34000 r-xp 00000000 08:02 458960 /lib/libcap.so.2.11
Rss: 12 kB
--
7f6c06134000-7f6c06170000 r-xp 00000000 08:02 1889247 /usr/lib/libdbus-1.so.3.4.0
Rss: 36 kB
--
7f6c06372000-7f6c06376000 r-xp 00000000 08:02 1891735 /usr/lib/libasyncns.so.0.1.0
Rss: 12 kB
--
7f6c06576000-7f6c0657e000 r-xp 00000000 08:02 458834 /lib/libwrap.so.0.7.6
Rss: 20 kB
--
7f6c0677f000-7f6c06784000 r-xp 00000000 08:02 1888031 /usr/lib/libXtst.so.6.1.0
Rss: 12 kB
--
7f6c06985000-7f6c0698d000 r-xp 00000000 08:02 1885331 /usr/lib/libSM.so.6.0.0
Rss: 12 kB
--
7f6c06b8d000-7f6c06ba3000 r-xp 00000000 08:02 1897238 /usr/lib/libICE.so.6.3.0
Rss: 28 kB
--
7f6c06da8000-7f6c06dee000 r-xp 00000000 08:02 2147381 /usr/lib/libpulsecommon-0.9.15.so
Rss: 56 kB
--
7f6c06fef000-7f6c06ff1000 r-xp 00000000 08:02 460735 /lib/libdl-2.9.so
Rss: 8 kB
--
7f6c071f3000-7f6c0722d000 r-xp 00000000 08:02 2147380 /usr/lib/libpulse.so.0.8.0
Rss: 44 kB
--
7f6c0742f000-7f6c07578000 r-xp 00000000 08:02 460721 /lib/libc-2.9.so
Rss: 464 kB
--
7f6c07782000-7f6c07786000 r-xp 00000000 08:02 1892933 /usr/lib/libvdeplug.so.2.1.0
Rss: 12 kB
--
7f6c07986000-7f6c0798f000 r-xp 00000000 08:02 459991 /lib/libbrlapi.so.0.5.1
Rss: 20 kB
--
7f6c07b91000-7f6c07bcb000 r-xp 00000000 08:02 458804 /lib/libncurses.so.5.6
Rss: 76 kB
--
7f6c07dd0000-7f6c07f05000 r-xp 00000000 08:02 1886324 /usr/lib/libX11.so.6.2.0
Rss: 120 kB
--
7f6c0810b000-7f6c08173000 r-xp 00000000 08:02 1892212 /usr/lib/libSDL-1.2.so.0.11.1
Rss: 36 kB
--
7f6c083c1000-7f6c083c3000 r-xp 00000000 08:02 460719 /lib/libutil-2.9.so
Rss: 8 kB
--
7f6c085c4000-7f6c085cb000 r-xp 00000000 08:02 460727 /lib/librt-2.9.so
Rss: 24 kB
--
7f6c087cc000-7f6c087e2000 r-xp 00000000 08:02 460725 /lib/libpthread-2.9.so
Rss: 68 kB
--
7f6c089e7000-7f6c089f2000 r-xp 00000000 08:02 1889339 /usr/lib/libpci.so.3.1.2
Rss: 16 kB
--
7f6c08bf2000-7f6c08c0c000 r-xp 00000000 08:02 2146929 /usr/lib/libbluetooth.so.3.2.5
Rss: 32 kB
--
7f6c08e0d000-7f6c08eb4000 r-xp 00000000 08:02 1896347 /usr/lib/libgnutls.so.26.11.5
Rss: 136 kB
--
7f6c090bf000-7f6c090c2000 r-xp 00000000 08:02 2147382 /usr/lib/libpulse-simple.so.0.0.2
Rss: 12 kB
--
7f6c092c3000-7f6c093a0000 r-xp 00000000 08:02 1885450 /usr/lib/libasound.so.2.0.0
Rss: 176 kB
--
7f6c095a7000-7f6c095bd000 r-xp 00000000 08:02 1885377 /usr/lib/libz.so.1.2.3.3
Rss: 16 kB
--
7f6c097be000-7f6c09840000 r-xp 00000000 08:02 460724 /lib/libm-2.9.so
Rss: 20 kB
--
7f6c09a41000-7f6c09a5e000 r-xp 00000000 08:02 459078 /lib/ld-2.9.so
Rss: 96 kB
--
7f6c09b2c000-7f6c09b31000 r-xp 00000000 08:02 1886943 /usr/lib/libgdbm.so.3.0.0
Rss: 12 kB
--
7fff54d4f000-7fff54d50000 r-xp 00000000 00:00 0 [vdso]
Rss: 4 kB
--
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Rss: 0 kB
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-09-03 2:04 ` Wu Fengguang
0 siblings, 0 replies; 243+ messages in thread
From: Wu Fengguang @ 2009-09-03 2:04 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
Jeff,
On Thu, Sep 03, 2009 at 03:30:59AM +0800, Dike, Jeffrey G wrote:
> I'm trying to better understand the motivation for your
> make-mapped-exec-pages-first-class-citizens patch. As I read your
> (very detailed!) description, you are diagnosing a threshold effect
> from Rik's evict-use-once-pages-first patch where if the inactive
> list is slightly smaller than the active list, the active list will
> start being scanned, pushing text (and other) pages onto the
> inactive list where they will be quickly kicked out to swap.
Right.
> As I read Rik's patch, if the active list is one page larger than
> the inactive list, then a batch of pages will get moved from one to
> the other. For this to have a noticeable effect on the system once
> the streaming is done, there must be something continuing to keep
> the active list larger than the inactive list. Maybe there is a
> consistent percentage of the streamed pages which are use-twice.
Right. Besides the use-twice case, I also explored the
desktop-working-set-cannot-fit-in-memory case in the patch.
> So, we a threshold effect where a small change in input (the size of
> the streaming file vs the number of active pages) causes a large
> change in output (lots of text pages suddenly start getting thrown
> out). My immediate reaction to that is that there shouldn't be
> this sudden change in behavior, and that maybe there should only be
> enough scanning in shink_active_list to bring the two lists back to
> parity. However, if there's something keeping the active list
> bigger than the inactive list, this will just put off the inevitable
> required scanning.
Yes there will be a sudden "behavior change" as soon as active list
grows larger than inactive list. However the "output change" is
bounded and not as large, because that extra behavior to scan active
list stops as soon as the two lists are back to parity.
> As for your patch, it seems like we have a problem with scanning
> I/O, and instead of looking at those pages, you are looking to
> protect some other set of pages (mapped text). That, in turn,
> increases pressure on anonymous pages (which is where I came in).
> Wouldn't it be a better idea to keep looking at those streaming
> pages and figure out how to get them out of memory quickly?
The scanning I/O problem has been largely addressed by Rik's patch.
It is not optimal(which is hard), but fair enough for common cases.
Your kvm test case sounds like desktop-working-set-cannot-fit-in-memory.
In that case, it obviously benefits to protect the exec-mapped pages,
and there are not too much kvm exec-mapped pages to impact anon pages.
I ran a kvm and collected its number exec-mapped pages as follows.
They sum up to ~3MB. This is not a big pressure on memory thrashing.
Thanks,
Fengguang
---
Rss of kvm:
% grep -A2 x /proc/7640/smaps | grep -v Size
00400000-005fe000 r-xp 00000000 08:02 1890389 /usr/bin/kvm
Rss: 680 kB
--
7f6c029f9000-7f6c02a0f000 r-xp 00000000 08:02 458771 /lib/libgcc_s.so.1
Rss: 16 kB
--
7f6c02c10000-7f6c02d00000 r-xp 00000000 08:02 1885409 /usr/lib/libstdc++.so.6.0.10
Rss: 364 kB
--
7f6c03bf2000-7f6c03bf7000 r-xp 00000000 08:02 1887873 /usr/lib/libXdmcp.so.6.0.0
Rss: 12 kB
--
7f6c03df7000-7f6c03df9000 r-xp 00000000 08:02 1887871 /usr/lib/libXau.so.6.0.0
Rss: 8 kB
--
7f6c03ff9000-7f6c04019000 r-xp 00000000 08:02 458890 /lib/libx86.so.1
Rss: 36 kB
--
7f6c04019000-7f6c04218000 ---p 00020000 08:02 458890 /lib/libx86.so.1
Rss: 0 kB
--
7f6c04218000-7f6c0421a000 rw-p 0001f000 08:02 458890 /lib/libx86.so.1
Rss: 8 kB
--
7f6c0421b000-7f6c0421f000 r-xp 00000000 08:02 458861 /lib/libattr.so.1.1.0
Rss: 12 kB
--
7f6c0441f000-7f6c04434000 r-xp 00000000 08:02 460739 /lib/libnsl-2.9.so
Rss: 24 kB
--
7f6c04637000-7f6c04647000 r-xp 00000000 08:02 1897259 /usr/lib/libXext.so.6.4.0
Rss: 20 kB
--
7f6c04647000-7f6c04847000 ---p 00010000 08:02 1897259 /usr/lib/libXext.so.6.4.0
Rss: 0 kB
--
7f6c04847000-7f6c04848000 rw-p 00010000 08:02 1897259 /usr/lib/libXext.so.6.4.0
Rss: 4 kB
--
7f6c04848000-7f6c04978000 r-xp 00000000 08:02 1889103 /usr/lib/libicuuc.so.38.1
Rss: 244 kB
--
7f6c04b89000-7f6c04ba4000 r-xp 00000000 08:02 1886322 /usr/lib/libxcb.so.1.1.0
Rss: 28 kB
--
7f6c04ba4000-7f6c04da4000 ---p 0001b000 08:02 1886322 /usr/lib/libxcb.so.1.1.0
Rss: 0 kB
--
7f6c04da4000-7f6c04da5000 rw-p 0001b000 08:02 1886322 /usr/lib/libxcb.so.1.1.0
Rss: 4 kB
--
7f6c04da5000-7f6c04df3000 r-xp 00000000 08:02 1899899 /usr/lib/libvga.so.1.4.3
Rss: 68 kB
--
7f6c05004000-7f6c05018000 r-xp 00000000 08:02 1896343 /usr/lib/libdirect-1.0.so.0.1.0
Rss: 24 kB
--
7f6c05219000-7f6c05221000 r-xp 00000000 08:02 1892825 /usr/lib/libfusion-1.0.so.0.1.0
Rss: 16 kB
--
7f6c05421000-7f6c0548d000 r-xp 00000000 08:02 1892826 /usr/lib/libdirectfb-1.0.so.0.1.0
Rss: 64 kB
--
7f6c05691000-7f6c056a4000 r-xp 00000000 08:02 460720 /lib/libresolv-2.9.so
Rss: 20 kB
--
7f6c058a7000-7f6c058aa000 r-xp 00000000 08:02 1895568 /usr/lib/libgpg-error.so.0.4.0
Rss: 4 kB
--
7f6c05aaa000-7f6c05b1d000 r-xp 00000000 08:02 1890493 /usr/lib/libgcrypt.so.11.5.2
Rss: 36 kB
--
7f6c05d20000-7f6c05d30000 r-xp 00000000 08:02 2081187 /usr/lib/libtasn1.so.3.1.2
Rss: 12 kB
--
7f6c05f30000-7f6c05f34000 r-xp 00000000 08:02 458960 /lib/libcap.so.2.11
Rss: 12 kB
--
7f6c06134000-7f6c06170000 r-xp 00000000 08:02 1889247 /usr/lib/libdbus-1.so.3.4.0
Rss: 36 kB
--
7f6c06372000-7f6c06376000 r-xp 00000000 08:02 1891735 /usr/lib/libasyncns.so.0.1.0
Rss: 12 kB
--
7f6c06576000-7f6c0657e000 r-xp 00000000 08:02 458834 /lib/libwrap.so.0.7.6
Rss: 20 kB
--
7f6c0677f000-7f6c06784000 r-xp 00000000 08:02 1888031 /usr/lib/libXtst.so.6.1.0
Rss: 12 kB
--
7f6c06985000-7f6c0698d000 r-xp 00000000 08:02 1885331 /usr/lib/libSM.so.6.0.0
Rss: 12 kB
--
7f6c06b8d000-7f6c06ba3000 r-xp 00000000 08:02 1897238 /usr/lib/libICE.so.6.3.0
Rss: 28 kB
--
7f6c06da8000-7f6c06dee000 r-xp 00000000 08:02 2147381 /usr/lib/libpulsecommon-0.9.15.so
Rss: 56 kB
--
7f6c06fef000-7f6c06ff1000 r-xp 00000000 08:02 460735 /lib/libdl-2.9.so
Rss: 8 kB
--
7f6c071f3000-7f6c0722d000 r-xp 00000000 08:02 2147380 /usr/lib/libpulse.so.0.8.0
Rss: 44 kB
--
7f6c0742f000-7f6c07578000 r-xp 00000000 08:02 460721 /lib/libc-2.9.so
Rss: 464 kB
--
7f6c07782000-7f6c07786000 r-xp 00000000 08:02 1892933 /usr/lib/libvdeplug.so.2.1.0
Rss: 12 kB
--
7f6c07986000-7f6c0798f000 r-xp 00000000 08:02 459991 /lib/libbrlapi.so.0.5.1
Rss: 20 kB
--
7f6c07b91000-7f6c07bcb000 r-xp 00000000 08:02 458804 /lib/libncurses.so.5.6
Rss: 76 kB
--
7f6c07dd0000-7f6c07f05000 r-xp 00000000 08:02 1886324 /usr/lib/libX11.so.6.2.0
Rss: 120 kB
--
7f6c0810b000-7f6c08173000 r-xp 00000000 08:02 1892212 /usr/lib/libSDL-1.2.so.0.11.1
Rss: 36 kB
--
7f6c083c1000-7f6c083c3000 r-xp 00000000 08:02 460719 /lib/libutil-2.9.so
Rss: 8 kB
--
7f6c085c4000-7f6c085cb000 r-xp 00000000 08:02 460727 /lib/librt-2.9.so
Rss: 24 kB
--
7f6c087cc000-7f6c087e2000 r-xp 00000000 08:02 460725 /lib/libpthread-2.9.so
Rss: 68 kB
--
7f6c089e7000-7f6c089f2000 r-xp 00000000 08:02 1889339 /usr/lib/libpci.so.3.1.2
Rss: 16 kB
--
7f6c08bf2000-7f6c08c0c000 r-xp 00000000 08:02 2146929 /usr/lib/libbluetooth.so.3.2.5
Rss: 32 kB
--
7f6c08e0d000-7f6c08eb4000 r-xp 00000000 08:02 1896347 /usr/lib/libgnutls.so.26.11.5
Rss: 136 kB
--
7f6c090bf000-7f6c090c2000 r-xp 00000000 08:02 2147382 /usr/lib/libpulse-simple.so.0.0.2
Rss: 12 kB
--
7f6c092c3000-7f6c093a0000 r-xp 00000000 08:02 1885450 /usr/lib/libasound.so.2.0.0
Rss: 176 kB
--
7f6c095a7000-7f6c095bd000 r-xp 00000000 08:02 1885377 /usr/lib/libz.so.1.2.3.3
Rss: 16 kB
--
7f6c097be000-7f6c09840000 r-xp 00000000 08:02 460724 /lib/libm-2.9.so
Rss: 20 kB
--
7f6c09a41000-7f6c09a5e000 r-xp 00000000 08:02 459078 /lib/ld-2.9.so
Rss: 96 kB
--
7f6c09b2c000-7f6c09b31000 r-xp 00000000 08:02 1886943 /usr/lib/libgdbm.so.3.0.0
Rss: 12 kB
--
7fff54d4f000-7fff54d50000 r-xp 00000000 00:00 0 [vdso]
Rss: 4 kB
--
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Rss: 0 kB
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
2009-09-03 2:04 ` Wu Fengguang
@ 2009-09-04 20:06 ` Dike, Jeffrey G
-1 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-09-04 20:06 UTC (permalink / raw)
To: Wu, Fengguang
Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
Stupid question - what in your patch allows a text page get kicked out to the inactive list after you've given it an extra pass through the active list?
Jeff
^ permalink raw reply [flat|nested] 243+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-09-04 20:06 ` Dike, Jeffrey G
0 siblings, 0 replies; 243+ messages in thread
From: Dike, Jeffrey G @ 2009-09-04 20:06 UTC (permalink / raw)
To: Wu, Fengguang
Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
Stupid question - what in your patch allows a text page get kicked out to the inactive list after you've given it an extra pass through the active list?
Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-09-04 20:06 ` Dike, Jeffrey G
@ 2009-09-04 20:57 ` Rik van Riel
-1 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-09-04 20:57 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Wu, Fengguang, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
Dike, Jeffrey G wrote:
> Stupid question - what in your patch allows a text page get kicked out to the inactive list after you've given it an extra pass through the active list?
If it did not get referenced during its second pass through
the active list, it will get deactivated.
--
All rights reversed.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-09-04 20:57 ` Rik van Riel
0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2009-09-04 20:57 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Wu, Fengguang, Johannes Weiner, Avi Kivity, KOSAKI Motohiro,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
Dike, Jeffrey G wrote:
> Stupid question - what in your patch allows a text page get kicked out to the inactive list after you've given it an extra pass through the active list?
If it did not get referenced during its second pass through
the active list, it will get deactivated.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
2009-08-14 21:42 ` Dike, Jeffrey G
@ 2009-09-13 16:23 ` KOSAKI Motohiro
-1 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-09-13 16:23 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Wu, Fengguang, Johannes Weiner, Avi Kivity, Rik van Riel,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
Hi Jeff,
> A side note - I've been doing some tracing and shrink_active_list is called a humongous number of times (25000-ish during a ~90 kvm run), with a net result of zero pages moved nearly all the time. Your test is rescuing essentially all candidate pages from the inactive list. Right now, I have the VM_EXEC || PageAnon version of your test.
Sorry for the long delayed replay.
I made reproduce environment today. but I don't have luck. I didn't
reproduce stack refault issue.
Could you please explain detailed reproduce way and your analysis way?
My environment is,
x86_64 CPUx4 MEM 6G
userland: fedora11
kernel: latest mmotm
cgroup size: 128M
guest mem: 256M
CONFIG_KSM=n
My result,
- plenty anon and file fault happen. but it is ideal. it is caused
by demand paging.
- do_anonymous_page almost doesn't handle stack fault. both host and guest.
^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages?
@ 2009-09-13 16:23 ` KOSAKI Motohiro
0 siblings, 0 replies; 243+ messages in thread
From: KOSAKI Motohiro @ 2009-09-13 16:23 UTC (permalink / raw)
To: Dike, Jeffrey G
Cc: Wu, Fengguang, Johannes Weiner, Avi Kivity, Rik van Riel,
Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins,
Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm
Hi Jeff,
> A side note - I've been doing some tracing and shrink_active_list is called a humongous number of times (25000-ish during a ~90 kvm run), with a net result of zero pages moved nearly all the time. Your test is rescuing essentially all candidate pages from the inactive list. Right now, I have the VM_EXEC || PageAnon version of your test.
Sorry for the long delayed replay.
I made reproduce environment today. but I don't have luck. I didn't
reproduce stack refault issue.
Could you please explain detailed reproduce way and your analysis way?
My environment is,
x86_64 CPUx4 MEM 6G
userland: fedora11
kernel: latest mmotm
cgroup size: 128M
guest mem: 256M
CONFIG_KSM=n
My result,
- plenty anon and file fault happen. but it is ideal. it is caused
by demand paging.
- do_anonymous_page almost doesn't handle stack fault. both host and guest.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 243+ messages in thread
end of thread, other threads:[~2009-09-13 16:30 UTC | newest]
Thread overview: 243+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-05 2:40 [RFC] respect the referenced bit of KVM guest pages? Wu Fengguang
2009-08-05 2:40 ` Wu Fengguang
2009-08-05 4:15 ` KOSAKI Motohiro
2009-08-05 4:15 ` KOSAKI Motohiro
2009-08-05 4:41 ` Wu Fengguang
2009-08-05 4:41 ` Wu Fengguang
2009-08-05 7:58 ` Avi Kivity
2009-08-05 7:58 ` Avi Kivity
2009-08-05 8:17 ` Avi Kivity
2009-08-05 8:17 ` Avi Kivity
2009-08-05 14:33 ` Rik van Riel
2009-08-05 14:33 ` Rik van Riel
2009-08-05 15:37 ` Avi Kivity
2009-08-05 15:37 ` Avi Kivity
2009-08-05 14:15 ` Rik van Riel
2009-08-05 14:15 ` Rik van Riel
2009-08-05 15:12 ` Avi Kivity
2009-08-05 15:12 ` Avi Kivity
2009-08-05 15:15 ` Rik van Riel
2009-08-05 15:15 ` Rik van Riel
2009-08-05 15:25 ` Avi Kivity
2009-08-05 15:25 ` Avi Kivity
2009-08-05 16:35 ` Andrea Arcangeli
2009-08-05 16:35 ` Andrea Arcangeli
2009-08-05 16:31 ` Andrea Arcangeli
2009-08-05 16:31 ` Andrea Arcangeli
2009-08-05 17:25 ` Rik van Riel
2009-08-05 17:25 ` Rik van Riel
2009-08-05 15:45 ` Dike, Jeffrey G
2009-08-05 15:45 ` Dike, Jeffrey G
2009-08-05 16:05 ` Andrea Arcangeli
2009-08-05 16:05 ` Andrea Arcangeli
2009-08-05 16:12 ` Dike, Jeffrey G
2009-08-05 16:12 ` Dike, Jeffrey G
2009-08-05 16:19 ` Andrea Arcangeli
2009-08-05 16:19 ` Andrea Arcangeli
2009-08-05 15:58 ` Andrea Arcangeli
2009-08-05 15:58 ` Andrea Arcangeli
2009-08-05 17:20 ` Rik van Riel
2009-08-05 17:20 ` Rik van Riel
2009-08-05 17:42 ` Rik van Riel
2009-08-05 17:42 ` Rik van Riel
2009-08-06 10:15 ` Andrea Arcangeli
2009-08-06 10:15 ` Andrea Arcangeli
2009-08-06 10:08 ` Andrea Arcangeli
2009-08-06 10:08 ` Andrea Arcangeli
2009-08-06 10:18 ` Avi Kivity
2009-08-06 10:18 ` Avi Kivity
2009-08-06 10:20 ` Andrea Arcangeli
2009-08-06 10:20 ` Andrea Arcangeli
2009-08-06 10:59 ` Wu Fengguang
2009-08-06 10:59 ` Wu Fengguang
2009-08-06 11:44 ` Avi Kivity
2009-08-06 11:44 ` Avi Kivity
2009-08-06 13:06 ` Wu Fengguang
2009-08-06 13:06 ` Wu Fengguang
2009-08-06 13:16 ` Rik van Riel
2009-08-06 13:16 ` Rik van Riel
2009-08-16 3:28 ` Wu Fengguang
2009-08-16 3:28 ` Wu Fengguang
2009-08-16 3:56 ` Rik van Riel
2009-08-16 3:56 ` Rik van Riel
2009-08-16 4:43 ` Balbir Singh
2009-08-16 4:43 ` Balbir Singh
2009-08-16 4:55 ` Wu Fengguang
2009-08-16 4:55 ` Wu Fengguang
2009-08-16 5:59 ` Balbir Singh
2009-08-16 5:59 ` Balbir Singh
2009-08-17 19:47 ` Dike, Jeffrey G
2009-08-17 19:47 ` Dike, Jeffrey G
2009-08-21 18:24 ` Balbir Singh
2009-08-21 18:24 ` Balbir Singh
2009-08-31 19:43 ` Dike, Jeffrey G
2009-08-31 19:43 ` Dike, Jeffrey G
2009-08-31 19:52 ` Rik van Riel
2009-08-31 19:52 ` Rik van Riel
2009-08-31 20:06 ` Dike, Jeffrey G
2009-08-31 20:06 ` Dike, Jeffrey G
2009-08-31 20:09 ` Rik van Riel
2009-08-31 20:09 ` Rik van Riel
2009-08-31 20:11 ` Dike, Jeffrey G
2009-08-31 20:11 ` Dike, Jeffrey G
2009-08-31 20:42 ` Balbir Singh
2009-08-31 20:42 ` Balbir Singh
2009-08-06 13:46 ` Avi Kivity
2009-08-06 13:46 ` Avi Kivity
2009-08-06 21:09 ` Jeff Dike
2009-08-06 21:09 ` Jeff Dike
2009-08-16 3:18 ` Wu Fengguang
2009-08-16 3:18 ` Wu Fengguang
2009-08-16 3:53 ` Rik van Riel
2009-08-16 3:53 ` Rik van Riel
2009-08-16 5:15 ` Wu Fengguang
2009-08-16 5:15 ` Wu Fengguang
2009-08-16 11:29 ` Wu Fengguang
2009-08-16 11:29 ` Wu Fengguang
2009-08-17 14:33 ` Minchan Kim
2009-08-17 14:33 ` Minchan Kim
2009-08-18 2:34 ` Wu Fengguang
2009-08-18 2:34 ` Wu Fengguang
2009-08-18 4:17 ` Minchan Kim
2009-08-18 4:17 ` Minchan Kim
2009-08-18 9:31 ` Wu Fengguang
2009-08-18 9:31 ` Wu Fengguang
2009-08-18 9:52 ` Minchan Kim
2009-08-18 9:52 ` Minchan Kim
2009-08-18 10:00 ` Wu Fengguang
2009-08-18 10:00 ` Wu Fengguang
2009-08-18 11:00 ` Minchan Kim
2009-08-18 11:00 ` Minchan Kim
2009-08-18 11:11 ` Wu Fengguang
2009-08-18 11:11 ` Wu Fengguang
2009-08-18 14:03 ` Minchan Kim
2009-08-18 14:03 ` Minchan Kim
2009-08-18 16:27 ` KOSAKI Motohiro
2009-08-18 16:27 ` KOSAKI Motohiro
2009-08-18 15:57 ` KOSAKI Motohiro
2009-08-18 15:57 ` KOSAKI Motohiro
2009-08-19 12:01 ` Wu Fengguang
2009-08-19 12:01 ` Wu Fengguang
2009-08-19 12:05 ` KOSAKI Motohiro
2009-08-19 12:05 ` KOSAKI Motohiro
2009-08-19 12:10 ` Wu Fengguang
2009-08-19 12:10 ` Wu Fengguang
2009-08-19 12:25 ` Minchan Kim
2009-08-19 12:25 ` Minchan Kim
2009-08-19 13:19 ` KOSAKI Motohiro
2009-08-19 13:19 ` KOSAKI Motohiro
2009-08-19 13:28 ` Minchan Kim
2009-08-19 13:28 ` Minchan Kim
2009-08-21 11:17 ` KOSAKI Motohiro
2009-08-21 11:17 ` KOSAKI Motohiro
2009-08-19 13:24 ` Wu Fengguang
2009-08-19 13:24 ` Wu Fengguang
2009-08-19 13:38 ` Minchan Kim
2009-08-19 13:38 ` Minchan Kim
2009-08-19 14:00 ` Wu Fengguang
2009-08-19 14:00 ` Wu Fengguang
2009-08-06 13:13 ` Rik van Riel
2009-08-06 13:13 ` Rik van Riel
2009-08-06 13:49 ` Avi Kivity
2009-08-06 13:49 ` Avi Kivity
2009-08-07 3:11 ` KOSAKI Motohiro
2009-08-07 3:11 ` KOSAKI Motohiro
2009-08-07 7:54 ` Balbir Singh
2009-08-07 7:54 ` Balbir Singh
2009-08-07 8:24 ` KAMEZAWA Hiroyuki
2009-08-07 8:24 ` KAMEZAWA Hiroyuki
2009-08-06 13:11 ` Rik van Riel
2009-08-06 13:11 ` Rik van Riel
2009-08-06 13:08 ` Rik van Riel
2009-08-06 13:08 ` Rik van Riel
2009-08-07 3:17 ` KOSAKI Motohiro
2009-08-07 3:17 ` KOSAKI Motohiro
2009-08-12 7:48 ` Wu Fengguang
2009-08-12 14:31 ` Rik van Riel
2009-08-12 14:31 ` Rik van Riel
2009-08-13 1:03 ` Wu Fengguang
2009-08-13 1:03 ` Wu Fengguang
2009-08-13 15:46 ` Rik van Riel
2009-08-13 15:46 ` Rik van Riel
2009-08-13 16:12 ` Avi Kivity
2009-08-13 16:12 ` Avi Kivity
2009-08-13 16:26 ` Rik van Riel
2009-08-13 16:26 ` Rik van Riel
2009-08-13 19:12 ` Avi Kivity
2009-08-13 19:12 ` Avi Kivity
2009-08-13 21:16 ` Johannes Weiner
2009-08-13 21:16 ` Johannes Weiner
2009-08-14 7:16 ` Avi Kivity
2009-08-14 7:16 ` Avi Kivity
2009-08-14 9:10 ` Johannes Weiner
2009-08-14 9:10 ` Johannes Weiner
2009-08-14 9:51 ` Wu Fengguang
2009-08-14 9:51 ` Wu Fengguang
2009-08-14 13:19 ` Rik van Riel
2009-08-14 13:19 ` Rik van Riel
2009-08-15 5:45 ` Wu Fengguang
2009-08-15 5:45 ` Wu Fengguang
2009-08-16 5:09 ` Balbir Singh
2009-08-16 5:09 ` Balbir Singh
2009-08-16 5:41 ` Wu Fengguang
2009-08-16 5:41 ` Wu Fengguang
2009-08-16 5:50 ` Wu Fengguang
2009-08-16 5:50 ` Wu Fengguang
2009-08-18 15:57 ` KOSAKI Motohiro
2009-08-18 15:57 ` KOSAKI Motohiro
2009-08-17 18:04 ` Dike, Jeffrey G
2009-08-17 18:04 ` Dike, Jeffrey G
2009-08-18 2:26 ` Wu Fengguang
2009-08-18 2:26 ` Wu Fengguang
2009-09-02 19:30 ` Dike, Jeffrey G
2009-09-02 19:30 ` Dike, Jeffrey G
2009-09-03 2:04 ` Wu Fengguang
2009-09-03 2:04 ` Wu Fengguang
2009-09-04 20:06 ` Dike, Jeffrey G
2009-09-04 20:06 ` Dike, Jeffrey G
2009-09-04 20:57 ` Rik van Riel
2009-09-04 20:57 ` Rik van Riel
2009-08-18 15:57 ` KOSAKI Motohiro
2009-08-18 15:57 ` KOSAKI Motohiro
2009-08-19 12:08 ` Wu Fengguang
2009-08-19 12:08 ` Wu Fengguang
2009-08-19 13:40 ` [RFC] memcg: move definitions to .h and inline some functions Wu Fengguang
2009-08-19 13:40 ` Wu Fengguang
2009-08-19 14:18 ` KAMEZAWA Hiroyuki
2009-08-19 14:18 ` KAMEZAWA Hiroyuki
2009-08-19 14:27 ` Balbir Singh
2009-08-19 14:27 ` Balbir Singh
2009-08-20 1:34 ` Wu Fengguang
2009-08-20 1:34 ` Wu Fengguang
2009-08-14 21:42 ` [RFC] respect the referenced bit of KVM guest pages? Dike, Jeffrey G
2009-08-14 21:42 ` Dike, Jeffrey G
2009-08-14 22:37 ` Rik van Riel
2009-08-14 22:37 ` Rik van Riel
2009-08-15 5:32 ` Wu Fengguang
2009-08-15 5:32 ` Wu Fengguang
2009-09-13 16:23 ` KOSAKI Motohiro
2009-09-13 16:23 ` KOSAKI Motohiro
2009-08-05 17:53 ` Rik van Riel
2009-08-05 17:53 ` Rik van Riel
2009-08-05 19:00 ` Dike, Jeffrey G
2009-08-05 19:00 ` Dike, Jeffrey G
2009-08-05 19:07 ` Rik van Riel
2009-08-05 19:07 ` Rik van Riel
2009-08-05 19:18 ` Dike, Jeffrey G
2009-08-05 19:18 ` Dike, Jeffrey G
2009-08-06 9:22 ` Avi Kivity
2009-08-06 9:22 ` Avi Kivity
2009-08-06 9:25 ` Wu Fengguang
2009-08-06 9:25 ` Wu Fengguang
2009-08-06 9:35 ` Avi Kivity
2009-08-06 9:35 ` Avi Kivity
2009-08-06 9:35 ` Wu Fengguang
2009-08-06 9:35 ` Wu Fengguang
2009-08-06 9:59 ` Avi Kivity
2009-08-06 9:59 ` Avi Kivity
2009-08-06 9:59 ` Wu Fengguang
2009-08-06 9:59 ` Wu Fengguang
2009-08-06 10:14 ` Avi Kivity
2009-08-06 10:14 ` Avi Kivity
2009-08-07 1:25 ` KAMEZAWA Hiroyuki
2009-08-07 1:25 ` KAMEZAWA Hiroyuki
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.