[PATCH 0/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
@ 2011-06-21 12:55 ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-21 12:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Izik Eidus, Andrea Arcangeli, Hugh Dickins, Chris Wright,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel

Compared to the first version, this patch set addresses the problem of
dirty bit updating of virtual machines, by adding two mmu_notifier interfaces.
So it can now track the volatile working set inside KVM guest OS.

V1 log:
Currently, ksm uses page checksum to detect volatile pages. Izik Eidus 
suggested that we could use pte dirty bit to optimize. This patch series
adds this new logic.

Preliminary benchmarks show that the scan speed is improved by up to 16 
times on volatile transparent huge pages and up to 8 times on volatile 
regular pages.

Following is the test program to show this top speed up (you need to make 
ksmd takes about more than 90% of the cpu and watch the ksm/full_scans).

  #include <stdio.h>
  #include <stdlib.h>
  #include <errno.h>
  #include <string.h>
  #include <unistd.h>
  #include <sys/mman.h>

  #define MADV_MERGEABLE   12

  #define SIZE (2000*1024*1024)
  #define PAGE_SIZE 4096

  int main(int argc, char **argv)
  {
        unsigned char *p;
        int j;
        int ret;

          p = mmap(NULL, SIZE, PROT_WRITE|PROT_READ,
                   MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);

        if (p == MAP_FAILED) {
                printf("mmap error\n");
                return 0;
        }

          ret = madvise(p, SIZE, MADV_MERGEABLE);

          if (ret==-1) {
                  printf("madvise failed \n");
                  return 0;
          }

        memset(p, 1, SIZE);

        while (1) {
                for (j=0; j<SIZE; j+=PAGE_SIZE) {
                        *((long*)(p+j+PAGE_SIZE-4)) = random();
                }
        }

        return 0;
  }

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 0/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
@ 2011-06-21 12:55 ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-21 12:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Izik Eidus, Andrea Arcangeli, Hugh Dickins, Chris Wright,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel

Compared to the first version, this patch set addresses the problem of
dirty bit updating of virtual machines, by adding two mmu_notifier interfaces.
So it can now track the volatile working set inside KVM guest OS.

V1 log:
Currently, ksm uses page checksum to detect volatile pages. Izik Eidus 
suggested that we could use pte dirty bit to optimize. This patch series
adds this new logic.

Preliminary benchmarks show that the scan speed is improved by up to 16 
times on volatile transparent huge pages and up to 8 times on volatile 
regular pages.

Following is the test program to show this top speed up (you need to make 
ksmd takes about more than 90% of the cpu and watch the ksm/full_scans).

  #include <stdio.h>
  #include <stdlib.h>
  #include <errno.h>
  #include <string.h>
  #include <unistd.h>
  #include <sys/mman.h>
  
  #define MADV_MERGEABLE   12
  
  
  #define SIZE (2000*1024*1024)
  #define PAGE_SIZE 4096
  
  int main(int argc, char **argv)
  {
        unsigned char *p;
        int j;
        int ret;
  
          p = mmap(NULL, SIZE, PROT_WRITE|PROT_READ,
                   MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
        
        if (p == MAP_FAILED) {
                printf("mmap error\n");
                return 0;
        }
      
          ret = madvise(p, SIZE, MADV_MERGEABLE);
      
          if (ret==-1) {
                  printf("madvise failed \n");
                  return 0;
          }
  
        
        memset(p, 1, SIZE);
  
        while (1) {
                for (j=0; j<SIZE; j+=PAGE_SIZE) {
                        *((long*)(p+j+PAGE_SIZE-4)) = random();
                }
        }
  
        return 0;
  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 1/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
  2011-06-21 12:55 ` Nai Xia
@ 2011-06-21 13:26   ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-21 13:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Izik Eidus, Andrea Arcangeli, Hugh Dickins, Chris Wright,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel

This patch makes the page_check_address() can validate if a subpage is
in its place in a huge page pointed by the address. This can be useful when
ksm does not split huge pages when looking up the subpages one by one.

And fix two potential bugs at the same time:

As I understand, there is a bug in __page_check_address() that may 
trigger a rare case of schedule in atomic on huge pages if CONFIG_HIGHPTE is enabled:
if a hugetlb page is validated by this function, the returned pte_t * is 
actually a pmd_t* which is not mapped by kmap_atomic(), but will later be
kunmap_atomic(). This may result in a false preempt count. This patch adds
another parameter named "need_pte_unmap" to let it tell outside if this is
a good huge page and should not be pte_unmap(). All callsites have been 
modified to use another new uniformed call:
page_check_address_unmap_unlock(ptl, pte, need_pte_unmap), to finalize the 
page_check_address().

Another possible tiny issue in huge_pte_offset() is that when it was called in 
__page_check_address(), there is no good-reasoned guarantee that the 
"address" passed in is really mapped to a huge page even if PageHuge(page)
is true. So it's too early to return a pmd without checking its _PAGE_PSE.

I am not an expert in this area and there maybe no bug report concerning the
above two issues. But I think there is potential risk and the reasoning is simple.
So some one please help me confirm these two issues. 

---
 arch/x86/mm/hugetlbpage.c |    2 +
 include/linux/rmap.h      |   26 +++++++++++++++---
 mm/filemap_xip.c          |    6 +++-
 mm/rmap.c                 |   61 +++++++++++++++++++++++++++++++++------------
 4 files changed, 72 insertions(+), 23 deletions(-)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index f581a18..132e84b 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -164,6 +164,8 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 			if (pud_large(*pud))
 				return (pte_t *)pud;
 			pmd = pmd_offset(pud, addr);
+			if (!pmd_huge(*pmd))
+				pmd = NULL;
 		}
 	}
 	return (pte_t *) pmd;
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..3c4ead9 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -9,6 +9,7 @@
 #include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/memcontrol.h>
+#include <linux/highmem.h>
 
 /*
  * The anon_vma heads a list of private "related" vmas, to scan if
@@ -183,20 +184,35 @@ int try_to_unmap_one(struct page *, struct vm_area_struct *,
  * Called from mm/filemap_xip.c to unmap empty zero page
  */
 pte_t *__page_check_address(struct page *, struct mm_struct *,
-				unsigned long, spinlock_t **, int);
+			    unsigned long, spinlock_t **, int, int *);
 
-static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
-					unsigned long address,
-					spinlock_t **ptlp, int sync)
+static inline
+pte_t *page_check_address(struct page *page, struct mm_struct *mm,
+			  unsigned long address, spinlock_t **ptlp,
+			  int sync, int *need_pte_unmap)
 {
 	pte_t *ptep;
 
 	__cond_lock(*ptlp, ptep = __page_check_address(page, mm, address,
-						       ptlp, sync));
+						       ptlp, sync,
+						       need_pte_unmap));
 	return ptep;
 }
 
 /*
+ * After a successful page_check_address() call this is the way to finalize
+ */
+static inline
+void page_check_address_unmap_unlock(spinlock_t *ptl, pte_t *pte,
+				     int need_pte_unmap)
+{
+	if (need_pte_unmap)
+		pte_unmap(pte);
+
+	spin_unlock(ptl);
+}
+
+/*
  * Used by swapoff to help locate where page is expected in vma.
  */
 unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 93356cd..01b6454 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -175,6 +175,7 @@ __xip_unmap (struct address_space * mapping,
 	struct page *page;
 	unsigned count;
 	int locked = 0;
+	int need_unmap;
 
 	count = read_seqcount_begin(&xip_sparse_seq);
 
@@ -189,7 +190,8 @@ retry:
 		address = vma->vm_start +
 			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-		pte = page_check_address(page, mm, address, &ptl, 1);
+		pte = page_check_address(page, mm, address, &ptl, 1,
+					 &need_pte_unmap);
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
@@ -197,7 +199,7 @@ retry:
 			page_remove_rmap(page);
 			dec_mm_counter(mm, MM_FILEPAGES);
 			BUG_ON(pte_dirty(pteval));
-			pte_unmap_unlock(pte, ptl);
+			page_check_address_unmap_unlock(ptl, pte, need_unmap);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/rmap.c b/mm/rmap.c
index 27dfd3b..815adc9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -573,17 +573,25 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
  * On success returns with pte mapped and locked.
  */
 pte_t *__page_check_address(struct page *page, struct mm_struct *mm,
-			  unsigned long address, spinlock_t **ptlp, int sync)
+			    unsigned long address, spinlock_t **ptlp,
+			    int sync, int *need_pte_unmap)
 {
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 	spinlock_t *ptl;
+	unsigned long sub_pfn;
+
+	*need_pte_unmap = 1;
 
 	if (unlikely(PageHuge(page))) {
 		pte = huge_pte_offset(mm, address);
+		if (!pte_present(*pte))
+			return NULL;
+
 		ptl = &mm->page_table_lock;
+		*need_pte_unmap = 0;
 		goto check;
 	}
 
@@ -598,8 +606,12 @@ pte_t *__page_check_address(struct page *page, struct mm_struct *mm,
 	pmd = pmd_offset(pud, address);
 	if (!pmd_present(*pmd))
 		return NULL;
-	if (pmd_trans_huge(*pmd))
-		return NULL;
+	if (pmd_trans_huge(*pmd)) {
+		pte = (pte_t *) pmd;
+		ptl = &mm->page_table_lock;
+		*need_pte_unmap = 0;
+		goto check;
+	}
 
 	pte = pte_offset_map(pmd, address);
 	/* Make a quick check before getting the lock */
@@ -611,11 +623,23 @@ pte_t *__page_check_address(struct page *page, struct mm_struct *mm,
 	ptl = pte_lockptr(mm, pmd);
 check:
 	spin_lock(ptl);
-	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
+	if (!*need_pte_unmap) {
+		sub_pfn = pte_pfn(*pte) +
+			((address & ~HPAGE_PMD_MASK) >> PAGE_SHIFT);
+
+		if (pte_present(*pte) && page_to_pfn(page) == sub_pfn) {
+			*ptlp = ptl;
+			return pte;
+		}
+	} else if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
 		*ptlp = ptl;
 		return pte;
 	}
-	pte_unmap_unlock(pte, ptl);
+
+	if (*need_pte_unmap)
+		pte_unmap(pte);
+
+	spin_unlock(ptl);
 	return NULL;
 }
 
@@ -633,14 +657,15 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 	unsigned long address;
 	pte_t *pte;
 	spinlock_t *ptl;
+	int need_pte_unmap;
 
 	address = vma_address(page, vma);
 	if (address == -EFAULT)		/* out of vma range */
 		return 0;
-	pte = page_check_address(page, vma->vm_mm, address, &ptl, 1);
+	pte = page_check_address(page, vma->vm_mm, address, &ptl, 1, &need_pte_unmap);
 	if (!pte)			/* the page is not in this mm */
 		return 0;
-	pte_unmap_unlock(pte, ptl);
+	page_check_address_unmap_unlock(ptl, pte, need_pte_unmap);
 
 	return 1;
 }
@@ -685,12 +710,14 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 	} else {
 		pte_t *pte;
 		spinlock_t *ptl;
+		int need_pte_unmap;
 
 		/*
 		 * rmap might return false positives; we must filter
 		 * these out using page_check_address().
 		 */
-		pte = page_check_address(page, mm, address, &ptl, 0);
+		pte = page_check_address(page, mm, address, &ptl, 0,
+					 &need_pte_unmap);
 		if (!pte)
 			goto out;
 
@@ -712,7 +739,7 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			if (likely(!VM_SequentialReadHint(vma)))
 				referenced++;
 		}
-		pte_unmap_unlock(pte, ptl);
+		page_check_address_unmap_unlock(ptl, pte, need_pte_unmap);
 	}
 
 	/* Pretend the page is referenced if the task has the
@@ -886,8 +913,9 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_t *pte;
 	spinlock_t *ptl;
 	int ret = 0;
+	int need_pte_unmap;
 
-	pte = page_check_address(page, mm, address, &ptl, 1);
+	pte = page_check_address(page, mm, address, &ptl, 1, &need_pte_unmap);
 	if (!pte)
 		goto out;
 
@@ -902,7 +930,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 		ret = 1;
 	}
 
-	pte_unmap_unlock(pte, ptl);
+	page_check_address_unmap_unlock(ptl, pte, need_pte_unmap);
 out:
 	return ret;
 }
@@ -974,9 +1002,9 @@ void page_move_anon_rmap(struct page *page,
 
 /**
  * __page_set_anon_rmap - set up new anonymous rmap
- * @page:	Page to add to rmap	
+ * @page:	Page to add to rmap
  * @vma:	VM area to add page to.
- * @address:	User virtual address of the mapping	
+ * @address:	User virtual address of the mapping
  * @exclusive:	the page is exclusively owned by the current process
  */
 static void __page_set_anon_rmap(struct page *page,
@@ -1176,8 +1204,9 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pte_t pteval;
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
+	int need_pte_unmap;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
+	pte = page_check_address(page, mm, address, &ptl, 0, &need_pte_unmap);
 	if (!pte)
 		goto out;
 
@@ -1262,12 +1291,12 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	page_cache_release(page);
 
 out_unmap:
-	pte_unmap_unlock(pte, ptl);
+	page_check_address_unmap_unlock(ptl, pte, need_pte_unmap);
 out:
 	return ret;
 
 out_mlock:
-	pte_unmap_unlock(pte, ptl);
+	page_check_address_unmap_unlock(ptl, pte, need_pte_unmap);
 
 
 	/*

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 1/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
@ 2011-06-21 13:26   ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-21 13:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Izik Eidus, Andrea Arcangeli, Hugh Dickins, Chris Wright,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel

This patch makes the page_check_address() can validate if a subpage is
in its place in a huge page pointed by the address. This can be useful when
ksm does not split huge pages when looking up the subpages one by one.

And fix two potential bugs at the same time:

As I understand, there is a bug in __page_check_address() that may 
trigger a rare case of schedule in atomic on huge pages if CONFIG_HIGHPTE is enabled:
if a hugetlb page is validated by this function, the returned pte_t * is 
actually a pmd_t* which is not mapped by kmap_atomic(), but will later be
kunmap_atomic(). This may result in a false preempt count. This patch adds
another parameter named "need_pte_unmap" to let it tell outside if this is
a good huge page and should not be pte_unmap(). All callsites have been 
modified to use another new uniformed call:
page_check_address_unmap_unlock(ptl, pte, need_pte_unmap), to finalize the 
page_check_address().

Another possible tiny issue in huge_pte_offset() is that when it was called in 
__page_check_address(), there is no good-reasoned guarantee that the 
"address" passed in is really mapped to a huge page even if PageHuge(page)
is true. So it's too early to return a pmd without checking its _PAGE_PSE.

I am not an expert in this area and there maybe no bug report concerning the
above two issues. But I think there is potential risk and the reasoning is simple.
So some one please help me confirm these two issues. 

---
 arch/x86/mm/hugetlbpage.c |    2 +
 include/linux/rmap.h      |   26 +++++++++++++++---
 mm/filemap_xip.c          |    6 +++-
 mm/rmap.c                 |   61 +++++++++++++++++++++++++++++++++------------
 4 files changed, 72 insertions(+), 23 deletions(-)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index f581a18..132e84b 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -164,6 +164,8 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 			if (pud_large(*pud))
 				return (pte_t *)pud;
 			pmd = pmd_offset(pud, addr);
+			if (!pmd_huge(*pmd))
+				pmd = NULL;
 		}
 	}
 	return (pte_t *) pmd;
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..3c4ead9 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -9,6 +9,7 @@
 #include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/memcontrol.h>
+#include <linux/highmem.h>
 
 /*
  * The anon_vma heads a list of private "related" vmas, to scan if
@@ -183,20 +184,35 @@ int try_to_unmap_one(struct page *, struct vm_area_struct *,
  * Called from mm/filemap_xip.c to unmap empty zero page
  */
 pte_t *__page_check_address(struct page *, struct mm_struct *,
-				unsigned long, spinlock_t **, int);
+			    unsigned long, spinlock_t **, int, int *);
 
-static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
-					unsigned long address,
-					spinlock_t **ptlp, int sync)
+static inline
+pte_t *page_check_address(struct page *page, struct mm_struct *mm,
+			  unsigned long address, spinlock_t **ptlp,
+			  int sync, int *need_pte_unmap)
 {
 	pte_t *ptep;
 
 	__cond_lock(*ptlp, ptep = __page_check_address(page, mm, address,
-						       ptlp, sync));
+						       ptlp, sync,
+						       need_pte_unmap));
 	return ptep;
 }
 
 /*
+ * After a successful page_check_address() call this is the way to finalize
+ */
+static inline
+void page_check_address_unmap_unlock(spinlock_t *ptl, pte_t *pte,
+				     int need_pte_unmap)
+{
+	if (need_pte_unmap)
+		pte_unmap(pte);
+
+	spin_unlock(ptl);
+}
+
+/*
  * Used by swapoff to help locate where page is expected in vma.
  */
 unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 93356cd..01b6454 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -175,6 +175,7 @@ __xip_unmap (struct address_space * mapping,
 	struct page *page;
 	unsigned count;
 	int locked = 0;
+	int need_unmap;
 
 	count = read_seqcount_begin(&xip_sparse_seq);
 
@@ -189,7 +190,8 @@ retry:
 		address = vma->vm_start +
 			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-		pte = page_check_address(page, mm, address, &ptl, 1);
+		pte = page_check_address(page, mm, address, &ptl, 1,
+					 &need_pte_unmap);
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
@@ -197,7 +199,7 @@ retry:
 			page_remove_rmap(page);
 			dec_mm_counter(mm, MM_FILEPAGES);
 			BUG_ON(pte_dirty(pteval));
-			pte_unmap_unlock(pte, ptl);
+			page_check_address_unmap_unlock(ptl, pte, need_unmap);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/rmap.c b/mm/rmap.c
index 27dfd3b..815adc9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -573,17 +573,25 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
  * On success returns with pte mapped and locked.
  */
 pte_t *__page_check_address(struct page *page, struct mm_struct *mm,
-			  unsigned long address, spinlock_t **ptlp, int sync)
+			    unsigned long address, spinlock_t **ptlp,
+			    int sync, int *need_pte_unmap)
 {
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 	spinlock_t *ptl;
+	unsigned long sub_pfn;
+
+	*need_pte_unmap = 1;
 
 	if (unlikely(PageHuge(page))) {
 		pte = huge_pte_offset(mm, address);
+		if (!pte_present(*pte))
+			return NULL;
+
 		ptl = &mm->page_table_lock;
+		*need_pte_unmap = 0;
 		goto check;
 	}
 
@@ -598,8 +606,12 @@ pte_t *__page_check_address(struct page *page, struct mm_struct *mm,
 	pmd = pmd_offset(pud, address);
 	if (!pmd_present(*pmd))
 		return NULL;
-	if (pmd_trans_huge(*pmd))
-		return NULL;
+	if (pmd_trans_huge(*pmd)) {
+		pte = (pte_t *) pmd;
+		ptl = &mm->page_table_lock;
+		*need_pte_unmap = 0;
+		goto check;
+	}
 
 	pte = pte_offset_map(pmd, address);
 	/* Make a quick check before getting the lock */
@@ -611,11 +623,23 @@ pte_t *__page_check_address(struct page *page, struct mm_struct *mm,
 	ptl = pte_lockptr(mm, pmd);
 check:
 	spin_lock(ptl);
-	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
+	if (!*need_pte_unmap) {
+		sub_pfn = pte_pfn(*pte) +
+			((address & ~HPAGE_PMD_MASK) >> PAGE_SHIFT);
+
+		if (pte_present(*pte) && page_to_pfn(page) == sub_pfn) {
+			*ptlp = ptl;
+			return pte;
+		}
+	} else if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
 		*ptlp = ptl;
 		return pte;
 	}
-	pte_unmap_unlock(pte, ptl);
+
+	if (*need_pte_unmap)
+		pte_unmap(pte);
+
+	spin_unlock(ptl);
 	return NULL;
 }
 
@@ -633,14 +657,15 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 	unsigned long address;
 	pte_t *pte;
 	spinlock_t *ptl;
+	int need_pte_unmap;
 
 	address = vma_address(page, vma);
 	if (address == -EFAULT)		/* out of vma range */
 		return 0;
-	pte = page_check_address(page, vma->vm_mm, address, &ptl, 1);
+	pte = page_check_address(page, vma->vm_mm, address, &ptl, 1, &need_pte_unmap);
 	if (!pte)			/* the page is not in this mm */
 		return 0;
-	pte_unmap_unlock(pte, ptl);
+	page_check_address_unmap_unlock(ptl, pte, need_pte_unmap);
 
 	return 1;
 }
@@ -685,12 +710,14 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 	} else {
 		pte_t *pte;
 		spinlock_t *ptl;
+		int need_pte_unmap;
 
 		/*
 		 * rmap might return false positives; we must filter
 		 * these out using page_check_address().
 		 */
-		pte = page_check_address(page, mm, address, &ptl, 0);
+		pte = page_check_address(page, mm, address, &ptl, 0,
+					 &need_pte_unmap);
 		if (!pte)
 			goto out;
 
@@ -712,7 +739,7 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			if (likely(!VM_SequentialReadHint(vma)))
 				referenced++;
 		}
-		pte_unmap_unlock(pte, ptl);
+		page_check_address_unmap_unlock(ptl, pte, need_pte_unmap);
 	}
 
 	/* Pretend the page is referenced if the task has the
@@ -886,8 +913,9 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_t *pte;
 	spinlock_t *ptl;
 	int ret = 0;
+	int need_pte_unmap;
 
-	pte = page_check_address(page, mm, address, &ptl, 1);
+	pte = page_check_address(page, mm, address, &ptl, 1, &need_pte_unmap);
 	if (!pte)
 		goto out;
 
@@ -902,7 +930,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 		ret = 1;
 	}
 
-	pte_unmap_unlock(pte, ptl);
+	page_check_address_unmap_unlock(ptl, pte, need_pte_unmap);
 out:
 	return ret;
 }
@@ -974,9 +1002,9 @@ void page_move_anon_rmap(struct page *page,
 
 /**
  * __page_set_anon_rmap - set up new anonymous rmap
- * @page:	Page to add to rmap	
+ * @page:	Page to add to rmap
  * @vma:	VM area to add page to.
- * @address:	User virtual address of the mapping	
+ * @address:	User virtual address of the mapping
  * @exclusive:	the page is exclusively owned by the current process
  */
 static void __page_set_anon_rmap(struct page *page,
@@ -1176,8 +1204,9 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pte_t pteval;
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
+	int need_pte_unmap;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
+	pte = page_check_address(page, mm, address, &ptl, 0, &need_pte_unmap);
 	if (!pte)
 		goto out;
 
@@ -1262,12 +1291,12 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	page_cache_release(page);
 
 out_unmap:
-	pte_unmap_unlock(pte, ptl);
+	page_check_address_unmap_unlock(ptl, pte, need_pte_unmap);
 out:
 	return ret;
 
 out_mlock:
-	pte_unmap_unlock(pte, ptl);
+	page_check_address_unmap_unlock(ptl, pte, need_pte_unmap);
 
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-21 12:55 ` Nai Xia
@ 2011-06-21 13:32   ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-21 13:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Izik Eidus, Andrea Arcangeli, Hugh Dickins, Chris Wright,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel, kvm

Introduced kvm_mmu_notifier_test_and_clear_dirty(), kvm_mmu_notifier_dirty_update()
and their mmu_notifier interfaces to support KSM dirty bit tracking, which brings
significant performance gain in volatile pages scanning in KSM.
Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is
enabled to indicate that the dirty bits of underlying sptes are not updated by
hardware.

Signed-off-by: Nai Xia <nai.xia@gmail.com>
Acked-by: Izik Eidus <izik.eidus@ravellosystems.com>
---
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/kvm/mmu.c              |   36 +++++++++++++++++++++++++++++
 arch/x86/kvm/mmu.h              |    3 +-
 arch/x86/kvm/vmx.c              |    1 +
 include/linux/kvm_host.h        |    2 +-
 include/linux/mmu_notifier.h    |   48 +++++++++++++++++++++++++++++++++++++++
 mm/mmu_notifier.c               |   33 ++++++++++++++++++++++++++
 virt/kvm/kvm_main.c             |   27 ++++++++++++++++++++++
 8 files changed, 149 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d2ac8e2..f0d7aa0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -848,6 +848,7 @@ extern bool kvm_rebooting;
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_age_hva(struct kvm *kvm, unsigned long hva);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
+int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva);
 void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index aee3862..a5a0c51 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -979,6 +979,37 @@ out:
 	return young;
 }
 
+/*
+ * Caller is supposed to SetPageDirty(), it's not done inside this.
+ */
+static
+int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp,
+				   unsigned long data)
+{
+	u64 *spte;
+	int dirty = 0;
+
+	if (!shadow_dirty_mask) {
+		WARN(1, "KVM: do NOT try to test dirty bit in EPT\n");
+		goto out;
+	}
+
+	spte = rmap_next(kvm, rmapp, NULL);
+	while (spte) {
+		int _dirty;
+		u64 _spte = *spte;
+		BUG_ON(!(_spte & PT_PRESENT_MASK));
+		_dirty = _spte & PT_DIRTY_MASK;
+		if (_dirty) {
+			dirty = 1;
+			clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
+		}
+		spte = rmap_next(kvm, rmapp, spte);
+	}
+out:
+	return dirty;
+}
+
 #define RMAP_RECYCLE_THRESHOLD 1000
 
 static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
@@ -1004,6 +1035,11 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
 	return kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp);
 }
 
+int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva)
+{
+	return kvm_handle_hva(kvm, hva, 0, kvm_test_and_clear_dirty_rmapp);
+}
+
 #ifdef MMU_DEBUG
 static int is_empty_shadow_page(u64 *spt)
 {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 7086ca8..b8d01c3 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -18,7 +18,8 @@
 #define PT_PCD_MASK (1ULL << 4)
 #define PT_ACCESSED_SHIFT 5
 #define PT_ACCESSED_MASK (1ULL << PT_ACCESSED_SHIFT)
-#define PT_DIRTY_MASK (1ULL << 6)
+#define PT_DIRTY_SHIFT 6
+#define PT_DIRTY_MASK (1ULL << PT_DIRTY_SHIFT)
 #define PT_PAGE_SIZE_MASK (1ULL << 7)
 #define PT_PAT_MASK (1ULL << 7)
 #define PT_GLOBAL_MASK (1ULL << 8)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d48ec60..b407a69 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4674,6 +4674,7 @@ static int __init vmx_init(void)
 		kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
 				VMX_EPT_EXECUTABLE_MASK);
 		kvm_enable_tdp();
+		kvm_dirty_update = 0;
 	} else
 		kvm_disable_tdp();
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 31ebb59..2036bae 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -53,7 +53,7 @@
 struct kvm;
 struct kvm_vcpu;
 extern struct kmem_cache *kvm_vcpu_cache;
-
+extern int kvm_dirty_update;
 /*
  * It would be nice to use something smarter than a linear search, TBD...
  * Thankfully we dont expect many devices to register (famous last words :),
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 1d1b1e1..bd6ba2d 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -24,6 +24,9 @@ struct mmu_notifier_mm {
 };
 
 struct mmu_notifier_ops {
+	int (*dirty_update)(struct mmu_notifier *mn,
+			     struct mm_struct *mm);
+
 	/*
 	 * Called either by mmu_notifier_unregister or when the mm is
 	 * being destroyed by exit_mmap, always before all pages are
@@ -72,6 +75,16 @@ struct mmu_notifier_ops {
 			  unsigned long address);
 
 	/*
+	 * clear_flush_dirty is called after the VM is
+	 * test-and-clearing the dirty/modified bitflag in the
+	 * pte. This way the VM will provide proper volatile page
+	 * testing to ksm.
+	 */
+	int (*test_and_clear_dirty)(struct mmu_notifier *mn,
+				    struct mm_struct *mm,
+				    unsigned long address);
+
+	/*
 	 * change_pte is called in cases that pte mapping to page is changed:
 	 * for example, when ksm remaps pte to point to a new shared page.
 	 */
@@ -170,11 +183,14 @@ extern int __mmu_notifier_register(struct mmu_notifier *mn,
 extern void mmu_notifier_unregister(struct mmu_notifier *mn,
 				    struct mm_struct *mm);
 extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
+extern int __mmu_notifier_dirty_update(struct mm_struct *mm);
 extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long address);
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
+extern int __mmu_notifier_test_and_clear_dirty(struct mm_struct *mm,
+					       unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
@@ -184,6 +200,19 @@ extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 
+/*
+ * For ksm to make use of dirty bit, it wants to make sure that the dirty bits
+ * in sptes really carry the dirty information. Currently only intel EPT is
+ * not for ksm dirty bit tracking.
+ */
+static inline int mmu_notifier_dirty_update(struct mm_struct *mm)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_dirty_update(mm);
+
+	return 1;
+}
+
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
 	if (mm_has_notifiers(mm))
@@ -206,6 +235,14 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 	return 0;
 }
 
+static inline int mmu_notifier_test_and_clear_dirty(struct mm_struct *mm,
+						    unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_test_and_clear_dirty(mm, address);
+	return 0;
+}
+
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 					   unsigned long address, pte_t pte)
 {
@@ -323,6 +360,11 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 
 #else /* CONFIG_MMU_NOTIFIER */
 
+static inline int mmu_notifier_dirty_update(struct mm_struct *mm)
+{
+	return 1;
+}
+
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
 }
@@ -339,6 +381,12 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 	return 0;
 }
 
+static inline int mmu_notifier_test_and_clear_dirty(struct mm_struct *mm,
+						    unsigned long address)
+{
+	return 0;
+}
+
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 					   unsigned long address, pte_t pte)
 {
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 8d032de..a4a1467 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -18,6 +18,22 @@
 #include <linux/sched.h>
 #include <linux/slab.h>
 
+int __mmu_notifier_dirty_update(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int dirty_update = 0;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->dirty_update)
+			dirty_update |= mn->ops->dirty_update(mn, mm);
+	}
+	rcu_read_unlock();
+
+	return dirty_update;
+}
+
 /*
  * This function can't run concurrently against mmu_notifier_register
  * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
@@ -120,6 +136,23 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
 	return young;
 }
 
+int __mmu_notifier_test_and_clear_dirty(struct mm_struct *mm,
+					unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int dirty = 0;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->test_and_clear_dirty)
+			dirty |= mn->ops->test_and_clear_dirty(mn, mm, address);
+	}
+	rcu_read_unlock();
+
+	return dirty;
+}
+
 void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 			       pte_t pte)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 96ebc06..22967c8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -78,6 +78,8 @@ static atomic_t hardware_enable_failed;
 struct kmem_cache *kvm_vcpu_cache;
 EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
 
+int kvm_dirty_update = 1;
+
 static __read_mostly struct preempt_ops kvm_preempt_ops;
 
 struct dentry *kvm_debugfs_dir;
@@ -398,6 +400,23 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 	return young;
 }
 
+/* Caller should SetPageDirty(), no need to flush tlb */
+static int kvm_mmu_notifier_test_and_clear_dirty(struct mmu_notifier *mn,
+						 struct mm_struct *mm,
+						 unsigned long address)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	int dirty, idx;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	spin_lock(&kvm->mmu_lock);
+	dirty = kvm_test_and_clear_dirty_hva(kvm, address);
+	spin_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, idx);
+
+	return dirty;
+}
+
 static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 				     struct mm_struct *mm)
 {
@@ -409,14 +428,22 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
+static int kvm_mmu_notifier_dirty_update(struct mmu_notifier *mn,
+					 struct mm_struct *mm)
+{
+	return kvm_dirty_update;
+}
+
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 	.invalidate_page	= kvm_mmu_notifier_invalidate_page,
 	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
 	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
 	.test_young		= kvm_mmu_notifier_test_young,
+	.test_and_clear_dirty	= kvm_mmu_notifier_test_and_clear_dirty,
 	.change_pte		= kvm_mmu_notifier_change_pte,
 	.release		= kvm_mmu_notifier_release,
+	.dirty_update		= kvm_mmu_notifier_dirty_update,
 };
 
 static int kvm_init_mmu_notifier(struct kvm *kvm)

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-21 13:32   ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-21 13:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Izik Eidus, Andrea Arcangeli, Hugh Dickins, Chris Wright,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel, kvm

Introduced kvm_mmu_notifier_test_and_clear_dirty(), kvm_mmu_notifier_dirty_update()
and their mmu_notifier interfaces to support KSM dirty bit tracking, which brings
significant performance gain in volatile pages scanning in KSM.
Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is
enabled to indicate that the dirty bits of underlying sptes are not updated by
hardware.

Signed-off-by: Nai Xia <nai.xia@gmail.com>
Acked-by: Izik Eidus <izik.eidus@ravellosystems.com>
---
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/kvm/mmu.c              |   36 +++++++++++++++++++++++++++++
 arch/x86/kvm/mmu.h              |    3 +-
 arch/x86/kvm/vmx.c              |    1 +
 include/linux/kvm_host.h        |    2 +-
 include/linux/mmu_notifier.h    |   48 +++++++++++++++++++++++++++++++++++++++
 mm/mmu_notifier.c               |   33 ++++++++++++++++++++++++++
 virt/kvm/kvm_main.c             |   27 ++++++++++++++++++++++
 8 files changed, 149 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d2ac8e2..f0d7aa0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -848,6 +848,7 @@ extern bool kvm_rebooting;
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_age_hva(struct kvm *kvm, unsigned long hva);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
+int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva);
 void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index aee3862..a5a0c51 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -979,6 +979,37 @@ out:
 	return young;
 }
 
+/*
+ * Caller is supposed to SetPageDirty(), it's not done inside this.
+ */
+static
+int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp,
+				   unsigned long data)
+{
+	u64 *spte;
+	int dirty = 0;
+
+	if (!shadow_dirty_mask) {
+		WARN(1, "KVM: do NOT try to test dirty bit in EPT\n");
+		goto out;
+	}
+
+	spte = rmap_next(kvm, rmapp, NULL);
+	while (spte) {
+		int _dirty;
+		u64 _spte = *spte;
+		BUG_ON(!(_spte & PT_PRESENT_MASK));
+		_dirty = _spte & PT_DIRTY_MASK;
+		if (_dirty) {
+			dirty = 1;
+			clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
+		}
+		spte = rmap_next(kvm, rmapp, spte);
+	}
+out:
+	return dirty;
+}
+
 #define RMAP_RECYCLE_THRESHOLD 1000
 
 static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
@@ -1004,6 +1035,11 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
 	return kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp);
 }
 
+int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva)
+{
+	return kvm_handle_hva(kvm, hva, 0, kvm_test_and_clear_dirty_rmapp);
+}
+
 #ifdef MMU_DEBUG
 static int is_empty_shadow_page(u64 *spt)
 {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 7086ca8..b8d01c3 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -18,7 +18,8 @@
 #define PT_PCD_MASK (1ULL << 4)
 #define PT_ACCESSED_SHIFT 5
 #define PT_ACCESSED_MASK (1ULL << PT_ACCESSED_SHIFT)
-#define PT_DIRTY_MASK (1ULL << 6)
+#define PT_DIRTY_SHIFT 6
+#define PT_DIRTY_MASK (1ULL << PT_DIRTY_SHIFT)
 #define PT_PAGE_SIZE_MASK (1ULL << 7)
 #define PT_PAT_MASK (1ULL << 7)
 #define PT_GLOBAL_MASK (1ULL << 8)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d48ec60..b407a69 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4674,6 +4674,7 @@ static int __init vmx_init(void)
 		kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
 				VMX_EPT_EXECUTABLE_MASK);
 		kvm_enable_tdp();
+		kvm_dirty_update = 0;
 	} else
 		kvm_disable_tdp();
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 31ebb59..2036bae 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -53,7 +53,7 @@
 struct kvm;
 struct kvm_vcpu;
 extern struct kmem_cache *kvm_vcpu_cache;
-
+extern int kvm_dirty_update;
 /*
  * It would be nice to use something smarter than a linear search, TBD...
  * Thankfully we dont expect many devices to register (famous last words :),
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 1d1b1e1..bd6ba2d 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -24,6 +24,9 @@ struct mmu_notifier_mm {
 };
 
 struct mmu_notifier_ops {
+	int (*dirty_update)(struct mmu_notifier *mn,
+			     struct mm_struct *mm);
+
 	/*
 	 * Called either by mmu_notifier_unregister or when the mm is
 	 * being destroyed by exit_mmap, always before all pages are
@@ -72,6 +75,16 @@ struct mmu_notifier_ops {
 			  unsigned long address);
 
 	/*
+	 * clear_flush_dirty is called after the VM is
+	 * test-and-clearing the dirty/modified bitflag in the
+	 * pte. This way the VM will provide proper volatile page
+	 * testing to ksm.
+	 */
+	int (*test_and_clear_dirty)(struct mmu_notifier *mn,
+				    struct mm_struct *mm,
+				    unsigned long address);
+
+	/*
 	 * change_pte is called in cases that pte mapping to page is changed:
 	 * for example, when ksm remaps pte to point to a new shared page.
 	 */
@@ -170,11 +183,14 @@ extern int __mmu_notifier_register(struct mmu_notifier *mn,
 extern void mmu_notifier_unregister(struct mmu_notifier *mn,
 				    struct mm_struct *mm);
 extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
+extern int __mmu_notifier_dirty_update(struct mm_struct *mm);
 extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long address);
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
+extern int __mmu_notifier_test_and_clear_dirty(struct mm_struct *mm,
+					       unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
@@ -184,6 +200,19 @@ extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 
+/*
+ * For ksm to make use of dirty bit, it wants to make sure that the dirty bits
+ * in sptes really carry the dirty information. Currently only intel EPT is
+ * not for ksm dirty bit tracking.
+ */
+static inline int mmu_notifier_dirty_update(struct mm_struct *mm)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_dirty_update(mm);
+
+	return 1;
+}
+
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
 	if (mm_has_notifiers(mm))
@@ -206,6 +235,14 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 	return 0;
 }
 
+static inline int mmu_notifier_test_and_clear_dirty(struct mm_struct *mm,
+						    unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_test_and_clear_dirty(mm, address);
+	return 0;
+}
+
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 					   unsigned long address, pte_t pte)
 {
@@ -323,6 +360,11 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 
 #else /* CONFIG_MMU_NOTIFIER */
 
+static inline int mmu_notifier_dirty_update(struct mm_struct *mm)
+{
+	return 1;
+}
+
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
 }
@@ -339,6 +381,12 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 	return 0;
 }
 
+static inline int mmu_notifier_test_and_clear_dirty(struct mm_struct *mm,
+						    unsigned long address)
+{
+	return 0;
+}
+
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 					   unsigned long address, pte_t pte)
 {
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 8d032de..a4a1467 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -18,6 +18,22 @@
 #include <linux/sched.h>
 #include <linux/slab.h>
 
+int __mmu_notifier_dirty_update(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int dirty_update = 0;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->dirty_update)
+			dirty_update |= mn->ops->dirty_update(mn, mm);
+	}
+	rcu_read_unlock();
+
+	return dirty_update;
+}
+
 /*
  * This function can't run concurrently against mmu_notifier_register
  * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
@@ -120,6 +136,23 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
 	return young;
 }
 
+int __mmu_notifier_test_and_clear_dirty(struct mm_struct *mm,
+					unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int dirty = 0;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->test_and_clear_dirty)
+			dirty |= mn->ops->test_and_clear_dirty(mn, mm, address);
+	}
+	rcu_read_unlock();
+
+	return dirty;
+}
+
 void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 			       pte_t pte)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 96ebc06..22967c8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -78,6 +78,8 @@ static atomic_t hardware_enable_failed;
 struct kmem_cache *kvm_vcpu_cache;
 EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
 
+int kvm_dirty_update = 1;
+
 static __read_mostly struct preempt_ops kvm_preempt_ops;
 
 struct dentry *kvm_debugfs_dir;
@@ -398,6 +400,23 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 	return young;
 }
 
+/* Caller should SetPageDirty(), no need to flush tlb */
+static int kvm_mmu_notifier_test_and_clear_dirty(struct mmu_notifier *mn,
+						 struct mm_struct *mm,
+						 unsigned long address)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	int dirty, idx;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	spin_lock(&kvm->mmu_lock);
+	dirty = kvm_test_and_clear_dirty_hva(kvm, address);
+	spin_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, idx);
+
+	return dirty;
+}
+
 static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 				     struct mm_struct *mm)
 {
@@ -409,14 +428,22 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
+static int kvm_mmu_notifier_dirty_update(struct mmu_notifier *mn,
+					 struct mm_struct *mm)
+{
+	return kvm_dirty_update;
+}
+
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 	.invalidate_page	= kvm_mmu_notifier_invalidate_page,
 	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
 	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
 	.test_young		= kvm_mmu_notifier_test_young,
+	.test_and_clear_dirty	= kvm_mmu_notifier_test_and_clear_dirty,
 	.change_pte		= kvm_mmu_notifier_change_pte,
 	.release		= kvm_mmu_notifier_release,
+	.dirty_update		= kvm_mmu_notifier_dirty_update,
 };
 
 static int kvm_init_mmu_notifier(struct kvm *kvm)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 2/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
  2011-06-21 12:55 ` Nai Xia
@ 2011-06-21 13:36   ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-21 13:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Izik Eidus, Andrea Arcangeli, Hugh Dickins, Chris Wright,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel

Introduced ksm_page_changed() to reference the dirty bit of a pte. We clear 
the dirty bit for each pte scanned but don't flush the tlb. For a huge page, 
if one of the subpage has changed, we try to skip the whole huge page 
assuming(this is true by now) that ksmd linearly scans the address space.

A NEW_FLAG is also introduced as a status of rmap_item to make ksmd scan
more aggressively for new VMAs - only skip the pages considered to be volatile
by the dirty bits. This can be enabled/disabled through KSM's sysfs interface.

Signed-off-by: Nai Xia <nai.xia@gmail.com>
Acked-by: Izik Eidus <izik.eidus@ravellosystems.com>
---
 mm/ksm.c |  189 ++++++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 155 insertions(+), 34 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 9a68b0c..021ae6f 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -108,6 +108,7 @@ struct ksm_scan {
 	unsigned long address;
 	struct rmap_item **rmap_list;
 	unsigned long seqnr;
+	unsigned long huge_skip; /* if a huge pte is dirty, skip subpages */
 };
 
 /**
@@ -151,6 +152,7 @@ struct rmap_item {
 #define SEQNR_MASK	0x0ff	/* low bits of unstable tree seqnr */
 #define UNSTABLE_FLAG	0x100	/* is a node of the unstable tree */
 #define STABLE_FLAG	0x200	/* is listed from the stable tree */
+#define NEW_FLAG	0x400	/* this rmap_item is new */
 
 /* The stable and unstable tree heads */
 static struct rb_root root_stable_tree = RB_ROOT;
@@ -189,6 +191,13 @@ static unsigned int ksm_thread_pages_to_scan = 100;
 /* Milliseconds ksmd should sleep between batches */
 static unsigned int ksm_thread_sleep_millisecs = 20;
 
+/*
+ * Skip page changed test and merge pages the first time we scan a page, this
+ * is useful for speeding up the merging of very large VMAs, since the
+ * scanning also allocs memory.
+ */
+static unsigned int ksm_merge_at_once = 0;
+
 #define KSM_RUN_STOP	0
 #define KSM_RUN_MERGE	1
 #define KSM_RUN_UNMERGE	2
@@ -374,10 +383,15 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 	return (ret & VM_FAULT_OOM) ? -ENOMEM : 0;
 }
 
+static inline unsigned long get_address(struct rmap_item *rmap_item)
+{
+	return rmap_item->address & PAGE_MASK;
+}
+
 static void break_cow(struct rmap_item *rmap_item)
 {
 	struct mm_struct *mm = rmap_item->mm;
-	unsigned long addr = rmap_item->address;
+	unsigned long addr = get_address(rmap_item);
 	struct vm_area_struct *vma;
 
 	/*
@@ -416,7 +430,7 @@ static struct page *page_trans_compound_anon(struct page *page)
 static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 {
 	struct mm_struct *mm = rmap_item->mm;
-	unsigned long addr = rmap_item->address;
+	unsigned long addr = get_address(rmap_item);
 	struct vm_area_struct *vma;
 	struct page *page;
 
@@ -454,7 +468,7 @@ static void remove_node_from_stable_tree(struct stable_node *stable_node)
 		else
 			ksm_pages_shared--;
 		put_anon_vma(rmap_item->anon_vma);
-		rmap_item->address &= PAGE_MASK;
+		rmap_item->address &= ~STABLE_FLAG;
 		cond_resched();
 	}
 
@@ -542,7 +556,7 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
 			ksm_pages_shared--;
 
 		put_anon_vma(rmap_item->anon_vma);
-		rmap_item->address &= PAGE_MASK;
+		rmap_item->address &= ~STABLE_FLAG;
 
 	} else if (rmap_item->address & UNSTABLE_FLAG) {
 		unsigned char age;
@@ -554,12 +568,14 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
 		 * than left over from before.
 		 */
 		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
-		BUG_ON(age > 1);
+		BUG_ON (age > 1);
+
 		if (!age)
 			rb_erase(&rmap_item->node, &root_unstable_tree);
 
 		ksm_pages_unshared--;
-		rmap_item->address &= PAGE_MASK;
+		rmap_item->address &= ~UNSTABLE_FLAG;
+		rmap_item->address &= ~SEQNR_MASK;
 	}
 out:
 	cond_resched();		/* we're called from many long loops */
@@ -705,13 +721,14 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 	spinlock_t *ptl;
 	int swapped;
 	int err = -EFAULT;
+	int need_pte_unmap;
 
 	addr = page_address_in_vma(page, vma);
 	if (addr == -EFAULT)
 		goto out;
 
 	BUG_ON(PageTransCompound(page));
-	ptep = page_check_address(page, mm, addr, &ptl, 0);
+	ptep = page_check_address(page, mm, addr, &ptl, 0, &need_pte_unmap);
 	if (!ptep)
 		goto out;
 
@@ -747,7 +764,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 	err = 0;
 
 out_unlock:
-	pte_unmap_unlock(ptep, ptl);
+	page_check_address_unmap_unlock(ptl, ptep, need_pte_unmap);
 out:
 	return err;
 }
@@ -923,12 +940,13 @@ static int try_to_merge_with_ksm_page(struct rmap_item *rmap_item,
 	struct mm_struct *mm = rmap_item->mm;
 	struct vm_area_struct *vma;
 	int err = -EFAULT;
+	unsigned long address = get_address(rmap_item);
 
 	down_read(&mm->mmap_sem);
 	if (ksm_test_exit(mm))
 		goto out;
-	vma = find_vma(mm, rmap_item->address);
-	if (!vma || vma->vm_start > rmap_item->address)
+	vma = find_vma(mm, address);
+	if (!vma || vma->vm_start > address)
 		goto out;
 
 	err = try_to_merge_one_page(vma, page, kpage);
@@ -1159,6 +1177,94 @@ static void stable_tree_append(struct rmap_item *rmap_item,
 		ksm_pages_shared++;
 }
 
+static inline unsigned long get_huge_end_addr(unsigned long address)
+{
+	return (address & HPAGE_PMD_MASK) + HPAGE_SIZE;
+}
+
+static inline int ksm_ptep_test_and_clear_dirty(pte_t *ptep)
+{
+	int ret = 0;
+
+	if (pte_dirty(*ptep))
+		ret = test_and_clear_bit(_PAGE_BIT_DIRTY,
+					 (unsigned long *) &ptep->pte);
+
+	return ret;
+}
+
+#define ksm_ptep_test_and_clear_dirty_notify(__mm, __address, __ptep)	\
+({									\
+	int __dirty;							\
+	struct mm_struct *___mm = __mm;					\
+	unsigned long ___address = __address;				\
+	__dirty = ksm_ptep_test_and_clear_dirty(__ptep);		\
+	__dirty |= mmu_notifier_test_and_clear_dirty(___mm,		\
+						     ___address);	\
+	__dirty;							\
+})
+
+/*
+ * ksm_page_changed - take the dirty bit of the pte as a hint for volatile
+ * pages. We clear the dirty bit for each pte scanned but don't flush the
+ * tlb. For huge pages, if one of the subpage has changed, we try to skip
+ * the whole huge page.
+ */
+static int ksm_page_changed(struct page *page, struct rmap_item *rmap_item)
+{
+	int ret = 1;
+	unsigned long address = get_address(rmap_item);
+	struct mm_struct *mm = rmap_item->mm;
+	pte_t *ptep;
+	spinlock_t *ptl;
+	int need_pte_unmap;
+	unsigned int checksum;
+
+	/* If the the ptes are not updated by guest OS, we rely on checksum. */
+	if (!mmu_notifier_dirty_update(mm)) {
+		checksum = calc_checksum(page);
+		if (rmap_item->oldchecksum != checksum)
+			rmap_item->oldchecksum = checksum;
+		else
+			ret = 0;
+		goto out;
+	}
+
+	if (ksm_scan.huge_skip) {
+		/* in process of skipping a huge page */
+		if (ksm_scan.mm_slot->mm == rmap_item->mm &&
+		    PageTail(page) && address < ksm_scan.huge_skip) {
+			ret = 1;
+			goto out;
+		} else {
+			ksm_scan.huge_skip = 0;
+		}
+	}
+
+	ptep = page_check_address(page, mm, address, &ptl, 0, &need_pte_unmap);
+	if (!ptep)
+		goto out;
+
+	if (ksm_ptep_test_and_clear_dirty_notify(mm, address, ptep)) {
+		set_page_dirty(page);
+		if (PageTransCompound(page))
+			ksm_scan.huge_skip = get_huge_end_addr(address);
+	} else {
+		ret = 0;
+	}
+
+	page_check_address_unmap_unlock(ptl, ptep, need_pte_unmap);
+
+out:
+	/* This is simply to speed up merging in the first scan. */
+	if (ksm_merge_at_once && rmap_item->address & NEW_FLAG) {
+		rmap_item->address &= ~NEW_FLAG;
+		ret = 0;
+	}
+
+	return ret;
+}
+
 /*
  * cmp_and_merge_page - first see if page can be merged into the stable tree;
  * if not, compare checksum to previous and if it's the same, see if page can
@@ -1174,7 +1280,6 @@ static void cmp_and_merge_page(struct page *page, struct rmap_item *rmap_item)
 	struct page *tree_page = NULL;
 	struct stable_node *stable_node;
 	struct page *kpage;
-	unsigned int checksum;
 	int err;
 
 	remove_rmap_item_from_tree(rmap_item);
@@ -1196,17 +1301,8 @@ static void cmp_and_merge_page(struct page *page, struct rmap_item *rmap_item)
 		return;
 	}
 
-	/*
-	 * If the hash value of the page has changed from the last time
-	 * we calculated it, this page is changing frequently: therefore we
-	 * don't want to insert it in the unstable tree, and we don't want
-	 * to waste our time searching for something identical to it there.
-	 */
-	checksum = calc_checksum(page);
-	if (rmap_item->oldchecksum != checksum) {
-		rmap_item->oldchecksum = checksum;
+	if (ksm_page_changed(page, rmap_item))
 		return;
-	}
 
 	tree_rmap_item =
 		unstable_tree_search_insert(rmap_item, page, &tree_page);
@@ -1252,9 +1348,9 @@ static struct rmap_item *get_next_rmap_item(struct mm_slot *mm_slot,
 
 	while (*rmap_list) {
 		rmap_item = *rmap_list;
-		if ((rmap_item->address & PAGE_MASK) == addr)
+		if (get_address(rmap_item) == addr)
 			return rmap_item;
-		if (rmap_item->address > addr)
+		if (get_address(rmap_item) > addr)
 			break;
 		*rmap_list = rmap_item->rmap_list;
 		remove_rmap_item_from_tree(rmap_item);
@@ -1266,6 +1362,7 @@ static struct rmap_item *get_next_rmap_item(struct mm_slot *mm_slot,
 		/* It has already been zeroed */
 		rmap_item->mm = mm_slot->mm;
 		rmap_item->address = addr;
+		rmap_item->address |= NEW_FLAG;
 		rmap_item->rmap_list = *rmap_list;
 		*rmap_list = rmap_item;
 	}
@@ -1608,12 +1705,12 @@ again:
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
+		unsigned long address = get_address(rmap_item);
 
 		anon_vma_lock(anon_vma);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
-			if (rmap_item->address < vma->vm_start ||
-			    rmap_item->address >= vma->vm_end)
+			if (address < vma->vm_start || address >= vma->vm_end)
 				continue;
 			/*
 			 * Initially we examine only the vma which covers this
@@ -1627,8 +1724,8 @@ again:
 			if (memcg && !mm_match_cgroup(vma->vm_mm, memcg))
 				continue;
 
-			referenced += page_referenced_one(page, vma,
-				rmap_item->address, &mapcount, vm_flags);
+			referenced += page_referenced_one(page, vma, address,
+						&mapcount, vm_flags);
 			if (!search_new_forks || !mapcount)
 				break;
 		}
@@ -1661,12 +1758,12 @@ again:
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
+		unsigned long address = get_address(rmap_item);
 
 		anon_vma_lock(anon_vma);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
-			if (rmap_item->address < vma->vm_start ||
-			    rmap_item->address >= vma->vm_end)
+			if (address < vma->vm_start || address >= vma->vm_end)
 				continue;
 			/*
 			 * Initially we examine only the vma which covers this
@@ -1677,8 +1774,7 @@ again:
 			if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
 				continue;
 
-			ret = try_to_unmap_one(page, vma,
-					rmap_item->address, flags);
+			ret = try_to_unmap_one(page, vma, address, flags);
 			if (ret != SWAP_AGAIN || !page_mapped(page)) {
 				anon_vma_unlock(anon_vma);
 				goto out;
@@ -1713,12 +1809,12 @@ again:
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
+		unsigned long address = get_address(rmap_item);
 
 		anon_vma_lock(anon_vma);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
-			if (rmap_item->address < vma->vm_start ||
-			    rmap_item->address >= vma->vm_end)
+			if (address < vma->vm_start || address >= vma->vm_end)
 				continue;
 			/*
 			 * Initially we examine only the vma which covers this
@@ -1729,7 +1825,7 @@ again:
 			if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
 				continue;
 
-			ret = rmap_one(page, vma, rmap_item->address, arg);
+			ret = rmap_one(page, vma, address, arg);
 			if (ret != SWAP_AGAIN) {
 				anon_vma_unlock(anon_vma);
 				goto out;
@@ -1872,6 +1968,30 @@ static ssize_t pages_to_scan_store(struct kobject *kobj,
 }
 KSM_ATTR(pages_to_scan);
 
+static ssize_t merge_at_once_show(struct kobject *kobj,
+				  struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%u\n", ksm_merge_at_once);
+}
+
+static ssize_t merge_at_once_store(struct kobject *kobj,
+				   struct kobj_attribute *attr,
+				   const char *buf, size_t count)
+{
+	int err;
+	unsigned long merge_at_once;
+
+	err = strict_strtoul(buf, 10, &merge_at_once);
+	if (err || merge_at_once > UINT_MAX)
+		return -EINVAL;
+
+	ksm_merge_at_once = merge_at_once;
+
+	return count;
+}
+KSM_ATTR(merge_at_once);
+
+
 static ssize_t run_show(struct kobject *kobj, struct kobj_attribute *attr,
 			char *buf)
 {
@@ -1975,6 +2095,7 @@ static struct attribute *ksm_attrs[] = {
 	&pages_unshared_attr.attr,
 	&pages_volatile_attr.attr,
 	&full_scans_attr.attr,
+	&merge_at_once_attr.attr,
 	NULL,
 };
 

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 2/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
@ 2011-06-21 13:36   ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-21 13:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Izik Eidus, Andrea Arcangeli, Hugh Dickins, Chris Wright,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel

Introduced ksm_page_changed() to reference the dirty bit of a pte. We clear 
the dirty bit for each pte scanned but don't flush the tlb. For a huge page, 
if one of the subpage has changed, we try to skip the whole huge page 
assuming(this is true by now) that ksmd linearly scans the address space.

A NEW_FLAG is also introduced as a status of rmap_item to make ksmd scan
more aggressively for new VMAs - only skip the pages considered to be volatile
by the dirty bits. This can be enabled/disabled through KSM's sysfs interface.

Signed-off-by: Nai Xia <nai.xia@gmail.com>
Acked-by: Izik Eidus <izik.eidus@ravellosystems.com>
---
 mm/ksm.c |  189 ++++++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 155 insertions(+), 34 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 9a68b0c..021ae6f 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -108,6 +108,7 @@ struct ksm_scan {
 	unsigned long address;
 	struct rmap_item **rmap_list;
 	unsigned long seqnr;
+	unsigned long huge_skip; /* if a huge pte is dirty, skip subpages */
 };
 
 /**
@@ -151,6 +152,7 @@ struct rmap_item {
 #define SEQNR_MASK	0x0ff	/* low bits of unstable tree seqnr */
 #define UNSTABLE_FLAG	0x100	/* is a node of the unstable tree */
 #define STABLE_FLAG	0x200	/* is listed from the stable tree */
+#define NEW_FLAG	0x400	/* this rmap_item is new */
 
 /* The stable and unstable tree heads */
 static struct rb_root root_stable_tree = RB_ROOT;
@@ -189,6 +191,13 @@ static unsigned int ksm_thread_pages_to_scan = 100;
 /* Milliseconds ksmd should sleep between batches */
 static unsigned int ksm_thread_sleep_millisecs = 20;
 
+/*
+ * Skip page changed test and merge pages the first time we scan a page, this
+ * is useful for speeding up the merging of very large VMAs, since the
+ * scanning also allocs memory.
+ */
+static unsigned int ksm_merge_at_once = 0;
+
 #define KSM_RUN_STOP	0
 #define KSM_RUN_MERGE	1
 #define KSM_RUN_UNMERGE	2
@@ -374,10 +383,15 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 	return (ret & VM_FAULT_OOM) ? -ENOMEM : 0;
 }
 
+static inline unsigned long get_address(struct rmap_item *rmap_item)
+{
+	return rmap_item->address & PAGE_MASK;
+}
+
 static void break_cow(struct rmap_item *rmap_item)
 {
 	struct mm_struct *mm = rmap_item->mm;
-	unsigned long addr = rmap_item->address;
+	unsigned long addr = get_address(rmap_item);
 	struct vm_area_struct *vma;
 
 	/*
@@ -416,7 +430,7 @@ static struct page *page_trans_compound_anon(struct page *page)
 static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 {
 	struct mm_struct *mm = rmap_item->mm;
-	unsigned long addr = rmap_item->address;
+	unsigned long addr = get_address(rmap_item);
 	struct vm_area_struct *vma;
 	struct page *page;
 
@@ -454,7 +468,7 @@ static void remove_node_from_stable_tree(struct stable_node *stable_node)
 		else
 			ksm_pages_shared--;
 		put_anon_vma(rmap_item->anon_vma);
-		rmap_item->address &= PAGE_MASK;
+		rmap_item->address &= ~STABLE_FLAG;
 		cond_resched();
 	}
 
@@ -542,7 +556,7 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
 			ksm_pages_shared--;
 
 		put_anon_vma(rmap_item->anon_vma);
-		rmap_item->address &= PAGE_MASK;
+		rmap_item->address &= ~STABLE_FLAG;
 
 	} else if (rmap_item->address & UNSTABLE_FLAG) {
 		unsigned char age;
@@ -554,12 +568,14 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
 		 * than left over from before.
 		 */
 		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
-		BUG_ON(age > 1);
+		BUG_ON (age > 1);
+
 		if (!age)
 			rb_erase(&rmap_item->node, &root_unstable_tree);
 
 		ksm_pages_unshared--;
-		rmap_item->address &= PAGE_MASK;
+		rmap_item->address &= ~UNSTABLE_FLAG;
+		rmap_item->address &= ~SEQNR_MASK;
 	}
 out:
 	cond_resched();		/* we're called from many long loops */
@@ -705,13 +721,14 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 	spinlock_t *ptl;
 	int swapped;
 	int err = -EFAULT;
+	int need_pte_unmap;
 
 	addr = page_address_in_vma(page, vma);
 	if (addr == -EFAULT)
 		goto out;
 
 	BUG_ON(PageTransCompound(page));
-	ptep = page_check_address(page, mm, addr, &ptl, 0);
+	ptep = page_check_address(page, mm, addr, &ptl, 0, &need_pte_unmap);
 	if (!ptep)
 		goto out;
 
@@ -747,7 +764,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 	err = 0;
 
 out_unlock:
-	pte_unmap_unlock(ptep, ptl);
+	page_check_address_unmap_unlock(ptl, ptep, need_pte_unmap);
 out:
 	return err;
 }
@@ -923,12 +940,13 @@ static int try_to_merge_with_ksm_page(struct rmap_item *rmap_item,
 	struct mm_struct *mm = rmap_item->mm;
 	struct vm_area_struct *vma;
 	int err = -EFAULT;
+	unsigned long address = get_address(rmap_item);
 
 	down_read(&mm->mmap_sem);
 	if (ksm_test_exit(mm))
 		goto out;
-	vma = find_vma(mm, rmap_item->address);
-	if (!vma || vma->vm_start > rmap_item->address)
+	vma = find_vma(mm, address);
+	if (!vma || vma->vm_start > address)
 		goto out;
 
 	err = try_to_merge_one_page(vma, page, kpage);
@@ -1159,6 +1177,94 @@ static void stable_tree_append(struct rmap_item *rmap_item,
 		ksm_pages_shared++;
 }
 
+static inline unsigned long get_huge_end_addr(unsigned long address)
+{
+	return (address & HPAGE_PMD_MASK) + HPAGE_SIZE;
+}
+
+static inline int ksm_ptep_test_and_clear_dirty(pte_t *ptep)
+{
+	int ret = 0;
+
+	if (pte_dirty(*ptep))
+		ret = test_and_clear_bit(_PAGE_BIT_DIRTY,
+					 (unsigned long *) &ptep->pte);
+
+	return ret;
+}
+
+#define ksm_ptep_test_and_clear_dirty_notify(__mm, __address, __ptep)	\
+({									\
+	int __dirty;							\
+	struct mm_struct *___mm = __mm;					\
+	unsigned long ___address = __address;				\
+	__dirty = ksm_ptep_test_and_clear_dirty(__ptep);		\
+	__dirty |= mmu_notifier_test_and_clear_dirty(___mm,		\
+						     ___address);	\
+	__dirty;							\
+})
+
+/*
+ * ksm_page_changed - take the dirty bit of the pte as a hint for volatile
+ * pages. We clear the dirty bit for each pte scanned but don't flush the
+ * tlb. For huge pages, if one of the subpage has changed, we try to skip
+ * the whole huge page.
+ */
+static int ksm_page_changed(struct page *page, struct rmap_item *rmap_item)
+{
+	int ret = 1;
+	unsigned long address = get_address(rmap_item);
+	struct mm_struct *mm = rmap_item->mm;
+	pte_t *ptep;
+	spinlock_t *ptl;
+	int need_pte_unmap;
+	unsigned int checksum;
+
+	/* If the the ptes are not updated by guest OS, we rely on checksum. */
+	if (!mmu_notifier_dirty_update(mm)) {
+		checksum = calc_checksum(page);
+		if (rmap_item->oldchecksum != checksum)
+			rmap_item->oldchecksum = checksum;
+		else
+			ret = 0;
+		goto out;
+	}
+
+	if (ksm_scan.huge_skip) {
+		/* in process of skipping a huge page */
+		if (ksm_scan.mm_slot->mm == rmap_item->mm &&
+		    PageTail(page) && address < ksm_scan.huge_skip) {
+			ret = 1;
+			goto out;
+		} else {
+			ksm_scan.huge_skip = 0;
+		}
+	}
+
+	ptep = page_check_address(page, mm, address, &ptl, 0, &need_pte_unmap);
+	if (!ptep)
+		goto out;
+
+	if (ksm_ptep_test_and_clear_dirty_notify(mm, address, ptep)) {
+		set_page_dirty(page);
+		if (PageTransCompound(page))
+			ksm_scan.huge_skip = get_huge_end_addr(address);
+	} else {
+		ret = 0;
+	}
+
+	page_check_address_unmap_unlock(ptl, ptep, need_pte_unmap);
+
+out:
+	/* This is simply to speed up merging in the first scan. */
+	if (ksm_merge_at_once && rmap_item->address & NEW_FLAG) {
+		rmap_item->address &= ~NEW_FLAG;
+		ret = 0;
+	}
+
+	return ret;
+}
+
 /*
  * cmp_and_merge_page - first see if page can be merged into the stable tree;
  * if not, compare checksum to previous and if it's the same, see if page can
@@ -1174,7 +1280,6 @@ static void cmp_and_merge_page(struct page *page, struct rmap_item *rmap_item)
 	struct page *tree_page = NULL;
 	struct stable_node *stable_node;
 	struct page *kpage;
-	unsigned int checksum;
 	int err;
 
 	remove_rmap_item_from_tree(rmap_item);
@@ -1196,17 +1301,8 @@ static void cmp_and_merge_page(struct page *page, struct rmap_item *rmap_item)
 		return;
 	}
 
-	/*
-	 * If the hash value of the page has changed from the last time
-	 * we calculated it, this page is changing frequently: therefore we
-	 * don't want to insert it in the unstable tree, and we don't want
-	 * to waste our time searching for something identical to it there.
-	 */
-	checksum = calc_checksum(page);
-	if (rmap_item->oldchecksum != checksum) {
-		rmap_item->oldchecksum = checksum;
+	if (ksm_page_changed(page, rmap_item))
 		return;
-	}
 
 	tree_rmap_item =
 		unstable_tree_search_insert(rmap_item, page, &tree_page);
@@ -1252,9 +1348,9 @@ static struct rmap_item *get_next_rmap_item(struct mm_slot *mm_slot,
 
 	while (*rmap_list) {
 		rmap_item = *rmap_list;
-		if ((rmap_item->address & PAGE_MASK) == addr)
+		if (get_address(rmap_item) == addr)
 			return rmap_item;
-		if (rmap_item->address > addr)
+		if (get_address(rmap_item) > addr)
 			break;
 		*rmap_list = rmap_item->rmap_list;
 		remove_rmap_item_from_tree(rmap_item);
@@ -1266,6 +1362,7 @@ static struct rmap_item *get_next_rmap_item(struct mm_slot *mm_slot,
 		/* It has already been zeroed */
 		rmap_item->mm = mm_slot->mm;
 		rmap_item->address = addr;
+		rmap_item->address |= NEW_FLAG;
 		rmap_item->rmap_list = *rmap_list;
 		*rmap_list = rmap_item;
 	}
@@ -1608,12 +1705,12 @@ again:
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
+		unsigned long address = get_address(rmap_item);
 
 		anon_vma_lock(anon_vma);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
-			if (rmap_item->address < vma->vm_start ||
-			    rmap_item->address >= vma->vm_end)
+			if (address < vma->vm_start || address >= vma->vm_end)
 				continue;
 			/*
 			 * Initially we examine only the vma which covers this
@@ -1627,8 +1724,8 @@ again:
 			if (memcg && !mm_match_cgroup(vma->vm_mm, memcg))
 				continue;
 
-			referenced += page_referenced_one(page, vma,
-				rmap_item->address, &mapcount, vm_flags);
+			referenced += page_referenced_one(page, vma, address,
+						&mapcount, vm_flags);
 			if (!search_new_forks || !mapcount)
 				break;
 		}
@@ -1661,12 +1758,12 @@ again:
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
+		unsigned long address = get_address(rmap_item);
 
 		anon_vma_lock(anon_vma);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
-			if (rmap_item->address < vma->vm_start ||
-			    rmap_item->address >= vma->vm_end)
+			if (address < vma->vm_start || address >= vma->vm_end)
 				continue;
 			/*
 			 * Initially we examine only the vma which covers this
@@ -1677,8 +1774,7 @@ again:
 			if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
 				continue;
 
-			ret = try_to_unmap_one(page, vma,
-					rmap_item->address, flags);
+			ret = try_to_unmap_one(page, vma, address, flags);
 			if (ret != SWAP_AGAIN || !page_mapped(page)) {
 				anon_vma_unlock(anon_vma);
 				goto out;
@@ -1713,12 +1809,12 @@ again:
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
+		unsigned long address = get_address(rmap_item);
 
 		anon_vma_lock(anon_vma);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
-			if (rmap_item->address < vma->vm_start ||
-			    rmap_item->address >= vma->vm_end)
+			if (address < vma->vm_start || address >= vma->vm_end)
 				continue;
 			/*
 			 * Initially we examine only the vma which covers this
@@ -1729,7 +1825,7 @@ again:
 			if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
 				continue;
 
-			ret = rmap_one(page, vma, rmap_item->address, arg);
+			ret = rmap_one(page, vma, address, arg);
 			if (ret != SWAP_AGAIN) {
 				anon_vma_unlock(anon_vma);
 				goto out;
@@ -1872,6 +1968,30 @@ static ssize_t pages_to_scan_store(struct kobject *kobj,
 }
 KSM_ATTR(pages_to_scan);
 
+static ssize_t merge_at_once_show(struct kobject *kobj,
+				  struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%u\n", ksm_merge_at_once);
+}
+
+static ssize_t merge_at_once_store(struct kobject *kobj,
+				   struct kobj_attribute *attr,
+				   const char *buf, size_t count)
+{
+	int err;
+	unsigned long merge_at_once;
+
+	err = strict_strtoul(buf, 10, &merge_at_once);
+	if (err || merge_at_once > UINT_MAX)
+		return -EINVAL;
+
+	ksm_merge_at_once = merge_at_once;
+
+	return count;
+}
+KSM_ATTR(merge_at_once);
+
+
 static ssize_t run_show(struct kobject *kobj, struct kobj_attribute *attr,
 			char *buf)
 {
@@ -1975,6 +2095,7 @@ static struct attribute *ksm_attrs[] = {
 	&pages_unshared_attr.attr,
 	&pages_volatile_attr.attr,
 	&full_scans_attr.attr,
+	&merge_at_once_attr.attr,
 	NULL,
 };
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH 1/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
  2011-06-21 13:26   ` Nai Xia
@ 2011-06-21 21:42     ` Chris Wright
  -1 siblings, 0 replies; 96+ messages in thread
From: Chris Wright @ 2011-06-21 21:42 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel

* Nai Xia (nai.xia@gmail.com) wrote:
> This patch makes the page_check_address() can validate if a subpage is
> in its place in a huge page pointed by the address. This can be useful when
> ksm does not split huge pages when looking up the subpages one by one.

Just a quick heads up...this patch does not compile by itself.  Could you
do a little patch cleanup?  Start with just making sure the Subject: is
correct for each patch.  Then make sure the 3 are part of same series.
And finally, make sure each is stand alone and complilable on its own.

thanks,
-chris

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 1/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
@ 2011-06-21 21:42     ` Chris Wright
  0 siblings, 0 replies; 96+ messages in thread
From: Chris Wright @ 2011-06-21 21:42 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel

* Nai Xia (nai.xia@gmail.com) wrote:
> This patch makes the page_check_address() can validate if a subpage is
> in its place in a huge page pointed by the address. This can be useful when
> ksm does not split huge pages when looking up the subpages one by one.

Just a quick heads up...this patch does not compile by itself.  Could you
do a little patch cleanup?  Start with just making sure the Subject: is
correct for each patch.  Then make sure the 3 are part of same series.
And finally, make sure each is stand alone and complilable on its own.

thanks,
-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
  2011-06-21 13:36   ` Nai Xia
@ 2011-06-21 22:38     ` Chris Wright
  -1 siblings, 0 replies; 96+ messages in thread
From: Chris Wright @ 2011-06-21 22:38 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel

* Nai Xia (nai.xia@gmail.com) wrote:
> Introduced ksm_page_changed() to reference the dirty bit of a pte. We clear 
> the dirty bit for each pte scanned but don't flush the tlb. For a huge page, 
> if one of the subpage has changed, we try to skip the whole huge page 
> assuming(this is true by now) that ksmd linearly scans the address space.

This doesn't build w/ kvm as a module.

> A NEW_FLAG is also introduced as a status of rmap_item to make ksmd scan
> more aggressively for new VMAs - only skip the pages considered to be volatile
> by the dirty bits. This can be enabled/disabled through KSM's sysfs interface.

This seems like it should be separated out.  And while it might be useful
to enable/disable for testing, I don't think it's worth supporting for
the long term.  Would also be useful to see the value of this flag.

> @@ -454,7 +468,7 @@ static void remove_node_from_stable_tree(struct stable_node *stable_node)
>  		else
>  			ksm_pages_shared--;
>  		put_anon_vma(rmap_item->anon_vma);
> -		rmap_item->address &= PAGE_MASK;
> +		rmap_item->address &= ~STABLE_FLAG;
>  		cond_resched();
>  	}
>  
> @@ -542,7 +556,7 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
>  			ksm_pages_shared--;
>  
>  		put_anon_vma(rmap_item->anon_vma);
> -		rmap_item->address &= PAGE_MASK;
> +		rmap_item->address &= ~STABLE_FLAG;
>  
>  	} else if (rmap_item->address & UNSTABLE_FLAG) {
>  		unsigned char age;
> @@ -554,12 +568,14 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
>  		 * than left over from before.
>  		 */
>  		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
> -		BUG_ON(age > 1);
> +		BUG_ON (age > 1);

No need to add space after BUG_ON() there

> +
>  		if (!age)
>  			rb_erase(&rmap_item->node, &root_unstable_tree);
>  
>  		ksm_pages_unshared--;
> -		rmap_item->address &= PAGE_MASK;
> +		rmap_item->address &= ~UNSTABLE_FLAG;
> +		rmap_item->address &= ~SEQNR_MASK;

None of these changes are needed AFAICT.  &= PAGE_MASK clears all
relevant bits.  How could it be in a tree, have NEW_FLAG set, and
while removing from tree want to preserve NEW_FLAG?

thanks,
-chris

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
@ 2011-06-21 22:38     ` Chris Wright
  0 siblings, 0 replies; 96+ messages in thread
From: Chris Wright @ 2011-06-21 22:38 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel

* Nai Xia (nai.xia@gmail.com) wrote:
> Introduced ksm_page_changed() to reference the dirty bit of a pte. We clear 
> the dirty bit for each pte scanned but don't flush the tlb. For a huge page, 
> if one of the subpage has changed, we try to skip the whole huge page 
> assuming(this is true by now) that ksmd linearly scans the address space.

This doesn't build w/ kvm as a module.

> A NEW_FLAG is also introduced as a status of rmap_item to make ksmd scan
> more aggressively for new VMAs - only skip the pages considered to be volatile
> by the dirty bits. This can be enabled/disabled through KSM's sysfs interface.

This seems like it should be separated out.  And while it might be useful
to enable/disable for testing, I don't think it's worth supporting for
the long term.  Would also be useful to see the value of this flag.

> @@ -454,7 +468,7 @@ static void remove_node_from_stable_tree(struct stable_node *stable_node)
>  		else
>  			ksm_pages_shared--;
>  		put_anon_vma(rmap_item->anon_vma);
> -		rmap_item->address &= PAGE_MASK;
> +		rmap_item->address &= ~STABLE_FLAG;
>  		cond_resched();
>  	}
>  
> @@ -542,7 +556,7 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
>  			ksm_pages_shared--;
>  
>  		put_anon_vma(rmap_item->anon_vma);
> -		rmap_item->address &= PAGE_MASK;
> +		rmap_item->address &= ~STABLE_FLAG;
>  
>  	} else if (rmap_item->address & UNSTABLE_FLAG) {
>  		unsigned char age;
> @@ -554,12 +568,14 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
>  		 * than left over from before.
>  		 */
>  		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
> -		BUG_ON(age > 1);
> +		BUG_ON (age > 1);

No need to add space after BUG_ON() there

> +
>  		if (!age)
>  			rb_erase(&rmap_item->node, &root_unstable_tree);
>  
>  		ksm_pages_unshared--;
> -		rmap_item->address &= PAGE_MASK;
> +		rmap_item->address &= ~UNSTABLE_FLAG;
> +		rmap_item->address &= ~SEQNR_MASK;

None of these changes are needed AFAICT.  &= PAGE_MASK clears all
relevant bits.  How could it be in a tree, have NEW_FLAG set, and
while removing from tree want to preserve NEW_FLAG?

thanks,
-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 1/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
  2011-06-21 21:42     ` Chris Wright
@ 2011-06-22  0:02       ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22  0:02 UTC (permalink / raw)
  To: Chris Wright
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel

On Wednesday 22 June 2011 05:42:33 you wrote:
> * Nai Xia (nai.xia@gmail.com) wrote:
> > This patch makes the page_check_address() can validate if a subpage is
> > in its place in a huge page pointed by the address. This can be useful when
> > ksm does not split huge pages when looking up the subpages one by one.
> 
> Just a quick heads up...this patch does not compile by itself.  Could you
> do a little patch cleanup?  Start with just making sure the Subject: is
> correct for each patch.  Then make sure the 3 are part of same series.
> And finally, make sure each is stand alone and complilable on its own.

Oh, indeed, there is a kvm & mmu_notifier related patch not named in a series.
But with a same email thread ID, I think.... I had thought it's ok...
I'll reformat this patch set to fullfill these requirements. 
Thanks for reviewing. 
(Sorry for repeated mail, I forgot to Cc the list..)


Nai

> 
> thanks,
> -chris
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 1/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
@ 2011-06-22  0:02       ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22  0:02 UTC (permalink / raw)
  To: Chris Wright
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel

On Wednesday 22 June 2011 05:42:33 you wrote:
> * Nai Xia (nai.xia@gmail.com) wrote:
> > This patch makes the page_check_address() can validate if a subpage is
> > in its place in a huge page pointed by the address. This can be useful when
> > ksm does not split huge pages when looking up the subpages one by one.
> 
> Just a quick heads up...this patch does not compile by itself.  Could you
> do a little patch cleanup?  Start with just making sure the Subject: is
> correct for each patch.  Then make sure the 3 are part of same series.
> And finally, make sure each is stand alone and complilable on its own.

Oh, indeed, there is a kvm & mmu_notifier related patch not named in a series.
But with a same email thread ID, I think.... I had thought it's ok...
I'll reformat this patch set to fullfill these requirements. 
Thanks for reviewing. 
(Sorry for repeated mail, I forgot to Cc the list..)


Nai

> 
> thanks,
> -chris
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
  2011-06-21 22:38     ` Chris Wright
@ 2011-06-22  0:04       ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22  0:04 UTC (permalink / raw)
  To: Chris Wright
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel

(Sorry for repeated mail, I forgot to Cc the list..)

On Wednesday 22 June 2011 06:38:00 you wrote:
> * Nai Xia (nai.xia@gmail.com) wrote:
> > Introduced ksm_page_changed() to reference the dirty bit of a pte. We clear 
> > the dirty bit for each pte scanned but don't flush the tlb. For a huge page, 
> > if one of the subpage has changed, we try to skip the whole huge page 
> > assuming(this is true by now) that ksmd linearly scans the address space.
> 
> This doesn't build w/ kvm as a module.

I think it's because of the name-error of a related kvm patch, which I only sent
in a same email thread. http://marc.info/?l=linux-mm&m=130866318804277&w=2
The patch split is not clean...I'll redo it.

> 
> > A NEW_FLAG is also introduced as a status of rmap_item to make ksmd scan
> > more aggressively for new VMAs - only skip the pages considered to be volatile
> > by the dirty bits. This can be enabled/disabled through KSM's sysfs interface.
> 
> This seems like it should be separated out.  And while it might be useful
> to enable/disable for testing, I don't think it's worth supporting for
> the long term.  Would also be useful to see the value of this flag.

I think it maybe useful for uses who want to turn on/off this scan policy explicitly
according to their working sets? 

> 
> > @@ -454,7 +468,7 @@ static void remove_node_from_stable_tree(struct stable_node *stable_node)
> >  		else
> >  			ksm_pages_shared--;
> >  		put_anon_vma(rmap_item->anon_vma);
> > -		rmap_item->address &= PAGE_MASK;
> > +		rmap_item->address &= ~STABLE_FLAG;
> >  		cond_resched();
> >  	}
> >  
> > @@ -542,7 +556,7 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
> >  			ksm_pages_shared--;
> >  
> >  		put_anon_vma(rmap_item->anon_vma);
> > -		rmap_item->address &= PAGE_MASK;
> > +		rmap_item->address &= ~STABLE_FLAG;
> >  
> >  	} else if (rmap_item->address & UNSTABLE_FLAG) {
> >  		unsigned char age;
> > @@ -554,12 +568,14 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
> >  		 * than left over from before.
> >  		 */
> >  		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
> > -		BUG_ON(age > 1);
> > +		BUG_ON (age > 1);
> 
> No need to add space after BUG_ON() there
> 
> > +
> >  		if (!age)
> >  			rb_erase(&rmap_item->node, &root_unstable_tree);
> >  
> >  		ksm_pages_unshared--;
> > -		rmap_item->address &= PAGE_MASK;
> > +		rmap_item->address &= ~UNSTABLE_FLAG;
> > +		rmap_item->address &= ~SEQNR_MASK;
> 
> None of these changes are needed AFAICT.  &= PAGE_MASK clears all
> relevant bits.  How could it be in a tree, have NEW_FLAG set, and
> while removing from tree want to preserve NEW_FLAG?

You are right, it's meaningless to preserve NEW_FLAG after it goes 
through the trees. I'll revert the lines.

Thanks!

Nai

> 
> thanks,
> -chris
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
@ 2011-06-22  0:04       ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22  0:04 UTC (permalink / raw)
  To: Chris Wright
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel

(Sorry for repeated mail, I forgot to Cc the list..)

On Wednesday 22 June 2011 06:38:00 you wrote:
> * Nai Xia (nai.xia@gmail.com) wrote:
> > Introduced ksm_page_changed() to reference the dirty bit of a pte. We clear 
> > the dirty bit for each pte scanned but don't flush the tlb. For a huge page, 
> > if one of the subpage has changed, we try to skip the whole huge page 
> > assuming(this is true by now) that ksmd linearly scans the address space.
> 
> This doesn't build w/ kvm as a module.

I think it's because of the name-error of a related kvm patch, which I only sent
in a same email thread. http://marc.info/?l=linux-mm&m=130866318804277&w=2
The patch split is not clean...I'll redo it.

> 
> > A NEW_FLAG is also introduced as a status of rmap_item to make ksmd scan
> > more aggressively for new VMAs - only skip the pages considered to be volatile
> > by the dirty bits. This can be enabled/disabled through KSM's sysfs interface.
> 
> This seems like it should be separated out.  And while it might be useful
> to enable/disable for testing, I don't think it's worth supporting for
> the long term.  Would also be useful to see the value of this flag.

I think it maybe useful for uses who want to turn on/off this scan policy explicitly
according to their working sets? 

> 
> > @@ -454,7 +468,7 @@ static void remove_node_from_stable_tree(struct stable_node *stable_node)
> >  		else
> >  			ksm_pages_shared--;
> >  		put_anon_vma(rmap_item->anon_vma);
> > -		rmap_item->address &= PAGE_MASK;
> > +		rmap_item->address &= ~STABLE_FLAG;
> >  		cond_resched();
> >  	}
> >  
> > @@ -542,7 +556,7 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
> >  			ksm_pages_shared--;
> >  
> >  		put_anon_vma(rmap_item->anon_vma);
> > -		rmap_item->address &= PAGE_MASK;
> > +		rmap_item->address &= ~STABLE_FLAG;
> >  
> >  	} else if (rmap_item->address & UNSTABLE_FLAG) {
> >  		unsigned char age;
> > @@ -554,12 +568,14 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
> >  		 * than left over from before.
> >  		 */
> >  		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
> > -		BUG_ON(age > 1);
> > +		BUG_ON (age > 1);
> 
> No need to add space after BUG_ON() there
> 
> > +
> >  		if (!age)
> >  			rb_erase(&rmap_item->node, &root_unstable_tree);
> >  
> >  		ksm_pages_unshared--;
> > -		rmap_item->address &= PAGE_MASK;
> > +		rmap_item->address &= ~UNSTABLE_FLAG;
> > +		rmap_item->address &= ~SEQNR_MASK;
> 
> None of these changes are needed AFAICT.  &= PAGE_MASK clears all
> relevant bits.  How could it be in a tree, have NEW_FLAG set, and
> while removing from tree want to preserve NEW_FLAG?

You are right, it's meaningless to preserve NEW_FLAG after it goes 
through the trees. I'll revert the lines.

Thanks!

Nai

> 
> thanks,
> -chris
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-21 13:32   ` Nai Xia
@ 2011-06-22  0:21     ` Chris Wright
  -1 siblings, 0 replies; 96+ messages in thread
From: Chris Wright @ 2011-06-22  0:21 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm, mtosatti

* Nai Xia (nai.xia@gmail.com) wrote:
> Introduced kvm_mmu_notifier_test_and_clear_dirty(), kvm_mmu_notifier_dirty_update()
> and their mmu_notifier interfaces to support KSM dirty bit tracking, which brings
> significant performance gain in volatile pages scanning in KSM.
> Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is
> enabled to indicate that the dirty bits of underlying sptes are not updated by
> hardware.

Did you test with each of EPT, NPT and shadow?

> Signed-off-by: Nai Xia <nai.xia@gmail.com>
> Acked-by: Izik Eidus <izik.eidus@ravellosystems.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    1 +
>  arch/x86/kvm/mmu.c              |   36 +++++++++++++++++++++++++++++
>  arch/x86/kvm/mmu.h              |    3 +-
>  arch/x86/kvm/vmx.c              |    1 +
>  include/linux/kvm_host.h        |    2 +-
>  include/linux/mmu_notifier.h    |   48 +++++++++++++++++++++++++++++++++++++++
>  mm/mmu_notifier.c               |   33 ++++++++++++++++++++++++++
>  virt/kvm/kvm_main.c             |   27 ++++++++++++++++++++++
>  8 files changed, 149 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d2ac8e2..f0d7aa0 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -848,6 +848,7 @@ extern bool kvm_rebooting;
>  int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
>  int kvm_age_hva(struct kvm *kvm, unsigned long hva);
>  int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
> +int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva);
>  void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
>  int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
>  int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index aee3862..a5a0c51 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -979,6 +979,37 @@ out:
>  	return young;
>  }
>  
> +/*
> + * Caller is supposed to SetPageDirty(), it's not done inside this.
> + */
> +static
> +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp,
> +				   unsigned long data)
> +{
> +	u64 *spte;
> +	int dirty = 0;
> +
> +	if (!shadow_dirty_mask) {
> +		WARN(1, "KVM: do NOT try to test dirty bit in EPT\n");
> +		goto out;
> +	}

This should never fire with the dirty_update() notifier test, right?
And that means that this whole optimization is for the shadow mmu case,
arguably the legacy case.

> +
> +	spte = rmap_next(kvm, rmapp, NULL);
> +	while (spte) {
> +		int _dirty;
> +		u64 _spte = *spte;
> +		BUG_ON(!(_spte & PT_PRESENT_MASK));
> +		_dirty = _spte & PT_DIRTY_MASK;
> +		if (_dirty) {
> +			dirty = 1;
> +			clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);

Is this sufficient (not losing dirty state ever)?

> +		}
> +		spte = rmap_next(kvm, rmapp, spte);
> +	}
> +out:
> +	return dirty;
> +}
> +
>  #define RMAP_RECYCLE_THRESHOLD 1000
>  
>  static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
> @@ -1004,6 +1035,11 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
>  	return kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp);
>  
>  
> +int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva)
> +{
> +	return kvm_handle_hva(kvm, hva, 0, kvm_test_and_clear_dirty_rmapp);
> +}
> +
>  #ifdef MMU_DEBUG
>  static int is_empty_shadow_page(u64 *spt)
>  {
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 7086ca8..b8d01c3 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -18,7 +18,8 @@
>  #define PT_PCD_MASK (1ULL << 4)
>  #define PT_ACCESSED_SHIFT 5
>  #define PT_ACCESSED_MASK (1ULL << PT_ACCESSED_SHIFT)
> -#define PT_DIRTY_MASK (1ULL << 6)
> +#define PT_DIRTY_SHIFT 6
> +#define PT_DIRTY_MASK (1ULL << PT_DIRTY_SHIFT)
>  #define PT_PAGE_SIZE_MASK (1ULL << 7)
>  #define PT_PAT_MASK (1ULL << 7)
>  #define PT_GLOBAL_MASK (1ULL << 8)
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index d48ec60..b407a69 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4674,6 +4674,7 @@ static int __init vmx_init(void)
>  		kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
>  				VMX_EPT_EXECUTABLE_MASK);
>  		kvm_enable_tdp();
> +		kvm_dirty_update = 0;

Doesn't the above shadow_dirty_mask==0ull tell us this same info?

>  	} else
>  		kvm_disable_tdp();
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 31ebb59..2036bae 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -53,7 +53,7 @@
>  struct kvm;
>  struct kvm_vcpu;
>  extern struct kmem_cache *kvm_vcpu_cache;
> -
> +extern int kvm_dirty_update;
>  /*
>   * It would be nice to use something smarter than a linear search, TBD...
>   * Thankfully we dont expect many devices to register (famous last words :),
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 1d1b1e1..bd6ba2d 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -24,6 +24,9 @@ struct mmu_notifier_mm {
>  };
>  
>  struct mmu_notifier_ops {

Need to add a comment to describe it.  And why is it not next to
test_and_clear_dirty()?  I see how it's used, but seems as if the
test_and_clear_dirty() code could return -1 (as in dirty state unknown)
for the case where it can't track dirty bit and fall back to checksum.

> +	int (*dirty_update)(struct mmu_notifier *mn,
> +			     struct mm_struct *mm);
> +
>  	/*
>  	 * Called either by mmu_notifier_unregister or when the mm is
>  	 * being destroyed by exit_mmap, always before all pages are
> @@ -72,6 +75,16 @@ struct mmu_notifier_ops {
>  			  unsigned long address);
>  
>  	/*
> +	 * clear_flush_dirty is called after the VM is
> +	 * test-and-clearing the dirty/modified bitflag in the
> +	 * pte. This way the VM will provide proper volatile page
> +	 * testing to ksm.
> +	 */
> +	int (*test_and_clear_dirty)(struct mmu_notifier *mn,
> +				    struct mm_struct *mm,
> +				    unsigned long address);
> +
> +	/*
>  	 * change_pte is called in cases that pte mapping to page is changed:
>  	 * for example, when ksm remaps pte to point to a new shared page.
>  	 */
> @@ -170,11 +183,14 @@ extern int __mmu_notifier_register(struct mmu_notifier *mn,
>  extern void mmu_notifier_unregister(struct mmu_notifier *mn,
>  				    struct mm_struct *mm);
>  extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
> +extern int __mmu_notifier_dirty_update(struct mm_struct *mm);
>  extern void __mmu_notifier_release(struct mm_struct *mm);
>  extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
>  					  unsigned long address);
>  extern int __mmu_notifier_test_young(struct mm_struct *mm,
>  				     unsigned long address);
> +extern int __mmu_notifier_test_and_clear_dirty(struct mm_struct *mm,
> +					       unsigned long address);
>  extern void __mmu_notifier_change_pte(struct mm_struct *mm,
>  				      unsigned long address, pte_t pte);
>  extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> @@ -184,6 +200,19 @@ extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>  				  unsigned long start, unsigned long end);
>  
> +/*
> + * For ksm to make use of dirty bit, it wants to make sure that the dirty bits
> + * in sptes really carry the dirty information. Currently only intel EPT is
> + * not for ksm dirty bit tracking.
> + */
> +static inline int mmu_notifier_dirty_update(struct mm_struct *mm)
> +{
> +	if (mm_has_notifiers(mm))
> +		return __mmu_notifier_dirty_update(mm);
> +

No need for extra newline.

> +	return 1;
> +}
> +

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 96ebc06..22967c8 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -78,6 +78,8 @@ static atomic_t hardware_enable_failed;
>  struct kmem_cache *kvm_vcpu_cache;
>  EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
>  
> +int kvm_dirty_update = 1;
> +
>  static __read_mostly struct preempt_ops kvm_preempt_ops;
>  
>  struct dentry *kvm_debugfs_dir;
> @@ -398,6 +400,23 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
>  	return young;
>  }
>  
> +/* Caller should SetPageDirty(), no need to flush tlb */
> +static int kvm_mmu_notifier_test_and_clear_dirty(struct mmu_notifier *mn,
> +						 struct mm_struct *mm,
> +						 unsigned long address)
> +{
> +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> +	int dirty, idx;

Perhaps something like:

	if (!shadow_dirty_mask)
		return -1;

And adjust caller logic accordingly?

> +     idx = srcu_read_lock(&kvm->srcu);
> +     spin_lock(&kvm->mmu_lock);
> +     dirty = kvm_test_and_clear_dirty_hva(kvm, address);
> +     spin_unlock(&kvm->mmu_lock);
> +     srcu_read_unlock(&kvm->srcu, idx);
> +
> +     return dirty;
> +}

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22  0:21     ` Chris Wright
  0 siblings, 0 replies; 96+ messages in thread
From: Chris Wright @ 2011-06-22  0:21 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm, mtosatti

* Nai Xia (nai.xia@gmail.com) wrote:
> Introduced kvm_mmu_notifier_test_and_clear_dirty(), kvm_mmu_notifier_dirty_update()
> and their mmu_notifier interfaces to support KSM dirty bit tracking, which brings
> significant performance gain in volatile pages scanning in KSM.
> Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is
> enabled to indicate that the dirty bits of underlying sptes are not updated by
> hardware.

Did you test with each of EPT, NPT and shadow?

> Signed-off-by: Nai Xia <nai.xia@gmail.com>
> Acked-by: Izik Eidus <izik.eidus@ravellosystems.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    1 +
>  arch/x86/kvm/mmu.c              |   36 +++++++++++++++++++++++++++++
>  arch/x86/kvm/mmu.h              |    3 +-
>  arch/x86/kvm/vmx.c              |    1 +
>  include/linux/kvm_host.h        |    2 +-
>  include/linux/mmu_notifier.h    |   48 +++++++++++++++++++++++++++++++++++++++
>  mm/mmu_notifier.c               |   33 ++++++++++++++++++++++++++
>  virt/kvm/kvm_main.c             |   27 ++++++++++++++++++++++
>  8 files changed, 149 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d2ac8e2..f0d7aa0 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -848,6 +848,7 @@ extern bool kvm_rebooting;
>  int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
>  int kvm_age_hva(struct kvm *kvm, unsigned long hva);
>  int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
> +int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva);
>  void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
>  int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
>  int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index aee3862..a5a0c51 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -979,6 +979,37 @@ out:
>  	return young;
>  }
>  
> +/*
> + * Caller is supposed to SetPageDirty(), it's not done inside this.
> + */
> +static
> +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp,
> +				   unsigned long data)
> +{
> +	u64 *spte;
> +	int dirty = 0;
> +
> +	if (!shadow_dirty_mask) {
> +		WARN(1, "KVM: do NOT try to test dirty bit in EPT\n");
> +		goto out;
> +	}

This should never fire with the dirty_update() notifier test, right?
And that means that this whole optimization is for the shadow mmu case,
arguably the legacy case.

> +
> +	spte = rmap_next(kvm, rmapp, NULL);
> +	while (spte) {
> +		int _dirty;
> +		u64 _spte = *spte;
> +		BUG_ON(!(_spte & PT_PRESENT_MASK));
> +		_dirty = _spte & PT_DIRTY_MASK;
> +		if (_dirty) {
> +			dirty = 1;
> +			clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);

Is this sufficient (not losing dirty state ever)?

> +		}
> +		spte = rmap_next(kvm, rmapp, spte);
> +	}
> +out:
> +	return dirty;
> +}
> +
>  #define RMAP_RECYCLE_THRESHOLD 1000
>  
>  static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
> @@ -1004,6 +1035,11 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
>  	return kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp);
>  
>  
> +int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva)
> +{
> +	return kvm_handle_hva(kvm, hva, 0, kvm_test_and_clear_dirty_rmapp);
> +}
> +
>  #ifdef MMU_DEBUG
>  static int is_empty_shadow_page(u64 *spt)
>  {
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 7086ca8..b8d01c3 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -18,7 +18,8 @@
>  #define PT_PCD_MASK (1ULL << 4)
>  #define PT_ACCESSED_SHIFT 5
>  #define PT_ACCESSED_MASK (1ULL << PT_ACCESSED_SHIFT)
> -#define PT_DIRTY_MASK (1ULL << 6)
> +#define PT_DIRTY_SHIFT 6
> +#define PT_DIRTY_MASK (1ULL << PT_DIRTY_SHIFT)
>  #define PT_PAGE_SIZE_MASK (1ULL << 7)
>  #define PT_PAT_MASK (1ULL << 7)
>  #define PT_GLOBAL_MASK (1ULL << 8)
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index d48ec60..b407a69 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4674,6 +4674,7 @@ static int __init vmx_init(void)
>  		kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
>  				VMX_EPT_EXECUTABLE_MASK);
>  		kvm_enable_tdp();
> +		kvm_dirty_update = 0;

Doesn't the above shadow_dirty_mask==0ull tell us this same info?

>  	} else
>  		kvm_disable_tdp();
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 31ebb59..2036bae 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -53,7 +53,7 @@
>  struct kvm;
>  struct kvm_vcpu;
>  extern struct kmem_cache *kvm_vcpu_cache;
> -
> +extern int kvm_dirty_update;
>  /*
>   * It would be nice to use something smarter than a linear search, TBD...
>   * Thankfully we dont expect many devices to register (famous last words :),
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 1d1b1e1..bd6ba2d 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -24,6 +24,9 @@ struct mmu_notifier_mm {
>  };
>  
>  struct mmu_notifier_ops {

Need to add a comment to describe it.  And why is it not next to
test_and_clear_dirty()?  I see how it's used, but seems as if the
test_and_clear_dirty() code could return -1 (as in dirty state unknown)
for the case where it can't track dirty bit and fall back to checksum.

> +	int (*dirty_update)(struct mmu_notifier *mn,
> +			     struct mm_struct *mm);
> +
>  	/*
>  	 * Called either by mmu_notifier_unregister or when the mm is
>  	 * being destroyed by exit_mmap, always before all pages are
> @@ -72,6 +75,16 @@ struct mmu_notifier_ops {
>  			  unsigned long address);
>  
>  	/*
> +	 * clear_flush_dirty is called after the VM is
> +	 * test-and-clearing the dirty/modified bitflag in the
> +	 * pte. This way the VM will provide proper volatile page
> +	 * testing to ksm.
> +	 */
> +	int (*test_and_clear_dirty)(struct mmu_notifier *mn,
> +				    struct mm_struct *mm,
> +				    unsigned long address);
> +
> +	/*
>  	 * change_pte is called in cases that pte mapping to page is changed:
>  	 * for example, when ksm remaps pte to point to a new shared page.
>  	 */
> @@ -170,11 +183,14 @@ extern int __mmu_notifier_register(struct mmu_notifier *mn,
>  extern void mmu_notifier_unregister(struct mmu_notifier *mn,
>  				    struct mm_struct *mm);
>  extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
> +extern int __mmu_notifier_dirty_update(struct mm_struct *mm);
>  extern void __mmu_notifier_release(struct mm_struct *mm);
>  extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
>  					  unsigned long address);
>  extern int __mmu_notifier_test_young(struct mm_struct *mm,
>  				     unsigned long address);
> +extern int __mmu_notifier_test_and_clear_dirty(struct mm_struct *mm,
> +					       unsigned long address);
>  extern void __mmu_notifier_change_pte(struct mm_struct *mm,
>  				      unsigned long address, pte_t pte);
>  extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> @@ -184,6 +200,19 @@ extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>  				  unsigned long start, unsigned long end);
>  
> +/*
> + * For ksm to make use of dirty bit, it wants to make sure that the dirty bits
> + * in sptes really carry the dirty information. Currently only intel EPT is
> + * not for ksm dirty bit tracking.
> + */
> +static inline int mmu_notifier_dirty_update(struct mm_struct *mm)
> +{
> +	if (mm_has_notifiers(mm))
> +		return __mmu_notifier_dirty_update(mm);
> +

No need for extra newline.

> +	return 1;
> +}
> +

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 96ebc06..22967c8 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -78,6 +78,8 @@ static atomic_t hardware_enable_failed;
>  struct kmem_cache *kvm_vcpu_cache;
>  EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
>  
> +int kvm_dirty_update = 1;
> +
>  static __read_mostly struct preempt_ops kvm_preempt_ops;
>  
>  struct dentry *kvm_debugfs_dir;
> @@ -398,6 +400,23 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
>  	return young;
>  }
>  
> +/* Caller should SetPageDirty(), no need to flush tlb */
> +static int kvm_mmu_notifier_test_and_clear_dirty(struct mmu_notifier *mn,
> +						 struct mm_struct *mm,
> +						 unsigned long address)
> +{
> +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> +	int dirty, idx;

Perhaps something like:

	if (!shadow_dirty_mask)
		return -1;

And adjust caller logic accordingly?

> +     idx = srcu_read_lock(&kvm->srcu);
> +     spin_lock(&kvm->mmu_lock);
> +     dirty = kvm_test_and_clear_dirty_hva(kvm, address);
> +     spin_unlock(&kvm->mmu_lock);
> +     srcu_read_unlock(&kvm->srcu, idx);
> +
> +     return dirty;
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
  2011-06-22  0:04       ` Nai Xia
@ 2011-06-22  0:35         ` Chris Wright
  -1 siblings, 0 replies; 96+ messages in thread
From: Chris Wright @ 2011-06-22  0:35 UTC (permalink / raw)
  To: Nai Xia
  Cc: Chris Wright, Andrew Morton, Izik Eidus, Andrea Arcangeli,
	Hugh Dickins, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel

* Nai Xia (nai.xia@gmail.com) wrote:
> (Sorry for repeated mail, I forgot to Cc the list..)
> 
> On Wednesday 22 June 2011 06:38:00 you wrote:
> > * Nai Xia (nai.xia@gmail.com) wrote:
> > > Introduced ksm_page_changed() to reference the dirty bit of a pte. We clear 
> > > the dirty bit for each pte scanned but don't flush the tlb. For a huge page, 
> > > if one of the subpage has changed, we try to skip the whole huge page 
> > > assuming(this is true by now) that ksmd linearly scans the address space.
> > 
> > This doesn't build w/ kvm as a module.
> 
> I think it's because of the name-error of a related kvm patch, which I only sent
> in a same email thread. http://marc.info/?l=linux-mm&m=130866318804277&w=2
> The patch split is not clean...I'll redo it.
> 

It needs an export as it is.
ERROR: "kvm_dirty_update" [arch/x86/kvm/kvm-intel.ko] undefined!

Although perhaps could be done w/out that dirty_update altogether (as I
mentioned in other email)?

> > 
> > > A NEW_FLAG is also introduced as a status of rmap_item to make ksmd scan
> > > more aggressively for new VMAs - only skip the pages considered to be volatile
> > > by the dirty bits. This can be enabled/disabled through KSM's sysfs interface.
> > 
> > This seems like it should be separated out.  And while it might be useful
> > to enable/disable for testing, I don't think it's worth supporting for
> > the long term.  Would also be useful to see the value of this flag.
> 
> I think it maybe useful for uses who want to turn on/off this scan policy explicitly
> according to their working sets? 

Can you split it out, and show the benefit of it directly?  I think it
only benefits:

p = mmap()
memset(p, $value, entire buffer);
...
very slowly (w.r.t scan times) touch bits of buffer and trigger cow to
break sharing.

Would you agree?

thanks,
-chris

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
@ 2011-06-22  0:35         ` Chris Wright
  0 siblings, 0 replies; 96+ messages in thread
From: Chris Wright @ 2011-06-22  0:35 UTC (permalink / raw)
  To: Nai Xia
  Cc: Chris Wright, Andrew Morton, Izik Eidus, Andrea Arcangeli,
	Hugh Dickins, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel

* Nai Xia (nai.xia@gmail.com) wrote:
> (Sorry for repeated mail, I forgot to Cc the list..)
> 
> On Wednesday 22 June 2011 06:38:00 you wrote:
> > * Nai Xia (nai.xia@gmail.com) wrote:
> > > Introduced ksm_page_changed() to reference the dirty bit of a pte. We clear 
> > > the dirty bit for each pte scanned but don't flush the tlb. For a huge page, 
> > > if one of the subpage has changed, we try to skip the whole huge page 
> > > assuming(this is true by now) that ksmd linearly scans the address space.
> > 
> > This doesn't build w/ kvm as a module.
> 
> I think it's because of the name-error of a related kvm patch, which I only sent
> in a same email thread. http://marc.info/?l=linux-mm&m=130866318804277&w=2
> The patch split is not clean...I'll redo it.
> 

It needs an export as it is.
ERROR: "kvm_dirty_update" [arch/x86/kvm/kvm-intel.ko] undefined!

Although perhaps could be done w/out that dirty_update altogether (as I
mentioned in other email)?

> > 
> > > A NEW_FLAG is also introduced as a status of rmap_item to make ksmd scan
> > > more aggressively for new VMAs - only skip the pages considered to be volatile
> > > by the dirty bits. This can be enabled/disabled through KSM's sysfs interface.
> > 
> > This seems like it should be separated out.  And while it might be useful
> > to enable/disable for testing, I don't think it's worth supporting for
> > the long term.  Would also be useful to see the value of this flag.
> 
> I think it maybe useful for uses who want to turn on/off this scan policy explicitly
> according to their working sets? 

Can you split it out, and show the benefit of it directly?  I think it
only benefits:

p = mmap()
memset(p, $value, entire buffer);
...
very slowly (w.r.t scan times) touch bits of buffer and trigger cow to
break sharing.

Would you agree?

thanks,
-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 1/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
  2011-06-22  0:02       ` Nai Xia
@ 2011-06-22  0:42         ` Chris Wright
  -1 siblings, 0 replies; 96+ messages in thread
From: Chris Wright @ 2011-06-22  0:42 UTC (permalink / raw)
  To: Nai Xia
  Cc: Chris Wright, Andrew Morton, Izik Eidus, Andrea Arcangeli,
	Hugh Dickins, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel

* Nai Xia (nai.xia@gmail.com) wrote:
> On Wednesday 22 June 2011 05:42:33 you wrote:
> > * Nai Xia (nai.xia@gmail.com) wrote:
> > > This patch makes the page_check_address() can validate if a subpage is
> > > in its place in a huge page pointed by the address. This can be useful when
> > > ksm does not split huge pages when looking up the subpages one by one.
> > 
> > Just a quick heads up...this patch does not compile by itself.  Could you
> > do a little patch cleanup?  Start with just making sure the Subject: is
> > correct for each patch.  Then make sure the 3 are part of same series.
> > And finally, make sure each is stand alone and complilable on its own.
> 
> Oh, indeed, there is a kvm & mmu_notifier related patch not named in a series.
> But with a same email thread ID, I think

Right, in same thread, but it ends up with:

[PATCH 1/2] ksm: take dirty bit as reference to avoid volatile pages...
[PATCH] mmu_notifier, kvm: Introduce dirty bit...
[PATCH 2/2] ksm: take dirty bit as reference to avoid volatile pages...

Not a big deal, but also easy to fix up ;)

> .... I had thought it's ok...
> I'll reformat this patch set to fullfill these requirements. 

thanks,
-chris

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 1/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
@ 2011-06-22  0:42         ` Chris Wright
  0 siblings, 0 replies; 96+ messages in thread
From: Chris Wright @ 2011-06-22  0:42 UTC (permalink / raw)
  To: Nai Xia
  Cc: Chris Wright, Andrew Morton, Izik Eidus, Andrea Arcangeli,
	Hugh Dickins, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel

* Nai Xia (nai.xia@gmail.com) wrote:
> On Wednesday 22 June 2011 05:42:33 you wrote:
> > * Nai Xia (nai.xia@gmail.com) wrote:
> > > This patch makes the page_check_address() can validate if a subpage is
> > > in its place in a huge page pointed by the address. This can be useful when
> > > ksm does not split huge pages when looking up the subpages one by one.
> > 
> > Just a quick heads up...this patch does not compile by itself.  Could you
> > do a little patch cleanup?  Start with just making sure the Subject: is
> > correct for each patch.  Then make sure the 3 are part of same series.
> > And finally, make sure each is stand alone and complilable on its own.
> 
> Oh, indeed, there is a kvm & mmu_notifier related patch not named in a series.
> But with a same email thread ID, I think

Right, in same thread, but it ends up with:

[PATCH 1/2] ksm: take dirty bit as reference to avoid volatile pages...
[PATCH] mmu_notifier, kvm: Introduce dirty bit...
[PATCH 2/2] ksm: take dirty bit as reference to avoid volatile pages...

Not a big deal, but also easy to fix up ;)

> .... I had thought it's ok...
> I'll reformat this patch set to fullfill these requirements. 

thanks,
-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 0/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
  2011-06-21 12:55 ` Nai Xia
@ 2011-06-22  0:46   ` Chris Wright
  -1 siblings, 0 replies; 96+ messages in thread
From: Chris Wright @ 2011-06-22  0:46 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel

* Nai Xia (nai.xia@gmail.com) wrote:
> Compared to the first version, this patch set addresses the problem of
> dirty bit updating of virtual machines, by adding two mmu_notifier interfaces.
> So it can now track the volatile working set inside KVM guest OS.
> 
> V1 log:
> Currently, ksm uses page checksum to detect volatile pages. Izik Eidus 
> suggested that we could use pte dirty bit to optimize. This patch series
> adds this new logic.
> 
> Preliminary benchmarks show that the scan speed is improved by up to 16 
> times on volatile transparent huge pages and up to 8 times on volatile 
> regular pages.

Did you run this only in the host (which would not trigger the notifiers
to kvm), or also run your test program in a guest?

thanks,
-chris

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 0/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
@ 2011-06-22  0:46   ` Chris Wright
  0 siblings, 0 replies; 96+ messages in thread
From: Chris Wright @ 2011-06-22  0:46 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel

* Nai Xia (nai.xia@gmail.com) wrote:
> Compared to the first version, this patch set addresses the problem of
> dirty bit updating of virtual machines, by adding two mmu_notifier interfaces.
> So it can now track the volatile working set inside KVM guest OS.
> 
> V1 log:
> Currently, ksm uses page checksum to detect volatile pages. Izik Eidus 
> suggested that we could use pte dirty bit to optimize. This patch series
> adds this new logic.
> 
> Preliminary benchmarks show that the scan speed is improved by up to 16 
> times on volatile transparent huge pages and up to 8 times on volatile 
> regular pages.

Did you run this only in the host (which would not trigger the notifiers
to kvm), or also run your test program in a guest?

thanks,
-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 0/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
  2011-06-22  0:46   ` Chris Wright
  (?)
@ 2011-06-22  4:15   ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22  4:15 UTC (permalink / raw)
  To: Undisclosed.Recipients:
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel

On Wednesday 22 June 2011 08:46:08 you wrote:
> * Nai Xia (nai.xia@gmail.com) wrote:
> > Compared to the first version, this patch set addresses the problem of
> > dirty bit updating of virtual machines, by adding two mmu_notifier interfaces.
> > So it can now track the volatile working set inside KVM guest OS.
> > 
> > V1 log:
> > Currently, ksm uses page checksum to detect volatile pages. Izik Eidus 
> > suggested that we could use pte dirty bit to optimize. This patch series
> > adds this new logic.
> > 
> > Preliminary benchmarks show that the scan speed is improved by up to 16 
> > times on volatile transparent huge pages and up to 8 times on volatile 
> > regular pages.
> 
> Did you run this only in the host (which would not trigger the notifiers
> to kvm), or also run your test program in a guest?

Yeah, I did run the test program in guest but I mean the top speed is measured in
host. i.e. I do confirm the ksmd can skip the pages of this test in guest OS 
but did not measure the speed up on guest.

Thanks, 

Nai
> 
> thanks,
> -chris
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22  0:21     ` Chris Wright
@ 2011-06-22  4:43       ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22  4:43 UTC (permalink / raw)
  To: Chris Wright
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel, kvm,
	mtosatti

On Wednesday 22 June 2011 08:21:23 Chris Wright wrote:
> * Nai Xia (nai.xia@gmail.com) wrote:
> > Introduced kvm_mmu_notifier_test_and_clear_dirty(), kvm_mmu_notifier_dirty_update()
> > and their mmu_notifier interfaces to support KSM dirty bit tracking, which brings
> > significant performance gain in volatile pages scanning in KSM.
> > Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is
> > enabled to indicate that the dirty bits of underlying sptes are not updated by
> > hardware.
> 
> Did you test with each of EPT, NPT and shadow?

I tested in EPT and pure softmmu. I have no NPT box and Izik told me that he 
did not have one either, so help me ... :D

> 
> > Signed-off-by: Nai Xia <nai.xia@gmail.com>
> > Acked-by: Izik Eidus <izik.eidus@ravellosystems.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |    1 +
> >  arch/x86/kvm/mmu.c              |   36 +++++++++++++++++++++++++++++
> >  arch/x86/kvm/mmu.h              |    3 +-
> >  arch/x86/kvm/vmx.c              |    1 +
> >  include/linux/kvm_host.h        |    2 +-
> >  include/linux/mmu_notifier.h    |   48 +++++++++++++++++++++++++++++++++++++++
> >  mm/mmu_notifier.c               |   33 ++++++++++++++++++++++++++
> >  virt/kvm/kvm_main.c             |   27 ++++++++++++++++++++++
> >  8 files changed, 149 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index d2ac8e2..f0d7aa0 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -848,6 +848,7 @@ extern bool kvm_rebooting;
> >  int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
> >  int kvm_age_hva(struct kvm *kvm, unsigned long hva);
> >  int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
> > +int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva);
> >  void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
> >  int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
> >  int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index aee3862..a5a0c51 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -979,6 +979,37 @@ out:
> >  	return young;
> >  }
> >  
> > +/*
> > + * Caller is supposed to SetPageDirty(), it's not done inside this.
> > + */
> > +static
> > +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp,
> > +				   unsigned long data)
> > +{
> > +	u64 *spte;
> > +	int dirty = 0;
> > +
> > +	if (!shadow_dirty_mask) {
> > +		WARN(1, "KVM: do NOT try to test dirty bit in EPT\n");
> > +		goto out;
> > +	}
> 
> This should never fire with the dirty_update() notifier test, right?
> And that means that this whole optimization is for the shadow mmu case,
> arguably the legacy case.

Yes, right. Actually I wrote this for potential abuse of this interface
since its name only does not suggest this. It can be a comment to save
some .text allocation and to compete with the "10k/3lines optimization"
in the list :P

> 
> > +
> > +	spte = rmap_next(kvm, rmapp, NULL);
> > +	while (spte) {
> > +		int _dirty;
> > +		u64 _spte = *spte;
> > +		BUG_ON(!(_spte & PT_PRESENT_MASK));
> > +		_dirty = _spte & PT_DIRTY_MASK;
> > +		if (_dirty) {
> > +			dirty = 1;
> > +			clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
> 
> Is this sufficient (not losing dirty state ever)?

This does lose some dirty state. Not flushing TLB may prevent CPU update
the dirty bit back to spte(I referred the Intel's manual x86 does not update 
in this case). But we(Izik & me) think it ok, because it seems currently the 
only user of dirty bit information is KSM. It's not critical to lose some 
information. And if we do found problem with it in the future, we can add the
flushing. How do you think?

> 
> > +		}
> > +		spte = rmap_next(kvm, rmapp, spte);
> > +	}
> > +out:
> > +	return dirty;
> > +}
> > +
> >  #define RMAP_RECYCLE_THRESHOLD 1000
> >  
> >  static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
> > @@ -1004,6 +1035,11 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
> >  	return kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp);
> >  
> >  
> > +int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva)
> > +{
> > +	return kvm_handle_hva(kvm, hva, 0, kvm_test_and_clear_dirty_rmapp);
> > +}
> > +
> >  #ifdef MMU_DEBUG
> >  static int is_empty_shadow_page(u64 *spt)
> >  {
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 7086ca8..b8d01c3 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -18,7 +18,8 @@
> >  #define PT_PCD_MASK (1ULL << 4)
> >  #define PT_ACCESSED_SHIFT 5
> >  #define PT_ACCESSED_MASK (1ULL << PT_ACCESSED_SHIFT)
> > -#define PT_DIRTY_MASK (1ULL << 6)
> > +#define PT_DIRTY_SHIFT 6
> > +#define PT_DIRTY_MASK (1ULL << PT_DIRTY_SHIFT)
> >  #define PT_PAGE_SIZE_MASK (1ULL << 7)
> >  #define PT_PAT_MASK (1ULL << 7)
> >  #define PT_GLOBAL_MASK (1ULL << 8)
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index d48ec60..b407a69 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -4674,6 +4674,7 @@ static int __init vmx_init(void)
> >  		kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
> >  				VMX_EPT_EXECUTABLE_MASK);
> >  		kvm_enable_tdp();
> > +		kvm_dirty_update = 0;
> 
> Doesn't the above shadow_dirty_mask==0ull tell us this same info?

Yes, it's nasty. I am not sure about making shadow_dirty_mask global or not
actually since all other similar state var are all static. 

> 
> >  	} else
> >  		kvm_disable_tdp();
> >  
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 31ebb59..2036bae 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -53,7 +53,7 @@
> >  struct kvm;
> >  struct kvm_vcpu;
> >  extern struct kmem_cache *kvm_vcpu_cache;
> > -
> > +extern int kvm_dirty_update;
> >  /*
> >   * It would be nice to use something smarter than a linear search, TBD...
> >   * Thankfully we dont expect many devices to register (famous last words :),
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 1d1b1e1..bd6ba2d 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -24,6 +24,9 @@ struct mmu_notifier_mm {
> >  };
> >  
> >  struct mmu_notifier_ops {
> 
> Need to add a comment to describe it.  And why is it not next to
> test_and_clear_dirty()?  I see how it's used, but seems as if the
> test_and_clear_dirty() code could return -1 (as in dirty state unknown)
> for the case where it can't track dirty bit and fall back to checksum.

Actually I did consider this option. But I thought it's weird to test
a bit as its name suggests and return -1 as a result. Should it be the 
first one in human history to do so ? :D

Thanks,
Nai

> 
> > +	int (*dirty_update)(struct mmu_notifier *mn,
> > +			     struct mm_struct *mm);
> > +
> >  	/*

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22  4:43       ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22  4:43 UTC (permalink / raw)
  To: Chris Wright
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel, kvm,
	mtosatti

On Wednesday 22 June 2011 08:21:23 Chris Wright wrote:
> * Nai Xia (nai.xia@gmail.com) wrote:
> > Introduced kvm_mmu_notifier_test_and_clear_dirty(), kvm_mmu_notifier_dirty_update()
> > and their mmu_notifier interfaces to support KSM dirty bit tracking, which brings
> > significant performance gain in volatile pages scanning in KSM.
> > Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is
> > enabled to indicate that the dirty bits of underlying sptes are not updated by
> > hardware.
> 
> Did you test with each of EPT, NPT and shadow?

I tested in EPT and pure softmmu. I have no NPT box and Izik told me that he 
did not have one either, so help me ... :D

> 
> > Signed-off-by: Nai Xia <nai.xia@gmail.com>
> > Acked-by: Izik Eidus <izik.eidus@ravellosystems.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |    1 +
> >  arch/x86/kvm/mmu.c              |   36 +++++++++++++++++++++++++++++
> >  arch/x86/kvm/mmu.h              |    3 +-
> >  arch/x86/kvm/vmx.c              |    1 +
> >  include/linux/kvm_host.h        |    2 +-
> >  include/linux/mmu_notifier.h    |   48 +++++++++++++++++++++++++++++++++++++++
> >  mm/mmu_notifier.c               |   33 ++++++++++++++++++++++++++
> >  virt/kvm/kvm_main.c             |   27 ++++++++++++++++++++++
> >  8 files changed, 149 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index d2ac8e2..f0d7aa0 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -848,6 +848,7 @@ extern bool kvm_rebooting;
> >  int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
> >  int kvm_age_hva(struct kvm *kvm, unsigned long hva);
> >  int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
> > +int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva);
> >  void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
> >  int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
> >  int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index aee3862..a5a0c51 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -979,6 +979,37 @@ out:
> >  	return young;
> >  }
> >  
> > +/*
> > + * Caller is supposed to SetPageDirty(), it's not done inside this.
> > + */
> > +static
> > +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp,
> > +				   unsigned long data)
> > +{
> > +	u64 *spte;
> > +	int dirty = 0;
> > +
> > +	if (!shadow_dirty_mask) {
> > +		WARN(1, "KVM: do NOT try to test dirty bit in EPT\n");
> > +		goto out;
> > +	}
> 
> This should never fire with the dirty_update() notifier test, right?
> And that means that this whole optimization is for the shadow mmu case,
> arguably the legacy case.

Yes, right. Actually I wrote this for potential abuse of this interface
since its name only does not suggest this. It can be a comment to save
some .text allocation and to compete with the "10k/3lines optimization"
in the list :P

> 
> > +
> > +	spte = rmap_next(kvm, rmapp, NULL);
> > +	while (spte) {
> > +		int _dirty;
> > +		u64 _spte = *spte;
> > +		BUG_ON(!(_spte & PT_PRESENT_MASK));
> > +		_dirty = _spte & PT_DIRTY_MASK;
> > +		if (_dirty) {
> > +			dirty = 1;
> > +			clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
> 
> Is this sufficient (not losing dirty state ever)?

This does lose some dirty state. Not flushing TLB may prevent CPU update
the dirty bit back to spte(I referred the Intel's manual x86 does not update 
in this case). But we(Izik & me) think it ok, because it seems currently the 
only user of dirty bit information is KSM. It's not critical to lose some 
information. And if we do found problem with it in the future, we can add the
flushing. How do you think?

> 
> > +		}
> > +		spte = rmap_next(kvm, rmapp, spte);
> > +	}
> > +out:
> > +	return dirty;
> > +}
> > +
> >  #define RMAP_RECYCLE_THRESHOLD 1000
> >  
> >  static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
> > @@ -1004,6 +1035,11 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
> >  	return kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp);
> >  
> >  
> > +int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva)
> > +{
> > +	return kvm_handle_hva(kvm, hva, 0, kvm_test_and_clear_dirty_rmapp);
> > +}
> > +
> >  #ifdef MMU_DEBUG
> >  static int is_empty_shadow_page(u64 *spt)
> >  {
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 7086ca8..b8d01c3 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -18,7 +18,8 @@
> >  #define PT_PCD_MASK (1ULL << 4)
> >  #define PT_ACCESSED_SHIFT 5
> >  #define PT_ACCESSED_MASK (1ULL << PT_ACCESSED_SHIFT)
> > -#define PT_DIRTY_MASK (1ULL << 6)
> > +#define PT_DIRTY_SHIFT 6
> > +#define PT_DIRTY_MASK (1ULL << PT_DIRTY_SHIFT)
> >  #define PT_PAGE_SIZE_MASK (1ULL << 7)
> >  #define PT_PAT_MASK (1ULL << 7)
> >  #define PT_GLOBAL_MASK (1ULL << 8)
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index d48ec60..b407a69 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -4674,6 +4674,7 @@ static int __init vmx_init(void)
> >  		kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
> >  				VMX_EPT_EXECUTABLE_MASK);
> >  		kvm_enable_tdp();
> > +		kvm_dirty_update = 0;
> 
> Doesn't the above shadow_dirty_mask==0ull tell us this same info?

Yes, it's nasty. I am not sure about making shadow_dirty_mask global or not
actually since all other similar state var are all static. 

> 
> >  	} else
> >  		kvm_disable_tdp();
> >  
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 31ebb59..2036bae 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -53,7 +53,7 @@
> >  struct kvm;
> >  struct kvm_vcpu;
> >  extern struct kmem_cache *kvm_vcpu_cache;
> > -
> > +extern int kvm_dirty_update;
> >  /*
> >   * It would be nice to use something smarter than a linear search, TBD...
> >   * Thankfully we dont expect many devices to register (famous last words :),
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 1d1b1e1..bd6ba2d 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -24,6 +24,9 @@ struct mmu_notifier_mm {
> >  };
> >  
> >  struct mmu_notifier_ops {
> 
> Need to add a comment to describe it.  And why is it not next to
> test_and_clear_dirty()?  I see how it's used, but seems as if the
> test_and_clear_dirty() code could return -1 (as in dirty state unknown)
> for the case where it can't track dirty bit and fall back to checksum.

Actually I did consider this option. But I thought it's weird to test
a bit as its name suggests and return -1 as a result. Should it be the 
first one in human history to do so ? :D

Thanks,
Nai

> 
> > +	int (*dirty_update)(struct mmu_notifier *mn,
> > +			     struct mm_struct *mm);
> > +
> >  	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
  2011-06-22  0:35         ` Chris Wright
@ 2011-06-22  4:47           ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22  4:47 UTC (permalink / raw)
  To: Chris Wright
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel

On Wednesday 22 June 2011 08:35:36 Chris Wright wrote:
> * Nai Xia (nai.xia@gmail.com) wrote:
> > (Sorry for repeated mail, I forgot to Cc the list..)
> > 
> > On Wednesday 22 June 2011 06:38:00 you wrote:
> > > * Nai Xia (nai.xia@gmail.com) wrote:
> > > > Introduced ksm_page_changed() to reference the dirty bit of a pte. We clear 
> > > > the dirty bit for each pte scanned but don't flush the tlb. For a huge page, 
> > > > if one of the subpage has changed, we try to skip the whole huge page 
> > > > assuming(this is true by now) that ksmd linearly scans the address space.
> > > 
> > > This doesn't build w/ kvm as a module.
> > 
> > I think it's because of the name-error of a related kvm patch, which I only sent
> > in a same email thread. http://marc.info/?l=linux-mm&m=130866318804277&w=2
> > The patch split is not clean...I'll redo it.
> > 
> 
> It needs an export as it is.
> ERROR: "kvm_dirty_update" [arch/x86/kvm/kvm-intel.ko] undefined!

Oops, yes, I forgot to do that! I'll correct it in the next submission.

Thanks,
Nai

> 
> Although perhaps could be done w/out that dirty_update altogether (as I
> mentioned in other email)?
> 
> > > 
> > > > A NEW_FLAG is also introduced as a status of rmap_item to make ksmd scan
> > > > more aggressively for new VMAs - only skip the pages considered to be volatile
> > > > by the dirty bits. This can be enabled/disabled through KSM's sysfs interface.
> > > 
> > > This seems like it should be separated out.  And while it might be useful
> > > to enable/disable for testing, I don't think it's worth supporting for
> > > the long term.  Would also be useful to see the value of this flag.
> > 
> > I think it maybe useful for uses who want to turn on/off this scan policy explicitly
> > according to their working sets? 
> 
> Can you split it out, and show the benefit of it directly?  I think it
> only benefits:
> 
> p = mmap()
> memset(p, $value, entire buffer);
> ...
> very slowly (w.r.t scan times) touch bits of buffer and trigger cow to
> break sharing.
> 
> Would you agree?
> 
> thanks,
> -chris
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
@ 2011-06-22  4:47           ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22  4:47 UTC (permalink / raw)
  To: Chris Wright
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel

On Wednesday 22 June 2011 08:35:36 Chris Wright wrote:
> * Nai Xia (nai.xia@gmail.com) wrote:
> > (Sorry for repeated mail, I forgot to Cc the list..)
> > 
> > On Wednesday 22 June 2011 06:38:00 you wrote:
> > > * Nai Xia (nai.xia@gmail.com) wrote:
> > > > Introduced ksm_page_changed() to reference the dirty bit of a pte. We clear 
> > > > the dirty bit for each pte scanned but don't flush the tlb. For a huge page, 
> > > > if one of the subpage has changed, we try to skip the whole huge page 
> > > > assuming(this is true by now) that ksmd linearly scans the address space.
> > > 
> > > This doesn't build w/ kvm as a module.
> > 
> > I think it's because of the name-error of a related kvm patch, which I only sent
> > in a same email thread. http://marc.info/?l=linux-mm&m=130866318804277&w=2
> > The patch split is not clean...I'll redo it.
> > 
> 
> It needs an export as it is.
> ERROR: "kvm_dirty_update" [arch/x86/kvm/kvm-intel.ko] undefined!

Oops, yes, I forgot to do that! I'll correct it in the next submission.

Thanks,
Nai

> 
> Although perhaps could be done w/out that dirty_update altogether (as I
> mentioned in other email)?
> 
> > > 
> > > > A NEW_FLAG is also introduced as a status of rmap_item to make ksmd scan
> > > > more aggressively for new VMAs - only skip the pages considered to be volatile
> > > > by the dirty bits. This can be enabled/disabled through KSM's sysfs interface.
> > > 
> > > This seems like it should be separated out.  And while it might be useful
> > > to enable/disable for testing, I don't think it's worth supporting for
> > > the long term.  Would also be useful to see the value of this flag.
> > 
> > I think it maybe useful for uses who want to turn on/off this scan policy explicitly
> > according to their working sets? 
> 
> Can you split it out, and show the benefit of it directly?  I think it
> only benefits:
> 
> p = mmap()
> memset(p, $value, entire buffer);
> ...
> very slowly (w.r.t scan times) touch bits of buffer and trigger cow to
> break sharing.
> 
> Would you agree?
> 
> thanks,
> -chris
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22  0:21     ` Chris Wright
@ 2011-06-22  6:15       ` Izik Eidus
  -1 siblings, 0 replies; 96+ messages in thread
From: Izik Eidus @ 2011-06-22  6:15 UTC (permalink / raw)
  To: Chris Wright
  Cc: Nai Xia, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel, kvm,
	mtosatti

On 6/22/2011 3:21 AM, Chris Wright wrote:
> * Nai Xia (nai.xia@gmail.com) wrote:
>> Introduced kvm_mmu_notifier_test_and_clear_dirty(), kvm_mmu_notifier_dirty_update()
>> and their mmu_notifier interfaces to support KSM dirty bit tracking, which brings
>> significant performance gain in volatile pages scanning in KSM.
>> Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is
>> enabled to indicate that the dirty bits of underlying sptes are not updated by
>> hardware.
> Did you test with each of EPT, NPT and shadow?
>
>> Signed-off-by: Nai Xia<nai.xia@gmail.com>
>> Acked-by: Izik Eidus<izik.eidus@ravellosystems.com>
>> ---
>>   arch/x86/include/asm/kvm_host.h |    1 +
>>   arch/x86/kvm/mmu.c              |   36 +++++++++++++++++++++++++++++
>>   arch/x86/kvm/mmu.h              |    3 +-
>>   arch/x86/kvm/vmx.c              |    1 +
>>   include/linux/kvm_host.h        |    2 +-
>>   include/linux/mmu_notifier.h    |   48 +++++++++++++++++++++++++++++++++++++++
>>   mm/mmu_notifier.c               |   33 ++++++++++++++++++++++++++
>>   virt/kvm/kvm_main.c             |   27 ++++++++++++++++++++++
>>   8 files changed, 149 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index d2ac8e2..f0d7aa0 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -848,6 +848,7 @@ extern bool kvm_rebooting;
>>   int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
>>   int kvm_age_hva(struct kvm *kvm, unsigned long hva);
>>   int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
>> +int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva);
>>   void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
>>   int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
>>   int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index aee3862..a5a0c51 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -979,6 +979,37 @@ out:
>>   	return young;
>>   }
>>
>> +/*
>> + * Caller is supposed to SetPageDirty(), it's not done inside this.
>> + */
>> +static
>> +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp,
>> +				   unsigned long data)
>> +{
>> +	u64 *spte;
>> +	int dirty = 0;
>> +
>> +	if (!shadow_dirty_mask) {
>> +		WARN(1, "KVM: do NOT try to test dirty bit in EPT\n");
>> +		goto out;
>> +	}
> This should never fire with the dirty_update() notifier test, right?
> And that means that this whole optimization is for the shadow mmu case,
> arguably the legacy case.
>

Hi Chris,
AMD npt does track the dirty bit in the nested page tables,
so the shadow_dirty_mask should not be 0 in that case...

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22  6:15       ` Izik Eidus
  0 siblings, 0 replies; 96+ messages in thread
From: Izik Eidus @ 2011-06-22  6:15 UTC (permalink / raw)
  To: Chris Wright
  Cc: Nai Xia, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel, kvm,
	mtosatti

On 6/22/2011 3:21 AM, Chris Wright wrote:
> * Nai Xia (nai.xia@gmail.com) wrote:
>> Introduced kvm_mmu_notifier_test_and_clear_dirty(), kvm_mmu_notifier_dirty_update()
>> and their mmu_notifier interfaces to support KSM dirty bit tracking, which brings
>> significant performance gain in volatile pages scanning in KSM.
>> Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is
>> enabled to indicate that the dirty bits of underlying sptes are not updated by
>> hardware.
> Did you test with each of EPT, NPT and shadow?
>
>> Signed-off-by: Nai Xia<nai.xia@gmail.com>
>> Acked-by: Izik Eidus<izik.eidus@ravellosystems.com>
>> ---
>>   arch/x86/include/asm/kvm_host.h |    1 +
>>   arch/x86/kvm/mmu.c              |   36 +++++++++++++++++++++++++++++
>>   arch/x86/kvm/mmu.h              |    3 +-
>>   arch/x86/kvm/vmx.c              |    1 +
>>   include/linux/kvm_host.h        |    2 +-
>>   include/linux/mmu_notifier.h    |   48 +++++++++++++++++++++++++++++++++++++++
>>   mm/mmu_notifier.c               |   33 ++++++++++++++++++++++++++
>>   virt/kvm/kvm_main.c             |   27 ++++++++++++++++++++++
>>   8 files changed, 149 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index d2ac8e2..f0d7aa0 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -848,6 +848,7 @@ extern bool kvm_rebooting;
>>   int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
>>   int kvm_age_hva(struct kvm *kvm, unsigned long hva);
>>   int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
>> +int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva);
>>   void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
>>   int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
>>   int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index aee3862..a5a0c51 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -979,6 +979,37 @@ out:
>>   	return young;
>>   }
>>
>> +/*
>> + * Caller is supposed to SetPageDirty(), it's not done inside this.
>> + */
>> +static
>> +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp,
>> +				   unsigned long data)
>> +{
>> +	u64 *spte;
>> +	int dirty = 0;
>> +
>> +	if (!shadow_dirty_mask) {
>> +		WARN(1, "KVM: do NOT try to test dirty bit in EPT\n");
>> +		goto out;
>> +	}
> This should never fire with the dirty_update() notifier test, right?
> And that means that this whole optimization is for the shadow mmu case,
> arguably the legacy case.
>

Hi Chris,
AMD npt does track the dirty bit in the nested page tables,
so the shadow_dirty_mask should not be 0 in that case...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22  6:15       ` Izik Eidus
@ 2011-06-22  6:38         ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22  6:38 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Chris Wright, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel, kvm,
	mtosatti

On Wednesday 22 June 2011 14:15:51 Izik Eidus wrote:
> On 6/22/2011 3:21 AM, Chris Wright wrote:
> > * Nai Xia (nai.xia@gmail.com) wrote:
> >> Introduced kvm_mmu_notifier_test_and_clear_dirty(), kvm_mmu_notifier_dirty_update()
> >> and their mmu_notifier interfaces to support KSM dirty bit tracking, which brings
> >> significant performance gain in volatile pages scanning in KSM.
> >> Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is
> >> enabled to indicate that the dirty bits of underlying sptes are not updated by
> >> hardware.
> > Did you test with each of EPT, NPT and shadow?
> >
> >> Signed-off-by: Nai Xia<nai.xia@gmail.com>
> >> Acked-by: Izik Eidus<izik.eidus@ravellosystems.com>
> >> ---
> >>   arch/x86/include/asm/kvm_host.h |    1 +
> >>   arch/x86/kvm/mmu.c              |   36 +++++++++++++++++++++++++++++
> >>   arch/x86/kvm/mmu.h              |    3 +-
> >>   arch/x86/kvm/vmx.c              |    1 +
> >>   include/linux/kvm_host.h        |    2 +-
> >>   include/linux/mmu_notifier.h    |   48 +++++++++++++++++++++++++++++++++++++++
> >>   mm/mmu_notifier.c               |   33 ++++++++++++++++++++++++++
> >>   virt/kvm/kvm_main.c             |   27 ++++++++++++++++++++++
> >>   8 files changed, 149 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >> index d2ac8e2..f0d7aa0 100644
> >> --- a/arch/x86/include/asm/kvm_host.h
> >> +++ b/arch/x86/include/asm/kvm_host.h
> >> @@ -848,6 +848,7 @@ extern bool kvm_rebooting;
> >>   int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
> >>   int kvm_age_hva(struct kvm *kvm, unsigned long hva);
> >>   int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
> >> +int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva);
> >>   void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
> >>   int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
> >>   int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index aee3862..a5a0c51 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -979,6 +979,37 @@ out:
> >>   	return young;
> >>   }
> >>
> >> +/*
> >> + * Caller is supposed to SetPageDirty(), it's not done inside this.
> >> + */
> >> +static
> >> +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp,
> >> +				   unsigned long data)
> >> +{
> >> +	u64 *spte;
> >> +	int dirty = 0;
> >> +
> >> +	if (!shadow_dirty_mask) {
> >> +		WARN(1, "KVM: do NOT try to test dirty bit in EPT\n");
> >> +		goto out;
> >> +	}
> > This should never fire with the dirty_update() notifier test, right?
> > And that means that this whole optimization is for the shadow mmu case,
> > arguably the legacy case.
> >
> 
> Hi Chris,
> AMD npt does track the dirty bit in the nested page tables,
> so the shadow_dirty_mask should not be 0 in that case...
> 
Hi Izik, 
I think he meant that if the caller is doing right && (!shadow_dirty_mask), 
the kvm_test_and_clear_dirty_rmapp() will never be called at all. So 
this test inside kvm_test_and_clear_dirty_rmapp() is useless...as I said
I added this test in any case of this interface abused by others, just like
a softer BUG_ON() --- dirty bit is not that critical to bump into BUG().




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22  6:38         ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22  6:38 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Chris Wright, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel, kvm,
	mtosatti

On Wednesday 22 June 2011 14:15:51 Izik Eidus wrote:
> On 6/22/2011 3:21 AM, Chris Wright wrote:
> > * Nai Xia (nai.xia@gmail.com) wrote:
> >> Introduced kvm_mmu_notifier_test_and_clear_dirty(), kvm_mmu_notifier_dirty_update()
> >> and their mmu_notifier interfaces to support KSM dirty bit tracking, which brings
> >> significant performance gain in volatile pages scanning in KSM.
> >> Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is
> >> enabled to indicate that the dirty bits of underlying sptes are not updated by
> >> hardware.
> > Did you test with each of EPT, NPT and shadow?
> >
> >> Signed-off-by: Nai Xia<nai.xia@gmail.com>
> >> Acked-by: Izik Eidus<izik.eidus@ravellosystems.com>
> >> ---
> >>   arch/x86/include/asm/kvm_host.h |    1 +
> >>   arch/x86/kvm/mmu.c              |   36 +++++++++++++++++++++++++++++
> >>   arch/x86/kvm/mmu.h              |    3 +-
> >>   arch/x86/kvm/vmx.c              |    1 +
> >>   include/linux/kvm_host.h        |    2 +-
> >>   include/linux/mmu_notifier.h    |   48 +++++++++++++++++++++++++++++++++++++++
> >>   mm/mmu_notifier.c               |   33 ++++++++++++++++++++++++++
> >>   virt/kvm/kvm_main.c             |   27 ++++++++++++++++++++++
> >>   8 files changed, 149 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >> index d2ac8e2..f0d7aa0 100644
> >> --- a/arch/x86/include/asm/kvm_host.h
> >> +++ b/arch/x86/include/asm/kvm_host.h
> >> @@ -848,6 +848,7 @@ extern bool kvm_rebooting;
> >>   int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
> >>   int kvm_age_hva(struct kvm *kvm, unsigned long hva);
> >>   int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
> >> +int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva);
> >>   void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
> >>   int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
> >>   int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index aee3862..a5a0c51 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -979,6 +979,37 @@ out:
> >>   	return young;
> >>   }
> >>
> >> +/*
> >> + * Caller is supposed to SetPageDirty(), it's not done inside this.
> >> + */
> >> +static
> >> +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp,
> >> +				   unsigned long data)
> >> +{
> >> +	u64 *spte;
> >> +	int dirty = 0;
> >> +
> >> +	if (!shadow_dirty_mask) {
> >> +		WARN(1, "KVM: do NOT try to test dirty bit in EPT\n");
> >> +		goto out;
> >> +	}
> > This should never fire with the dirty_update() notifier test, right?
> > And that means that this whole optimization is for the shadow mmu case,
> > arguably the legacy case.
> >
> 
> Hi Chris,
> AMD npt does track the dirty bit in the nested page tables,
> so the shadow_dirty_mask should not be 0 in that case...
> 
Hi Izik, 
I think he meant that if the caller is doing right && (!shadow_dirty_mask), 
the kvm_test_and_clear_dirty_rmapp() will never be called at all. So 
this test inside kvm_test_and_clear_dirty_rmapp() is useless...as I said
I added this test in any case of this interface abused by others, just like
a softer BUG_ON() --- dirty bit is not that critical to bump into BUG().



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-21 13:32   ` Nai Xia
@ 2011-06-22 10:43     ` Avi Kivity
  -1 siblings, 0 replies; 96+ messages in thread
From: Avi Kivity @ 2011-06-22 10:43 UTC (permalink / raw)
  To: nai.xia
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 06/21/2011 04:32 PM, Nai Xia wrote:
> Introduced kvm_mmu_notifier_test_and_clear_dirty(), kvm_mmu_notifier_dirty_update()
> and their mmu_notifier interfaces to support KSM dirty bit tracking, which brings
> significant performance gain in volatile pages scanning in KSM.
> Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is
> enabled to indicate that the dirty bits of underlying sptes are not updated by
> hardware.
>


Can you quantify the performance gains?

> +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp,
> +				   unsigned long data)
> +{
> +	u64 *spte;
> +	int dirty = 0;
> +
> +	if (!shadow_dirty_mask) {
> +		WARN(1, "KVM: do NOT try to test dirty bit in EPT\n");
> +		goto out;
> +	}
> +
> +	spte = rmap_next(kvm, rmapp, NULL);
> +	while (spte) {
> +		int _dirty;
> +		u64 _spte = *spte;
> +		BUG_ON(!(_spte&  PT_PRESENT_MASK));
> +		_dirty = _spte&  PT_DIRTY_MASK;
> +		if (_dirty) {
> +			dirty = 1;
> +			clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
> +		}

Racy.  Also, needs a tlb flush eventually.

> +		spte = rmap_next(kvm, rmapp, spte);
> +	}
> +out:
> +	return dirty;
> +}
> +
>   #define RMAP_RECYCLE_THRESHOLD 1000
>
>
>   struct mmu_notifier_ops {
> +	int (*dirty_update)(struct mmu_notifier *mn,
> +			     struct mm_struct *mm);
> +

I prefer to have test_and_clear_dirty() always return 1 in this case (if 
the spte is writeable), and drop this callback.
> +int __mmu_notifier_dirty_update(struct mm_struct *mm)
> +{
> +	struct mmu_notifier *mn;
> +	struct hlist_node *n;
> +	int dirty_update = 0;
> +
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(mn, n,&mm->mmu_notifier_mm->list, hlist) {
> +		if (mn->ops->dirty_update)
> +			dirty_update |= mn->ops->dirty_update(mn, mm);
> +	}
> +	rcu_read_unlock();
> +

Should it not be &= instead?

> +	return dirty_update;
> +}
> +
>   /*
>    * This function can't run concurrently against mmu_notifier_register
>    * because mm->mm_users>  0 during mmu_notifier_register and exit_mmap

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 10:43     ` Avi Kivity
  0 siblings, 0 replies; 96+ messages in thread
From: Avi Kivity @ 2011-06-22 10:43 UTC (permalink / raw)
  To: nai.xia
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 06/21/2011 04:32 PM, Nai Xia wrote:
> Introduced kvm_mmu_notifier_test_and_clear_dirty(), kvm_mmu_notifier_dirty_update()
> and their mmu_notifier interfaces to support KSM dirty bit tracking, which brings
> significant performance gain in volatile pages scanning in KSM.
> Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is
> enabled to indicate that the dirty bits of underlying sptes are not updated by
> hardware.
>


Can you quantify the performance gains?

> +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp,
> +				   unsigned long data)
> +{
> +	u64 *spte;
> +	int dirty = 0;
> +
> +	if (!shadow_dirty_mask) {
> +		WARN(1, "KVM: do NOT try to test dirty bit in EPT\n");
> +		goto out;
> +	}
> +
> +	spte = rmap_next(kvm, rmapp, NULL);
> +	while (spte) {
> +		int _dirty;
> +		u64 _spte = *spte;
> +		BUG_ON(!(_spte&  PT_PRESENT_MASK));
> +		_dirty = _spte&  PT_DIRTY_MASK;
> +		if (_dirty) {
> +			dirty = 1;
> +			clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
> +		}

Racy.  Also, needs a tlb flush eventually.

> +		spte = rmap_next(kvm, rmapp, spte);
> +	}
> +out:
> +	return dirty;
> +}
> +
>   #define RMAP_RECYCLE_THRESHOLD 1000
>
>
>   struct mmu_notifier_ops {
> +	int (*dirty_update)(struct mmu_notifier *mn,
> +			     struct mm_struct *mm);
> +

I prefer to have test_and_clear_dirty() always return 1 in this case (if 
the spte is writeable), and drop this callback.
> +int __mmu_notifier_dirty_update(struct mm_struct *mm)
> +{
> +	struct mmu_notifier *mn;
> +	struct hlist_node *n;
> +	int dirty_update = 0;
> +
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(mn, n,&mm->mmu_notifier_mm->list, hlist) {
> +		if (mn->ops->dirty_update)
> +			dirty_update |= mn->ops->dirty_update(mn, mm);
> +	}
> +	rcu_read_unlock();
> +

Should it not be &= instead?

> +	return dirty_update;
> +}
> +
>   /*
>    * This function can't run concurrently against mmu_notifier_register
>    * because mm->mm_users>  0 during mmu_notifier_register and exit_mmap

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
  2011-06-22  0:35         ` Chris Wright
@ 2011-06-22 10:55           ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22 10:55 UTC (permalink / raw)
  To: Chris Wright
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel

On Wednesday 22 June 2011 08:35:36 Chris Wright wrote:
> * Nai Xia (nai.xia@gmail.com) wrote:
> > (Sorry for repeated mail, I forgot to Cc the list..)
> > 
> > On Wednesday 22 June 2011 06:38:00 you wrote:
> > > * Nai Xia (nai.xia@gmail.com) wrote:
> > > > Introduced ksm_page_changed() to reference the dirty bit of a pte. We clear 
> > > > the dirty bit for each pte scanned but don't flush the tlb. For a huge page, 
> > > > if one of the subpage has changed, we try to skip the whole huge page 
> > > > assuming(this is true by now) that ksmd linearly scans the address space.
> > > 
> > > This doesn't build w/ kvm as a module.
> > 
> > I think it's because of the name-error of a related kvm patch, which I only sent
> > in a same email thread. http://marc.info/?l=linux-mm&m=130866318804277&w=2
> > The patch split is not clean...I'll redo it.
> > 
> 
> It needs an export as it is.
> ERROR: "kvm_dirty_update" [arch/x86/kvm/kvm-intel.ko] undefined!
> 
> Although perhaps could be done w/out that dirty_update altogether (as I
> mentioned in other email)?
> 
> > > 
> > > > A NEW_FLAG is also introduced as a status of rmap_item to make ksmd scan
> > > > more aggressively for new VMAs - only skip the pages considered to be volatile
> > > > by the dirty bits. This can be enabled/disabled through KSM's sysfs interface.
> > > 
> > > This seems like it should be separated out.  And while it might be useful
> > > to enable/disable for testing, I don't think it's worth supporting for
> > > the long term.  Would also be useful to see the value of this flag.
> > 
> > I think it maybe useful for uses who want to turn on/off this scan policy explicitly
> > according to their working sets? 
> 
> Can you split it out, and show the benefit of it directly?  I think it
> only benefits:
> 
> p = mmap()
> memset(p, $value, entire buffer);
> ...
> very slowly (w.r.t scan times) touch bits of buffer and trigger cow to
> break sharing.
> 
> Would you agree?

The direct benefit of it is that when merging a very big area, the system
does not be caught in a non-trivial period people see the free memory is 
actually dropping by creating only rmap_items, despite he is 100% sure that
his workset is very duplicated. I think it's puzzling to users and also 
risky of OOM.

Thanks,
Nai

> 
> thanks,
> -chris
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning
@ 2011-06-22 10:55           ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22 10:55 UTC (permalink / raw)
  To: Chris Wright
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel

On Wednesday 22 June 2011 08:35:36 Chris Wright wrote:
> * Nai Xia (nai.xia@gmail.com) wrote:
> > (Sorry for repeated mail, I forgot to Cc the list..)
> > 
> > On Wednesday 22 June 2011 06:38:00 you wrote:
> > > * Nai Xia (nai.xia@gmail.com) wrote:
> > > > Introduced ksm_page_changed() to reference the dirty bit of a pte. We clear 
> > > > the dirty bit for each pte scanned but don't flush the tlb. For a huge page, 
> > > > if one of the subpage has changed, we try to skip the whole huge page 
> > > > assuming(this is true by now) that ksmd linearly scans the address space.
> > > 
> > > This doesn't build w/ kvm as a module.
> > 
> > I think it's because of the name-error of a related kvm patch, which I only sent
> > in a same email thread. http://marc.info/?l=linux-mm&m=130866318804277&w=2
> > The patch split is not clean...I'll redo it.
> > 
> 
> It needs an export as it is.
> ERROR: "kvm_dirty_update" [arch/x86/kvm/kvm-intel.ko] undefined!
> 
> Although perhaps could be done w/out that dirty_update altogether (as I
> mentioned in other email)?
> 
> > > 
> > > > A NEW_FLAG is also introduced as a status of rmap_item to make ksmd scan
> > > > more aggressively for new VMAs - only skip the pages considered to be volatile
> > > > by the dirty bits. This can be enabled/disabled through KSM's sysfs interface.
> > > 
> > > This seems like it should be separated out.  And while it might be useful
> > > to enable/disable for testing, I don't think it's worth supporting for
> > > the long term.  Would also be useful to see the value of this flag.
> > 
> > I think it maybe useful for uses who want to turn on/off this scan policy explicitly
> > according to their working sets? 
> 
> Can you split it out, and show the benefit of it directly?  I think it
> only benefits:
> 
> p = mmap()
> memset(p, $value, entire buffer);
> ...
> very slowly (w.r.t scan times) touch bits of buffer and trigger cow to
> break sharing.
> 
> Would you agree?

The direct benefit of it is that when merging a very big area, the system
does not be caught in a non-trivial period people see the free memory is 
actually dropping by creating only rmap_items, despite he is 100% sure that
his workset is very duplicated. I think it's puzzling to users and also 
risky of OOM.

Thanks,
Nai

> 
> thanks,
> -chris
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 10:43     ` Avi Kivity
@ 2011-06-22 11:05       ` Izik Eidus
  -1 siblings, 0 replies; 96+ messages in thread
From: Izik Eidus @ 2011-06-22 11:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: nai.xia, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 6/22/2011 1:43 PM, Avi Kivity wrote:
> On 06/21/2011 04:32 PM, Nai Xia wrote:
>> Introduced kvm_mmu_notifier_test_and_clear_dirty(), 
>> kvm_mmu_notifier_dirty_update()
>> and their mmu_notifier interfaces to support KSM dirty bit tracking, 
>> which brings
>> significant performance gain in volatile pages scanning in KSM.
>> Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if 
>> intel EPT is
>> enabled to indicate that the dirty bits of underlying sptes are not 
>> updated by
>> hardware.
>>
>
>
> Can you quantify the performance gains?
>
>> +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long 
>> *rmapp,
>> +                   unsigned long data)
>> +{
>> +    u64 *spte;
>> +    int dirty = 0;
>> +
>> +    if (!shadow_dirty_mask) {
>> +        WARN(1, "KVM: do NOT try to test dirty bit in EPT\n");
>> +        goto out;
>> +    }
>> +
>> +    spte = rmap_next(kvm, rmapp, NULL);
>> +    while (spte) {
>> +        int _dirty;
>> +        u64 _spte = *spte;
>> +        BUG_ON(!(_spte&  PT_PRESENT_MASK));
>> +        _dirty = _spte&  PT_DIRTY_MASK;
>> +        if (_dirty) {
>> +            dirty = 1;
>> +            clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
>> +        }
>
> Racy.  Also, needs a tlb flush eventually.

Hi, one of the issues is that the whole point of this patch is not do 
tlb flush eventually,
But I see your point, because other users will not expect such behavior, 
so maybe there is need into a parameter
flush_tlb=?, or add another mmu notifier call?


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 11:05       ` Izik Eidus
  0 siblings, 0 replies; 96+ messages in thread
From: Izik Eidus @ 2011-06-22 11:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: nai.xia, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 6/22/2011 1:43 PM, Avi Kivity wrote:
> On 06/21/2011 04:32 PM, Nai Xia wrote:
>> Introduced kvm_mmu_notifier_test_and_clear_dirty(), 
>> kvm_mmu_notifier_dirty_update()
>> and their mmu_notifier interfaces to support KSM dirty bit tracking, 
>> which brings
>> significant performance gain in volatile pages scanning in KSM.
>> Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if 
>> intel EPT is
>> enabled to indicate that the dirty bits of underlying sptes are not 
>> updated by
>> hardware.
>>
>
>
> Can you quantify the performance gains?
>
>> +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long 
>> *rmapp,
>> +                   unsigned long data)
>> +{
>> +    u64 *spte;
>> +    int dirty = 0;
>> +
>> +    if (!shadow_dirty_mask) {
>> +        WARN(1, "KVM: do NOT try to test dirty bit in EPT\n");
>> +        goto out;
>> +    }
>> +
>> +    spte = rmap_next(kvm, rmapp, NULL);
>> +    while (spte) {
>> +        int _dirty;
>> +        u64 _spte = *spte;
>> +        BUG_ON(!(_spte&  PT_PRESENT_MASK));
>> +        _dirty = _spte&  PT_DIRTY_MASK;
>> +        if (_dirty) {
>> +            dirty = 1;
>> +            clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
>> +        }
>
> Racy.  Also, needs a tlb flush eventually.

Hi, one of the issues is that the whole point of this patch is not do 
tlb flush eventually,
But I see your point, because other users will not expect such behavior, 
so maybe there is need into a parameter
flush_tlb=?, or add another mmu notifier call?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 11:05       ` Izik Eidus
@ 2011-06-22 11:10         ` Avi Kivity
  -1 siblings, 0 replies; 96+ messages in thread
From: Avi Kivity @ 2011-06-22 11:10 UTC (permalink / raw)
  To: Izik Eidus
  Cc: nai.xia, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 06/22/2011 02:05 PM, Izik Eidus wrote:
>>> +    spte = rmap_next(kvm, rmapp, NULL);
>>> +    while (spte) {
>>> +        int _dirty;
>>> +        u64 _spte = *spte;
>>> +        BUG_ON(!(_spte&  PT_PRESENT_MASK));
>>> +        _dirty = _spte&  PT_DIRTY_MASK;
>>> +        if (_dirty) {
>>> +            dirty = 1;
>>> +            clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
>>> +        }
>>
>> Racy.  Also, needs a tlb flush eventually.
> +
>
> Hi, one of the issues is that the whole point of this patch is not do 
> tlb flush eventually,
> But I see your point, because other users will not expect such 
> behavior, so maybe there is need into a parameter
> flush_tlb=?, or add another mmu notifier call?
>

If you don't flush the tlb, a subsequent write will not see that spte.d 
is clear and the write will happen.  So you'll see the page as clean 
even though it's dirty.  That's not acceptable.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 11:10         ` Avi Kivity
  0 siblings, 0 replies; 96+ messages in thread
From: Avi Kivity @ 2011-06-22 11:10 UTC (permalink / raw)
  To: Izik Eidus
  Cc: nai.xia, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 06/22/2011 02:05 PM, Izik Eidus wrote:
>>> +    spte = rmap_next(kvm, rmapp, NULL);
>>> +    while (spte) {
>>> +        int _dirty;
>>> +        u64 _spte = *spte;
>>> +        BUG_ON(!(_spte&  PT_PRESENT_MASK));
>>> +        _dirty = _spte&  PT_DIRTY_MASK;
>>> +        if (_dirty) {
>>> +            dirty = 1;
>>> +            clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
>>> +        }
>>
>> Racy.  Also, needs a tlb flush eventually.
> +
>
> Hi, one of the issues is that the whole point of this patch is not do 
> tlb flush eventually,
> But I see your point, because other users will not expect such 
> behavior, so maybe there is need into a parameter
> flush_tlb=?, or add another mmu notifier call?
>

If you don't flush the tlb, a subsequent write will not see that spte.d 
is clear and the write will happen.  So you'll see the page as clean 
even though it's dirty.  That's not acceptable.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 11:10         ` Avi Kivity
@ 2011-06-22 11:19           ` Izik Eidus
  -1 siblings, 0 replies; 96+ messages in thread
From: Izik Eidus @ 2011-06-22 11:19 UTC (permalink / raw)
  To: Avi Kivity
  Cc: nai.xia, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 6/22/2011 2:10 PM, Avi Kivity wrote:
> On 06/22/2011 02:05 PM, Izik Eidus wrote:
>>>> +    spte = rmap_next(kvm, rmapp, NULL);
>>>> +    while (spte) {
>>>> +        int _dirty;
>>>> +        u64 _spte = *spte;
>>>> +        BUG_ON(!(_spte&  PT_PRESENT_MASK));
>>>> +        _dirty = _spte&  PT_DIRTY_MASK;
>>>> +        if (_dirty) {
>>>> +            dirty = 1;
>>>> +            clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
>>>> +        }
>>>
>>> Racy.  Also, needs a tlb flush eventually.
>> +
>>
>> Hi, one of the issues is that the whole point of this patch is not do 
>> tlb flush eventually,
>> But I see your point, because other users will not expect such 
>> behavior, so maybe there is need into a parameter
>> flush_tlb=?, or add another mmu notifier call?
>>
>
> If you don't flush the tlb, a subsequent write will not see that 
> spte.d is clear and the write will happen.  So you'll see the page as 
> clean even though it's dirty.  That's not acceptable.
>

Yes, but this is exactly what we want from this use case:
Right now ksm calculate the page hash to see if it was changed, the idea 
behind this patch is to use the dirty bit instead,
however the guest might not really like the fact that we will flush its 
tlb over and over again, specially in periodically scan like ksm does.

So what we say here is: it is better to have little junk in the unstable 
tree that get flushed eventualy anyway, instead of make the guest slower....
this race is something that does not reflect accurate of ksm anyway due 
to the full memcmp that we will eventualy perform...

Ofcurse we trust that in most cases, beacuse it take ksm to get into a 
random virtual address in real systems few minutes, there will be 
already tlb flush performed.

What you think about having 2 calls: one that does the expected behivor 
and does flush the tlb, and one that clearly say it doesnt flush the tlb
and expline its use case for ksm?

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 11:19           ` Izik Eidus
  0 siblings, 0 replies; 96+ messages in thread
From: Izik Eidus @ 2011-06-22 11:19 UTC (permalink / raw)
  To: Avi Kivity
  Cc: nai.xia, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 6/22/2011 2:10 PM, Avi Kivity wrote:
> On 06/22/2011 02:05 PM, Izik Eidus wrote:
>>>> +    spte = rmap_next(kvm, rmapp, NULL);
>>>> +    while (spte) {
>>>> +        int _dirty;
>>>> +        u64 _spte = *spte;
>>>> +        BUG_ON(!(_spte&  PT_PRESENT_MASK));
>>>> +        _dirty = _spte&  PT_DIRTY_MASK;
>>>> +        if (_dirty) {
>>>> +            dirty = 1;
>>>> +            clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
>>>> +        }
>>>
>>> Racy.  Also, needs a tlb flush eventually.
>> +
>>
>> Hi, one of the issues is that the whole point of this patch is not do 
>> tlb flush eventually,
>> But I see your point, because other users will not expect such 
>> behavior, so maybe there is need into a parameter
>> flush_tlb=?, or add another mmu notifier call?
>>
>
> If you don't flush the tlb, a subsequent write will not see that 
> spte.d is clear and the write will happen.  So you'll see the page as 
> clean even though it's dirty.  That's not acceptable.
>

Yes, but this is exactly what we want from this use case:
Right now ksm calculate the page hash to see if it was changed, the idea 
behind this patch is to use the dirty bit instead,
however the guest might not really like the fact that we will flush its 
tlb over and over again, specially in periodically scan like ksm does.

So what we say here is: it is better to have little junk in the unstable 
tree that get flushed eventualy anyway, instead of make the guest slower....
this race is something that does not reflect accurate of ksm anyway due 
to the full memcmp that we will eventualy perform...

Ofcurse we trust that in most cases, beacuse it take ksm to get into a 
random virtual address in real systems few minutes, there will be 
already tlb flush performed.

What you think about having 2 calls: one that does the expected behivor 
and does flush the tlb, and one that clearly say it doesnt flush the tlb
and expline its use case for ksm?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 11:19           ` Izik Eidus
@ 2011-06-22 11:24             ` Avi Kivity
  -1 siblings, 0 replies; 96+ messages in thread
From: Avi Kivity @ 2011-06-22 11:24 UTC (permalink / raw)
  To: Izik Eidus
  Cc: nai.xia, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 06/22/2011 02:19 PM, Izik Eidus wrote:
> On 6/22/2011 2:10 PM, Avi Kivity wrote:
>> On 06/22/2011 02:05 PM, Izik Eidus wrote:
>>>>> +    spte = rmap_next(kvm, rmapp, NULL);
>>>>> +    while (spte) {
>>>>> +        int _dirty;
>>>>> +        u64 _spte = *spte;
>>>>> +        BUG_ON(!(_spte&  PT_PRESENT_MASK));
>>>>> +        _dirty = _spte&  PT_DIRTY_MASK;
>>>>> +        if (_dirty) {
>>>>> +            dirty = 1;
>>>>> +            clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
>>>>> +        }
>>>>
>>>> Racy.  Also, needs a tlb flush eventually.
>>> +
>>>
>>> Hi, one of the issues is that the whole point of this patch is not 
>>> do tlb flush eventually,
>>> But I see your point, because other users will not expect such 
>>> behavior, so maybe there is need into a parameter
>>> flush_tlb=?, or add another mmu notifier call?
>>>
>>
>> If you don't flush the tlb, a subsequent write will not see that 
>> spte.d is clear and the write will happen.  So you'll see the page as 
>> clean even though it's dirty.  That's not acceptable.
>>
>
> Yes, but this is exactly what we want from this use case:
> Right now ksm calculate the page hash to see if it was changed, the 
> idea behind this patch is to use the dirty bit instead,
> however the guest might not really like the fact that we will flush 
> its tlb over and over again, specially in periodically scan like ksm 
> does.

I see.

>
> So what we say here is: it is better to have little junk in the 
> unstable tree that get flushed eventualy anyway, instead of make the 
> guest slower....
> this race is something that does not reflect accurate of ksm anyway 
> due to the full memcmp that we will eventualy perform...
>
> Ofcurse we trust that in most cases, beacuse it take ksm to get into a 
> random virtual address in real systems few minutes, there will be 
> already tlb flush performed.
>
> What you think about having 2 calls: one that does the expected 
> behivor and does flush the tlb, and one that clearly say it doesnt 
> flush the tlb
> and expline its use case for ksm?

Yes.  And if the unstable/fast callback is not provided, have the common 
code fall back to the stable/slow callback instead.

Or have a parameter that allows inaccurate results to be returned more 
quickly.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 11:24             ` Avi Kivity
  0 siblings, 0 replies; 96+ messages in thread
From: Avi Kivity @ 2011-06-22 11:24 UTC (permalink / raw)
  To: Izik Eidus
  Cc: nai.xia, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 06/22/2011 02:19 PM, Izik Eidus wrote:
> On 6/22/2011 2:10 PM, Avi Kivity wrote:
>> On 06/22/2011 02:05 PM, Izik Eidus wrote:
>>>>> +    spte = rmap_next(kvm, rmapp, NULL);
>>>>> +    while (spte) {
>>>>> +        int _dirty;
>>>>> +        u64 _spte = *spte;
>>>>> +        BUG_ON(!(_spte&  PT_PRESENT_MASK));
>>>>> +        _dirty = _spte&  PT_DIRTY_MASK;
>>>>> +        if (_dirty) {
>>>>> +            dirty = 1;
>>>>> +            clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
>>>>> +        }
>>>>
>>>> Racy.  Also, needs a tlb flush eventually.
>>> +
>>>
>>> Hi, one of the issues is that the whole point of this patch is not 
>>> do tlb flush eventually,
>>> But I see your point, because other users will not expect such 
>>> behavior, so maybe there is need into a parameter
>>> flush_tlb=?, or add another mmu notifier call?
>>>
>>
>> If you don't flush the tlb, a subsequent write will not see that 
>> spte.d is clear and the write will happen.  So you'll see the page as 
>> clean even though it's dirty.  That's not acceptable.
>>
>
> Yes, but this is exactly what we want from this use case:
> Right now ksm calculate the page hash to see if it was changed, the 
> idea behind this patch is to use the dirty bit instead,
> however the guest might not really like the fact that we will flush 
> its tlb over and over again, specially in periodically scan like ksm 
> does.

I see.

>
> So what we say here is: it is better to have little junk in the 
> unstable tree that get flushed eventualy anyway, instead of make the 
> guest slower....
> this race is something that does not reflect accurate of ksm anyway 
> due to the full memcmp that we will eventualy perform...
>
> Ofcurse we trust that in most cases, beacuse it take ksm to get into a 
> random virtual address in real systems few minutes, there will be 
> already tlb flush performed.
>
> What you think about having 2 calls: one that does the expected 
> behivor and does flush the tlb, and one that clearly say it doesnt 
> flush the tlb
> and expline its use case for ksm?

Yes.  And if the unstable/fast callback is not provided, have the common 
code fall back to the stable/slow callback instead.

Or have a parameter that allows inaccurate results to be returned more 
quickly.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 10:43     ` Avi Kivity
  (?)
  (?)
@ 2011-06-22 11:24     ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22 11:24 UTC (permalink / raw)
  To: Undisclosed.Recipients:
  Cc: Andrew Morton, Izik Eidus, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

Hi Avi,

Thanks for viewing!

On Wednesday 22 June 2011 18:43:30 Avi Kivity wrote:
> On 06/21/2011 04:32 PM, Nai Xia wrote:
> > Introduced kvm_mmu_notifier_test_and_clear_dirty(), kvm_mmu_notifier_dirty_update()
> > and their mmu_notifier interfaces to support KSM dirty bit tracking, which brings
> > significant performance gain in volatile pages scanning in KSM.
> > Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is
> > enabled to indicate that the dirty bits of underlying sptes are not updated by
> > hardware.
> >
> 
> 
> Can you quantify the performance gains?

Compared with checksum based approach, the speed up for volatile host working 
set is about 8 times on normal pages, 16 times on transhuge page. I have not
collect the figures in guest os yet. I'll be back with these numbers in guest.

> 
> > +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp,
> > +				   unsigned long data)
> > +{
> > +	u64 *spte;
> > +	int dirty = 0;
> > +
> > +	if (!shadow_dirty_mask) {
> > +		WARN(1, "KVM: do NOT try to test dirty bit in EPT\n");
> > +		goto out;
> > +	}
> > +
> > +	spte = rmap_next(kvm, rmapp, NULL);
> > +	while (spte) {
> > +		int _dirty;
> > +		u64 _spte = *spte;
> > +		BUG_ON(!(_spte&  PT_PRESENT_MASK));
> > +		_dirty = _spte&  PT_DIRTY_MASK;
> > +		if (_dirty) {
> > +			dirty = 1;
> > +			clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
> > +		}
> 
> Racy.  Also, needs a tlb flush eventually.
> 
> > +		spte = rmap_next(kvm, rmapp, spte);
> > +	}
> > +out:
> > +	return dirty;
> > +}
> > +
> >   #define RMAP_RECYCLE_THRESHOLD 1000
> >
> >
> >   struct mmu_notifier_ops {
> > +	int (*dirty_update)(struct mmu_notifier *mn,
> > +			     struct mm_struct *mm);
> > +
> 
> I prefer to have test_and_clear_dirty() always return 1 in this case (if 
> the spte is writeable), and drop this callback.

If test_and_clear_dirty() always return 1, how can ksmd tell if it's a real
dirty page or just casued by EPT and ksmd should just fallback to checksum 
based approach?

> > +int __mmu_notifier_dirty_update(struct mm_struct *mm)
> > +{
> > +	struct mmu_notifier *mn;
> > +	struct hlist_node *n;
> > +	int dirty_update = 0;
> > +
> > +	rcu_read_lock();
> > +	hlist_for_each_entry_rcu(mn, n,&mm->mmu_notifier_mm->list, hlist) {
> > +		if (mn->ops->dirty_update)
> > +			dirty_update |= mn->ops->dirty_update(mn, mm);
> > +	}
> > +	rcu_read_unlock();
> > +
> 
> Should it not be &= instead?

I think the logic is "if _any_ underlying MMU is going to update the bit, then
this bit is not dead, we can query it throught test_and_clear....". ksmd should 
not care about which one dirties the page, as long as it's dirty, it can be skipped.
Did I miss sth?

Thanks,

Nai


> 
> > +	return dirty_update;
> > +}
> > +
> >   /*
> >    * This function can't run concurrently against mmu_notifier_register
> >    * because mm->mm_users>  0 during mmu_notifier_register and exit_mmap
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 11:24             ` Avi Kivity
@ 2011-06-22 11:28               ` Avi Kivity
  -1 siblings, 0 replies; 96+ messages in thread
From: Avi Kivity @ 2011-06-22 11:28 UTC (permalink / raw)
  To: Izik Eidus
  Cc: nai.xia, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 06/22/2011 02:24 PM, Avi Kivity wrote:
> On 06/22/2011 02:19 PM, Izik Eidus wrote:
>> On 6/22/2011 2:10 PM, Avi Kivity wrote:
>>> On 06/22/2011 02:05 PM, Izik Eidus wrote:
>>>>>> +    spte = rmap_next(kvm, rmapp, NULL);
>>>>>> +    while (spte) {
>>>>>> +        int _dirty;
>>>>>> +        u64 _spte = *spte;
>>>>>> +        BUG_ON(!(_spte&  PT_PRESENT_MASK));
>>>>>> +        _dirty = _spte&  PT_DIRTY_MASK;
>>>>>> +        if (_dirty) {
>>>>>> +            dirty = 1;
>>>>>> +            clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
>>>>>> +        }
>>>>>
>>>>> Racy.  Also, needs a tlb flush eventually.
>>>> +
>>>>
>>>> Hi, one of the issues is that the whole point of this patch is not 
>>>> do tlb flush eventually,
>>>> But I see your point, because other users will not expect such 
>>>> behavior, so maybe there is need into a parameter
>>>> flush_tlb=?, or add another mmu notifier call?
>>>>
>>>
>>> If you don't flush the tlb, a subsequent write will not see that 
>>> spte.d is clear and the write will happen.  So you'll see the page 
>>> as clean even though it's dirty.  That's not acceptable.
>>>
>>
>> Yes, but this is exactly what we want from this use case:
>> Right now ksm calculate the page hash to see if it was changed, the 
>> idea behind this patch is to use the dirty bit instead,
>> however the guest might not really like the fact that we will flush 
>> its tlb over and over again, specially in periodically scan like ksm 
>> does.
>
> I see.

Actually, this is dangerous.  If we use the dirty bit for other things, 
we will get data corruption.

For example we might want to map clean host pages as writeable-clean in 
the spte on a read fault so that we don't get a page fault when they get 
eventually written.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 11:28               ` Avi Kivity
  0 siblings, 0 replies; 96+ messages in thread
From: Avi Kivity @ 2011-06-22 11:28 UTC (permalink / raw)
  To: Izik Eidus
  Cc: nai.xia, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 06/22/2011 02:24 PM, Avi Kivity wrote:
> On 06/22/2011 02:19 PM, Izik Eidus wrote:
>> On 6/22/2011 2:10 PM, Avi Kivity wrote:
>>> On 06/22/2011 02:05 PM, Izik Eidus wrote:
>>>>>> +    spte = rmap_next(kvm, rmapp, NULL);
>>>>>> +    while (spte) {
>>>>>> +        int _dirty;
>>>>>> +        u64 _spte = *spte;
>>>>>> +        BUG_ON(!(_spte&  PT_PRESENT_MASK));
>>>>>> +        _dirty = _spte&  PT_DIRTY_MASK;
>>>>>> +        if (_dirty) {
>>>>>> +            dirty = 1;
>>>>>> +            clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
>>>>>> +        }
>>>>>
>>>>> Racy.  Also, needs a tlb flush eventually.
>>>> +
>>>>
>>>> Hi, one of the issues is that the whole point of this patch is not 
>>>> do tlb flush eventually,
>>>> But I see your point, because other users will not expect such 
>>>> behavior, so maybe there is need into a parameter
>>>> flush_tlb=?, or add another mmu notifier call?
>>>>
>>>
>>> If you don't flush the tlb, a subsequent write will not see that 
>>> spte.d is clear and the write will happen.  So you'll see the page 
>>> as clean even though it's dirty.  That's not acceptable.
>>>
>>
>> Yes, but this is exactly what we want from this use case:
>> Right now ksm calculate the page hash to see if it was changed, the 
>> idea behind this patch is to use the dirty bit instead,
>> however the guest might not really like the fact that we will flush 
>> its tlb over and over again, specially in periodically scan like ksm 
>> does.
>
> I see.

Actually, this is dangerous.  If we use the dirty bit for other things, 
we will get data corruption.

For example we might want to map clean host pages as writeable-clean in 
the spte on a read fault so that we don't get a page fault when they get 
eventually written.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 11:28               ` Avi Kivity
@ 2011-06-22 11:31                 ` Avi Kivity
  -1 siblings, 0 replies; 96+ messages in thread
From: Avi Kivity @ 2011-06-22 11:31 UTC (permalink / raw)
  To: Izik Eidus
  Cc: nai.xia, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 06/22/2011 02:28 PM, Avi Kivity wrote:
>
> Actually, this is dangerous.  If we use the dirty bit for other 
> things, we will get data corruption.
>
> For example we might want to map clean host pages as writeable-clean 
> in the spte on a read fault so that we don't get a page fault when 
> they get eventually written.
>

Another example - we can use the dirty bit for dirty page loggging.

So I think we can get away with a conditional tlb flush - only flush if 
the page was dirty.  That should be rare after the first pass, at least 
with small pages.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 11:31                 ` Avi Kivity
  0 siblings, 0 replies; 96+ messages in thread
From: Avi Kivity @ 2011-06-22 11:31 UTC (permalink / raw)
  To: Izik Eidus
  Cc: nai.xia, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 06/22/2011 02:28 PM, Avi Kivity wrote:
>
> Actually, this is dangerous.  If we use the dirty bit for other 
> things, we will get data corruption.
>
> For example we might want to map clean host pages as writeable-clean 
> in the spte on a read fault so that we don't get a page fault when 
> they get eventually written.
>

Another example - we can use the dirty bit for dirty page loggging.

So I think we can get away with a conditional tlb flush - only flush if 
the page was dirty.  That should be rare after the first pass, at least 
with small pages.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 11:28               ` Avi Kivity
@ 2011-06-22 11:33                 ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22 11:33 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Izik Eidus, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Wednesday 22 June 2011 19:28:08 Avi Kivity wrote:
> On 06/22/2011 02:24 PM, Avi Kivity wrote:
> > On 06/22/2011 02:19 PM, Izik Eidus wrote:
> >> On 6/22/2011 2:10 PM, Avi Kivity wrote:
> >>> On 06/22/2011 02:05 PM, Izik Eidus wrote:
> >>>>>> +    spte = rmap_next(kvm, rmapp, NULL);
> >>>>>> +    while (spte) {
> >>>>>> +        int _dirty;
> >>>>>> +        u64 _spte = *spte;
> >>>>>> +        BUG_ON(!(_spte&  PT_PRESENT_MASK));
> >>>>>> +        _dirty = _spte&  PT_DIRTY_MASK;
> >>>>>> +        if (_dirty) {
> >>>>>> +            dirty = 1;
> >>>>>> +            clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
> >>>>>> +        }
> >>>>>
> >>>>> Racy.  Also, needs a tlb flush eventually.
> >>>> +
> >>>>
> >>>> Hi, one of the issues is that the whole point of this patch is not 
> >>>> do tlb flush eventually,
> >>>> But I see your point, because other users will not expect such 
> >>>> behavior, so maybe there is need into a parameter
> >>>> flush_tlb=?, or add another mmu notifier call?
> >>>>
> >>>
> >>> If you don't flush the tlb, a subsequent write will not see that 
> >>> spte.d is clear and the write will happen.  So you'll see the page 
> >>> as clean even though it's dirty.  That's not acceptable.
> >>>
> >>
> >> Yes, but this is exactly what we want from this use case:
> >> Right now ksm calculate the page hash to see if it was changed, the 
> >> idea behind this patch is to use the dirty bit instead,
> >> however the guest might not really like the fact that we will flush 
> >> its tlb over and over again, specially in periodically scan like ksm 
> >> does.
> >
> > I see.
> 
> Actually, this is dangerous.  If we use the dirty bit for other things, 
> we will get data corruption.

Yeah,yeah, I actually clarified in a reply letter to Chris about his similar
concern that we are currently the _only_ user. :)
We can add the flushing when someone else should rely on this bit.

> 
> For example we might want to map clean host pages as writeable-clean in 
> the spte on a read fault so that we don't get a page fault when they get 
> eventually written.
> 
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 11:33                 ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22 11:33 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Izik Eidus, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Wednesday 22 June 2011 19:28:08 Avi Kivity wrote:
> On 06/22/2011 02:24 PM, Avi Kivity wrote:
> > On 06/22/2011 02:19 PM, Izik Eidus wrote:
> >> On 6/22/2011 2:10 PM, Avi Kivity wrote:
> >>> On 06/22/2011 02:05 PM, Izik Eidus wrote:
> >>>>>> +    spte = rmap_next(kvm, rmapp, NULL);
> >>>>>> +    while (spte) {
> >>>>>> +        int _dirty;
> >>>>>> +        u64 _spte = *spte;
> >>>>>> +        BUG_ON(!(_spte&  PT_PRESENT_MASK));
> >>>>>> +        _dirty = _spte&  PT_DIRTY_MASK;
> >>>>>> +        if (_dirty) {
> >>>>>> +            dirty = 1;
> >>>>>> +            clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
> >>>>>> +        }
> >>>>>
> >>>>> Racy.  Also, needs a tlb flush eventually.
> >>>> +
> >>>>
> >>>> Hi, one of the issues is that the whole point of this patch is not 
> >>>> do tlb flush eventually,
> >>>> But I see your point, because other users will not expect such 
> >>>> behavior, so maybe there is need into a parameter
> >>>> flush_tlb=?, or add another mmu notifier call?
> >>>>
> >>>
> >>> If you don't flush the tlb, a subsequent write will not see that 
> >>> spte.d is clear and the write will happen.  So you'll see the page 
> >>> as clean even though it's dirty.  That's not acceptable.
> >>>
> >>
> >> Yes, but this is exactly what we want from this use case:
> >> Right now ksm calculate the page hash to see if it was changed, the 
> >> idea behind this patch is to use the dirty bit instead,
> >> however the guest might not really like the fact that we will flush 
> >> its tlb over and over again, specially in periodically scan like ksm 
> >> does.
> >
> > I see.
> 
> Actually, this is dangerous.  If we use the dirty bit for other things, 
> we will get data corruption.

Yeah,yeah, I actually clarified in a reply letter to Chris about his similar
concern that we are currently the _only_ user. :)
We can add the flushing when someone else should rely on this bit.

> 
> For example we might want to map clean host pages as writeable-clean in 
> the spte on a read fault so that we don't get a page fault when they get 
> eventually written.
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 11:33                 ` Nai Xia
@ 2011-06-22 11:39                   ` Izik Eidus
  -1 siblings, 0 replies; 96+ messages in thread
From: Izik Eidus @ 2011-06-22 11:39 UTC (permalink / raw)
  To: nai.xia
  Cc: Avi Kivity, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 6/22/2011 2:33 PM, Nai Xia wrote:
> On Wednesday 22 June 2011 19:28:08 Avi Kivity wrote:
>> On 06/22/2011 02:24 PM, Avi Kivity wrote:
>>> On 06/22/2011 02:19 PM, Izik Eidus wrote:
>>>> On 6/22/2011 2:10 PM, Avi Kivity wrote:
>>>>> On 06/22/2011 02:05 PM, Izik Eidus wrote:
>>>>>>>> +    spte = rmap_next(kvm, rmapp, NULL);
>>>>>>>> +    while (spte) {
>>>>>>>> +        int _dirty;
>>>>>>>> +        u64 _spte = *spte;
>>>>>>>> +        BUG_ON(!(_spte&   PT_PRESENT_MASK));
>>>>>>>> +        _dirty = _spte&   PT_DIRTY_MASK;
>>>>>>>> +        if (_dirty) {
>>>>>>>> +            dirty = 1;
>>>>>>>> +            clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
>>>>>>>> +        }
>>>>>>> Racy.  Also, needs a tlb flush eventually.
>>>>>> +
>>>>>>
>>>>>> Hi, one of the issues is that the whole point of this patch is not
>>>>>> do tlb flush eventually,
>>>>>> But I see your point, because other users will not expect such
>>>>>> behavior, so maybe there is need into a parameter
>>>>>> flush_tlb=?, or add another mmu notifier call?
>>>>>>
>>>>> If you don't flush the tlb, a subsequent write will not see that
>>>>> spte.d is clear and the write will happen.  So you'll see the page
>>>>> as clean even though it's dirty.  That's not acceptable.
>>>>>
>>>> Yes, but this is exactly what we want from this use case:
>>>> Right now ksm calculate the page hash to see if it was changed, the
>>>> idea behind this patch is to use the dirty bit instead,
>>>> however the guest might not really like the fact that we will flush
>>>> its tlb over and over again, specially in periodically scan like ksm
>>>> does.
>>> I see.
>> Actually, this is dangerous.  If we use the dirty bit for other things,
>> we will get data corruption.
> Yeah,yeah, I actually clarified in a reply letter to Chris about his similar
> concern that we are currently the _only_ user. :)
> We can add the flushing when someone else should rely on this bit.
>

I suggest to add the flushing when someone else will use it as well

Btw I don`t think this whole optimization is worth for kvm guests in 
case that tlb flush must be perform,
in machine with alot of cpus, it much better ksm will burn one cpu 
usage, instead of slowering all the others...
So while this patch will really make ksm look faster, the whole system 
will be slower...

So in case you don`t want to add the flushing when someone else will 
rely on it,
it will be better to use the dirty tick just for userspace applications 
and not for kvm guests..

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 11:39                   ` Izik Eidus
  0 siblings, 0 replies; 96+ messages in thread
From: Izik Eidus @ 2011-06-22 11:39 UTC (permalink / raw)
  To: nai.xia
  Cc: Avi Kivity, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Chris Wright, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 6/22/2011 2:33 PM, Nai Xia wrote:
> On Wednesday 22 June 2011 19:28:08 Avi Kivity wrote:
>> On 06/22/2011 02:24 PM, Avi Kivity wrote:
>>> On 06/22/2011 02:19 PM, Izik Eidus wrote:
>>>> On 6/22/2011 2:10 PM, Avi Kivity wrote:
>>>>> On 06/22/2011 02:05 PM, Izik Eidus wrote:
>>>>>>>> +    spte = rmap_next(kvm, rmapp, NULL);
>>>>>>>> +    while (spte) {
>>>>>>>> +        int _dirty;
>>>>>>>> +        u64 _spte = *spte;
>>>>>>>> +        BUG_ON(!(_spte&   PT_PRESENT_MASK));
>>>>>>>> +        _dirty = _spte&   PT_DIRTY_MASK;
>>>>>>>> +        if (_dirty) {
>>>>>>>> +            dirty = 1;
>>>>>>>> +            clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
>>>>>>>> +        }
>>>>>>> Racy.  Also, needs a tlb flush eventually.
>>>>>> +
>>>>>>
>>>>>> Hi, one of the issues is that the whole point of this patch is not
>>>>>> do tlb flush eventually,
>>>>>> But I see your point, because other users will not expect such
>>>>>> behavior, so maybe there is need into a parameter
>>>>>> flush_tlb=?, or add another mmu notifier call?
>>>>>>
>>>>> If you don't flush the tlb, a subsequent write will not see that
>>>>> spte.d is clear and the write will happen.  So you'll see the page
>>>>> as clean even though it's dirty.  That's not acceptable.
>>>>>
>>>> Yes, but this is exactly what we want from this use case:
>>>> Right now ksm calculate the page hash to see if it was changed, the
>>>> idea behind this patch is to use the dirty bit instead,
>>>> however the guest might not really like the fact that we will flush
>>>> its tlb over and over again, specially in periodically scan like ksm
>>>> does.
>>> I see.
>> Actually, this is dangerous.  If we use the dirty bit for other things,
>> we will get data corruption.
> Yeah,yeah, I actually clarified in a reply letter to Chris about his similar
> concern that we are currently the _only_ user. :)
> We can add the flushing when someone else should rely on this bit.
>

I suggest to add the flushing when someone else will use it as well

Btw I don`t think this whole optimization is worth for kvm guests in 
case that tlb flush must be perform,
in machine with alot of cpus, it much better ksm will burn one cpu 
usage, instead of slowering all the others...
So while this patch will really make ksm look faster, the whole system 
will be slower...

So in case you don`t want to add the flushing when someone else will 
rely on it,
it will be better to use the dirty tick just for userspace applications 
and not for kvm guests..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-21 13:32   ` Nai Xia
@ 2011-06-22 15:03     ` Andrea Arcangeli
  -1 siblings, 0 replies; 96+ messages in thread
From: Andrea Arcangeli @ 2011-06-22 15:03 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrew Morton, Izik Eidus, Hugh Dickins, Chris Wright,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel, kvm

On Tue, Jun 21, 2011 at 09:32:39PM +0800, Nai Xia wrote:
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index d48ec60..b407a69 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4674,6 +4674,7 @@ static int __init vmx_init(void)
>  		kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
>  				VMX_EPT_EXECUTABLE_MASK);
>  		kvm_enable_tdp();
> +		kvm_dirty_update = 0;
>  	} else
>  		kvm_disable_tdp();
>  

Why not return !shadow_dirty_mask instead of adding a new var?

>  struct mmu_notifier_ops {
> +	int (*dirty_update)(struct mmu_notifier *mn,
> +			     struct mm_struct *mm);
> +

Needs some docu.

I think dirty_update isn't self explanatory name. I think
"has_test_and_clear_dirty" would be better.

If we don't flush the smp tlb don't we risk that we'll insert pages in
the unstable tree that are volatile just because the dirty bit didn't
get set again on the spte?

The first patch I guess it's a sign of hugetlbfs going a little over
the edge in trying to mix with the core VM... Passing that parameter
&need_pte_unmap all over the place not so nice, maybe it'd be possible
to fix within hugetlbfs to use a different method to walk the hugetlb
vmas. I'd prefer that if possible.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 15:03     ` Andrea Arcangeli
  0 siblings, 0 replies; 96+ messages in thread
From: Andrea Arcangeli @ 2011-06-22 15:03 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrew Morton, Izik Eidus, Hugh Dickins, Chris Wright,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel, kvm

On Tue, Jun 21, 2011 at 09:32:39PM +0800, Nai Xia wrote:
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index d48ec60..b407a69 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4674,6 +4674,7 @@ static int __init vmx_init(void)
>  		kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
>  				VMX_EPT_EXECUTABLE_MASK);
>  		kvm_enable_tdp();
> +		kvm_dirty_update = 0;
>  	} else
>  		kvm_disable_tdp();
>  

Why not return !shadow_dirty_mask instead of adding a new var?

>  struct mmu_notifier_ops {
> +	int (*dirty_update)(struct mmu_notifier *mn,
> +			     struct mm_struct *mm);
> +

Needs some docu.

I think dirty_update isn't self explanatory name. I think
"has_test_and_clear_dirty" would be better.

If we don't flush the smp tlb don't we risk that we'll insert pages in
the unstable tree that are volatile just because the dirty bit didn't
get set again on the spte?

The first patch I guess it's a sign of hugetlbfs going a little over
the edge in trying to mix with the core VM... Passing that parameter
&need_pte_unmap all over the place not so nice, maybe it'd be possible
to fix within hugetlbfs to use a different method to walk the hugetlb
vmas. I'd prefer that if possible.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 15:03     ` Andrea Arcangeli
@ 2011-06-22 15:19       ` Izik Eidus
  -1 siblings, 0 replies; 96+ messages in thread
From: Izik Eidus @ 2011-06-22 15:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nai Xia, Andrew Morton, Hugh Dickins, Chris Wright, Rik van Riel,
	linux-mm, Johannes Weiner, linux-kernel, kvm


> If we don't flush the smp tlb don't we risk that we'll insert pages in
> the unstable tree that are volatile just because the dirty bit didn't
> get set again on the spte?

Yes, this is the trade off we take, the unstable tree will be flushed 
anyway -
so this is nothing that won`t be recovered very soon after it happen...

and most of the chances the tlb will be flushed before ksm get there anyway
(specially for heavily modified page, that we don`t want in the unstable 
tree)

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 15:19       ` Izik Eidus
  0 siblings, 0 replies; 96+ messages in thread
From: Izik Eidus @ 2011-06-22 15:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nai Xia, Andrew Morton, Hugh Dickins, Chris Wright, Rik van Riel,
	linux-mm, Johannes Weiner, linux-kernel, kvm


> If we don't flush the smp tlb don't we risk that we'll insert pages in
> the unstable tree that are volatile just because the dirty bit didn't
> get set again on the spte?

Yes, this is the trade off we take, the unstable tree will be flushed 
anyway -
so this is nothing that won`t be recovered very soon after it happen...

and most of the chances the tlb will be flushed before ksm get there anyway
(specially for heavily modified page, that we don`t want in the unstable 
tree)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 11:19           ` Izik Eidus
@ 2011-06-22 15:39             ` Rik van Riel
  -1 siblings, 0 replies; 96+ messages in thread
From: Rik van Riel @ 2011-06-22 15:39 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Avi Kivity, nai.xia, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 06/22/2011 07:19 AM, Izik Eidus wrote:

> So what we say here is: it is better to have little junk in the unstable
> tree that get flushed eventualy anyway, instead of make the guest
> slower....
> this race is something that does not reflect accurate of ksm anyway due
> to the full memcmp that we will eventualy perform...

With 2MB pages, I am not convinced they will get "flushed eventually",
because there is a good chance at least one of the 4kB pages inside
a 2MB page is in active use at all times.

I worry that the proposed changes may end up effectively preventing
KSM from scanning inside 2MB pages, when even one 4kB page inside
is in active use.  This could mean increased swapping on systems
that run low on memory, which can be a much larger performance penalty
than ksmd CPU use.

We need to scan inside 2MB pages when memory runs low, regardless
of the accessed or dirty bits.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 15:39             ` Rik van Riel
  0 siblings, 0 replies; 96+ messages in thread
From: Rik van Riel @ 2011-06-22 15:39 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Avi Kivity, nai.xia, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 06/22/2011 07:19 AM, Izik Eidus wrote:

> So what we say here is: it is better to have little junk in the unstable
> tree that get flushed eventualy anyway, instead of make the guest
> slower....
> this race is something that does not reflect accurate of ksm anyway due
> to the full memcmp that we will eventualy perform...

With 2MB pages, I am not convinced they will get "flushed eventually",
because there is a good chance at least one of the 4kB pages inside
a 2MB page is in active use at all times.

I worry that the proposed changes may end up effectively preventing
KSM from scanning inside 2MB pages, when even one 4kB page inside
is in active use.  This could mean increased swapping on systems
that run low on memory, which can be a much larger performance penalty
than ksmd CPU use.

We need to scan inside 2MB pages when memory runs low, regardless
of the accessed or dirty bits.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22  6:15       ` Izik Eidus
@ 2011-06-22 15:46         ` Chris Wright
  -1 siblings, 0 replies; 96+ messages in thread
From: Chris Wright @ 2011-06-22 15:46 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Chris Wright, Nai Xia, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm, mtosatti

* Izik Eidus (izik.eidus@ravellosystems.com) wrote:
> On 6/22/2011 3:21 AM, Chris Wright wrote:
> >* Nai Xia (nai.xia@gmail.com) wrote:
> >>+	if (!shadow_dirty_mask) {
> >>+		WARN(1, "KVM: do NOT try to test dirty bit in EPT\n");
> >>+		goto out;
> >>+	}
> >This should never fire with the dirty_update() notifier test, right?
> >And that means that this whole optimization is for the shadow mmu case,
> >arguably the legacy case.
> 
> Hi Chris,
> AMD npt does track the dirty bit in the nested page tables,
> so the shadow_dirty_mask should not be 0 in that case...

Yeah, momentary lapse... ;)

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 15:46         ` Chris Wright
  0 siblings, 0 replies; 96+ messages in thread
From: Chris Wright @ 2011-06-22 15:46 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Chris Wright, Nai Xia, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Rik van Riel, linux-mm, Johannes Weiner,
	linux-kernel, kvm, mtosatti

* Izik Eidus (izik.eidus@ravellosystems.com) wrote:
> On 6/22/2011 3:21 AM, Chris Wright wrote:
> >* Nai Xia (nai.xia@gmail.com) wrote:
> >>+	if (!shadow_dirty_mask) {
> >>+		WARN(1, "KVM: do NOT try to test dirty bit in EPT\n");
> >>+		goto out;
> >>+	}
> >This should never fire with the dirty_update() notifier test, right?
> >And that means that this whole optimization is for the shadow mmu case,
> >arguably the legacy case.
> 
> Hi Chris,
> AMD npt does track the dirty bit in the nested page tables,
> so the shadow_dirty_mask should not be 0 in that case...

Yeah, momentary lapse... ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 15:39             ` Rik van Riel
@ 2011-06-22 16:55               ` Andrea Arcangeli
  -1 siblings, 0 replies; 96+ messages in thread
From: Andrea Arcangeli @ 2011-06-22 16:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Izik Eidus, Avi Kivity, nai.xia, Andrew Morton, Hugh Dickins,
	Chris Wright, linux-mm, Johannes Weiner, linux-kernel, kvm

On Wed, Jun 22, 2011 at 11:39:40AM -0400, Rik van Riel wrote:
> On 06/22/2011 07:19 AM, Izik Eidus wrote:
> 
> > So what we say here is: it is better to have little junk in the unstable
> > tree that get flushed eventualy anyway, instead of make the guest
> > slower....
> > this race is something that does not reflect accurate of ksm anyway due
> > to the full memcmp that we will eventualy perform...
> 
> With 2MB pages, I am not convinced they will get "flushed eventually",
> because there is a good chance at least one of the 4kB pages inside
> a 2MB page is in active use at all times.
> 
> I worry that the proposed changes may end up effectively preventing
> KSM from scanning inside 2MB pages, when even one 4kB page inside
> is in active use.  This could mean increased swapping on systems
> that run low on memory, which can be a much larger performance penalty
> than ksmd CPU use.
> 
> We need to scan inside 2MB pages when memory runs low, regardless
> of the accessed or dirty bits.

I guess we could fallback to the cksum when a THP is encountered
(repeating the test_and_clear_dirty also wouldn't give the expected
result if it's repeated on the same hugepmd for the next 4k virtual
address candidate for unstable tree insertion, so it'd need special
handling during the virtual walk anyway).

So it's getting a little hairy, skip on THP, skip on EPT, then I
wonder what is the common case that would be left using it...

Or we could evaluate with statistic how many less pages are inserted
into the unstable tree using the 2m dirty bit but clearly it'd be less
reliable, the algorithm really is meant to track the volatility of
what is later merged, not of a bigger chunk with unrelated data in it.

On a side note, khugepaged should also be changed to preserve the
dirty bit if at least one dirty bit of the ptes is dirty (currently
the hugepmd is always created dirty, it can never happen for an
hugepmd to be clean today so it wasn't preserved in khugepaged so far).

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 16:55               ` Andrea Arcangeli
  0 siblings, 0 replies; 96+ messages in thread
From: Andrea Arcangeli @ 2011-06-22 16:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Izik Eidus, Avi Kivity, nai.xia, Andrew Morton, Hugh Dickins,
	Chris Wright, linux-mm, Johannes Weiner, linux-kernel, kvm

On Wed, Jun 22, 2011 at 11:39:40AM -0400, Rik van Riel wrote:
> On 06/22/2011 07:19 AM, Izik Eidus wrote:
> 
> > So what we say here is: it is better to have little junk in the unstable
> > tree that get flushed eventualy anyway, instead of make the guest
> > slower....
> > this race is something that does not reflect accurate of ksm anyway due
> > to the full memcmp that we will eventualy perform...
> 
> With 2MB pages, I am not convinced they will get "flushed eventually",
> because there is a good chance at least one of the 4kB pages inside
> a 2MB page is in active use at all times.
> 
> I worry that the proposed changes may end up effectively preventing
> KSM from scanning inside 2MB pages, when even one 4kB page inside
> is in active use.  This could mean increased swapping on systems
> that run low on memory, which can be a much larger performance penalty
> than ksmd CPU use.
> 
> We need to scan inside 2MB pages when memory runs low, regardless
> of the accessed or dirty bits.

I guess we could fallback to the cksum when a THP is encountered
(repeating the test_and_clear_dirty also wouldn't give the expected
result if it's repeated on the same hugepmd for the next 4k virtual
address candidate for unstable tree insertion, so it'd need special
handling during the virtual walk anyway).

So it's getting a little hairy, skip on THP, skip on EPT, then I
wonder what is the common case that would be left using it...

Or we could evaluate with statistic how many less pages are inserted
into the unstable tree using the 2m dirty bit but clearly it'd be less
reliable, the algorithm really is meant to track the volatility of
what is later merged, not of a bigger chunk with unrelated data in it.

On a side note, khugepaged should also be changed to preserve the
dirty bit if at least one dirty bit of the ptes is dirty (currently
the hugepmd is always created dirty, it can never happen for an
hugepmd to be clean today so it wasn't preserved in khugepaged so far).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 15:39             ` Rik van Riel
@ 2011-06-22 23:13               ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22 23:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Izik Eidus, Avi Kivity, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Wed, Jun 22, 2011 at 11:39 PM, Rik van Riel <riel@redhat.com> wrote:
> On 06/22/2011 07:19 AM, Izik Eidus wrote:
>
>> So what we say here is: it is better to have little junk in the unstable
>> tree that get flushed eventualy anyway, instead of make the guest
>> slower....
>> this race is something that does not reflect accurate of ksm anyway due
>> to the full memcmp that we will eventualy perform...
>
> With 2MB pages, I am not convinced they will get "flushed eventually",
> because there is a good chance at least one of the 4kB pages inside
> a 2MB page is in active use at all times.
>
> I worry that the proposed changes may end up effectively preventing
> KSM from scanning inside 2MB pages, when even one 4kB page inside
> is in active use.  This could mean increased swapping on systems
> that run low on memory, which can be a much larger performance penalty
> than ksmd CPU use.
>
> We need to scan inside 2MB pages when memory runs low, regardless
> of the accessed or dirty bits.

I agree on this point. Dirty bit , young bit, is by no means accurate. Even
on 4kB pages, there is always a chance that the pte are dirty but the contents
are actually the same. Yeah, the whole optimization contains trade-offs and
trades-offs always have the possibilities to annoy  someone.  Just like
page-bit-relying LRU approximations none of them is perfect too. But I think
it can benefit some people. So maybe we could just provide a generic balanced
solution but provide fine tuning interfaces to make sure tha when it really gets
in the way of someone, he has a way to walk around.
Do you agree on my argument? :-)

>
> --
> All rights reversed
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 23:13               ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22 23:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Izik Eidus, Avi Kivity, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Wed, Jun 22, 2011 at 11:39 PM, Rik van Riel <riel@redhat.com> wrote:
> On 06/22/2011 07:19 AM, Izik Eidus wrote:
>
>> So what we say here is: it is better to have little junk in the unstable
>> tree that get flushed eventualy anyway, instead of make the guest
>> slower....
>> this race is something that does not reflect accurate of ksm anyway due
>> to the full memcmp that we will eventualy perform...
>
> With 2MB pages, I am not convinced they will get "flushed eventually",
> because there is a good chance at least one of the 4kB pages inside
> a 2MB page is in active use at all times.
>
> I worry that the proposed changes may end up effectively preventing
> KSM from scanning inside 2MB pages, when even one 4kB page inside
> is in active use.  This could mean increased swapping on systems
> that run low on memory, which can be a much larger performance penalty
> than ksmd CPU use.
>
> We need to scan inside 2MB pages when memory runs low, regardless
> of the accessed or dirty bits.

I agree on this point. Dirty bit , young bit, is by no means accurate. Even
on 4kB pages, there is always a chance that the pte are dirty but the contents
are actually the same. Yeah, the whole optimization contains trade-offs and
trades-offs always have the possibilities to annoy  someone.  Just like
page-bit-relying LRU approximations none of them is perfect too. But I think
it can benefit some people. So maybe we could just provide a generic balanced
solution but provide fine tuning interfaces to make sure tha when it really gets
in the way of someone, he has a way to walk around.
Do you agree on my argument? :-)

>
> --
> All rights reversed
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 15:03     ` Andrea Arcangeli
@ 2011-06-22 23:19       ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22 23:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Izik Eidus, Hugh Dickins, Chris Wright,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel, kvm

On Wed, Jun 22, 2011 at 11:03 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Tue, Jun 21, 2011 at 09:32:39PM +0800, Nai Xia wrote:
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index d48ec60..b407a69 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -4674,6 +4674,7 @@ static int __init vmx_init(void)
>>               kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
>>                               VMX_EPT_EXECUTABLE_MASK);
>>               kvm_enable_tdp();
>> +             kvm_dirty_update = 0;
>>       } else
>>               kvm_disable_tdp();
>>
>
> Why not return !shadow_dirty_mask instead of adding a new var?
>
>>  struct mmu_notifier_ops {
>> +     int (*dirty_update)(struct mmu_notifier *mn,
>> +                          struct mm_struct *mm);
>> +
>
> Needs some docu.
>
> I think dirty_update isn't self explanatory name. I think
> "has_test_and_clear_dirty" would be better.
>
> If we don't flush the smp tlb don't we risk that we'll insert pages in
> the unstable tree that are volatile just because the dirty bit didn't
> get set again on the spte?
>
> The first patch I guess it's a sign of hugetlbfs going a little over
> the edge in trying to mix with the core VM... Passing that parameter
> &need_pte_unmap all over the place not so nice, maybe it'd be possible
> to fix within hugetlbfs to use a different method to walk the hugetlb
> vmas. I'd prefer that if possible.

OK, I'll have a try over other workarounds.
I am not feeling good about need_pte_unmap myself. :-)

Thanks for viewing!

-Nai

>
> Thanks,
> Andrea
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 23:19       ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22 23:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Izik Eidus, Hugh Dickins, Chris Wright,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel, kvm

On Wed, Jun 22, 2011 at 11:03 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Tue, Jun 21, 2011 at 09:32:39PM +0800, Nai Xia wrote:
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index d48ec60..b407a69 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -4674,6 +4674,7 @@ static int __init vmx_init(void)
>>               kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
>>                               VMX_EPT_EXECUTABLE_MASK);
>>               kvm_enable_tdp();
>> +             kvm_dirty_update = 0;
>>       } else
>>               kvm_disable_tdp();
>>
>
> Why not return !shadow_dirty_mask instead of adding a new var?
>
>>  struct mmu_notifier_ops {
>> +     int (*dirty_update)(struct mmu_notifier *mn,
>> +                          struct mm_struct *mm);
>> +
>
> Needs some docu.
>
> I think dirty_update isn't self explanatory name. I think
> "has_test_and_clear_dirty" would be better.
>
> If we don't flush the smp tlb don't we risk that we'll insert pages in
> the unstable tree that are volatile just because the dirty bit didn't
> get set again on the spte?
>
> The first patch I guess it's a sign of hugetlbfs going a little over
> the edge in trying to mix with the core VM... Passing that parameter
> &need_pte_unmap all over the place not so nice, maybe it'd be possible
> to fix within hugetlbfs to use a different method to walk the hugetlb
> vmas. I'd prefer that if possible.

OK, I'll have a try over other workarounds.
I am not feeling good about need_pte_unmap myself. :-)

Thanks for viewing!

-Nai

>
> Thanks,
> Andrea
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 23:13               ` Nai Xia
@ 2011-06-22 23:25                 ` Andrea Arcangeli
  -1 siblings, 0 replies; 96+ messages in thread
From: Andrea Arcangeli @ 2011-06-22 23:25 UTC (permalink / raw)
  To: Nai Xia
  Cc: Rik van Riel, Izik Eidus, Avi Kivity, Andrew Morton,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Thu, Jun 23, 2011 at 07:13:54AM +0800, Nai Xia wrote:
> I agree on this point. Dirty bit , young bit, is by no means accurate. Even
> on 4kB pages, there is always a chance that the pte are dirty but the contents
> are actually the same. Yeah, the whole optimization contains trade-offs and

Just a side note: the fact the dirty bit would be set even when the
data is the same is actually a pros, not a cons. If the content is the
same but the page was written to, it'd trigger a copy on write short
after merging the page rendering the whole exercise wasteful. The
cksum plays a double role, it both "stabilizes" the unstable tree, so
there's less chance of bad lookups, but it also avoids us to merge
stuff that is written to frequently triggering copy on writes, and the
dirty bit would also catch overwrites with the same data, something
the cksum can't do.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 23:25                 ` Andrea Arcangeli
  0 siblings, 0 replies; 96+ messages in thread
From: Andrea Arcangeli @ 2011-06-22 23:25 UTC (permalink / raw)
  To: Nai Xia
  Cc: Rik van Riel, Izik Eidus, Avi Kivity, Andrew Morton,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Thu, Jun 23, 2011 at 07:13:54AM +0800, Nai Xia wrote:
> I agree on this point. Dirty bit , young bit, is by no means accurate. Even
> on 4kB pages, there is always a chance that the pte are dirty but the contents
> are actually the same. Yeah, the whole optimization contains trade-offs and

Just a side note: the fact the dirty bit would be set even when the
data is the same is actually a pros, not a cons. If the content is the
same but the page was written to, it'd trigger a copy on write short
after merging the page rendering the whole exercise wasteful. The
cksum plays a double role, it both "stabilizes" the unstable tree, so
there's less chance of bad lookups, but it also avoids us to merge
stuff that is written to frequently triggering copy on writes, and the
dirty bit would also catch overwrites with the same data, something
the cksum can't do.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 23:13               ` Nai Xia
@ 2011-06-22 23:28                 ` Rik van Riel
  -1 siblings, 0 replies; 96+ messages in thread
From: Rik van Riel @ 2011-06-22 23:28 UTC (permalink / raw)
  To: Nai Xia
  Cc: Izik Eidus, Avi Kivity, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 06/22/2011 07:13 PM, Nai Xia wrote:
> On Wed, Jun 22, 2011 at 11:39 PM, Rik van Riel<riel@redhat.com>  wrote:
>> On 06/22/2011 07:19 AM, Izik Eidus wrote:
>>
>>> So what we say here is: it is better to have little junk in the unstable
>>> tree that get flushed eventualy anyway, instead of make the guest
>>> slower....
>>> this race is something that does not reflect accurate of ksm anyway due
>>> to the full memcmp that we will eventualy perform...
>>
>> With 2MB pages, I am not convinced they will get "flushed eventually",
>> because there is a good chance at least one of the 4kB pages inside
>> a 2MB page is in active use at all times.
>>
>> I worry that the proposed changes may end up effectively preventing
>> KSM from scanning inside 2MB pages, when even one 4kB page inside
>> is in active use.  This could mean increased swapping on systems
>> that run low on memory, which can be a much larger performance penalty
>> than ksmd CPU use.
>>
>> We need to scan inside 2MB pages when memory runs low, regardless
>> of the accessed or dirty bits.
>
> I agree on this point. Dirty bit , young bit, is by no means accurate. Even
> on 4kB pages, there is always a chance that the pte are dirty but the contents
> are actually the same. Yeah, the whole optimization contains trade-offs and
> trades-offs always have the possibilities to annoy  someone.  Just like
> page-bit-relying LRU approximations none of them is perfect too. But I think
> it can benefit some people. So maybe we could just provide a generic balanced
> solution but provide fine tuning interfaces to make sure tha when it really gets
> in the way of someone, he has a way to walk around.
> Do you agree on my argument? :-)

That's not an argument.

That is a "if I wave my hands vigorously enough, maybe people
will let my patch in without thinking about what I wrote"
style argument.

I believe your optimization makes sense for 4kB pages, but
is going to be counter-productive for 2MB pages.

Your approach of "make ksmd skip over more pages, so it uses
less CPU" is likely to reduce the effectiveness of ksm by not
sharing some pages.

For 4kB pages that is fine, because you'll get around to them
eventually.

However, the internal use of a 2MB page is likely to be quite
different.  Chances are most 2MB pages will have actively used,
barely used and free pages inside.

You absolutely want ksm to get at the barely used and free
sub-pages.  Having just one actively used 4kB sub-page prevent
ksm from merging any of the other 511 sub-pages is a problem.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 23:28                 ` Rik van Riel
  0 siblings, 0 replies; 96+ messages in thread
From: Rik van Riel @ 2011-06-22 23:28 UTC (permalink / raw)
  To: Nai Xia
  Cc: Izik Eidus, Avi Kivity, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 06/22/2011 07:13 PM, Nai Xia wrote:
> On Wed, Jun 22, 2011 at 11:39 PM, Rik van Riel<riel@redhat.com>  wrote:
>> On 06/22/2011 07:19 AM, Izik Eidus wrote:
>>
>>> So what we say here is: it is better to have little junk in the unstable
>>> tree that get flushed eventualy anyway, instead of make the guest
>>> slower....
>>> this race is something that does not reflect accurate of ksm anyway due
>>> to the full memcmp that we will eventualy perform...
>>
>> With 2MB pages, I am not convinced they will get "flushed eventually",
>> because there is a good chance at least one of the 4kB pages inside
>> a 2MB page is in active use at all times.
>>
>> I worry that the proposed changes may end up effectively preventing
>> KSM from scanning inside 2MB pages, when even one 4kB page inside
>> is in active use.  This could mean increased swapping on systems
>> that run low on memory, which can be a much larger performance penalty
>> than ksmd CPU use.
>>
>> We need to scan inside 2MB pages when memory runs low, regardless
>> of the accessed or dirty bits.
>
> I agree on this point. Dirty bit , young bit, is by no means accurate. Even
> on 4kB pages, there is always a chance that the pte are dirty but the contents
> are actually the same. Yeah, the whole optimization contains trade-offs and
> trades-offs always have the possibilities to annoy  someone.  Just like
> page-bit-relying LRU approximations none of them is perfect too. But I think
> it can benefit some people. So maybe we could just provide a generic balanced
> solution but provide fine tuning interfaces to make sure tha when it really gets
> in the way of someone, he has a way to walk around.
> Do you agree on my argument? :-)

That's not an argument.

That is a "if I wave my hands vigorously enough, maybe people
will let my patch in without thinking about what I wrote"
style argument.

I believe your optimization makes sense for 4kB pages, but
is going to be counter-productive for 2MB pages.

Your approach of "make ksmd skip over more pages, so it uses
less CPU" is likely to reduce the effectiveness of ksm by not
sharing some pages.

For 4kB pages that is fine, because you'll get around to them
eventually.

However, the internal use of a 2MB page is likely to be quite
different.  Chances are most 2MB pages will have actively used,
barely used and free pages inside.

You absolutely want ksm to get at the barely used and free
sub-pages.  Having just one actively used 4kB sub-page prevent
ksm from merging any of the other 511 sub-pages is a problem.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 16:55               ` Andrea Arcangeli
@ 2011-06-22 23:37                 ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22 23:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Izik Eidus, Avi Kivity, Andrew Morton,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Thu, Jun 23, 2011 at 12:55 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Wed, Jun 22, 2011 at 11:39:40AM -0400, Rik van Riel wrote:
>> On 06/22/2011 07:19 AM, Izik Eidus wrote:
>>
>> > So what we say here is: it is better to have little junk in the unstable
>> > tree that get flushed eventualy anyway, instead of make the guest
>> > slower....
>> > this race is something that does not reflect accurate of ksm anyway due
>> > to the full memcmp that we will eventualy perform...
>>
>> With 2MB pages, I am not convinced they will get "flushed eventually",
>> because there is a good chance at least one of the 4kB pages inside
>> a 2MB page is in active use at all times.
>>
>> I worry that the proposed changes may end up effectively preventing
>> KSM from scanning inside 2MB pages, when even one 4kB page inside
>> is in active use.  This could mean increased swapping on systems
>> that run low on memory, which can be a much larger performance penalty
>> than ksmd CPU use.
>>
>> We need to scan inside 2MB pages when memory runs low, regardless
>> of the accessed or dirty bits.
>
> I guess we could fallback to the cksum when a THP is encountered
> (repeating the test_and_clear_dirty also wouldn't give the expected
> result if it's repeated on the same hugepmd for the next 4k virtual
> address candidate for unstable tree insertion, so it'd need special
> handling during the virtual walk anyway).
>
> So it's getting a little hairy, skip on THP, skip on EPT, then I
> wonder what is the common case that would be left using it...
>
> Or we could evaluate with statistic how many less pages are inserted
> into the unstable tree using the 2m dirty bit but clearly it'd be less
> reliable, the algorithm really is meant to track the volatility of
> what is later merged, not of a bigger chunk with unrelated data in it.

On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
huge pages before their sub pages gets really merged to stable tree.
So when there are many 2MB pages each having a 4kB subpage
changed for all time, this is already a concern for ksmd to judge
if it's worthwhile to split 2MB page and get its sub-pages merged.
I think the policy for ksmd in a system should be "If you cannot do sth good,
at least do nothing evil". So I really don't think we can satisfy _all_ people.
Get a general method and give users one or two knobs to tune it when they
are the corner cases. How do  you think of my proposal ?

>
> On a side note, khugepaged should also be changed to preserve the
> dirty bit if at least one dirty bit of the ptes is dirty (currently
> the hugepmd is always created dirty, it can never happen for an
> hugepmd to be clean today so it wasn't preserved in khugepaged so far).
>

Thanks for the point that out. This is what I have overlooked!

thanks,
Nai

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 23:37                 ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22 23:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Izik Eidus, Avi Kivity, Andrew Morton,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Thu, Jun 23, 2011 at 12:55 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Wed, Jun 22, 2011 at 11:39:40AM -0400, Rik van Riel wrote:
>> On 06/22/2011 07:19 AM, Izik Eidus wrote:
>>
>> > So what we say here is: it is better to have little junk in the unstable
>> > tree that get flushed eventualy anyway, instead of make the guest
>> > slower....
>> > this race is something that does not reflect accurate of ksm anyway due
>> > to the full memcmp that we will eventualy perform...
>>
>> With 2MB pages, I am not convinced they will get "flushed eventually",
>> because there is a good chance at least one of the 4kB pages inside
>> a 2MB page is in active use at all times.
>>
>> I worry that the proposed changes may end up effectively preventing
>> KSM from scanning inside 2MB pages, when even one 4kB page inside
>> is in active use.  This could mean increased swapping on systems
>> that run low on memory, which can be a much larger performance penalty
>> than ksmd CPU use.
>>
>> We need to scan inside 2MB pages when memory runs low, regardless
>> of the accessed or dirty bits.
>
> I guess we could fallback to the cksum when a THP is encountered
> (repeating the test_and_clear_dirty also wouldn't give the expected
> result if it's repeated on the same hugepmd for the next 4k virtual
> address candidate for unstable tree insertion, so it'd need special
> handling during the virtual walk anyway).
>
> So it's getting a little hairy, skip on THP, skip on EPT, then I
> wonder what is the common case that would be left using it...
>
> Or we could evaluate with statistic how many less pages are inserted
> into the unstable tree using the 2m dirty bit but clearly it'd be less
> reliable, the algorithm really is meant to track the volatility of
> what is later merged, not of a bigger chunk with unrelated data in it.

On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
huge pages before their sub pages gets really merged to stable tree.
So when there are many 2MB pages each having a 4kB subpage
changed for all time, this is already a concern for ksmd to judge
if it's worthwhile to split 2MB page and get its sub-pages merged.
I think the policy for ksmd in a system should be "If you cannot do sth good,
at least do nothing evil". So I really don't think we can satisfy _all_ people.
Get a general method and give users one or two knobs to tune it when they
are the corner cases. How do  you think of my proposal ?

>
> On a side note, khugepaged should also be changed to preserve the
> dirty bit if at least one dirty bit of the ptes is dirty (currently
> the hugepmd is always created dirty, it can never happen for an
> hugepmd to be clean today so it wasn't preserved in khugepaged so far).
>

Thanks for the point that out. This is what I have overlooked!

thanks,
Nai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 15:03     ` Andrea Arcangeli
@ 2011-06-22 23:42       ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22 23:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Izik Eidus, Hugh Dickins, Chris Wright,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel, kvm

On Wed, Jun 22, 2011 at 11:03 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Tue, Jun 21, 2011 at 09:32:39PM +0800, Nai Xia wrote:
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index d48ec60..b407a69 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -4674,6 +4674,7 @@ static int __init vmx_init(void)
>>               kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
>>                               VMX_EPT_EXECUTABLE_MASK);
>>               kvm_enable_tdp();
>> +             kvm_dirty_update = 0;
>>       } else
>>               kvm_disable_tdp();
>>
>
> Why not return !shadow_dirty_mask instead of adding a new var?
>
>>  struct mmu_notifier_ops {
>> +     int (*dirty_update)(struct mmu_notifier *mn,
>> +                          struct mm_struct *mm);
>> +
>
> Needs some docu.

OK. I'll add it.

>
> I think dirty_update isn't self explanatory name. I think
> "has_test_and_clear_dirty" would be better.

Agreed.  So it will be the name in the next version. :)

Thanks,
Nai

>
> If we don't flush the smp tlb don't we risk that we'll insert pages in
> the unstable tree that are volatile just because the dirty bit didn't
> get set again on the spte?
>
> The first patch I guess it's a sign of hugetlbfs going a little over
> the edge in trying to mix with the core VM... Passing that parameter
> &need_pte_unmap all over the place not so nice, maybe it'd be possible
> to fix within hugetlbfs to use a different method to walk the hugetlb
> vmas. I'd prefer that if possible.
>
> Thanks,
> Andrea
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 23:42       ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-22 23:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Izik Eidus, Hugh Dickins, Chris Wright,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel, kvm

On Wed, Jun 22, 2011 at 11:03 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Tue, Jun 21, 2011 at 09:32:39PM +0800, Nai Xia wrote:
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index d48ec60..b407a69 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -4674,6 +4674,7 @@ static int __init vmx_init(void)
>>               kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
>>                               VMX_EPT_EXECUTABLE_MASK);
>>               kvm_enable_tdp();
>> +             kvm_dirty_update = 0;
>>       } else
>>               kvm_disable_tdp();
>>
>
> Why not return !shadow_dirty_mask instead of adding a new var?
>
>>  struct mmu_notifier_ops {
>> +     int (*dirty_update)(struct mmu_notifier *mn,
>> +                          struct mm_struct *mm);
>> +
>
> Needs some docu.

OK. I'll add it.

>
> I think dirty_update isn't self explanatory name. I think
> "has_test_and_clear_dirty" would be better.

Agreed.  So it will be the name in the next version. :)

Thanks,
Nai

>
> If we don't flush the smp tlb don't we risk that we'll insert pages in
> the unstable tree that are volatile just because the dirty bit didn't
> get set again on the spte?
>
> The first patch I guess it's a sign of hugetlbfs going a little over
> the edge in trying to mix with the core VM... Passing that parameter
> &need_pte_unmap all over the place not so nice, maybe it'd be possible
> to fix within hugetlbfs to use a different method to walk the hugetlb
> vmas. I'd prefer that if possible.
>
> Thanks,
> Andrea
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 23:19       ` Nai Xia
@ 2011-06-22 23:44         ` Andrea Arcangeli
  -1 siblings, 0 replies; 96+ messages in thread
From: Andrea Arcangeli @ 2011-06-22 23:44 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrew Morton, Izik Eidus, Hugh Dickins, Chris Wright,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel, kvm

On Thu, Jun 23, 2011 at 07:19:06AM +0800, Nai Xia wrote:
> OK, I'll have a try over other workarounds.
> I am not feeling good about need_pte_unmap myself. :-)

The usual way is to check VM_HUGETLB in the caller and to call another
function that doesn't kmap. Casting pmd_t to pte_t isn't really nice
(but hey we're also doing that exceptionally in smaps_pte_range for
THP, but it safe there because we're casting the value of the pmd, not
the pointer to the pmd, so the kmap is done by the pte version of the
caller and not done by the pmd version of the caller).

Is it done for migrate? Surely it's not for swapout ;).

> Thanks for viewing!

You're welcome!

JFYI I'll be offline on vacation for a week, starting tomorrow, so if
I don't answer in the next few days that's the reason but I'll follow
the progress in a week.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 23:44         ` Andrea Arcangeli
  0 siblings, 0 replies; 96+ messages in thread
From: Andrea Arcangeli @ 2011-06-22 23:44 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrew Morton, Izik Eidus, Hugh Dickins, Chris Wright,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel, kvm

On Thu, Jun 23, 2011 at 07:19:06AM +0800, Nai Xia wrote:
> OK, I'll have a try over other workarounds.
> I am not feeling good about need_pte_unmap myself. :-)

The usual way is to check VM_HUGETLB in the caller and to call another
function that doesn't kmap. Casting pmd_t to pte_t isn't really nice
(but hey we're also doing that exceptionally in smaps_pte_range for
THP, but it safe there because we're casting the value of the pmd, not
the pointer to the pmd, so the kmap is done by the pte version of the
caller and not done by the pmd version of the caller).

Is it done for migrate? Surely it's not for swapout ;).

> Thanks for viewing!

You're welcome!

JFYI I'll be offline on vacation for a week, starting tomorrow, so if
I don't answer in the next few days that's the reason but I'll follow
the progress in a week.

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 23:37                 ` Nai Xia
@ 2011-06-22 23:59                   ` Andrea Arcangeli
  -1 siblings, 0 replies; 96+ messages in thread
From: Andrea Arcangeli @ 2011-06-22 23:59 UTC (permalink / raw)
  To: Nai Xia
  Cc: Rik van Riel, Izik Eidus, Avi Kivity, Andrew Morton,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Thu, Jun 23, 2011 at 07:37:47AM +0800, Nai Xia wrote:
> On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
> huge pages before their sub pages gets really merged to stable tree.
> So when there are many 2MB pages each having a 4kB subpage
> changed for all time, this is already a concern for ksmd to judge
> if it's worthwhile to split 2MB page and get its sub-pages merged.

Hmm not sure to follow. KSM memory density with THP on and off should
be identical. The cksum is computed on subpages so the fact the 4k
subpage is actually mapped by a hugepmd is invisible to KSM up to the
point we get a unstable_tree_search_insert/stable_tree_search lookup
succeeding.

> I think the policy for ksmd in a system should be "If you cannot do sth good,
> at least do nothing evil". So I really don't think we can satisfy _all_ people.
> Get a general method and give users one or two knobs to tune it when they
> are the corner cases. How do  you think of my proposal ?

I'm neutral, but if we get two methods for deciding the unstable tree
candidates, the default probably should prioritize on maximum merging
even if it takes more CPU (if one cares about performance of the core
dedicated to ksmd, KSM is likely going to be off or scanning at low
rate in the first place).

> > On a side note, khugepaged should also be changed to preserve the
> > dirty bit if at least one dirty bit of the ptes is dirty (currently
> > the hugepmd is always created dirty, it can never happen for an
> > hugepmd to be clean today so it wasn't preserved in khugepaged so far).
> >
> 
> Thanks for the point that out. This is what I have overlooked!

No prob. And its default scan rate is very slow compared to ksmd so
it was unlikely to generate too many false positive dirty bits even if
you were splitting hugepages through swap.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-22 23:59                   ` Andrea Arcangeli
  0 siblings, 0 replies; 96+ messages in thread
From: Andrea Arcangeli @ 2011-06-22 23:59 UTC (permalink / raw)
  To: Nai Xia
  Cc: Rik van Riel, Izik Eidus, Avi Kivity, Andrew Morton,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Thu, Jun 23, 2011 at 07:37:47AM +0800, Nai Xia wrote:
> On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
> huge pages before their sub pages gets really merged to stable tree.
> So when there are many 2MB pages each having a 4kB subpage
> changed for all time, this is already a concern for ksmd to judge
> if it's worthwhile to split 2MB page and get its sub-pages merged.

Hmm not sure to follow. KSM memory density with THP on and off should
be identical. The cksum is computed on subpages so the fact the 4k
subpage is actually mapped by a hugepmd is invisible to KSM up to the
point we get a unstable_tree_search_insert/stable_tree_search lookup
succeeding.

> I think the policy for ksmd in a system should be "If you cannot do sth good,
> at least do nothing evil". So I really don't think we can satisfy _all_ people.
> Get a general method and give users one or two knobs to tune it when they
> are the corner cases. How do  you think of my proposal ?

I'm neutral, but if we get two methods for deciding the unstable tree
candidates, the default probably should prioritize on maximum merging
even if it takes more CPU (if one cares about performance of the core
dedicated to ksmd, KSM is likely going to be off or scanning at low
rate in the first place).

> > On a side note, khugepaged should also be changed to preserve the
> > dirty bit if at least one dirty bit of the ptes is dirty (currently
> > the hugepmd is always created dirty, it can never happen for an
> > hugepmd to be clean today so it wasn't preserved in khugepaged so far).
> >
> 
> Thanks for the point that out. This is what I have overlooked!

No prob. And its default scan rate is very slow compared to ksmd so
it was unlikely to generate too many false positive dirty bits even if
you were splitting hugepages through swap.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 23:37                 ` Nai Xia
@ 2011-06-23  0:00                   ` Rik van Riel
  -1 siblings, 0 replies; 96+ messages in thread
From: Rik van Riel @ 2011-06-23  0:00 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrea Arcangeli, Izik Eidus, Avi Kivity, Andrew Morton,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 06/22/2011 07:37 PM, Nai Xia wrote:

> On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
> huge pages before their sub pages gets really merged to stable tree.

Your proposal appears to add a condition that causes ksmd to skip
doing that, which can cause the system to start using swap instead
of sharing memory.

> So when there are many 2MB pages each having a 4kB subpage
> changed for all time, this is already a concern for ksmd to judge
> if it's worthwhile to split 2MB page and get its sub-pages merged.
> I think the policy for ksmd in a system should be "If you cannot do sth good,
> at least do nothing evil". So I really don't think we can satisfy _all_ people.
> Get a general method and give users one or two knobs to tune it when they
> are the corner cases. How do  you think of my proposal ?

I think your proposal makes sense for 4kB pages, but the ksmd
policy for 2MB pages probably needs to be much more aggressive.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-23  0:00                   ` Rik van Riel
  0 siblings, 0 replies; 96+ messages in thread
From: Rik van Riel @ 2011-06-23  0:00 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrea Arcangeli, Izik Eidus, Avi Kivity, Andrew Morton,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On 06/22/2011 07:37 PM, Nai Xia wrote:

> On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
> huge pages before their sub pages gets really merged to stable tree.

Your proposal appears to add a condition that causes ksmd to skip
doing that, which can cause the system to start using swap instead
of sharing memory.

> So when there are many 2MB pages each having a 4kB subpage
> changed for all time, this is already a concern for ksmd to judge
> if it's worthwhile to split 2MB page and get its sub-pages merged.
> I think the policy for ksmd in a system should be "If you cannot do sth good,
> at least do nothing evil". So I really don't think we can satisfy _all_ people.
> Get a general method and give users one or two knobs to tune it when they
> are the corner cases. How do  you think of my proposal ?

I think your proposal makes sense for 4kB pages, but the ksmd
policy for 2MB pages probably needs to be much more aggressive.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 23:44         ` Andrea Arcangeli
@ 2011-06-23  0:14           ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-23  0:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Izik Eidus, Hugh Dickins, Chris Wright,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel, kvm

On Thu, Jun 23, 2011 at 7:44 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Thu, Jun 23, 2011 at 07:19:06AM +0800, Nai Xia wrote:
>> OK, I'll have a try over other workarounds.
>> I am not feeling good about need_pte_unmap myself. :-)
>
> The usual way is to check VM_HUGETLB in the caller and to call another
> function that doesn't kmap. Casting pmd_t to pte_t isn't really nice
> (but hey we're also doing that exceptionally in smaps_pte_range for
> THP, but it safe there because we're casting the value of the pmd, not
> the pointer to the pmd, so the kmap is done by the pte version of the
> caller and not done by the pmd version of the caller).
>
> Is it done for migrate? Surely it's not for swapout ;).

Thanks for the hint. :-)

You know, another thing I am worried about is that I think I
did make page_check_address()  return a pmd version for skipping the
tail subpages ...
I did detecte a schedule in atomic if I kunmap() the returned value. :-(

>
>> Thanks for viewing!
>
> You're welcome!
>
> JFYI I'll be offline on vacation for a week, starting tomorrow, so if
> I don't answer in the next few days that's the reason but I'll follow
> the progress in a week.

Have a nice vacation man! Enjoy the sunlight, we all have enough
of code in rooms. ;-)


Thanks,
Nai

>
> Thanks!
> Andrea
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-23  0:14           ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-23  0:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Izik Eidus, Hugh Dickins, Chris Wright,
	Rik van Riel, linux-mm, Johannes Weiner, linux-kernel, kvm

On Thu, Jun 23, 2011 at 7:44 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Thu, Jun 23, 2011 at 07:19:06AM +0800, Nai Xia wrote:
>> OK, I'll have a try over other workarounds.
>> I am not feeling good about need_pte_unmap myself. :-)
>
> The usual way is to check VM_HUGETLB in the caller and to call another
> function that doesn't kmap. Casting pmd_t to pte_t isn't really nice
> (but hey we're also doing that exceptionally in smaps_pte_range for
> THP, but it safe there because we're casting the value of the pmd, not
> the pointer to the pmd, so the kmap is done by the pte version of the
> caller and not done by the pmd version of the caller).
>
> Is it done for migrate? Surely it's not for swapout ;).

Thanks for the hint. :-)

You know, another thing I am worried about is that I think I
did make page_check_address()  return a pmd version for skipping the
tail subpages ...
I did detecte a schedule in atomic if I kunmap() the returned value. :-(

>
>> Thanks for viewing!
>
> You're welcome!
>
> JFYI I'll be offline on vacation for a week, starting tomorrow, so if
> I don't answer in the next few days that's the reason but I'll follow
> the progress in a week.

Have a nice vacation man! Enjoy the sunlight, we all have enough
of code in rooms. ;-)


Thanks,
Nai

>
> Thanks!
> Andrea
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 23:59                   ` Andrea Arcangeli
@ 2011-06-23  0:31                     ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-23  0:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Izik Eidus, Avi Kivity, Andrew Morton,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Thu, Jun 23, 2011 at 7:59 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Thu, Jun 23, 2011 at 07:37:47AM +0800, Nai Xia wrote:
>> On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
>> huge pages before their sub pages gets really merged to stable tree.
>> So when there are many 2MB pages each having a 4kB subpage
>> changed for all time, this is already a concern for ksmd to judge
>> if it's worthwhile to split 2MB page and get its sub-pages merged.
>
> Hmm not sure to follow. KSM memory density with THP on and off should
> be identical. The cksum is computed on subpages so the fact the 4k
> subpage is actually mapped by a hugepmd is invisible to KSM up to the
> point we get a unstable_tree_search_insert/stable_tree_search lookup
> succeeding.

I agree on your points.

But, I mean splitting the huge page into normal pages when some subpages
need to be merged may increase the TLB lookside timing of CPU and
_might_ hurt the workload ksmd is scanning. If only a small portion of false
negative 2MB pages are really get merged eventually, maybe it's not worthwhile,
right?

But, well, just like Rik said below, yes, ksmd should be more aggressive to
avoid much more time consuming cost for swapping.

>
>> I think the policy for ksmd in a system should be "If you cannot do sth good,
>> at least do nothing evil". So I really don't think we can satisfy _all_ people.
>> Get a general method and give users one or two knobs to tune it when they
>> are the corner cases. How do  you think of my proposal ?
>
> I'm neutral, but if we get two methods for deciding the unstable tree
> candidates, the default probably should prioritize on maximum merging
> even if it takes more CPU (if one cares about performance of the core
> dedicated to ksmd, KSM is likely going to be off or scanning at low
> rate in the first place).

I agree with you here.


thanks,

Nai
>
>> > On a side note, khugepaged should also be changed to preserve the
>> > dirty bit if at least one dirty bit of the ptes is dirty (currently
>> > the hugepmd is always created dirty, it can never happen for an
>> > hugepmd to be clean today so it wasn't preserved in khugepaged so far).
>> >
>>
>> Thanks for the point that out. This is what I have overlooked!
>
> No prob. And its default scan rate is very slow compared to ksmd so
> it was unlikely to generate too many false positive dirty bits even if
> you were splitting hugepages through swap.
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-23  0:31                     ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-23  0:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Izik Eidus, Avi Kivity, Andrew Morton,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Thu, Jun 23, 2011 at 7:59 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Thu, Jun 23, 2011 at 07:37:47AM +0800, Nai Xia wrote:
>> On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
>> huge pages before their sub pages gets really merged to stable tree.
>> So when there are many 2MB pages each having a 4kB subpage
>> changed for all time, this is already a concern for ksmd to judge
>> if it's worthwhile to split 2MB page and get its sub-pages merged.
>
> Hmm not sure to follow. KSM memory density with THP on and off should
> be identical. The cksum is computed on subpages so the fact the 4k
> subpage is actually mapped by a hugepmd is invisible to KSM up to the
> point we get a unstable_tree_search_insert/stable_tree_search lookup
> succeeding.

I agree on your points.

But, I mean splitting the huge page into normal pages when some subpages
need to be merged may increase the TLB lookside timing of CPU and
_might_ hurt the workload ksmd is scanning. If only a small portion of false
negative 2MB pages are really get merged eventually, maybe it's not worthwhile,
right?

But, well, just like Rik said below, yes, ksmd should be more aggressive to
avoid much more time consuming cost for swapping.

>
>> I think the policy for ksmd in a system should be "If you cannot do sth good,
>> at least do nothing evil". So I really don't think we can satisfy _all_ people.
>> Get a general method and give users one or two knobs to tune it when they
>> are the corner cases. How do  you think of my proposal ?
>
> I'm neutral, but if we get two methods for deciding the unstable tree
> candidates, the default probably should prioritize on maximum merging
> even if it takes more CPU (if one cares about performance of the core
> dedicated to ksmd, KSM is likely going to be off or scanning at low
> rate in the first place).

I agree with you here.


thanks,

Nai
>
>> > On a side note, khugepaged should also be changed to preserve the
>> > dirty bit if at least one dirty bit of the ptes is dirty (currently
>> > the hugepmd is always created dirty, it can never happen for an
>> > hugepmd to be clean today so it wasn't preserved in khugepaged so far).
>> >
>>
>> Thanks for the point that out. This is what I have overlooked!
>
> No prob. And its default scan rate is very slow compared to ksmd so
> it was unlikely to generate too many false positive dirty bits even if
> you were splitting hugepages through swap.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-23  0:00                   ` Rik van Riel
@ 2011-06-23  0:42                     ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-23  0:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Izik Eidus, Avi Kivity, Andrew Morton,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Thu, Jun 23, 2011 at 8:00 AM, Rik van Riel <riel@redhat.com> wrote:
> On 06/22/2011 07:37 PM, Nai Xia wrote:
>
>> On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
>> huge pages before their sub pages gets really merged to stable tree.
>
> Your proposal appears to add a condition that causes ksmd to skip
> doing that, which can cause the system to start using swap instead
> of sharing memory.

Hmm, yes, no swapping. So we should make the checksum default
for huge pages, right?

>
>> So when there are many 2MB pages each having a 4kB subpage
>> changed for all time, this is already a concern for ksmd to judge
>> if it's worthwhile to split 2MB page and get its sub-pages merged.
>> I think the policy for ksmd in a system should be "If you cannot do sth
>> good,
>> at least do nothing evil". So I really don't think we can satisfy _all_
>> people.
>> Get a general method and give users one or two knobs to tune it when they
>> are the corner cases. How do  you think of my proposal ?
>
> I think your proposal makes sense for 4kB pages, but the ksmd
> policy for 2MB pages probably needs to be much more aggressive.

I now agree with you on the whole point. Let's fall back to checksum
Thanks for make my mind clear! :)

And shall we provide a interface to users if he _really_ what to judge the dirty
bit from the pmd level? I think we should agree to one point before I
misunderstand
you and spam you with my next submission :P


And thanks for your time viewing!

-Nai


>
> --
> All rights reversed
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-23  0:42                     ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-23  0:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Izik Eidus, Avi Kivity, Andrew Morton,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Thu, Jun 23, 2011 at 8:00 AM, Rik van Riel <riel@redhat.com> wrote:
> On 06/22/2011 07:37 PM, Nai Xia wrote:
>
>> On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
>> huge pages before their sub pages gets really merged to stable tree.
>
> Your proposal appears to add a condition that causes ksmd to skip
> doing that, which can cause the system to start using swap instead
> of sharing memory.

Hmm, yes, no swapping. So we should make the checksum default
for huge pages, right?

>
>> So when there are many 2MB pages each having a 4kB subpage
>> changed for all time, this is already a concern for ksmd to judge
>> if it's worthwhile to split 2MB page and get its sub-pages merged.
>> I think the policy for ksmd in a system should be "If you cannot do sth
>> good,
>> at least do nothing evil". So I really don't think we can satisfy _all_
>> people.
>> Get a general method and give users one or two knobs to tune it when they
>> are the corner cases. How do  you think of my proposal ?
>
> I think your proposal makes sense for 4kB pages, but the ksmd
> policy for 2MB pages probably needs to be much more aggressive.

I now agree with you on the whole point. Let's fall back to checksum
Thanks for make my mind clear! :)

And shall we provide a interface to users if he _really_ what to judge the dirty
bit from the pmd level? I think we should agree to one point before I
misunderstand
you and spam you with my next submission :P


And thanks for your time viewing!

-Nai


>
> --
> All rights reversed
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-23  0:31                     ` Nai Xia
@ 2011-06-23  0:44                       ` Andrea Arcangeli
  -1 siblings, 0 replies; 96+ messages in thread
From: Andrea Arcangeli @ 2011-06-23  0:44 UTC (permalink / raw)
  To: Nai Xia
  Cc: Rik van Riel, Izik Eidus, Avi Kivity, Andrew Morton,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Thu, Jun 23, 2011 at 08:31:56AM +0800, Nai Xia wrote:
> On Thu, Jun 23, 2011 at 7:59 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > On Thu, Jun 23, 2011 at 07:37:47AM +0800, Nai Xia wrote:
> >> On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
> >> huge pages before their sub pages gets really merged to stable tree.
> >> So when there are many 2MB pages each having a 4kB subpage
> >> changed for all time, this is already a concern for ksmd to judge
> >> if it's worthwhile to split 2MB page and get its sub-pages merged.
> >
> > Hmm not sure to follow. KSM memory density with THP on and off should
> > be identical. The cksum is computed on subpages so the fact the 4k
> > subpage is actually mapped by a hugepmd is invisible to KSM up to the
> > point we get a unstable_tree_search_insert/stable_tree_search lookup
> > succeeding.
> 
> I agree on your points.
> 
> But, I mean splitting the huge page into normal pages when some subpages
> need to be merged may increase the TLB lookside timing of CPU and
> _might_ hurt the workload ksmd is scanning. If only a small portion of false
> negative 2MB pages are really get merged eventually, maybe it's not worthwhile,
> right?

Yes, there's not threshold to say "only split if we could merge more
than N subpages", 1 subpage match in two different hugepages is enough
to split both and save just 4k but then memory accesses will be slower
for both 2m ranges that have been splitted. But the point is that it
won't be slower than if THP was off in the first place. So in the end
all we gain is 4k saved but we still run faster than THP off, in the
other hugepages that haven't been splitted yet.

> But, well, just like Rik said below, yes, ksmd should be more aggressive to
> avoid much more time consuming cost for swapping.

Correct the above logic also follows the idea to always maximize
memory merging in KSM, which is why we've no threshold to wait N
subpages to be mergeable before we split the hugepage.

I'm unsure if admins in real life would then start to use those
thresholds even if we'd implement them. The current way of enabling
KSM a per-VM (not per-host) basis is pretty simple: the performance
critical VM has KSM off, non-performance critical VM has KSM on and it
prioritizes on memory merging.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-23  0:44                       ` Andrea Arcangeli
  0 siblings, 0 replies; 96+ messages in thread
From: Andrea Arcangeli @ 2011-06-23  0:44 UTC (permalink / raw)
  To: Nai Xia
  Cc: Rik van Riel, Izik Eidus, Avi Kivity, Andrew Morton,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Thu, Jun 23, 2011 at 08:31:56AM +0800, Nai Xia wrote:
> On Thu, Jun 23, 2011 at 7:59 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > On Thu, Jun 23, 2011 at 07:37:47AM +0800, Nai Xia wrote:
> >> On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
> >> huge pages before their sub pages gets really merged to stable tree.
> >> So when there are many 2MB pages each having a 4kB subpage
> >> changed for all time, this is already a concern for ksmd to judge
> >> if it's worthwhile to split 2MB page and get its sub-pages merged.
> >
> > Hmm not sure to follow. KSM memory density with THP on and off should
> > be identical. The cksum is computed on subpages so the fact the 4k
> > subpage is actually mapped by a hugepmd is invisible to KSM up to the
> > point we get a unstable_tree_search_insert/stable_tree_search lookup
> > succeeding.
> 
> I agree on your points.
> 
> But, I mean splitting the huge page into normal pages when some subpages
> need to be merged may increase the TLB lookside timing of CPU and
> _might_ hurt the workload ksmd is scanning. If only a small portion of false
> negative 2MB pages are really get merged eventually, maybe it's not worthwhile,
> right?

Yes, there's not threshold to say "only split if we could merge more
than N subpages", 1 subpage match in two different hugepages is enough
to split both and save just 4k but then memory accesses will be slower
for both 2m ranges that have been splitted. But the point is that it
won't be slower than if THP was off in the first place. So in the end
all we gain is 4k saved but we still run faster than THP off, in the
other hugepages that haven't been splitted yet.

> But, well, just like Rik said below, yes, ksmd should be more aggressive to
> avoid much more time consuming cost for swapping.

Correct the above logic also follows the idea to always maximize
memory merging in KSM, which is why we've no threshold to wait N
subpages to be mergeable before we split the hugepage.

I'm unsure if admins in real life would then start to use those
thresholds even if we'd implement them. The current way of enabling
KSM a per-VM (not per-host) basis is pretty simple: the performance
critical VM has KSM off, non-performance critical VM has KSM on and it
prioritizes on memory merging.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 23:28                 ` Rik van Riel
@ 2011-06-23  0:52                   ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-23  0:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Izik Eidus, Avi Kivity, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Thu, Jun 23, 2011 at 7:28 AM, Rik van Riel <riel@redhat.com> wrote:
> On 06/22/2011 07:13 PM, Nai Xia wrote:
>>
>> On Wed, Jun 22, 2011 at 11:39 PM, Rik van Riel<riel@redhat.com>  wrote:
>>>
>>> On 06/22/2011 07:19 AM, Izik Eidus wrote:
>>>
>>>> So what we say here is: it is better to have little junk in the unstable
>>>> tree that get flushed eventualy anyway, instead of make the guest
>>>> slower....
>>>> this race is something that does not reflect accurate of ksm anyway due
>>>> to the full memcmp that we will eventualy perform...
>>>
>>> With 2MB pages, I am not convinced they will get "flushed eventually",
>>> because there is a good chance at least one of the 4kB pages inside
>>> a 2MB page is in active use at all times.
>>>
>>> I worry that the proposed changes may end up effectively preventing
>>> KSM from scanning inside 2MB pages, when even one 4kB page inside
>>> is in active use.  This could mean increased swapping on systems
>>> that run low on memory, which can be a much larger performance penalty
>>> than ksmd CPU use.
>>>
>>> We need to scan inside 2MB pages when memory runs low, regardless
>>> of the accessed or dirty bits.
>>
>> I agree on this point. Dirty bit , young bit, is by no means accurate.
>> Even
>> on 4kB pages, there is always a chance that the pte are dirty but the
>> contents
>> are actually the same. Yeah, the whole optimization contains trade-offs
>> and
>> trades-offs always have the possibilities to annoy  someone.  Just like
>> page-bit-relying LRU approximations none of them is perfect too. But I
>> think
>> it can benefit some people. So maybe we could just provide a generic
>> balanced
>> solution but provide fine tuning interfaces to make sure tha when it
>> really gets
>> in the way of someone, he has a way to walk around.
>> Do you agree on my argument? :-)
>
> That's not an argument.
>
> That is a "if I wave my hands vigorously enough, maybe people
> will let my patch in without thinking about what I wrote"
> style argument.

Oh, NO, this is not what I meant.
Really sorry if I made myself look so evil...
I actually mean: "Skip or not, we agree on a point that will not
harm most people, and provide another interface to let someon
who _really_ want to take another way."

I am by no means pushing the idea of "skipping" huge pages.
I am just not sure about it and want to get a precise idea from
you. And now I get it.


>
> I believe your optimization makes sense for 4kB pages, but
> is going to be counter-productive for 2MB pages.
>
> Your approach of "make ksmd skip over more pages, so it uses
> less CPU" is likely to reduce the effectiveness of ksm by not
> sharing some pages.
>
> For 4kB pages that is fine, because you'll get around to them
> eventually.
>
> However, the internal use of a 2MB page is likely to be quite
> different.  Chances are most 2MB pages will have actively used,
> barely used and free pages inside.
>
> You absolutely want ksm to get at the barely used and free
> sub-pages.  Having just one actively used 4kB sub-page prevent
> ksm from merging any of the other 511 sub-pages is a problem.

No, no,  I was just not sure about it. I meant we cannot satisfy
all people but I was not sure which one is good for most of them.

Sorry, again, if I didn't make it clear.


Nai

>
> --
> All rights reversed
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-23  0:52                   ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-23  0:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Izik Eidus, Avi Kivity, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Thu, Jun 23, 2011 at 7:28 AM, Rik van Riel <riel@redhat.com> wrote:
> On 06/22/2011 07:13 PM, Nai Xia wrote:
>>
>> On Wed, Jun 22, 2011 at 11:39 PM, Rik van Riel<riel@redhat.com>  wrote:
>>>
>>> On 06/22/2011 07:19 AM, Izik Eidus wrote:
>>>
>>>> So what we say here is: it is better to have little junk in the unstable
>>>> tree that get flushed eventualy anyway, instead of make the guest
>>>> slower....
>>>> this race is something that does not reflect accurate of ksm anyway due
>>>> to the full memcmp that we will eventualy perform...
>>>
>>> With 2MB pages, I am not convinced they will get "flushed eventually",
>>> because there is a good chance at least one of the 4kB pages inside
>>> a 2MB page is in active use at all times.
>>>
>>> I worry that the proposed changes may end up effectively preventing
>>> KSM from scanning inside 2MB pages, when even one 4kB page inside
>>> is in active use.  This could mean increased swapping on systems
>>> that run low on memory, which can be a much larger performance penalty
>>> than ksmd CPU use.
>>>
>>> We need to scan inside 2MB pages when memory runs low, regardless
>>> of the accessed or dirty bits.
>>
>> I agree on this point. Dirty bit , young bit, is by no means accurate.
>> Even
>> on 4kB pages, there is always a chance that the pte are dirty but the
>> contents
>> are actually the same. Yeah, the whole optimization contains trade-offs
>> and
>> trades-offs always have the possibilities to annoy  someone.  Just like
>> page-bit-relying LRU approximations none of them is perfect too. But I
>> think
>> it can benefit some people. So maybe we could just provide a generic
>> balanced
>> solution but provide fine tuning interfaces to make sure tha when it
>> really gets
>> in the way of someone, he has a way to walk around.
>> Do you agree on my argument? :-)
>
> That's not an argument.
>
> That is a "if I wave my hands vigorously enough, maybe people
> will let my patch in without thinking about what I wrote"
> style argument.

Oh, NO, this is not what I meant.
Really sorry if I made myself look so evil...
I actually mean: "Skip or not, we agree on a point that will not
harm most people, and provide another interface to let someon
who _really_ want to take another way."

I am by no means pushing the idea of "skipping" huge pages.
I am just not sure about it and want to get a precise idea from
you. And now I get it.


>
> I believe your optimization makes sense for 4kB pages, but
> is going to be counter-productive for 2MB pages.
>
> Your approach of "make ksmd skip over more pages, so it uses
> less CPU" is likely to reduce the effectiveness of ksm by not
> sharing some pages.
>
> For 4kB pages that is fine, because you'll get around to them
> eventually.
>
> However, the internal use of a 2MB page is likely to be quite
> different.  Chances are most 2MB pages will have actively used,
> barely used and free pages inside.
>
> You absolutely want ksm to get at the barely used and free
> sub-pages.  Having just one actively used 4kB sub-page prevent
> ksm from merging any of the other 511 sub-pages is a problem.

No, no,  I was just not sure about it. I meant we cannot satisfy
all people but I was not sure which one is good for most of them.

Sorry, again, if I didn't make it clear.


Nai

>
> --
> All rights reversed
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-22 23:25                 ` Andrea Arcangeli
@ 2011-06-23  1:30                   ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-23  1:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Izik Eidus, Avi Kivity, Andrew Morton,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Thu, Jun 23, 2011 at 7:25 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Thu, Jun 23, 2011 at 07:13:54AM +0800, Nai Xia wrote:
>> I agree on this point. Dirty bit , young bit, is by no means accurate. Even
>> on 4kB pages, there is always a chance that the pte are dirty but the contents
>> are actually the same. Yeah, the whole optimization contains trade-offs and
>
> Just a side note: the fact the dirty bit would be set even when the
> data is the same is actually a pros, not a cons. If the content is the
> same but the page was written to, it'd trigger a copy on write short
> after merging the page rendering the whole exercise wasteful. The
> cksum plays a double role, it both "stabilizes" the unstable tree, so
> there's less chance of bad lookups, but it also avoids us to merge
> stuff that is written to frequently triggering copy on writes, and the
> dirty bit would also catch overwrites with the same data, something
> the cksum can't do.

Good point. I actually have myself another version of ksm(off topic, but
if you want to take a glance: http://code.google.com/p/uksm/ :-) )
that did do statistics of the ratio of the pages in a VMA that really got COWed.
due to KSM merging on each scan round basis.

It's  complicated to deduce a precise  information only
from the dirty and cksum.


Thanks,
Nai
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-23  1:30                   ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-23  1:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Izik Eidus, Avi Kivity, Andrew Morton,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Thu, Jun 23, 2011 at 7:25 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Thu, Jun 23, 2011 at 07:13:54AM +0800, Nai Xia wrote:
>> I agree on this point. Dirty bit , young bit, is by no means accurate. Even
>> on 4kB pages, there is always a chance that the pte are dirty but the contents
>> are actually the same. Yeah, the whole optimization contains trade-offs and
>
> Just a side note: the fact the dirty bit would be set even when the
> data is the same is actually a pros, not a cons. If the content is the
> same but the page was written to, it'd trigger a copy on write short
> after merging the page rendering the whole exercise wasteful. The
> cksum plays a double role, it both "stabilizes" the unstable tree, so
> there's less chance of bad lookups, but it also avoids us to merge
> stuff that is written to frequently triggering copy on writes, and the
> dirty bit would also catch overwrites with the same data, something
> the cksum can't do.

Good point. I actually have myself another version of ksm(off topic, but
if you want to take a glance: http://code.google.com/p/uksm/ :-) )
that did do statistics of the ratio of the pages in a VMA that really got COWed.
due to KSM merging on each scan round basis.

It's  complicated to deduce a precise  information only
from the dirty and cksum.


Thanks,
Nai
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
  2011-06-23  0:44                       ` Andrea Arcangeli
@ 2011-06-23  1:36                         ` Nai Xia
  -1 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-23  1:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Izik Eidus, Avi Kivity, Andrew Morton,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Thu, Jun 23, 2011 at 8:44 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Thu, Jun 23, 2011 at 08:31:56AM +0800, Nai Xia wrote:
>> On Thu, Jun 23, 2011 at 7:59 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>> > On Thu, Jun 23, 2011 at 07:37:47AM +0800, Nai Xia wrote:
>> >> On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
>> >> huge pages before their sub pages gets really merged to stable tree.
>> >> So when there are many 2MB pages each having a 4kB subpage
>> >> changed for all time, this is already a concern for ksmd to judge
>> >> if it's worthwhile to split 2MB page and get its sub-pages merged.
>> >
>> > Hmm not sure to follow. KSM memory density with THP on and off should
>> > be identical. The cksum is computed on subpages so the fact the 4k
>> > subpage is actually mapped by a hugepmd is invisible to KSM up to the
>> > point we get a unstable_tree_search_insert/stable_tree_search lookup
>> > succeeding.
>>
>> I agree on your points.
>>
>> But, I mean splitting the huge page into normal pages when some subpages
>> need to be merged may increase the TLB lookside timing of CPU and
>> _might_ hurt the workload ksmd is scanning. If only a small portion of false
>> negative 2MB pages are really get merged eventually, maybe it's not worthwhile,
>> right?
>
> Yes, there's not threshold to say "only split if we could merge more
> than N subpages", 1 subpage match in two different hugepages is enough
> to split both and save just 4k but then memory accesses will be slower
> for both 2m ranges that have been splitted. But the point is that it
> won't be slower than if THP was off in the first place. So in the end
> all we gain is 4k saved but we still run faster than THP off, in the
> other hugepages that haven't been splitted yet.

Yes, so ksmd is still doing good compared to THP off.
Thanks for making my mind clearer :)

>
>> But, well, just like Rik said below, yes, ksmd should be more aggressive to
>> avoid much more time consuming cost for swapping.
>
> Correct the above logic also follows the idea to always maximize
> memory merging in KSM, which is why we've no threshold to wait N
> subpages to be mergeable before we split the hugepage.
>
> I'm unsure if admins in real life would then start to use those
> thresholds even if we'd implement them. The current way of enabling
> KSM a per-VM (not per-host) basis is pretty simple: the performance
> critical VM has KSM off, non-performance critical VM has KSM on and it
> prioritizes on memory merging.
>
Hmm, yes, you are right.

Thanks,
Nai

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
@ 2011-06-23  1:36                         ` Nai Xia
  0 siblings, 0 replies; 96+ messages in thread
From: Nai Xia @ 2011-06-23  1:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Izik Eidus, Avi Kivity, Andrew Morton,
	Hugh Dickins, Chris Wright, linux-mm, Johannes Weiner,
	linux-kernel, kvm

On Thu, Jun 23, 2011 at 8:44 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Thu, Jun 23, 2011 at 08:31:56AM +0800, Nai Xia wrote:
>> On Thu, Jun 23, 2011 at 7:59 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>> > On Thu, Jun 23, 2011 at 07:37:47AM +0800, Nai Xia wrote:
>> >> On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
>> >> huge pages before their sub pages gets really merged to stable tree.
>> >> So when there are many 2MB pages each having a 4kB subpage
>> >> changed for all time, this is already a concern for ksmd to judge
>> >> if it's worthwhile to split 2MB page and get its sub-pages merged.
>> >
>> > Hmm not sure to follow. KSM memory density with THP on and off should
>> > be identical. The cksum is computed on subpages so the fact the 4k
>> > subpage is actually mapped by a hugepmd is invisible to KSM up to the
>> > point we get a unstable_tree_search_insert/stable_tree_search lookup
>> > succeeding.
>>
>> I agree on your points.
>>
>> But, I mean splitting the huge page into normal pages when some subpages
>> need to be merged may increase the TLB lookside timing of CPU and
>> _might_ hurt the workload ksmd is scanning. If only a small portion of false
>> negative 2MB pages are really get merged eventually, maybe it's not worthwhile,
>> right?
>
> Yes, there's not threshold to say "only split if we could merge more
> than N subpages", 1 subpage match in two different hugepages is enough
> to split both and save just 4k but then memory accesses will be slower
> for both 2m ranges that have been splitted. But the point is that it
> won't be slower than if THP was off in the first place. So in the end
> all we gain is 4k saved but we still run faster than THP off, in the
> other hugepages that haven't been splitted yet.

Yes, so ksmd is still doing good compared to THP off.
Thanks for making my mind clearer :)

>
>> But, well, just like Rik said below, yes, ksmd should be more aggressive to
>> avoid much more time consuming cost for swapping.
>
> Correct the above logic also follows the idea to always maximize
> memory merging in KSM, which is why we've no threshold to wait N
> subpages to be mergeable before we split the hugepage.
>
> I'm unsure if admins in real life would then start to use those
> thresholds even if we'd implement them. The current way of enabling
> KSM a per-VM (not per-host) basis is pretty simple: the performance
> critical VM has KSM off, non-performance critical VM has KSM on and it
> prioritizes on memory merging.
>
Hmm, yes, you are right.

Thanks,
Nai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

end of thread, other threads:[~2011-06-23  1:36 UTC | newest]

Thread overview: 96+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-21 12:55 [PATCH 0/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning Nai Xia
2011-06-21 12:55 ` Nai Xia
2011-06-21 13:26 ` [PATCH 1/2 " Nai Xia
2011-06-21 13:26   ` Nai Xia
2011-06-21 21:42   ` Chris Wright
2011-06-21 21:42     ` Chris Wright
2011-06-22  0:02     ` Nai Xia
2011-06-22  0:02       ` Nai Xia
2011-06-22  0:42       ` Chris Wright
2011-06-22  0:42         ` Chris Wright
2011-06-21 13:32 ` [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking Nai Xia
2011-06-21 13:32   ` Nai Xia
2011-06-22  0:21   ` Chris Wright
2011-06-22  0:21     ` Chris Wright
2011-06-22  4:43     ` Nai Xia
2011-06-22  4:43       ` Nai Xia
2011-06-22  6:15     ` Izik Eidus
2011-06-22  6:15       ` Izik Eidus
2011-06-22  6:38       ` Nai Xia
2011-06-22  6:38         ` Nai Xia
2011-06-22 15:46       ` Chris Wright
2011-06-22 15:46         ` Chris Wright
2011-06-22 10:43   ` Avi Kivity
2011-06-22 10:43     ` Avi Kivity
2011-06-22 11:05     ` Izik Eidus
2011-06-22 11:05       ` Izik Eidus
2011-06-22 11:10       ` Avi Kivity
2011-06-22 11:10         ` Avi Kivity
2011-06-22 11:19         ` Izik Eidus
2011-06-22 11:19           ` Izik Eidus
2011-06-22 11:24           ` Avi Kivity
2011-06-22 11:24             ` Avi Kivity
2011-06-22 11:28             ` Avi Kivity
2011-06-22 11:28               ` Avi Kivity
2011-06-22 11:31               ` Avi Kivity
2011-06-22 11:31                 ` Avi Kivity
2011-06-22 11:33               ` Nai Xia
2011-06-22 11:33                 ` Nai Xia
2011-06-22 11:39                 ` Izik Eidus
2011-06-22 11:39                   ` Izik Eidus
2011-06-22 15:39           ` Rik van Riel
2011-06-22 15:39             ` Rik van Riel
2011-06-22 16:55             ` Andrea Arcangeli
2011-06-22 16:55               ` Andrea Arcangeli
2011-06-22 23:37               ` Nai Xia
2011-06-22 23:37                 ` Nai Xia
2011-06-22 23:59                 ` Andrea Arcangeli
2011-06-22 23:59                   ` Andrea Arcangeli
2011-06-23  0:31                   ` Nai Xia
2011-06-23  0:31                     ` Nai Xia
2011-06-23  0:44                     ` Andrea Arcangeli
2011-06-23  0:44                       ` Andrea Arcangeli
2011-06-23  1:36                       ` Nai Xia
2011-06-23  1:36                         ` Nai Xia
2011-06-23  0:00                 ` Rik van Riel
2011-06-23  0:00                   ` Rik van Riel
2011-06-23  0:42                   ` Nai Xia
2011-06-23  0:42                     ` Nai Xia
2011-06-22 23:13             ` Nai Xia
2011-06-22 23:13               ` Nai Xia
2011-06-22 23:25               ` Andrea Arcangeli
2011-06-22 23:25                 ` Andrea Arcangeli
2011-06-23  1:30                 ` Nai Xia
2011-06-23  1:30                   ` Nai Xia
2011-06-22 23:28               ` Rik van Riel
2011-06-22 23:28                 ` Rik van Riel
2011-06-23  0:52                 ` Nai Xia
2011-06-23  0:52                   ` Nai Xia
2011-06-22 11:24     ` Nai Xia
2011-06-22 15:03   ` Andrea Arcangeli
2011-06-22 15:03     ` Andrea Arcangeli
2011-06-22 15:19     ` Izik Eidus
2011-06-22 15:19       ` Izik Eidus
2011-06-22 23:19     ` Nai Xia
2011-06-22 23:19       ` Nai Xia
2011-06-22 23:44       ` Andrea Arcangeli
2011-06-22 23:44         ` Andrea Arcangeli
2011-06-23  0:14         ` Nai Xia
2011-06-23  0:14           ` Nai Xia
2011-06-22 23:42     ` Nai Xia
2011-06-22 23:42       ` Nai Xia
2011-06-21 13:36 ` [PATCH 2/2 V2] ksm: take dirty bit as reference to avoid volatile pages scanning Nai Xia
2011-06-21 13:36   ` Nai Xia
2011-06-21 22:38   ` Chris Wright
2011-06-21 22:38     ` Chris Wright
2011-06-22  0:04     ` Nai Xia
2011-06-22  0:04       ` Nai Xia
2011-06-22  0:35       ` Chris Wright
2011-06-22  0:35         ` Chris Wright
2011-06-22  4:47         ` Nai Xia
2011-06-22  4:47           ` Nai Xia
2011-06-22 10:55         ` Nai Xia
2011-06-22 10:55           ` Nai Xia
2011-06-22  0:46 ` [PATCH 0/2 " Chris Wright
2011-06-22  0:46   ` Chris Wright
2011-06-22  4:15   ` Nai Xia

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.