linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] MADV_FREE refactoring and fix KSM page
@ 2015-10-19  6:31 Minchan Kim
  2015-10-19  6:31 ` [PATCH 1/5] mm: MADV_FREE trivial clean up Minchan Kim
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Minchan Kim @ 2015-10-19  6:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman,
	Michal Hocko, Johannes Weiner, Kirill A. Shutemov,
	Vlastimil Babka, Minchan Kim

Hello, it's too late since I sent previos patch.
https://lkml.org/lkml/2015/6/3/37

This patch is alomost new compared to previos approach.
I think this is more simple, clear and easy to review.

One thing I should notice is that I have tested this patch
and couldn't find any critical problem so I rebased patchset
onto recent mmotm(ie, mmotm-2015-10-15-15-20) to send formal
patchset. Unfortunately, I start to see sudden discarding of
the page we shouldn't do. IOW, application's valid anonymous page
was disappeared suddenly.

When I look through THP changes, I think we could lose
dirty bit of pte between freeze_page and unfreeze_page
when we mark it as migration entry and restore it.
So, I added below simple code without enough considering
and cannot see the problem any more.
I hope it's good hint to find right fix this problem.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d5ea516ffb54..e881c04f5950 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3138,6 +3138,9 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
 		if (is_write_migration_entry(swp_entry))
 			entry = maybe_mkwrite(entry, vma);
 
+		if (PageDirty(page))
+			SetPageDirty(page);
+
 		flush_dcache_page(page);
 		set_pte_at(vma->vm_mm, address, pte + i, entry);
 

Although it fixes abvove problem, I can encounter below another bug
in several hours.

	BUG: Bad rss-counter state mm:ffff88007fc28000 idx:1 val:439
	BUG: Bad rss-counter state mm:ffff88007fc28000 idx:2 val:73

Or

	BUG: Bad rss-counter state mm:ffff88007fc28000 idx:1 val:512

It seems we are zapping THP page without decreasing MM_ANONPAGES
and MM_SWAPENTS. Of course, it could be a bug of MADV_FREE and
recent changes of THP reveals it. What I can say is I couldn't see
any problem until mmotm-2015-10-06-16-30 so I guess there is some
conflict with THP-refcount redesign of Kirill or it makes to reveal
MADV_FREE's hidden bug.

I will hunt it down but I hope Kirill might catch it up earlier than me.

Major thing with this patch is two things.

1. Work with MADV_FREE on PG_dirty page.

So far, MADV_FREE doesn't work with page which is not in swap cache
but has PG_dirty(ex, swapped-in page). Details are in [3/5].

2. Make MADV_FREE discard path simple

Current logic for discarding hinted page is really mess
so [4/5] makes it simple and clean.

3. Fix with KSM page

A process can have KSM page which is no dirty bit in page table
entry and no PG_dirty in page->flags so VM could discard it wrongly.
[5/5] fixes it.

Minchan Kim (5):
  [1/5] mm: MADV_FREE trivial clean up
  [2/5] mm: skip huge zero page in MADV_FREE
  [3/5] mm: clear PG_dirty to mark page freeable
  [4/5] mm: simplify reclaim path for MADV_FREE
  [5/5] mm: mark stable page dirty in KSM

 include/linux/rmap.h |  6 +----
 mm/huge_memory.c     |  9 ++++----
 mm/ksm.c             | 12 ++++++++++
 mm/madvise.c         | 29 +++++++++++-------------
 mm/rmap.c            | 46 +++++++------------------------------
 mm/swap_state.c      |  5 ++--
 mm/vmscan.c          | 64 ++++++++++++++++------------------------------------
 7 files changed, 60 insertions(+), 111 deletions(-)

-- 
1.9.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 1/5] mm: MADV_FREE trivial clean up
  2015-10-19  6:31 [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim
@ 2015-10-19  6:31 ` Minchan Kim
  2015-10-19  6:31 ` [PATCH 2/5] mm: skip huge zero page in MADV_FREE Minchan Kim
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Minchan Kim @ 2015-10-19  6:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman,
	Michal Hocko, Johannes Weiner, Kirill A. Shutemov,
	Vlastimil Babka, Minchan Kim

1. Page table waker already pass the vma it is processing
so we don't need to pass vma.

2. If page table entry is dirty in try_to_unmap_one, the dirtiness
should propagate to PG_dirty of the page. So, it's enough to check
only PageDirty without other pte dirty bit checking.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/madvise.c | 17 +++--------------
 mm/rmap.c    |  6 ++----
 2 files changed, 5 insertions(+), 18 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 7835bc1eaccb..fdfb14a78c60 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -24,11 +24,6 @@
 
 #include <asm/tlb.h>
 
-struct madvise_free_private {
-	struct vm_area_struct *vma;
-	struct mmu_gather *tlb;
-};
-
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
  * take mmap_sem for writing. Others, which simply traverse vmas, need
@@ -269,10 +264,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 
 {
-	struct madvise_free_private *fp = walk->private;
-	struct mmu_gather *tlb = fp->tlb;
+	struct mmu_gather *tlb = walk->private;
 	struct mm_struct *mm = tlb->mm;
-	struct vm_area_struct *vma = fp->vma;
+	struct vm_area_struct *vma = walk->vma;
 	spinlock_t *ptl;
 	pte_t *pte, ptent;
 	struct page *page;
@@ -365,15 +359,10 @@ static void madvise_free_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end)
 {
-	struct madvise_free_private fp = {
-		.vma = vma,
-		.tlb = tlb,
-	};
-
 	struct mm_walk free_walk = {
 		.pmd_entry = madvise_free_pte_range,
 		.mm = vma->vm_mm,
-		.private = &fp,
+		.private = tlb,
 	};
 
 	BUG_ON(addr >= end);
diff --git a/mm/rmap.c b/mm/rmap.c
index 6f0f9331a20f..94ee372e238b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1380,7 +1380,6 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
 	enum ttu_flags flags = (enum ttu_flags)arg;
-	int dirty = 0;
 
 	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
@@ -1423,8 +1422,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
-	dirty = pte_dirty(pteval);
-	if (dirty)
+	if (pte_dirty(pteval))
 		set_page_dirty(page);
 
 	/* Update high watermark before we lower rss */
@@ -1457,7 +1455,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 		if (flags & TTU_FREE) {
 			VM_BUG_ON_PAGE(PageSwapCache(page), page);
-			if (!dirty && !PageDirty(page)) {
+			if (!PageDirty(page)) {
 				/* It's a freeable page by MADV_FREE */
 				dec_mm_counter(mm, MM_ANONPAGES);
 				goto discard;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/5] mm: skip huge zero page in MADV_FREE
  2015-10-19  6:31 [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim
  2015-10-19  6:31 ` [PATCH 1/5] mm: MADV_FREE trivial clean up Minchan Kim
@ 2015-10-19  6:31 ` Minchan Kim
  2015-10-19  6:31 ` [PATCH 3/5] mm: clear PG_dirty to mark page freeable Minchan Kim
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Minchan Kim @ 2015-10-19  6:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman,
	Michal Hocko, Johannes Weiner, Kirill A. Shutemov,
	Vlastimil Babka, Minchan Kim

It is pointless to mark huge zero page as freeable.
Let's skip it.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/huge_memory.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f1de4ce583a6..269ed99493f0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1542,6 +1542,9 @@ int madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		struct page *page;
 		pmd_t orig_pmd;
 
+		if (is_huge_zero_pmd(*pmd))
+			goto out;
+
 		orig_pmd = pmdp_huge_get_and_clear(mm, addr, pmd);
 
 		/* No hugepage in swapcache */
@@ -1553,6 +1556,7 @@ int madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		set_pmd_at(mm, addr, pmd, orig_pmd);
 		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+out:
 		spin_unlock(ptl);
 		ret = 0;
 	}
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/5] mm: clear PG_dirty to mark page freeable
  2015-10-19  6:31 [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim
  2015-10-19  6:31 ` [PATCH 1/5] mm: MADV_FREE trivial clean up Minchan Kim
  2015-10-19  6:31 ` [PATCH 2/5] mm: skip huge zero page in MADV_FREE Minchan Kim
@ 2015-10-19  6:31 ` Minchan Kim
  2015-10-27  1:28   ` Hugh Dickins
  2015-10-19  6:31 ` [PATCH 4/5] mm: simplify reclaim path for MADV_FREE Minchan Kim
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Minchan Kim @ 2015-10-19  6:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman,
	Michal Hocko, Johannes Weiner, Kirill A. Shutemov,
	Vlastimil Babka, Minchan Kim

Basically, MADV_FREE relies on dirty bit in page table entry
to decide whether VM allows to discard the page or not.
IOW, if page table entry includes marked dirty bit, VM shouldn't
discard the page.

However, as a example, if swap-in by read fault happens,
page table entry doesn't have dirty bit so MADV_FREE could discard
the page wrongly.

For avoiding the problem, MADV_FREE did more checks with PageDirty
and PageSwapCache. It worked out because swapped-in page lives on
swap cache and since it is evicted from the swap cache, the page has
PG_dirty flag. So both page flags check effectively prevent
wrong discarding by MADV_FREE.

However, a problem in above logic is that swapped-in page has
PG_dirty still after they are removed from swap cache so VM cannot
consider the page as freeable any more even if madvise_free is
called in future.

Look at below example for detail.

    ptr = malloc();
    memset(ptr);
    ..
    ..
    .. heavy memory pressure so all of pages are swapped out
    ..
    ..
    var = *ptr; -> a page swapped-in and could be removed from
                   swapcache. Then, page table doesn't mark
                   dirty bit and page descriptor includes PG_dirty
    ..
    ..
    madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
    ..
    ..
    ..
    .. heavy memory pressure again.
    .. In this time, VM cannot discard the page because the page
    .. has *PG_dirty*

To solve the problem, this patch clears PG_dirty if only the page
is owned exclusively by current process when madvise is called
because PG_dirty represents ptes's dirtiness in several processes
so we could clear it only if we own it exclusively.

Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/madvise.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index fdfb14a78c60..5db546431285 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -312,11 +312,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		if (!page)
 			continue;
 
-		if (PageSwapCache(page)) {
+		if (PageSwapCache(page) || PageDirty(page)) {
 			if (!trylock_page(page))
 				continue;
+			/*
+			 * If page is shared with others, we couldn't clear
+			 * PG_dirty of the page.
+			 */
+			if (page_count(page) != 1 + !!PageSwapCache(page)) {
+				unlock_page(page);
+				continue;
+			}
 
-			if (!try_to_free_swap(page)) {
+			if (PageSwapCache(page) && !try_to_free_swap(page)) {
 				unlock_page(page);
 				continue;
 			}
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/5] mm: simplify reclaim path for MADV_FREE
  2015-10-19  6:31 [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim
                   ` (2 preceding siblings ...)
  2015-10-19  6:31 ` [PATCH 3/5] mm: clear PG_dirty to mark page freeable Minchan Kim
@ 2015-10-19  6:31 ` Minchan Kim
  2015-10-27  2:09   ` Hugh Dickins
  2015-10-19  6:31 ` [PATCH 5/5] mm: mark stable page dirty in KSM Minchan Kim
  2015-10-19 10:01 ` [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim
  5 siblings, 1 reply; 26+ messages in thread
From: Minchan Kim @ 2015-10-19  6:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman,
	Michal Hocko, Johannes Weiner, Kirill A. Shutemov,
	Vlastimil Babka, Minchan Kim

I made reclaim path mess to check and free MADV_FREEed page.
This patch simplify it with tweaking add_to_swap.

So far, we mark page as PG_dirty when we add the page into
swap cache(ie, add_to_swap) to page out to swap device but
this patch moves PG_dirty marking under try_to_unmap_one
when we decide to change pte from anon to swapent so if
any process's pte has swapent for the page, the page must
be swapped out. IOW, there should be no funcional behavior
change. It makes relcaim path really simple for MADV_FREE
because we just need to check PG_dirty of page to decide
discarding the page or not.

Other thing this patch does is to pass TTU_BATCH_FLUSH to
try_to_unmap when we handle freeable page because I don't
see any reason to prevent it.

Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/rmap.h |  6 +----
 mm/huge_memory.c     |  5 ----
 mm/rmap.c            | 42 ++++++----------------------------
 mm/swap_state.c      |  5 ++--
 mm/vmscan.c          | 64 ++++++++++++++++------------------------------------
 5 files changed, 30 insertions(+), 92 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 6b6233fafb53..978f65066fd5 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -193,8 +193,7 @@ static inline void page_dup_rmap(struct page *page, bool compound)
  * Called from mm/vmscan.c to handle paging out
  */
 int page_referenced(struct page *, int is_locked,
-			struct mem_cgroup *memcg, unsigned long *vm_flags,
-			int *is_pte_dirty);
+			struct mem_cgroup *memcg, unsigned long *vm_flags);
 
 #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
 
@@ -272,11 +271,8 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc);
 static inline int page_referenced(struct page *page, int is_locked,
 				  struct mem_cgroup *memcg,
 				  unsigned long *vm_flags,
-				  int *is_pte_dirty)
 {
 	*vm_flags = 0;
-	if (is_pte_dirty)
-		*is_pte_dirty = 0;
 	return 0;
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 269ed99493f0..adccfb48ce57 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1753,11 +1753,6 @@ pmd_t *page_check_address_pmd(struct page *page,
 	return NULL;
 }
 
-int pmd_freeable(pmd_t pmd)
-{
-	return !pmd_dirty(pmd);
-}
-
 #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
diff --git a/mm/rmap.c b/mm/rmap.c
index 94ee372e238b..fd64f79c87c4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -797,7 +797,6 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 }
 
 struct page_referenced_arg {
-	int dirtied;
 	int mapcount;
 	int referenced;
 	unsigned long vm_flags;
@@ -812,7 +811,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
 	int referenced = 0;
-	int dirty = 0;
 	struct page_referenced_arg *pra = arg;
 
 	if (unlikely(PageTransHuge(page))) {
@@ -835,14 +833,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
 			referenced++;
 
-		/*
-		 * Use pmd_freeable instead of raw pmd_dirty because in some
-		 * of architecture, pmd_dirty is not defined unless
-		 * CONFIG_TRANSPARENT_HUGEPAGE is enabled
-		 */
-		if (!pmd_freeable(*pmd))
-			dirty++;
-
 		spin_unlock(ptl);
 	} else {
 		pte_t *pte;
@@ -873,9 +863,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 				referenced++;
 		}
 
-		if (pte_dirty(*pte))
-			dirty++;
-
 		pte_unmap_unlock(pte, ptl);
 	}
 
@@ -889,9 +876,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		pra->vm_flags |= vma->vm_flags;
 	}
 
-	if (dirty)
-		pra->dirtied++;
-
 	pra->mapcount--;
 	if (!pra->mapcount)
 		return SWAP_SUCCESS; /* To break the loop */
@@ -916,7 +900,6 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
  * @is_locked: caller holds lock on the page
  * @memcg: target memory cgroup
  * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
- * @is_pte_dirty: ptes which have marked dirty bit - used for lazyfree page
  *
  * Quick test_and_clear_referenced for all mappings to a page,
  * returns the number of ptes which referenced the page.
@@ -924,8 +907,7 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
 int page_referenced(struct page *page,
 		    int is_locked,
 		    struct mem_cgroup *memcg,
-		    unsigned long *vm_flags,
-		    int *is_pte_dirty)
+		    unsigned long *vm_flags)
 {
 	int ret;
 	int we_locked = 0;
@@ -940,8 +922,6 @@ int page_referenced(struct page *page,
 	};
 
 	*vm_flags = 0;
-	if (is_pte_dirty)
-		*is_pte_dirty = 0;
 
 	if (!page_mapped(page))
 		return 0;
@@ -970,9 +950,6 @@ int page_referenced(struct page *page,
 	if (we_locked)
 		unlock_page(page);
 
-	if (is_pte_dirty)
-		*is_pte_dirty = pra.dirtied;
-
 	return pra.referenced;
 }
 
@@ -1453,17 +1430,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		swp_entry_t entry = { .val = page_private(page) };
 		pte_t swp_pte;
 
-		if (flags & TTU_FREE) {
-			VM_BUG_ON_PAGE(PageSwapCache(page), page);
-			if (!PageDirty(page)) {
-				/* It's a freeable page by MADV_FREE */
-				dec_mm_counter(mm, MM_ANONPAGES);
-				goto discard;
-			} else {
-				set_pte_at(mm, address, pte, pteval);
-				ret = SWAP_FAIL;
-				goto out_unmap;
-			}
+		if (!PageDirty(page) && (flags & TTU_FREE)) {
+			/* It's a freeable page by MADV_FREE */
+			dec_mm_counter(mm, MM_ANONPAGES);
+			goto discard;
 		}
 
 		if (PageSwapCache(page)) {
@@ -1476,6 +1446,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				ret = SWAP_FAIL;
 				goto out_unmap;
 			}
+			if (!PageDirty(page))
+				SetPageDirty(page);
 			if (list_empty(&mm->mmlist)) {
 				spin_lock(&mmlist_lock);
 				if (list_empty(&mm->mmlist))
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d783872d746c..676ff2991380 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -185,13 +185,12 @@ int add_to_swap(struct page *page, struct list_head *list)
 	 * deadlock in the swap out path.
 	 */
 	/*
-	 * Add it to the swap cache and mark it dirty
+	 * Add it to the swap cache.
 	 */
 	err = add_to_swap_cache(page, entry,
 			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
 
-	if (!err) {	/* Success */
-		SetPageDirty(page);
+	if (!err) {
 		return 1;
 	} else {	/* -ENOMEM radix-tree allocation failure */
 		/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 27d580b5e853..9b52ecf91194 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -791,17 +791,15 @@ enum page_references {
 };
 
 static enum page_references page_check_references(struct page *page,
-						  struct scan_control *sc,
-						  bool *freeable)
+						  struct scan_control *sc)
 {
 	int referenced_ptes, referenced_page;
 	unsigned long vm_flags;
-	int pte_dirty;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
 	referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
-					  &vm_flags, &pte_dirty);
+					  &vm_flags);
 	referenced_page = TestClearPageReferenced(page);
 
 	/*
@@ -842,10 +840,6 @@ static enum page_references page_check_references(struct page *page,
 		return PAGEREF_KEEP;
 	}
 
-	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
-			!PageDirty(page))
-		*freeable = true;
-
 	/* Reclaim if clean, defer dirty pages to writeback */
 	if (referenced_page && !PageSwapBacked(page))
 		return PAGEREF_RECLAIM_CLEAN;
@@ -1037,8 +1031,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (!force_reclaim)
-			references = page_check_references(page, sc,
-							&freeable);
+			references = page_check_references(page, sc);
 
 		switch (references) {
 		case PAGEREF_ACTIVATE:
@@ -1055,31 +1048,24 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * Try to allocate it some swap space here.
 		 */
 		if (PageAnon(page) && !PageSwapCache(page)) {
-			if (!freeable) {
-				if (!(sc->gfp_mask & __GFP_IO))
-					goto keep_locked;
-				if (!add_to_swap(page, page_list))
-					goto activate_locked;
-				may_enter_fs = 1;
-				/* Adding to swap updated mapping */
-				mapping = page_mapping(page);
-			} else {
-				if (likely(!PageTransHuge(page)))
-					goto unmap;
-				/* try_to_unmap isn't aware of THP page */
-				if (unlikely(split_huge_page_to_list(page,
-								page_list)))
-					goto keep_locked;
-			}
+			if (!(sc->gfp_mask & __GFP_IO))
+				goto keep_locked;
+			if (!add_to_swap(page, page_list))
+				goto activate_locked;
+			freeable = true;
+			may_enter_fs = 1;
+			/* Adding to swap updated mapping */
+			mapping = page_mapping(page);
 		}
-unmap:
+
 		/*
 		 * The page is mapped into the page tables of one or more
 		 * processes. Try to unmap it here.
 		 */
-		if (page_mapped(page) && (mapping || freeable)) {
+		if (page_mapped(page) && mapping) {
 			switch (try_to_unmap(page, freeable ?
-					TTU_FREE : ttu_flags|TTU_BATCH_FLUSH)) {
+					ttu_flags | TTU_BATCH_FLUSH | TTU_FREE :
+					ttu_flags | TTU_BATCH_FLUSH)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -1087,20 +1073,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			case SWAP_MLOCK:
 				goto cull_mlocked;
 			case SWAP_SUCCESS:
-				/* try to free the page below */
-				if (!freeable)
-					break;
-				/*
-				 * Freeable anon page doesn't have mapping
-				 * due to skipping of swapcache so we free
-				 * page in here rather than __remove_mapping.
-				 */
-				VM_BUG_ON_PAGE(PageSwapCache(page), page);
-				if (!page_freeze_refs(page, 1))
-					goto keep_locked;
-				__ClearPageLocked(page);
-				count_vm_event(PGLAZYFREED);
-				goto free_it;
+				; /* try to free the page below */
 			}
 		}
 
@@ -1217,6 +1190,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		__ClearPageLocked(page);
 free_it:
+		if (freeable && !PageDirty(page))
+			count_vm_event(PGLAZYFREED);
+
 		nr_reclaimed++;
 
 		/*
@@ -1847,7 +1823,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		}
 
 		if (page_referenced(page, 0, sc->target_mem_cgroup,
-				    &vm_flags, NULL)) {
+				    &vm_flags)) {
 			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/5] mm: mark stable page dirty in KSM
  2015-10-19  6:31 [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim
                   ` (3 preceding siblings ...)
  2015-10-19  6:31 ` [PATCH 4/5] mm: simplify reclaim path for MADV_FREE Minchan Kim
@ 2015-10-19  6:31 ` Minchan Kim
  2015-10-27  2:23   ` Hugh Dickins
  2015-10-19 10:01 ` [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim
  5 siblings, 1 reply; 26+ messages in thread
From: Minchan Kim @ 2015-10-19  6:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman,
	Michal Hocko, Johannes Weiner, Kirill A. Shutemov,
	Vlastimil Babka, Minchan Kim

Stable page could be shared by several processes and last process
could own the page among them after CoW or zapping for every process
except last process happens. Then, page table entry of the page
in last process can have no dirty bit and PG_dirty flag in page->flags.
In this case, MADV_FREE could discard the page wrongly.
For preventing it, we mark stable page dirty.

Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/ksm.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/ksm.c b/mm/ksm.c
index 8f0faf809bf5..659e2b5119c0 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1050,6 +1050,18 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 			 */
 			set_page_stable_node(page, NULL);
 			mark_page_accessed(page);
+			/*
+			 * Stable page could be shared by several processes
+			 * and last process could own the page among them after
+			 * CoW or zapping for every process except last process
+			 * happens. Then, page table entry of the page
+			 * in last process can have no dirty bit.
+			 * In this case, MADV_FREE could discard the page
+			 * wrongly.
+			 * For preventing it, we mark stable page dirty.
+			 */
+			if (!PageDirty(page))
+				SetPageDirty(page);
 			err = 0;
 		} else if (pages_identical(page, kpage))
 			err = replace_page(vma, page, kpage, orig_pte);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] MADV_FREE refactoring and fix KSM page
  2015-10-19  6:31 [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim
                   ` (4 preceding siblings ...)
  2015-10-19  6:31 ` [PATCH 5/5] mm: mark stable page dirty in KSM Minchan Kim
@ 2015-10-19 10:01 ` Minchan Kim
  2015-10-20  1:38   ` Minchan Kim
  2015-10-20  7:21   ` Minchan Kim
  5 siblings, 2 replies; 26+ messages in thread
From: Minchan Kim @ 2015-10-19 10:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman,
	Michal Hocko, Johannes Weiner, Kirill A. Shutemov,
	Vlastimil Babka

On Mon, Oct 19, 2015 at 03:31:42PM +0900, Minchan Kim wrote:
> Hello, it's too late since I sent previos patch.
> https://lkml.org/lkml/2015/6/3/37
> 
> This patch is alomost new compared to previos approach.
> I think this is more simple, clear and easy to review.
> 
> One thing I should notice is that I have tested this patch
> and couldn't find any critical problem so I rebased patchset
> onto recent mmotm(ie, mmotm-2015-10-15-15-20) to send formal
> patchset. Unfortunately, I start to see sudden discarding of
> the page we shouldn't do. IOW, application's valid anonymous page
> was disappeared suddenly.
> 
> When I look through THP changes, I think we could lose
> dirty bit of pte between freeze_page and unfreeze_page
> when we mark it as migration entry and restore it.
> So, I added below simple code without enough considering
> and cannot see the problem any more.
> I hope it's good hint to find right fix this problem.
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d5ea516ffb54..e881c04f5950 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3138,6 +3138,9 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
>  		if (is_write_migration_entry(swp_entry))
>  			entry = maybe_mkwrite(entry, vma);
>  
> +		if (PageDirty(page))
> +			SetPageDirty(page);

The condition of PageDirty was typo. I didn't add the condition.
Just added.

                SetPageDirty(page);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] MADV_FREE refactoring and fix KSM page
  2015-10-19 10:01 ` [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim
@ 2015-10-20  1:38   ` Minchan Kim
  2015-10-20  7:21   ` Minchan Kim
  1 sibling, 0 replies; 26+ messages in thread
From: Minchan Kim @ 2015-10-20  1:38 UTC (permalink / raw)
  To: Andrew Morton, Kirill A. Shutemov
  Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman,
	Michal Hocko, Johannes Weiner, Kirill A. Shutemov,
	Vlastimil Babka

[-- Attachment #1: Type: text/plain, Size: 10303 bytes --]

On Mon, Oct 19, 2015 at 07:01:50PM +0900, Minchan Kim wrote:
> On Mon, Oct 19, 2015 at 03:31:42PM +0900, Minchan Kim wrote:
> > Hello, it's too late since I sent previos patch.
> > https://lkml.org/lkml/2015/6/3/37
> > 
> > This patch is alomost new compared to previos approach.
> > I think this is more simple, clear and easy to review.
> > 
> > One thing I should notice is that I have tested this patch
> > and couldn't find any critical problem so I rebased patchset
> > onto recent mmotm(ie, mmotm-2015-10-15-15-20) to send formal
> > patchset. Unfortunately, I start to see sudden discarding of
> > the page we shouldn't do. IOW, application's valid anonymous page
> > was disappeared suddenly.
> > 
> > When I look through THP changes, I think we could lose
> > dirty bit of pte between freeze_page and unfreeze_page
> > when we mark it as migration entry and restore it.
> > So, I added below simple code without enough considering
> > and cannot see the problem any more.
> > I hope it's good hint to find right fix this problem.
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index d5ea516ffb54..e881c04f5950 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3138,6 +3138,9 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
> >  		if (is_write_migration_entry(swp_entry))
> >  			entry = maybe_mkwrite(entry, vma);
> >  
> > +		if (PageDirty(page))
> > +			SetPageDirty(page);
> 
> The condition of PageDirty was typo. I didn't add the condition.
> Just added.
> 
>                 SetPageDirty(page);

For the first step to find this bug, I removed all MADV_FREE related
code in mmotm-2015-10-15-15-20. IOW, git checkout 54bad5da4834
(arm64: add pmd_[dirty|mkclean] for THP) so the tree doesn't have
any core code of MADV_FREE.

I tested following workloads in my KVM machine.

0. make memcg
1. limit memcg
2. fork several processes
3. each process allocates THP page and fill
4. increase limit of the memcg to swapoff successfully
5. swapoff
6. kill all of processes
7. goto 1

Within a few hours, I encounter following bug.
Attached detailed boot log and dmesg result.


Initializing cgroup subsys cpu
Command line: hung_task_panic=1 earlyprintk=ttyS0,115200 debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic console=ttyS0,115200 console=tty0 earlyprintk=ttyS0 ignore_loglevel ftrace_dump_on_oops vga=normal root=/dev/vda1 rw
KERNEL supported cpus:
  Intel GenuineIntel
x86/fpu: Legacy x87 FPU detected.
x86/fpu: Using 'lazy' FPU context switches.
e820: BIOS-provided physical RAM map:
BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x00000000bfffbfff] usable
BIOS-e820: [mem 0x00000000bfffc000-0x00000000bfffffff] reserved
BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved

<snip>

Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff810782a9>] down_read_trylock+0x9/0x30
PGD 0 
Oops: 0000 [#1] SMP 
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 26445 Comm: sh Not tainted 4.3.0-rc5-mm1-diet-meta+ #1545
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff8800b9af3480 ti: ffff88007fea0000 task.ti: ffff88007fea0000
RIP: 0010:[<ffffffff810782a9>]  [<ffffffff810782a9>] down_read_trylock+0x9/0x30
RSP: 0018:ffff88007fea3648  EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffffea0002324900 RCX: ffff88007fea37e8
RDX: 0000000000000000 RSI: ffff88007fea36e8 RDI: 0000000000000008
RBP: ffff88007fea3648 R08: ffffffff818446a0 R09: ffff8800b9af4c80
R10: 0000000000000216 R11: 0000000000000001 R12: ffff88007f58d6e1
R13: ffff88007f58d6e0 R14: 0000000000000008 R15: 0000000000000001
FS:  00007f0993e78740(0000) GS:ffff8800bfa20000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 000000007edee000 CR4: 00000000000006a0
Stack:
 ffff88007fea3678 ffffffff81124ff0 ffffea0002324900 ffff88007fea36e8
 ffff88009ffe8400 0000000000000000 ffff88007fea36c0 ffffffff81125733
 ffff8800bfa34540 ffffffff8105dc9d ffffea0002324900 ffff88007fea37e8
Call Trace:
 [<ffffffff81124ff0>] page_lock_anon_vma_read+0x60/0x180
 [<ffffffff81125733>] rmap_walk+0x1b3/0x3f0
 [<ffffffff8105dc9d>] ? finish_task_switch+0x5d/0x1f0
 [<ffffffff81125b13>] page_referenced+0x1a3/0x220
 [<ffffffff81123e30>] ? __page_check_address+0x1a0/0x1a0
 [<ffffffff81124f90>] ? page_get_anon_vma+0xd0/0xd0
 [<ffffffff81123820>] ? anon_vma_ctor+0x40/0x40
 [<ffffffff8110087b>] shrink_page_list+0x5ab/0xde0
 [<ffffffff8110174c>] shrink_inactive_list+0x18c/0x4b0
 [<ffffffff811023bd>] shrink_lruvec+0x59d/0x740
 [<ffffffff811025f0>] shrink_zone+0x90/0x250
 [<ffffffff811028dd>] do_try_to_free_pages+0x12d/0x3b0
 [<ffffffff81102d3d>] try_to_free_mem_cgroup_pages+0x9d/0x120
 [<ffffffff811496c3>] try_charge+0x163/0x700
 [<ffffffff81149cb4>] mem_cgroup_do_precharge+0x54/0x70
 [<ffffffff81149e45>] mem_cgroup_can_attach+0x175/0x1b0
 [<ffffffff811b2c57>] ? kernfs_iattrs.isra.6+0x37/0xd0
 [<ffffffff81148e70>] ? get_mctgt_type+0x320/0x320
 [<ffffffff810a9d29>] cgroup_migrate+0x149/0x440
 [<ffffffff810aa60c>] cgroup_attach_task+0x7c/0xe0
 [<ffffffff810aa904>] __cgroup_procs_write.isra.33+0x1d4/0x2b0
 [<ffffffff810aaa10>] cgroup_tasks_write+0x10/0x20
 [<ffffffff810a6238>] cgroup_file_write+0x38/0xf0
 [<ffffffff811b54ad>] kernfs_fop_write+0x11d/0x170
 [<ffffffff81153918>] __vfs_write+0x28/0xe0
 [<ffffffff8116e614>] ? __fd_install+0x24/0xc0
 [<ffffffff810784a1>] ? percpu_down_read+0x21/0x50
 [<ffffffff81153e91>] vfs_write+0xa1/0x170
 [<ffffffff81154716>] SyS_write+0x46/0xa0
 [<ffffffff81420a17>] entry_SYSCALL_64_fastpath+0x12/0x6a
Code: 5e 82 3a 00 48 83 c4 08 5b 5d c3 48 89 45 f0 e8 9b 6a 3a 00 48 8b 45 f0 eb df 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7 
RIP  [<ffffffff810782a9>] down_read_trylock+0x9/0x30
 RSP <ffff88007fea3648>
CR2: 0000000000000008
BUG: unable to handle kernel ---[ end trace e81a82c8122b447d ]---
Kernel panic - not syncing: Fatal exception

NULL pointer dereference at 0000000000000008
IP: [<ffffffff810782a9>] down_read_trylock+0x9/0x30
PGD 0 
Oops: 0000 [#2] SMP 
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 10 PID: 59 Comm: khugepaged Tainted: G      D         4.3.0-rc5-mm1-diet-meta+ #1545
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff8800b9851a40 ti: ffff8800b985c000 task.ti: ffff8800b985c000
RIP: 0010:[<ffffffff810782a9>]  [<ffffffff810782a9>] down_read_trylock+0x9/0x30
RSP: 0018:ffff8800b985f778  EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffffea0002321800 RCX: ffff8800b985f918
RDX: 0000000000000000 RSI: ffff8800b985f818 RDI: 0000000000000008
RBP: ffff8800b985f778 R08: ffffffff818446a0 R09: ffff8800b9853240
R10: 000000000000ba03 R11: 0000000000000001 R12: ffff88007f58d6e1
R13: ffff88007f58d6e0 R14: 0000000000000008 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff8800bfb40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000001808000 CR4: 00000000000006a0
Stack:
 ffff8800b985f7a8 ffffffff81124ff0 ffffea0002321800 ffff8800b985f818
 ffff88009ffe8400 0000000000000000 ffff8800b985f7f0 ffffffff81125733
 ffff8800bfb54540 ffffffff8105dc9d ffffea0002321800 ffff8800b985f918
Call Trace:
 [<ffffffff81124ff0>] page_lock_anon_vma_read+0x60/0x180
 [<ffffffff81125733>] rmap_walk+0x1b3/0x3f0
 [<ffffffff8105dc9d>] ? finish_task_switch+0x5d/0x1f0
 [<ffffffff81125b13>] page_referenced+0x1a3/0x220
 [<ffffffff81123e30>] ? __page_check_address+0x1a0/0x1a0
 [<ffffffff81124f90>] ? page_get_anon_vma+0xd0/0xd0
 [<ffffffff81123820>] ? anon_vma_ctor+0x40/0x40
 [<ffffffff8110087b>] shrink_page_list+0x5ab/0xde0
 [<ffffffff8110174c>] shrink_inactive_list+0x18c/0x4b0
 [<ffffffff811023bd>] shrink_lruvec+0x59d/0x740
 [<ffffffff811025f0>] shrink_zone+0x90/0x250
 [<ffffffff811028dd>] do_try_to_free_pages+0x12d/0x3b0
 [<ffffffff81102d3d>] try_to_free_mem_cgroup_pages+0x9d/0x120
 [<ffffffff811496c3>] try_charge+0x163/0x700
 [<ffffffff8141d1f3>] ? schedule+0x33/0x80
 [<ffffffff8114d45f>] mem_cgroup_try_charge+0x9f/0x1d0
 [<ffffffff811434bc>] khugepaged+0x7cc/0x1ac0
 [<ffffffff81066e01>] ? hrtick_update+0x1/0x70
 [<ffffffff81072430>] ? prepare_to_wait_event+0xf0/0xf0
 [<ffffffff81142cf0>] ? total_mapcount+0x70/0x70
 [<ffffffff81056cd9>] kthread+0xc9/0xe0
 [<ffffffff81056c10>] ? kthread_park+0x60/0x60
 [<ffffffff81420d6f>] ret_from_fork+0x3f/0x70
 [<ffffffff81056c10>] ? kthread_park+0x60/0x60
Code: 5e 82 3a 00 48 83 c4 08 5b 5d c3 48 89 45 f0 e8 9b 6a 3a 00 48 8b 45 f0 eb df 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7 
RIP  [<ffffffff810782a9>] down_read_trylock+0x9/0x30
 RSP <ffff8800b985f778>
CR2: 0000000000000008
---[ end trace e81a82c8122b447e ]---
Shutting down cpus with NMI
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled


[-- Attachment #2: test_bug.log --]
[-- Type: text/plain, Size: 46938 bytes --]

QEMU 2.0.0 monitor - type 'help' for more information
(qemu) s^[[Kearly console in setup code
Initializing cgroup subsys cpu
Linux version 4.3.0-rc5-mm1-diet-meta+ (barrios@bbox) (gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) ) #1545 SMP Tue Oct 20 08:55:45 KST 2015
Command line: hung_task_panic=1 earlyprintk=ttyS0,115200 debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic console=ttyS0,115200 console=tty0 earlyprintk=ttyS0 ignore_loglevel ftrace_dump_on_oops vga=normal root=/dev/vda1 rw
KERNEL supported cpus:
  Intel GenuineIntel
x86/fpu: Legacy x87 FPU detected.
x86/fpu: Using 'lazy' FPU context switches.
e820: BIOS-provided physical RAM map:
BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x00000000bfffbfff] usable
BIOS-e820: [mem 0x00000000bfffc000-0x00000000bfffffff] reserved
BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
bootconsole [earlyser0] enabled
debug: ignoring loglevel setting.
NX (Execute Disable) protection: active
SMBIOS 2.4 present.
DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
e820: remove [mem 0x000a0000-0x000fffff] usable
e820: last_pfn = 0xbfffc max_arch_pfn = 0x400000000
MTRR default type: write-back
MTRR fixed ranges enabled:
  00000-9FFFF write-back
  A0000-BFFFF uncachable
  C0000-FFFFF write-protect
MTRR variable ranges enabled:
  0 base 00C0000000 mask FFC0000000 uncachable
  1 disabled
  2 disabled
  3 disabled
  4 disabled
  5 disabled
  6 disabled
  7 disabled
x86/PAT: PAT not supported by CPU.
Scan for SMP in [mem 0x00000000-0x000003ff]
Scan for SMP in [mem 0x0009fc00-0x0009ffff]
Scan for SMP in [mem 0x000f0000-0x000fffff]
found SMP MP-table at [mem 0x000f0a70-0x000f0a7f] mapped at [ffff8800000f0a70]
  mpc: f0a80-f0c44
Scanning 1 areas for low memory corruption
Base memory trampoline at [ffff880000099000] 99000 size 24576
init_memory_mapping: [mem 0x00000000-0x000fffff]
 [mem 0x00000000-0x000fffff] page 4k
BRK [0x0220e000, 0x0220efff] PGTABLE
BRK [0x0220f000, 0x0220ffff] PGTABLE
BRK [0x02210000, 0x02210fff] PGTABLE
init_memory_mapping: [mem 0xbfc00000-0xbfdfffff]
 [mem 0xbfc00000-0xbfdfffff] page 2M
BRK [0x02211000, 0x02211fff] PGTABLE
init_memory_mapping: [mem 0xa0000000-0xbfbfffff]
 [mem 0xa0000000-0xbfbfffff] page 2M
init_memory_mapping: [mem 0x80000000-0x9fffffff]
 [mem 0x80000000-0x9fffffff] page 2M
init_memory_mapping: [mem 0x00100000-0x7fffffff]
 [mem 0x00100000-0x001fffff] page 4k
 [mem 0x00200000-0x7fffffff] page 2M
init_memory_mapping: [mem 0xbfe00000-0xbfffbfff]
 [mem 0xbfe00000-0xbfffbfff] page 4k
BRK [0x02212000, 0x02212fff] PGTABLE
RAMDISK: [mem 0x7851a000-0x7fffffff]
 [ffffea0000000000-ffffea0002ffffff] PMD -> [ffff8800bc400000-ffff8800bf3fffff] on node 0
Zone ranges:
  DMA      [mem 0x0000000000001000-0x0000000000ffffff]
  DMA32    [mem 0x0000000001000000-0x00000000bfffbfff]
  Normal   empty
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x0000000000001000-0x000000000009efff]
  node   0: [mem 0x0000000000100000-0x00000000bfffbfff]
Initmem setup node 0 [mem 0x0000000000001000-0x00000000bfffbfff]
On node 0 totalpages: 786330
  DMA zone: 64 pages used for memmap
  DMA zone: 21 pages reserved
  DMA zone: 3998 pages, LIFO batch:0
  DMA32 zone: 12224 pages used for memmap
  DMA32 zone: 782332 pages, LIFO batch:31
Intel MultiProcessor Specification v1.4
  mpc: f0a80-f0c44
MPTABLE: OEM ID: BOCHSCPU
MPTABLE: Product ID: 0.1         
MPTABLE: APIC at: 0xFEE00000
mapped APIC to ffffffffff5fd000 (        fee00000)
Processor #0 (Bootup-CPU)
Processor #1
Processor #2
Processor #3
Processor #4
Processor #5
Processor #6
Processor #7
Processor #8
Processor #9
Processor #10
Processor #11
Bus #0 is PCI   
Bus #1 is ISA   
IOAPIC[0]: apic_id 0, version 17, address 0xfec00000, GSI 0-23
Int: type 0, pol 1, trig 0, bus 00, IRQ 04, APIC ID 0, APIC INT 09
Int: type 0, pol 1, trig 0, bus 00, IRQ 0c, APIC ID 0, APIC INT 0b
Int: type 0, pol 1, trig 0, bus 00, IRQ 10, APIC ID 0, APIC INT 0b
Int: type 0, pol 1, trig 0, bus 00, IRQ 14, APIC ID 0, APIC INT 0a
Int: type 0, pol 1, trig 0, bus 00, IRQ 18, APIC ID 0, APIC INT 0a
Int: type 0, pol 0, trig 0, bus 01, IRQ 00, APIC ID 0, APIC INT 02
Int: type 0, pol 0, trig 0, bus 01, IRQ 01, APIC ID 0, APIC INT 01
Int: type 0, pol 0, trig 0, bus 01, IRQ 03, APIC ID 0, APIC INT 03
Int: type 0, pol 0, trig 0, bus 01, IRQ 04, APIC ID 0, APIC INT 04
Int: type 0, pol 0, trig 0, bus 01, IRQ 06, APIC ID 0, APIC INT 06
Int: type 0, pol 0, trig 0, bus 01, IRQ 07, APIC ID 0, APIC INT 07
Int: type 0, pol 0, trig 0, bus 01, IRQ 08, APIC ID 0, APIC INT 08
Int: type 0, pol 0, trig 0, bus 01, IRQ 0c, APIC ID 0, APIC INT 0c
Int: type 0, pol 0, trig 0, bus 01, IRQ 0d, APIC ID 0, APIC INT 0d
Int: type 0, pol 0, trig 0, bus 01, IRQ 0e, APIC ID 0, APIC INT 0e
Int: type 0, pol 0, trig 0, bus 01, IRQ 0f, APIC ID 0, APIC INT 0f
Lint: type 3, pol 0, trig 0, bus 01, IRQ 00, APIC ID 0, APIC LINT 00
Lint: type 1, pol 0, trig 0, bus 01, IRQ 00, APIC ID ff, APIC LINT 01
Processors: 12
smpboot: Allowing 12 CPUs, 0 hotplug CPUs
mapped IOAPIC to ffffffffff5fc000 (fec00000)
e820: [mem 0xc0000000-0xfeffbfff] available for PCI devices
clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
setup_percpu: NR_CPUS:16 nr_cpumask_bits:16 nr_cpu_ids:12 nr_node_ids:1
PERCPU: Embedded 31 pages/cpu @ffff8800bfa00000 s87640 r8192 d31144 u131072
pcpu-alloc: s87640 r8192 d31144 u131072 alloc=1*2097152
pcpu-alloc: [0] 00 01 02 03 04 05 06 07 08 09 10 11 -- -- -- -- 
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 774021
Kernel command line: hung_task_panic=1 earlyprintk=ttyS0,115200 debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic console=ttyS0,115200 console=tty0 earlyprintk=ttyS0 ignore_loglevel ftrace_dump_on_oops vga=normal root=/dev/vda1 rw
sysrq: sysrq always enabled.
log_buf_len individual max cpu contribution: 2097152 bytes
log_buf_len total cpu_extra contributions: 23068672 bytes
log_buf_len min size: 8388608 bytes
log_buf_len: 33554432 bytes
early log buf free: 8380096(99%)
PID hash table entries: 4096 (order: 3, 32768 bytes)
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Memory: 2911172K/3145320K available (4237K kernel code, 721K rwdata, 1988K rodata, 936K init, 8608K bss, 234148K reserved, 0K cma-reserved)
SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=12, Nodes=1
Hierarchical RCU implementation.
	Build-time adjustment of leaf fanout to 64.
	RCU restricting CPUs from NR_CPUS=16 to nr_cpu_ids=12.
RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=12
NR_IRQS:4352 nr_irqs:136 16
Console: colour VGA+ 80x25
console [tty0] enabled
bootconsole [earlyser0] disabled
Initializing cgroup subsys cpu
Linux version 4.3.0-rc5-mm1-diet-meta+ (barrios@bbox) (gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) ) #1545 SMP Tue Oct 20 08:55:45 KST 2015
Command line: hung_task_panic=1 earlyprintk=ttyS0,115200 debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic console=ttyS0,115200 console=tty0 earlyprintk=ttyS0 ignore_loglevel ftrace_dump_on_oops vga=normal root=/dev/vda1 rw
KERNEL supported cpus:
  Intel GenuineIntel
x86/fpu: Legacy x87 FPU detected.
x86/fpu: Using 'lazy' FPU context switches.
e820: BIOS-provided physical RAM map:
BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x00000000bfffbfff] usable
BIOS-e820: [mem 0x00000000bfffc000-0x00000000bfffffff] reserved
BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
bootconsole [earlyser0] enabled
debug: ignoring loglevel setting.
NX (Execute Disable) protection: active
SMBIOS 2.4 present.
DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
e820: remove [mem 0x000a0000-0x000fffff] usable
e820: last_pfn = 0xbfffc max_arch_pfn = 0x400000000
MTRR default type: write-back
MTRR fixed ranges enabled:
  00000-9FFFF write-back
  A0000-BFFFF uncachable
  C0000-FFFFF write-protect
MTRR variable ranges enabled:
  0 base 00C0000000 mask FFC0000000 uncachable
  1 disabled
  2 disabled
  3 disabled
  4 disabled
  5 disabled
  6 disabled
  7 disabled
x86/PAT: PAT not supported by CPU.
Scan for SMP in [mem 0x00000000-0x000003ff]
Scan for SMP in [mem 0x0009fc00-0x0009ffff]
Scan for SMP in [mem 0x000f0000-0x000fffff]
found SMP MP-table at [mem 0x000f0a70-0x000f0a7f] mapped at [ffff8800000f0a70]
  mpc: f0a80-f0c44
Scanning 1 areas for low memory corruption
Base memory trampoline at [ffff880000099000] 99000 size 24576
init_memory_mapping: [mem 0x00000000-0x000fffff]
 [mem 0x00000000-0x000fffff] page 4k
BRK [0x0220e000, 0x0220efff] PGTABLE
BRK [0x0220f000, 0x0220ffff] PGTABLE
BRK [0x02210000, 0x02210fff] PGTABLE
init_memory_mapping: [mem 0xbfc00000-0xbfdfffff]
 [mem 0xbfc00000-0xbfdfffff] page 2M
BRK [0x02211000, 0x02211fff] PGTABLE
init_memory_mapping: [mem 0xa0000000-0xbfbfffff]
 [mem 0xa0000000-0xbfbfffff] page 2M
init_memory_mapping: [mem 0x80000000-0x9fffffff]
 [mem 0x80000000-0x9fffffff] page 2M
init_memory_mapping: [mem 0x00100000-0x7fffffff]
 [mem 0x00100000-0x001fffff] page 4k
 [mem 0x00200000-0x7fffffff] page 2M
init_memory_mapping: [mem 0xbfe00000-0xbfffbfff]
 [mem 0xbfe00000-0xbfffbfff] page 4k
BRK [0x02212000, 0x02212fff] PGTABLE
RAMDISK: [mem 0x7851a000-0x7fffffff]
 [ffffea0000000000-ffffea0002ffffff] PMD -> [ffff8800bc400000-ffff8800bf3fffff] on node 0
Zone ranges:
  DMA      [mem 0x0000000000001000-0x0000000000ffffff]
  DMA32    [mem 0x0000000001000000-0x00000000bfffbfff]
  Normal   empty
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x0000000000001000-0x000000000009efff]
  node   0: [mem 0x0000000000100000-0x00000000bfffbfff]
Initmem setup node 0 [mem 0x0000000000001000-0x00000000bfffbfff]
On node 0 totalpages: 786330
  DMA zone: 64 pages used for memmap
  DMA zone: 21 pages reserved
  DMA zone: 3998 pages, LIFO batch:0
  DMA32 zone: 12224 pages used for memmap
  DMA32 zone: 782332 pages, LIFO batch:31
Intel MultiProcessor Specification v1.4
  mpc: f0a80-f0c44
MPTABLE: OEM ID: BOCHSCPU
MPTABLE: Product ID: 0.1         
MPTABLE: APIC at: 0xFEE00000
mapped APIC to ffffffffff5fd000 (        fee00000)
Processor #0 (Bootup-CPU)
Processor #1
Processor #2
Processor #3
Processor #4
Processor #5
Processor #6
Processor #7
Processor #8
Processor #9
Processor #10
Processor #11
Bus #0 is PCI   
Bus #1 is ISA   
IOAPIC[0]: apic_id 0, version 17, address 0xfec00000, GSI 0-23
Int: type 0, pol 1, trig 0, bus 00, IRQ 04, APIC ID 0, APIC INT 09
Int: type 0, pol 1, trig 0, bus 00, IRQ 0c, APIC ID 0, APIC INT 0b
Int: type 0, pol 1, trig 0, bus 00, IRQ 10, APIC ID 0, APIC INT 0b
Int: type 0, pol 1, trig 0, bus 00, IRQ 14, APIC ID 0, APIC INT 0a
Int: type 0, pol 1, trig 0, bus 00, IRQ 18, APIC ID 0, APIC INT 0a
Int: type 0, pol 0, trig 0, bus 01, IRQ 00, APIC ID 0, APIC INT 02
Int: type 0, pol 0, trig 0, bus 01, IRQ 01, APIC ID 0, APIC INT 01
Int: type 0, pol 0, trig 0, bus 01, IRQ 03, APIC ID 0, APIC INT 03
Int: type 0, pol 0, trig 0, bus 01, IRQ 04, APIC ID 0, APIC INT 04
Int: type 0, pol 0, trig 0, bus 01, IRQ 06, APIC ID 0, APIC INT 06
Int: type 0, pol 0, trig 0, bus 01, IRQ 07, APIC ID 0, APIC INT 07
Int: type 0, pol 0, trig 0, bus 01, IRQ 08, APIC ID 0, APIC INT 08
Int: type 0, pol 0, trig 0, bus 01, IRQ 0c, APIC ID 0, APIC INT 0c
Int: type 0, pol 0, trig 0, bus 01, IRQ 0d, APIC ID 0, APIC INT 0d
Int: type 0, pol 0, trig 0, bus 01, IRQ 0e, APIC ID 0, APIC INT 0e
Int: type 0, pol 0, trig 0, bus 01, IRQ 0f, APIC ID 0, APIC INT 0f
Lint: type 3, pol 0, trig 0, bus 01, IRQ 00, APIC ID 0, APIC LINT 00
Lint: type 1, pol 0, trig 0, bus 01, IRQ 00, APIC ID ff, APIC LINT 01
Processors: 12
smpboot: Allowing 12 CPUs, 0 hotplug CPUs
mapped IOAPIC to ffffffffff5fc000 (fec00000)
e820: [mem 0xc0000000-0xfeffbfff] available for PCI devices
clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
setup_percpu: NR_CPUS:16 nr_cpumask_bits:16 nr_cpu_ids:12 nr_node_ids:1
PERCPU: Embedded 31 pages/cpu @ffff8800bfa00000 s87640 r8192 d31144 u131072
pcpu-alloc: s87640 r8192 d31144 u131072 alloc=1*2097152
pcpu-alloc: [0] 00 01 02 03 04 05 06 07 08 09 10 11 -- -- -- -- 
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 774021
Kernel command line: hung_task_panic=1 earlyprintk=ttyS0,115200 debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic console=ttyS0,115200 console=tty0 earlyprintk=ttyS0 ignore_loglevel ftrace_dump_on_oops vga=normal root=/dev/vda1 rw
sysrq: sysrq always enabled.
log_buf_len individual max cpu contribution: 2097152 bytes
log_buf_len total cpu_extra contributions: 23068672 bytes
log_buf_len min size: 8388608 bytes
log_buf_len: 33554432 bytes
early log buf free: 8380096(99%)
PID hash table entries: 4096 (order: 3, 32768 bytes)
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Memory: 2911172K/3145320K available (4237K kernel code, 721K rwdata, 1988K rodata, 936K init, 8608K bss, 234148K reserved, 0K cma-reserved)
SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=12, Nodes=1
Hierarchical RCU implementation.
	Build-time adjustment of leaf fanout to 64.
	RCU restricting CPUs from NR_CPUS=16 to nr_cpu_ids=12.
RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=12
NR_IRQS:4352 nr_irqs:136 16
Console: colour VGA+ 80x25
console [tty0] enabled
bootconsole [earlyser0] disabled
console [ttyS0] enabled
tsc: Fast TSC calibration using PIT
tsc: Detected 3199.926 MHz processor
Calibrating delay loop (skipped), value calculated using timer frequency.. 6399.85 BogoMIPS (lpj=12799704)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 8192 (order: 4, 65536 bytes)
Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes)
Initializing cgroup subsys memory
Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
Freeing SMP alternatives memory: 20K (ffffffff819a0000 - ffffffff819a5000)
ftrace: allocating 16664 entries in 66 pages
Switched APIC routing to physical flat.
enabled ExtINT on CPU#0
ENABLING IO-APIC IRQs
init IO_APIC IRQs
 apic 0 pin 0 not connected
IOAPIC[0]: Set routing entry (0-1 -> 0x31 -> IRQ 1 Mode:0 Active:0 Dest:0)
IOAPIC[0]: Set routing entry (0-2 -> 0x30 -> IRQ 0 Mode:0 Active:0 Dest:0)
IOAPIC[0]: Set routing entry (0-3 -> 0x33 -> IRQ 3 Mode:0 Active:0 Dest:0)
IOAPIC[0]: Set routing entry (0-4 -> 0x34 -> IRQ 4 Mode:0 Active:0 Dest:0)
 apic 0 pin 5 not connected
IOAPIC[0]: Set routing entry (0-6 -> 0x36 -> IRQ 6 Mode:0 Active:0 Dest:0)
IOAPIC[0]: Set routing entry (0-7 -> 0x37 -> IRQ 7 Mode:0 Active:0 Dest:0)
IOAPIC[0]: Set routing entry (0-8 -> 0x38 -> IRQ 8 Mode:0 Active:0 Dest:0)
IOAPIC[0]: Set routing entry (0-9 -> 0x39 -> IRQ 9 Mode:1 Active:0 Dest:0)
IOAPIC[0]: Set routing entry (0-10 -> 0x3a -> IRQ 10 Mode:1 Active:0 Dest:0)
IOAPIC[0]: Set routing entry (0-11 -> 0x3b -> IRQ 11 Mode:1 Active:0 Dest:0)
IOAPIC[0]: Set routing entry (0-12 -> 0x3c -> IRQ 12 Mode:0 Active:0 Dest:0)
IOAPIC[0]: Set routing entry (0-13 -> 0x3d -> IRQ 13 Mode:0 Active:0 Dest:0)
IOAPIC[0]: Set routing entry (0-14 -> 0x3e -> IRQ 14 Mode:0 Active:0 Dest:0)
IOAPIC[0]: Set routing entry (0-15 -> 0x3f -> IRQ 15 Mode:0 Active:0 Dest:0)
 apic 0 pin 16 not connected
 apic 0 pin 17 not connected
 apic 0 pin 18 not connected
 apic 0 pin 19 not connected
 apic 0 pin 20 not connected
 apic 0 pin 21 not connected
 apic 0 pin 22 not connected
 apic 0 pin 23 not connected
..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
Using local APIC timer interrupts.
calibrating APIC timer ...
... lapic delta = 6251755
..... delta 6251755
..... mult: 268510832
..... calibration result: 4001123
..... CPU clock speed is 3200.3592 MHz.
..... host bus clock speed is 1000.1123 MHz.
... verify APIC timer
... jiffies delta = 25
... jiffies result ok
smpboot: CPU0: Intel QEMU Virtual CPU version 2.0.0 (family: 0x6, model: 0x6, stepping: 0x3)
Performance Events: Broken PMU hardware detected, using software events only.
Failed to access perfctr msr (MSR c2 is 0)
x86: Booting SMP configuration:
.... node  #0, CPUs:        #1
masked ExtINT on CPU#1
  #2
masked ExtINT on CPU#2
  #3
masked ExtINT on CPU#3
  #4
masked ExtINT on CPU#4
  #5
masked ExtINT on CPU#5
  #6
masked ExtINT on CPU#6
  #7
masked ExtINT on CPU#7
  #8
masked ExtINT on CPU#8
  #9
masked ExtINT on CPU#9
 #10
masked ExtINT on CPU#10
 #11
masked ExtINT on CPU#11
x86: Booted up 1 node, 12 CPUs
smpboot: Total of 12 processors activated (76818.13 BogoMIPS)
devtmpfs: initialized
clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
NET: Registered protocol family 16
PCI: Using configuration type 1 for base access
vgaarb: loaded
SCSI subsystem initialized
libata version 3.00 loaded.
PCI: Probing PCI hardware
PCI: root bus 00: using default resources
PCI: Probing PCI hardware (bus 00)
PCI host bridge to bus 0000:00
pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
pci_bus 0000:00: root bus resource [mem 0x00000000-0xffffffffff]
pci_bus 0000:00: No busn resource found for root bus, will use [bus 00-ff]
pci 0000:00:00.0: [8086:1237] type 00 class 0x060000
pci 0000:00:01.0: [8086:7000] type 00 class 0x060100
pci 0000:00:01.1: [8086:7010] type 00 class 0x010180
pci 0000:00:01.1: reg 0x20: [io  0xc0c0-0xc0cf]
pci 0000:00:01.1: legacy IDE quirk: reg 0x10: [io  0x01f0-0x01f7]
pci 0000:00:01.1: legacy IDE quirk: reg 0x14: [io  0x03f6]
pci 0000:00:01.1: legacy IDE quirk: reg 0x18: [io  0x0170-0x0177]
pci 0000:00:01.1: legacy IDE quirk: reg 0x1c: [io  0x0376]
pci 0000:00:01.3: [8086:7113] type 00 class 0x068000
pci 0000:00:02.0: [1013:00b8] type 00 class 0x030000
pci 0000:00:02.0: reg 0x10: [mem 0xfc000000-0xfdffffff pref]
pci 0000:00:02.0: reg 0x14: [mem 0xfebd0000-0xfebd0fff]
pci 0000:00:02.0: reg 0x30: [mem 0xfebc0000-0xfebcffff pref]
vgaarb: setting as boot device: PCI:0000:00:02.0
vgaarb: device added: PCI:0000:00:02.0,decodes=io+mem,owns=io+mem,locks=none
pci 0000:00:03.0: [1af4:1000] type 00 class 0x020000
pci 0000:00:03.0: reg 0x10: [io  0xc080-0xc09f]
pci 0000:00:03.0: reg 0x14: [mem 0xfebd1000-0xfebd1fff]
pci 0000:00:03.0: reg 0x30: [mem 0xfeb80000-0xfebbffff pref]
pci 0000:00:04.0: [1af4:1002] type 00 class 0x00ff00
pci 0000:00:04.0: reg 0x10: [io  0xc0a0-0xc0bf]
pci 0000:00:05.0: [1af4:1001] type 00 class 0x010000
pci 0000:00:05.0: reg 0x10: [io  0xc000-0xc03f]
pci 0000:00:05.0: reg 0x14: [mem 0xfebd2000-0xfebd2fff]
pci 0000:00:06.0: [1af4:1001] type 00 class 0x010000
pci 0000:00:06.0: reg 0x10: [io  0xc040-0xc07f]
pci 0000:00:06.0: reg 0x14: [mem 0xfebd3000-0xfebd3fff]
pci 0000:00:07.0: [8086:25ab] type 00 class 0x088000
pci 0000:00:07.0: reg 0x10: [mem 0xfebd4000-0xfebd400f]
pci_bus 0000:00: busn_res: [bus 00-ff] end is updated to 00
pci 0000:00:01.0: PIIX/ICH IRQ router [8086:7000]
PCI: pci_cache_line_size set to 64 bytes
e820: reserve RAM buffer [mem 0x0009fc00-0x0009ffff]
e820: reserve RAM buffer [mem 0xbfffc000-0xbfffffff]
clocksource: Switched to clocksource refined-jiffies
pci_bus 0000:00: resource 4 [io  0x0000-0xffff]
pci_bus 0000:00: resource 5 [mem 0x00000000-0xffffffffff]
NET: Registered protocol family 2
TCP established hash table entries: 32768 (order: 6, 262144 bytes)
TCP bind hash table entries: 32768 (order: 7, 524288 bytes)
TCP: Hash tables configured (established 32768 bind 32768)
UDP hash table entries: 2048 (order: 4, 65536 bytes)
UDP-Lite hash table entries: 2048 (order: 4, 65536 bytes)
NET: Registered protocol family 1
Trying to unpack rootfs image as initramfs...
Freeing initrd memory: 125848K (ffff88007851a000 - ffff880080000000)
platform rtc_cmos: registered platform RTC device (no PNP device found)
Scanning for low memory corruption every 60 seconds
futex hash table entries: 4096 (order: 6, 262144 bytes)
HugeTLB registered 2 MB page size, pre-allocated 0 pages
fuse init (API version 7.23)
9p: Installing v9fs 9p2000 file system support
cryptomgr_test (74) used greatest stack depth: 15352 bytes left
cryptomgr_test (82) used greatest stack depth: 15136 bytes left
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251)
io scheduler noop registered
io scheduler deadline registered
io scheduler cfq registered (default)
querying PCI -> IRQ mapping bus:0, slot:3, pin:0.
virtio-pci 0000:00:03.0: PCI->APIC IRQ transform: INT A -> IRQ 11
virtio-pci 0000:00:03.0: virtio_pci: leaving for legacy driver
querying PCI -> IRQ mapping bus:0, slot:4, pin:0.
virtio-pci 0000:00:04.0: PCI->APIC IRQ transform: INT A -> IRQ 11
virtio-pci 0000:00:04.0: virtio_pci: leaving for legacy driver
querying PCI -> IRQ mapping bus:0, slot:5, pin:0.
virtio-pci 0000:00:05.0: PCI->APIC IRQ transform: INT A -> IRQ 10
virtio-pci 0000:00:05.0: virtio_pci: leaving for legacy driver
querying PCI -> IRQ mapping bus:0, slot:6, pin:0.
virtio-pci 0000:00:06.0: PCI->APIC IRQ transform: INT A -> IRQ 10
virtio-pci 0000:00:06.0: virtio_pci: leaving for legacy driver
Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
Linux agpgart interface v0.103
brd: module loaded
loop: module loaded
 vda: vda1 vda2 < vda5 >
zram: Added device: zram0
libphy: Fixed MDIO Bus: probed
tun: Universal TUN/TAP device driver, 1.6
tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
mousedev: PS/2 mouse device common for all mice
rtc_cmos rtc_cmos: rtc core: registered rtc_cmos as rtc0
rtc_cmos rtc_cmos: alarms up to one day, 114 bytes nvram
device-mapper: ioctl: 4.33.0-ioctl (2015-8-18) initialised: dm-devel@redhat.com
device-mapper: cache cleaner: version 1.0.0 loaded
NET: Registered protocol family 17
9pnet: Installing 9P2000 support
... APIC ID:      00000000 (0)
... APIC VERSION: 01050014
0000000000000000000000000000000000000000000000000000000000000000
000000000e000000000000000000000000000000000000000000000000000000
0000000000020000000000000000000000000000000000000000000000008000

number of MP IRQ sources: 16.
number of IO-APIC #0 registers: 24.
testing the IO APIC.......................
IO APIC #0......
.... register #00: 00000000
.......    : physical APIC id: 00
.......    : Delivery Type: 0
.......    : LTS          : 0
.... register #01: 00170011
.......     : max redirection entries: 17
.......     : PRQ implemented: 0
.......     : IO APIC version: 11
.... register #02: 00000000
.......     : arbitration: 00
.... IRQ redirection table:
IOAPIC 0:
 pin00, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0)
 pin01, enabled , edge , high, V(31), IRR(0), S(0), physical, D(00), M(0)
 pin02, enabled , edge , high, V(30), IRR(0), S(0), physical, D(00), M(0)
 pin03, enabled , edge , high, V(33), IRR(0), S(0), physical, D(00), M(0)
 pin04, disabled, edge , high, V(34), IRR(0), S(0), physical, D(00), M(0)
 pin05, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0)
 pin06, enabled , edge , high, V(36), IRR(0), S(0), physical, D(00), M(0)
 pin07, enabled , edge , high, V(37), IRR(0), S(0), physical, D(00), M(0)
 pin08, enabled , edge , high, V(38), IRR(0), S(0), physical, D(00), M(0)
 pin09, disabled, level, high, V(39), IRR(0), S(0), physical, D(00), M(0)
 pin0a, enabled , level, high, V(3A), IRR(0), S(0), physical, D(00), M(0)
 pin0b, enabled , level, high, V(3B), IRR(0), S(0), physical, D(00), M(0)
 pin0c, enabled , edge , high, V(3C), IRR(0), S(0), physical, D(00), M(0)
 pin0d, enabled , edge , high, V(3D), IRR(0), S(0), physical, D(00), M(0)
 pin0e, enabled , edge , high, V(3E), IRR(0), S(0), physical, D(00), M(0)
 pin0f, enabled , edge , high, V(3F), IRR(0), S(0), physical, D(00), M(0)
 pin10, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0)
 pin11, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0)
 pin12, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0)
 pin13, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0)
 pin14, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0)
 pin15, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0)
 pin16, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0)
 pin17, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0)
IRQ to pin mappings:
IRQ0 -> 0:2
IRQ1 -> 0:1
IRQ3 -> 0:3
IRQ4 -> 0:4
IRQ6 -> 0:6
IRQ7 -> 0:7
IRQ8 -> 0:8
IRQ9 -> 0:9
IRQ10 -> 0:10
IRQ11 -> 0:11
IRQ12 -> 0:12
IRQ13 -> 0:13
IRQ14 -> 0:14
IRQ15 -> 0:15
.................................... done.
rtc_cmos rtc_cmos: setting system clock to 2015-10-20 08:57:55 UTC (1445331475)
input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input0
Freeing unused kernel memory: 936K (ffffffff818b6000 - ffffffff819a0000)
Write protecting the kernel read-only data: 8192k
Freeing unused kernel memory: 1900K (ffff880001425000 - ffff880001600000)
Freeing unused kernel memory: 60K (ffff8800017f1000 - ffff880001800000)
busybox (117) used greatest stack depth: 14480 bytes left
exe (124) used greatest stack depth: 14024 bytes left
udevd[140]: starting version 175
blkid (151) used greatest stack depth: 13920 bytes left
modprobe (242) used greatest stack depth: 13784 bytes left
clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2e200418439, max_idle_ns: 440795220848 ns
clocksource: Switched to clocksource tsc
EXT4-fs (vda1): recovery complete
EXT4-fs (vda1): mounted filesystem with ordered data mode. Opts: (null)
exe (262) used greatest stack depth: 13032 bytes left
random: init urandom read with 9 bits of entropy available
init: plymouth-upstart-bridge main process (279) terminated with status 1
init: plymouth-upstart-bridge main process ended, respawning
init: plymouth-upstart-bridge main process (289) terminated with status 1
init: plymouth-upstart-bridge main process ended, respawning
init: plymouth-upstart-bridge main process (293) terminated with status 1
init: plymouth-upstart-bridge main process ended, respawning
init: ureadahead main process (282) terminated with status 5
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
systemd-udevd[423]: starting version 204
EXT4-fs (vdb): mounted filesystem with ordered data mode. Opts: errors=remount-ro
 * Stopping Send an event to indicate plymouth is up^[[74G[ OK ]
 * Starting Mount filesystems on boot^[[74G[ OK ]
 * Starting Signal sysvinit that the rootfs is mounted^[[74G[ OK ]
 * Starting Populate /dev filesystem^[[74G[ OK ]
 * Starting Populate and link to /run filesystem^[[74G[ OK ]
 * Stopping Populate /dev filesystem^[[74G[ OK ]
 * Stopping Populate and link to /run filesystem^[[74G[ OK ]
 * Starting Clean /tmp directory^[[74G[ OK ]
 * Stopping Track if upstart is running in a container^[[74G[ OK ]
 * Stopping Clean /tmp directory^[[74G[ OK ]
 * Starting Initialize or finalize resolvconf^[[74G[ OK ]
 * Starting set console keymap^[[74G[ OK ]
 * Starting Signal sysvinit that virtual filesystems are mounted^[[74G[ OK ]
 * Starting Signal sysvinit that virtual filesystems are mounted^[[74G[ OK ]
 * Starting Bridge udev events into upstart^[[74G[ OK ]
 * Starting Signal sysvinit that remote filesystems are mounted^[[74G[ OK ]
 * Stopping set console keymap^[[74G[ OK ]
 * Starting device node and kernel event manager^[[74G[ OK ]
 * Starting load modules from /etc/modules^[[74G[ OK ]
 * Starting cold plug devices^[[74G[ OK ]
 * Starting log initial device creation^[[74G[ OK ]
 * Stopping Read required files in advance (for other mountpoints)^[[74G[ OK ]
 * Stopping load modules from /etc/modules^[[74G[ OK ]
 * Starting Signal sysvinit that local filesystems are mounted^[[74G[ OK ]
 * Starting flush early job output to logs^[[74G[ OK ]
 * Stopping Mount filesystems on boot^[[74G[ OK ]
 * Stopping flush early job output to logs^[[74G[ OK ]
 * Starting D-Bus system message bus^[[74G[ OK ]
 * Starting SystemD login management service^[[74G[ OK ]
 * Starting system logging daemon^[[74G[ OK ]
 * Stopping cold plug devices^[[74G[ OK ]
 * Starting Uncomplicated firewall^[[74G[ OK ]
 * Starting configure network device security^[[74G[ OK ]
 * Stopping log initial device creation^[[74G[ OK ]
 * Starting configure network device security^[[74G[ OK ]
 * Starting save udev log and update rules^[[74G[ OK ]
 * Starting set console font^[[74G[ OK ]
 * Stopping save udev log and update rules^[[74G[ OK ]
 * Starting Mount network filesystems^[[74G[ OK ]
 * Starting Failsafe Boot Delay^[[74G[ OK ]
 * Starting configure network device security^[[74G[ OK ]
 * Stopping Mount network filesystems^[[74G[ OK ]
 * Starting configure network device^[[74G[ OK ]
 * Starting configure network device^[[74G[ OK ]
 * Starting Bridge file events into upstart^[[74G[ OK ]
 * Starting Bridge socket events into upstart^[[74G[ OK ]
 * Stopping set console font^[[74G[ OK ]
 * Starting userspace bootsplash^[[74G[ OK ]
 * Starting Send an event to indicate plymouth is up^[[74G[ OK ]
 * Stopping userspace bootsplash^[[74G[ OK ]
 * Stopping Send an event to indicate plymouth is up^[[74G[ OK ]
 * Starting Mount network filesystems^[[74G[ OK ]
init: failsafe main process (591) killed by TERM signal
 * Stopping Failsafe Boot Delay^[[74G[ OK ]
 * Starting System V initialisation compatibility^[[74G[ OK ]
 * Stopping Mount network filesystems^[[74G[ OK ]
 * Starting configure virtual network devices^[[74G[ OK ]
 * Stopping System V initialisation compatibility^[[74G[ OK ]
 * Starting System V runlevel compatibility^[[74G[ OK ]
 * Starting deferred execution scheduler^[[74G[ OK ]
 * Starting regular background program processing daemon^[[74G[ OK ]
 * Starting ACPI daemon^[[74G[ OK ]
 * Starting save kernel messages^[[74G[ OK ]
 * Starting CPU interrupts balancing daemon^[[74G[ OK ]
 * Stopping save kernel messages^[[74G[ OK ]
 * Starting OpenSSH server^[[74G[ OK ]
 * Starting automatic crash report generation^[[74G[ OK ]
 * Restoring resolver state...       ^[[80G \r^[[74G[ OK ]
eth0 Link encap:Ethernet HWaddr 52:54:79:12:34:57 inet addr:192.168.0.21 Bcast:192.168.0.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:34 errors:0 dropped:24 overruns:0 frame:0 TX packets:4 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:5780 (5.7 KB) TX bytes:800 (800.0 B) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
 * Stopping System V runlevel compatibility^[[74G[ OK ]
init: plymouth-upstart-bridge main process ended, respawning
sh (1429) used greatest stack depth: 11752 bytes left
sh (1454) used greatest stack depth: 11528 bytes left
random: nonblocking pool is initialized
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
sh (2785) used greatest stack depth: 11480 bytes left
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5.  Priority:-1 extents:1 across:4191228k FS
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff810782a9>] down_read_trylock+0x9/0x30
PGD 0 
Oops: 0000 [#1] SMP 
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 26445 Comm: sh Not tainted 4.3.0-rc5-mm1-diet-meta+ #1545
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff8800b9af3480 ti: ffff88007fea0000 task.ti: ffff88007fea0000
RIP: 0010:[<ffffffff810782a9>]  [<ffffffff810782a9>] down_read_trylock+0x9/0x30
RSP: 0018:ffff88007fea3648  EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffffea0002324900 RCX: ffff88007fea37e8
RDX: 0000000000000000 RSI: ffff88007fea36e8 RDI: 0000000000000008
RBP: ffff88007fea3648 R08: ffffffff818446a0 R09: ffff8800b9af4c80
R10: 0000000000000216 R11: 0000000000000001 R12: ffff88007f58d6e1
R13: ffff88007f58d6e0 R14: 0000000000000008 R15: 0000000000000001
FS:  00007f0993e78740(0000) GS:ffff8800bfa20000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 000000007edee000 CR4: 00000000000006a0
Stack:
 ffff88007fea3678 ffffffff81124ff0 ffffea0002324900 ffff88007fea36e8
 ffff88009ffe8400 0000000000000000 ffff88007fea36c0 ffffffff81125733
 ffff8800bfa34540 ffffffff8105dc9d ffffea0002324900 ffff88007fea37e8
Call Trace:
 [<ffffffff81124ff0>] page_lock_anon_vma_read+0x60/0x180
 [<ffffffff81125733>] rmap_walk+0x1b3/0x3f0
 [<ffffffff8105dc9d>] ? finish_task_switch+0x5d/0x1f0
 [<ffffffff81125b13>] page_referenced+0x1a3/0x220
 [<ffffffff81123e30>] ? __page_check_address+0x1a0/0x1a0
 [<ffffffff81124f90>] ? page_get_anon_vma+0xd0/0xd0
 [<ffffffff81123820>] ? anon_vma_ctor+0x40/0x40
 [<ffffffff8110087b>] shrink_page_list+0x5ab/0xde0
 [<ffffffff8110174c>] shrink_inactive_list+0x18c/0x4b0
 [<ffffffff811023bd>] shrink_lruvec+0x59d/0x740
 [<ffffffff811025f0>] shrink_zone+0x90/0x250
 [<ffffffff811028dd>] do_try_to_free_pages+0x12d/0x3b0
 [<ffffffff81102d3d>] try_to_free_mem_cgroup_pages+0x9d/0x120
 [<ffffffff811496c3>] try_charge+0x163/0x700
 [<ffffffff81149cb4>] mem_cgroup_do_precharge+0x54/0x70
 [<ffffffff81149e45>] mem_cgroup_can_attach+0x175/0x1b0
 [<ffffffff811b2c57>] ? kernfs_iattrs.isra.6+0x37/0xd0
 [<ffffffff81148e70>] ? get_mctgt_type+0x320/0x320
 [<ffffffff810a9d29>] cgroup_migrate+0x149/0x440
 [<ffffffff810aa60c>] cgroup_attach_task+0x7c/0xe0
 [<ffffffff810aa904>] __cgroup_procs_write.isra.33+0x1d4/0x2b0
 [<ffffffff810aaa10>] cgroup_tasks_write+0x10/0x20
 [<ffffffff810a6238>] cgroup_file_write+0x38/0xf0
 [<ffffffff811b54ad>] kernfs_fop_write+0x11d/0x170
 [<ffffffff81153918>] __vfs_write+0x28/0xe0
 [<ffffffff8116e614>] ? __fd_install+0x24/0xc0
 [<ffffffff810784a1>] ? percpu_down_read+0x21/0x50
 [<ffffffff81153e91>] vfs_write+0xa1/0x170
 [<ffffffff81154716>] SyS_write+0x46/0xa0
 [<ffffffff81420a17>] entry_SYSCALL_64_fastpath+0x12/0x6a
Code: 5e 82 3a 00 48 83 c4 08 5b 5d c3 48 89 45 f0 e8 9b 6a 3a 00 48 8b 45 f0 eb df 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7 
RIP  [<ffffffff810782a9>] down_read_trylock+0x9/0x30
 RSP <ffff88007fea3648>
CR2: 0000000000000008
BUG: unable to handle kernel ---[ end trace e81a82c8122b447d ]---
Kernel panic - not syncing: Fatal exception

NULL pointer dereference at 0000000000000008
IP: [<ffffffff810782a9>] down_read_trylock+0x9/0x30
PGD 0 
Oops: 0000 [#2] SMP 
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 10 PID: 59 Comm: khugepaged Tainted: G      D         4.3.0-rc5-mm1-diet-meta+ #1545
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff8800b9851a40 ti: ffff8800b985c000 task.ti: ffff8800b985c000
RIP: 0010:[<ffffffff810782a9>]  [<ffffffff810782a9>] down_read_trylock+0x9/0x30
RSP: 0018:ffff8800b985f778  EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffffea0002321800 RCX: ffff8800b985f918
RDX: 0000000000000000 RSI: ffff8800b985f818 RDI: 0000000000000008
RBP: ffff8800b985f778 R08: ffffffff818446a0 R09: ffff8800b9853240
R10: 000000000000ba03 R11: 0000000000000001 R12: ffff88007f58d6e1
R13: ffff88007f58d6e0 R14: 0000000000000008 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff8800bfb40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000001808000 CR4: 00000000000006a0
Stack:
 ffff8800b985f7a8 ffffffff81124ff0 ffffea0002321800 ffff8800b985f818
 ffff88009ffe8400 0000000000000000 ffff8800b985f7f0 ffffffff81125733
 ffff8800bfb54540 ffffffff8105dc9d ffffea0002321800 ffff8800b985f918
Call Trace:
 [<ffffffff81124ff0>] page_lock_anon_vma_read+0x60/0x180
 [<ffffffff81125733>] rmap_walk+0x1b3/0x3f0
 [<ffffffff8105dc9d>] ? finish_task_switch+0x5d/0x1f0
 [<ffffffff81125b13>] page_referenced+0x1a3/0x220
 [<ffffffff81123e30>] ? __page_check_address+0x1a0/0x1a0
 [<ffffffff81124f90>] ? page_get_anon_vma+0xd0/0xd0
 [<ffffffff81123820>] ? anon_vma_ctor+0x40/0x40
 [<ffffffff8110087b>] shrink_page_list+0x5ab/0xde0
 [<ffffffff8110174c>] shrink_inactive_list+0x18c/0x4b0
 [<ffffffff811023bd>] shrink_lruvec+0x59d/0x740
 [<ffffffff811025f0>] shrink_zone+0x90/0x250
 [<ffffffff811028dd>] do_try_to_free_pages+0x12d/0x3b0
 [<ffffffff81102d3d>] try_to_free_mem_cgroup_pages+0x9d/0x120
 [<ffffffff811496c3>] try_charge+0x163/0x700
 [<ffffffff8141d1f3>] ? schedule+0x33/0x80
 [<ffffffff8114d45f>] mem_cgroup_try_charge+0x9f/0x1d0
 [<ffffffff811434bc>] khugepaged+0x7cc/0x1ac0
 [<ffffffff81066e01>] ? hrtick_update+0x1/0x70
 [<ffffffff81072430>] ? prepare_to_wait_event+0xf0/0xf0
 [<ffffffff81142cf0>] ? total_mapcount+0x70/0x70
 [<ffffffff81056cd9>] kthread+0xc9/0xe0
 [<ffffffff81056c10>] ? kthread_park+0x60/0x60
 [<ffffffff81420d6f>] ret_from_fork+0x3f/0x70
 [<ffffffff81056c10>] ? kthread_park+0x60/0x60
Code: 5e 82 3a 00 48 83 c4 08 5b 5d c3 48 89 45 f0 e8 9b 6a 3a 00 48 8b 45 f0 eb df 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7 
RIP  [<ffffffff810782a9>] down_read_trylock+0x9/0x30
 RSP <ffff8800b985f778>
CR2: 0000000000000008
---[ end trace e81a82c8122b447e ]---
Shutting down cpus with NMI
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] MADV_FREE refactoring and fix KSM page
  2015-10-19 10:01 ` [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim
  2015-10-20  1:38   ` Minchan Kim
@ 2015-10-20  7:21   ` Minchan Kim
  2015-10-20  7:27     ` Minchan Kim
  2015-10-20 21:36     ` Andrew Morton
  1 sibling, 2 replies; 26+ messages in thread
From: Minchan Kim @ 2015-10-20  7:21 UTC (permalink / raw)
  To: Andrew Morton, Kirill A. Shutemov
  Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman,
	Michal Hocko, Johannes Weiner, Vlastimil Babka

On Mon, Oct 19, 2015 at 07:01:50PM +0900, Minchan Kim wrote:
> On Mon, Oct 19, 2015 at 03:31:42PM +0900, Minchan Kim wrote:
> > Hello, it's too late since I sent previos patch.
> > https://lkml.org/lkml/2015/6/3/37
> > 
> > This patch is alomost new compared to previos approach.
> > I think this is more simple, clear and easy to review.
> > 
> > One thing I should notice is that I have tested this patch
> > and couldn't find any critical problem so I rebased patchset
> > onto recent mmotm(ie, mmotm-2015-10-15-15-20) to send formal
> > patchset. Unfortunately, I start to see sudden discarding of
> > the page we shouldn't do. IOW, application's valid anonymous page
> > was disappeared suddenly.
> > 
> > When I look through THP changes, I think we could lose
> > dirty bit of pte between freeze_page and unfreeze_page
> > when we mark it as migration entry and restore it.
> > So, I added below simple code without enough considering
> > and cannot see the problem any more.
> > I hope it's good hint to find right fix this problem.
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index d5ea516ffb54..e881c04f5950 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3138,6 +3138,9 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
> >  		if (is_write_migration_entry(swp_entry))
> >  			entry = maybe_mkwrite(entry, vma);
> >  
> > +		if (PageDirty(page))
> > +			SetPageDirty(page);
> 
> The condition of PageDirty was typo. I didn't add the condition.
> Just added.
> 
>                 SetPageDirty(page);

I reviewed THP refcount redesign patch and It seems below patch fixes
MADV_FREE problem. It works well for hours.

>From 104a0940b4c0f97e61de9fee0fd602926ff28312 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Tue, 20 Oct 2015 16:00:52 +0900
Subject: [PATCH] mm: mark head page dirty in split_huge_page

In thp split in old THP refcount, we mappped all of pages
(ie, head + tails) to pte_mkdirty and mark PG_flags to every
tail pages.

But with THP refcount redesign, we can lose dirty bit in page table
and PG_dirty for head page if we want to free the THP page using
migration_entry.

It ends up discarding head page by madvise_free suddenly.
This patch fixes it by mark the head page PG_dirty when VM splits
the THP page.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/huge_memory.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index adccfb48ce57..7fbbd42554a1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3258,6 +3258,7 @@ static void __split_huge_page(struct page *page, struct list_head *list)
 	atomic_sub(tail_mapcount, &head->_count);
 
 	ClearPageCompound(head);
+	SetPageDirty(head);
 	spin_unlock_irq(&zone->lru_lock);
 
 	unfreeze_page(page_anon_vma(head), head);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] MADV_FREE refactoring and fix KSM page
  2015-10-20  7:21   ` Minchan Kim
@ 2015-10-20  7:27     ` Minchan Kim
  2015-10-20 21:36     ` Andrew Morton
  1 sibling, 0 replies; 26+ messages in thread
From: Minchan Kim @ 2015-10-20  7:27 UTC (permalink / raw)
  To: Andrew Morton, Kirill A. Shutemov
  Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman,
	Michal Hocko, Johannes Weiner, Vlastimil Babka

On Tue, Oct 20, 2015 at 04:21:09PM +0900, Minchan Kim wrote:
> On Mon, Oct 19, 2015 at 07:01:50PM +0900, Minchan Kim wrote:
> > On Mon, Oct 19, 2015 at 03:31:42PM +0900, Minchan Kim wrote:
> > > Hello, it's too late since I sent previos patch.
> > > https://lkml.org/lkml/2015/6/3/37
> > > 
> > > This patch is alomost new compared to previos approach.
> > > I think this is more simple, clear and easy to review.
> > > 
> > > One thing I should notice is that I have tested this patch
> > > and couldn't find any critical problem so I rebased patchset
> > > onto recent mmotm(ie, mmotm-2015-10-15-15-20) to send formal
> > > patchset. Unfortunately, I start to see sudden discarding of
> > > the page we shouldn't do. IOW, application's valid anonymous page
> > > was disappeared suddenly.
> > > 
> > > When I look through THP changes, I think we could lose
> > > dirty bit of pte between freeze_page and unfreeze_page
> > > when we mark it as migration entry and restore it.
> > > So, I added below simple code without enough considering
> > > and cannot see the problem any more.
> > > I hope it's good hint to find right fix this problem.
> > > 
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index d5ea516ffb54..e881c04f5950 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -3138,6 +3138,9 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
> > >  		if (is_write_migration_entry(swp_entry))
> > >  			entry = maybe_mkwrite(entry, vma);
> > >  
> > > +		if (PageDirty(page))
> > > +			SetPageDirty(page);
> > 
> > The condition of PageDirty was typo. I didn't add the condition.
> > Just added.
> > 
> >                 SetPageDirty(page);
> 
> I reviewed THP refcount redesign patch and It seems below patch fixes
> MADV_FREE problem. It works well for hours.
> 
> From 104a0940b4c0f97e61de9fee0fd602926ff28312 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Tue, 20 Oct 2015 16:00:52 +0900
> Subject: [PATCH] mm: mark head page dirty in split_huge_page
> 
> In thp split in old THP refcount, we mappped all of pages
> (ie, head + tails) to pte_mkdirty and mark PG_flags to every
> tail pages.
> 
> But with THP refcount redesign, we can lose dirty bit in page table
> and PG_dirty for head page if we want to free the THP page using
 
typo.
                                           freeze


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] MADV_FREE refactoring and fix KSM page
  2015-10-20  7:21   ` Minchan Kim
  2015-10-20  7:27     ` Minchan Kim
@ 2015-10-20 21:36     ` Andrew Morton
  2015-10-20 22:43       ` Kirill A. Shutemov
  1 sibling, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2015-10-20 21:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Kirill A. Shutemov, linux-mm, linux-kernel, Hugh Dickins,
	Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner,
	Vlastimil Babka

On Tue, 20 Oct 2015 16:21:09 +0900 Minchan Kim <minchan@kernel.org> wrote:

> 
> I reviewed THP refcount redesign patch and It seems below patch fixes
> MADV_FREE problem. It works well for hours.
> 
> >From 104a0940b4c0f97e61de9fee0fd602926ff28312 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Tue, 20 Oct 2015 16:00:52 +0900
> Subject: [PATCH] mm: mark head page dirty in split_huge_page
> 
> In thp split in old THP refcount, we mappped all of pages
> (ie, head + tails) to pte_mkdirty and mark PG_flags to every
> tail pages.
> 
> But with THP refcount redesign, we can lose dirty bit in page table
> and PG_dirty for head page if we want to free the THP page using
> migration_entry.
> 
> It ends up discarding head page by madvise_free suddenly.
> This patch fixes it by mark the head page PG_dirty when VM splits
> the THP page.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  mm/huge_memory.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index adccfb48ce57..7fbbd42554a1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3258,6 +3258,7 @@ static void __split_huge_page(struct page *page, struct list_head *list)
>  	atomic_sub(tail_mapcount, &head->_count);
>  
>  	ClearPageCompound(head);
> +	SetPageDirty(head);
>  	spin_unlock_irq(&zone->lru_lock);
>  
>  	unfreeze_page(page_anon_vma(head), head);

This appears to be a bugfix against Kirill's "thp: reintroduce
split_huge_page()"?

Yes, __split_huge_page() is marking the tail pages dirty but forgot
about the head page

You say "we can lose dirty bit in page table" but I don't see how the
above patch fixes that?


Why does __split_huge_page() unconditionally mark the pages dirty, btw?
Is it because the THP page was known to be dirty?  If so, the head
page already had PG_dirty, so this patch doesn't do anything.

freeze_page(), unfreeze_page() and their callees desperately need some
description of what they're doing.  Kirill, could you cook somethnig up
please?



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] MADV_FREE refactoring and fix KSM page
  2015-10-20 21:36     ` Andrew Morton
@ 2015-10-20 22:43       ` Kirill A. Shutemov
  2015-10-21  5:11         ` Minchan Kim
  0 siblings, 1 reply; 26+ messages in thread
From: Kirill A. Shutemov @ 2015-10-20 22:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, linux-mm, linux-kernel, Hugh Dickins, Rik van Riel,
	Mel Gorman, Michal Hocko, Johannes Weiner, Vlastimil Babka

On Tue, Oct 20, 2015 at 02:36:51PM -0700, Andrew Morton wrote:
> On Tue, 20 Oct 2015 16:21:09 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > 
> > I reviewed THP refcount redesign patch and It seems below patch fixes
> > MADV_FREE problem. It works well for hours.
> > 
> > >From 104a0940b4c0f97e61de9fee0fd602926ff28312 Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <minchan@kernel.org>
> > Date: Tue, 20 Oct 2015 16:00:52 +0900
> > Subject: [PATCH] mm: mark head page dirty in split_huge_page
> > 
> > In thp split in old THP refcount, we mappped all of pages
> > (ie, head + tails) to pte_mkdirty and mark PG_flags to every
> > tail pages.
> > 
> > But with THP refcount redesign, we can lose dirty bit in page table
> > and PG_dirty for head page if we want to free the THP page using
> > migration_entry.
> > 
> > It ends up discarding head page by madvise_free suddenly.
> > This patch fixes it by mark the head page PG_dirty when VM splits
> > the THP page.
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  mm/huge_memory.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index adccfb48ce57..7fbbd42554a1 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3258,6 +3258,7 @@ static void __split_huge_page(struct page *page, struct list_head *list)
> >  	atomic_sub(tail_mapcount, &head->_count);
> >  
> >  	ClearPageCompound(head);
> > +	SetPageDirty(head);
> >  	spin_unlock_irq(&zone->lru_lock);
> >  
> >  	unfreeze_page(page_anon_vma(head), head);
 
Sorry, I've missed the email at first.

> This appears to be a bugfix against Kirill's "thp: reintroduce
> split_huge_page()"?
> 
> Yes, __split_huge_page() is marking the tail pages dirty but forgot
> about the head page
> 
> You say "we can lose dirty bit in page table" but I don't see how the
> above patch fixes that?

I think the problem is in unfreeze_page_vma(), where I missed dirtying
pte.

> Why does __split_huge_page() unconditionally mark the pages dirty, btw?
> Is it because the THP page was known to be dirty?

THP doesn't have backing storage and cannot be swapped out without
splitting, therefore always dirty. (huge zero page is exception, I guess).

> If so, the head page already had PG_dirty, so this patch doesn't do
> anything.

PG_dirty appears on struct page as result of transferring from dirty bit
in page tables. There's no guarantee that it's happened.

> freeze_page(), unfreeze_page() and their callees desperately need some
> description of what they're doing.  Kirill, could you cook somethnig up
> please?

Minchan, could you test patch below instead?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86924cc34bac..ea1f3805afa3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3115,7 +3115,7 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
 
                entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
                if (is_write_migration_entry(swp_entry))
-                       entry = maybe_mkwrite(entry, vma);
+                       entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 
                flush_dcache_page(page);
                set_pte_at(vma->vm_mm, address, pte + i, entry);
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] MADV_FREE refactoring and fix KSM page
  2015-10-20 22:43       ` Kirill A. Shutemov
@ 2015-10-21  5:11         ` Minchan Kim
  2015-10-21  7:50           ` Kirill A. Shutemov
  0 siblings, 1 reply; 26+ messages in thread
From: Minchan Kim @ 2015-10-21  5:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins,
	Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner,
	Vlastimil Babka

On Wed, Oct 21, 2015 at 01:43:53AM +0300, Kirill A. Shutemov wrote:
> On Tue, Oct 20, 2015 at 02:36:51PM -0700, Andrew Morton wrote:
> > On Tue, 20 Oct 2015 16:21:09 +0900 Minchan Kim <minchan@kernel.org> wrote:
> > 
> > > 
> > > I reviewed THP refcount redesign patch and It seems below patch fixes
> > > MADV_FREE problem. It works well for hours.
> > > 
> > > >From 104a0940b4c0f97e61de9fee0fd602926ff28312 Mon Sep 17 00:00:00 2001
> > > From: Minchan Kim <minchan@kernel.org>
> > > Date: Tue, 20 Oct 2015 16:00:52 +0900
> > > Subject: [PATCH] mm: mark head page dirty in split_huge_page
> > > 
> > > In thp split in old THP refcount, we mappped all of pages
> > > (ie, head + tails) to pte_mkdirty and mark PG_flags to every
> > > tail pages.
> > > 
> > > But with THP refcount redesign, we can lose dirty bit in page table
> > > and PG_dirty for head page if we want to free the THP page using
> > > migration_entry.
> > > 
> > > It ends up discarding head page by madvise_free suddenly.
> > > This patch fixes it by mark the head page PG_dirty when VM splits
> > > the THP page.
> > > 
> > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > ---
> > >  mm/huge_memory.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index adccfb48ce57..7fbbd42554a1 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -3258,6 +3258,7 @@ static void __split_huge_page(struct page *page, struct list_head *list)
> > >  	atomic_sub(tail_mapcount, &head->_count);
> > >  
> > >  	ClearPageCompound(head);
> > > +	SetPageDirty(head);
> > >  	spin_unlock_irq(&zone->lru_lock);
> > >  
> > >  	unfreeze_page(page_anon_vma(head), head);
>  
> Sorry, I've missed the email at first.
> 
> > This appears to be a bugfix against Kirill's "thp: reintroduce
> > split_huge_page()"?
> > 
> > Yes, __split_huge_page() is marking the tail pages dirty but forgot
> > about the head page
> > 
> > You say "we can lose dirty bit in page table" but I don't see how the
> > above patch fixes that?
> 
> I think the problem is in unfreeze_page_vma(), where I missed dirtying
> pte.
> 
> > Why does __split_huge_page() unconditionally mark the pages dirty, btw?
> > Is it because the THP page was known to be dirty?
> 
> THP doesn't have backing storage and cannot be swapped out without
> splitting, therefore always dirty. (huge zero page is exception, I guess).

It's right until now but I think we need more(e.g. is_dirty_migration_entry,
make_migration_entry(struct page *page, int write, int dirty) in terms of
MADV_FREE to keep dirty bit of pte rather than making pages dirty
unconditionally.

For example, we could call madvise_free to THP page so madvise_free clears
dirty bit of pmd without split THP pages(ie, lazy split, maybe you suggest
it, thanks!) instantly. Then, when VM tries to reclaim the THP page and
splits it, every page will be marked PG_dirty or pte_mkdirty even if
there is no write ever since then so madvise_free can never discard it
although we could.

Anyway it shouldn't be party-pooper. It could be enhanced and I will check
it.


> 
> > If so, the head page already had PG_dirty, so this patch doesn't do
> > anything.
> 
> PG_dirty appears on struct page as result of transferring from dirty bit
> in page tables. There's no guarantee that it's happened.
> 
> > freeze_page(), unfreeze_page() and their callees desperately need some
> > description of what they're doing.  Kirill, could you cook somethnig up
> > please?
> 
> Minchan, could you test patch below instead?

I think it will definitely work and more right fix than mine because
it covers split_huge_page_to_list's error path(ie,

                unfreeze_page(anon_vma, head);
                ret = -EBUSY;
        }


I will queue it to test machine.

..
Zzzz
..

After 2 hours, I don't see any problemso far but I have a question below.

> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 86924cc34bac..ea1f3805afa3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3115,7 +3115,7 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
>  
>                 entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
>                 if (is_write_migration_entry(swp_entry))
> -                       entry = maybe_mkwrite(entry, vma);
> +                       entry = maybe_mkwrite(pte_mkdirty(entry), vma);

Why should we do pte_mkdiry only if is_write_migration_entry is true?
Doesn't it lose a dirty bit again if someone changes protection
from RW to R?

>  
>                 flush_dcache_page(page);
>                 set_pte_at(vma->vm_mm, address, pte + i, entry);
> -- 
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] MADV_FREE refactoring and fix KSM page
  2015-10-21  5:11         ` Minchan Kim
@ 2015-10-21  7:50           ` Kirill A. Shutemov
  0 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2015-10-21  7:50 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins,
	Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner,
	Vlastimil Babka

On Wed, Oct 21, 2015 at 02:11:39PM +0900, Minchan Kim wrote:
> On Wed, Oct 21, 2015 at 01:43:53AM +0300, Kirill A. Shutemov wrote:
> > On Tue, Oct 20, 2015 at 02:36:51PM -0700, Andrew Morton wrote:
> > > On Tue, 20 Oct 2015 16:21:09 +0900 Minchan Kim <minchan@kernel.org> wrote:
> > > 
> > > > 
> > > > I reviewed THP refcount redesign patch and It seems below patch fixes
> > > > MADV_FREE problem. It works well for hours.
> > > > 
> > > > >From 104a0940b4c0f97e61de9fee0fd602926ff28312 Mon Sep 17 00:00:00 2001
> > > > From: Minchan Kim <minchan@kernel.org>
> > > > Date: Tue, 20 Oct 2015 16:00:52 +0900
> > > > Subject: [PATCH] mm: mark head page dirty in split_huge_page
> > > > 
> > > > In thp split in old THP refcount, we mappped all of pages
> > > > (ie, head + tails) to pte_mkdirty and mark PG_flags to every
> > > > tail pages.
> > > > 
> > > > But with THP refcount redesign, we can lose dirty bit in page table
> > > > and PG_dirty for head page if we want to free the THP page using
> > > > migration_entry.
> > > > 
> > > > It ends up discarding head page by madvise_free suddenly.
> > > > This patch fixes it by mark the head page PG_dirty when VM splits
> > > > the THP page.
> > > > 
> > > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > > ---
> > > >  mm/huge_memory.c | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > > 
> > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > index adccfb48ce57..7fbbd42554a1 100644
> > > > --- a/mm/huge_memory.c
> > > > +++ b/mm/huge_memory.c
> > > > @@ -3258,6 +3258,7 @@ static void __split_huge_page(struct page *page, struct list_head *list)
> > > >  	atomic_sub(tail_mapcount, &head->_count);
> > > >  
> > > >  	ClearPageCompound(head);
> > > > +	SetPageDirty(head);
> > > >  	spin_unlock_irq(&zone->lru_lock);
> > > >  
> > > >  	unfreeze_page(page_anon_vma(head), head);
> >  
> > Sorry, I've missed the email at first.
> > 
> > > This appears to be a bugfix against Kirill's "thp: reintroduce
> > > split_huge_page()"?
> > > 
> > > Yes, __split_huge_page() is marking the tail pages dirty but forgot
> > > about the head page
> > > 
> > > You say "we can lose dirty bit in page table" but I don't see how the
> > > above patch fixes that?
> > 
> > I think the problem is in unfreeze_page_vma(), where I missed dirtying
> > pte.
> > 
> > > Why does __split_huge_page() unconditionally mark the pages dirty, btw?
> > > Is it because the THP page was known to be dirty?
> > 
> > THP doesn't have backing storage and cannot be swapped out without
> > splitting, therefore always dirty. (huge zero page is exception, I guess).
> 
> It's right until now but I think we need more(e.g. is_dirty_migration_entry,
> make_migration_entry(struct page *page, int write, int dirty) in terms of
> MADV_FREE to keep dirty bit of pte rather than making pages dirty
> unconditionally.

That means you need to find one more bit in swap entries. I'm not sure
it's possible on all architectures.

> 
> For example, we could call madvise_free to THP page so madvise_free clears
> dirty bit of pmd without split THP pages(ie, lazy split, maybe you suggest
> it, thanks!) instantly. Then, when VM tries to reclaim the THP page and
> splits it, every page will be marked PG_dirty or pte_mkdirty even if
> there is no write ever since then so madvise_free can never discard it
> although we could.
> 
> Anyway it shouldn't be party-pooper. It could be enhanced and I will check
> it.
> 
> 
> > 
> > > If so, the head page already had PG_dirty, so this patch doesn't do
> > > anything.
> > 
> > PG_dirty appears on struct page as result of transferring from dirty bit
> > in page tables. There's no guarantee that it's happened.
> > 
> > > freeze_page(), unfreeze_page() and their callees desperately need some
> > > description of what they're doing.  Kirill, could you cook somethnig up
> > > please?
> > 
> > Minchan, could you test patch below instead?
> 
> I think it will definitely work and more right fix than mine because
> it covers split_huge_page_to_list's error path(ie,
> 
>                 unfreeze_page(anon_vma, head);
>                 ret = -EBUSY;
>         }
> 
> 
> I will queue it to test machine.
> 
> ..
> Zzzz
> ..
> 
> After 2 hours, I don't see any problemso far but I have a question below.
> 
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 86924cc34bac..ea1f3805afa3 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3115,7 +3115,7 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
> >  
> >                 entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
> >                 if (is_write_migration_entry(swp_entry))
> > -                       entry = maybe_mkwrite(entry, vma);
> > +                       entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> 
> Why should we do pte_mkdiry only if is_write_migration_entry is true?
> Doesn't it lose a dirty bit again if someone changes protection
> from RW to R?

2 a.m. is not ideal time for patches. You are right. It need to be
unconditionally.

Andrew, could you fold the patch below into "thp: reintroduce
split_huge_page()" instead of patch from Minchan?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86924cc34bac..f297baf8e793 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3114,6 +3114,7 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
 			continue;
 
 		entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
+		entry = pte_mkdirty(entry);
 		if (is_write_migration_entry(swp_entry))
 			entry = maybe_mkwrite(entry, vma);
 
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] mm: clear PG_dirty to mark page freeable
  2015-10-19  6:31 ` [PATCH 3/5] mm: clear PG_dirty to mark page freeable Minchan Kim
@ 2015-10-27  1:28   ` Hugh Dickins
  2015-10-27  6:50     ` Minchan Kim
  0 siblings, 1 reply; 26+ messages in thread
From: Hugh Dickins @ 2015-10-27  1:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins,
	Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner,
	Kirill A. Shutemov, Vlastimil Babka

On Mon, 19 Oct 2015, Minchan Kim wrote:

> Basically, MADV_FREE relies on dirty bit in page table entry
> to decide whether VM allows to discard the page or not.
> IOW, if page table entry includes marked dirty bit, VM shouldn't
> discard the page.
> 
> However, as a example, if swap-in by read fault happens,
> page table entry doesn't have dirty bit so MADV_FREE could discard
> the page wrongly.
> 
> For avoiding the problem, MADV_FREE did more checks with PageDirty
> and PageSwapCache. It worked out because swapped-in page lives on
> swap cache and since it is evicted from the swap cache, the page has
> PG_dirty flag. So both page flags check effectively prevent
> wrong discarding by MADV_FREE.
> 
> However, a problem in above logic is that swapped-in page has
> PG_dirty still after they are removed from swap cache so VM cannot
> consider the page as freeable any more even if madvise_free is
> called in future.
> 
> Look at below example for detail.
> 
>     ptr = malloc();
>     memset(ptr);
>     ..
>     ..
>     .. heavy memory pressure so all of pages are swapped out
>     ..
>     ..
>     var = *ptr; -> a page swapped-in and could be removed from
>                    swapcache. Then, page table doesn't mark
>                    dirty bit and page descriptor includes PG_dirty
>     ..
>     ..
>     madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
>     ..
>     ..
>     ..
>     .. heavy memory pressure again.
>     .. In this time, VM cannot discard the page because the page
>     .. has *PG_dirty*
> 
> To solve the problem, this patch clears PG_dirty if only the page
> is owned exclusively by current process when madvise is called
> because PG_dirty represents ptes's dirtiness in several processes
> so we could clear it only if we own it exclusively.
> 
> Cc: Hugh Dickins <hughd@google.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Acked-by: Hugh Dickins <hughd@google.com>

(and patches 1/5 and 2/5 too if you like)

> ---
>  mm/madvise.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index fdfb14a78c60..5db546431285 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -312,11 +312,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  		if (!page)
>  			continue;
>  
> -		if (PageSwapCache(page)) {
> +		if (PageSwapCache(page) || PageDirty(page)) {
>  			if (!trylock_page(page))
>  				continue;
> +			/*
> +			 * If page is shared with others, we couldn't clear
> +			 * PG_dirty of the page.
> +			 */
> +			if (page_count(page) != 1 + !!PageSwapCache(page)) {
> +				unlock_page(page);
> +				continue;
> +			}
>  
> -			if (!try_to_free_swap(page)) {
> +			if (PageSwapCache(page) && !try_to_free_swap(page)) {
>  				unlock_page(page);
>  				continue;
>  			}
> -- 
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] mm: simplify reclaim path for MADV_FREE
  2015-10-19  6:31 ` [PATCH 4/5] mm: simplify reclaim path for MADV_FREE Minchan Kim
@ 2015-10-27  2:09   ` Hugh Dickins
  2015-10-27  3:44     ` yalin wang
  2015-10-27  6:54     ` Minchan Kim
  0 siblings, 2 replies; 26+ messages in thread
From: Hugh Dickins @ 2015-10-27  2:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins,
	Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner,
	Kirill A. Shutemov, Vlastimil Babka

On Mon, 19 Oct 2015, Minchan Kim wrote:

> I made reclaim path mess to check and free MADV_FREEed page.
> This patch simplify it with tweaking add_to_swap.
> 
> So far, we mark page as PG_dirty when we add the page into
> swap cache(ie, add_to_swap) to page out to swap device but
> this patch moves PG_dirty marking under try_to_unmap_one
> when we decide to change pte from anon to swapent so if
> any process's pte has swapent for the page, the page must
> be swapped out. IOW, there should be no funcional behavior
> change. It makes relcaim path really simple for MADV_FREE
> because we just need to check PG_dirty of page to decide
> discarding the page or not.
> 
> Other thing this patch does is to pass TTU_BATCH_FLUSH to
> try_to_unmap when we handle freeable page because I don't
> see any reason to prevent it.
> 
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Acked-by: Hugh Dickins <hughd@google.com>

This is sooooooo much nicer than the code it replaces!  Really good.
Kudos also to Hannes for suggesting this approach originally, I think.

I hope this implementation satisfies a good proportion of the people
who have been wanting MADV_FREE: I'm not among them, and have long
lost touch with those discussions, so won't judge how usable it is.

I assume you'll refactor the series again before it goes to Linus,
so the previous messier implementations vanish?  I notice Andrew
has this "mm: simplify reclaim path for MADV_FREE" in mmotm as
mm-dont-split-thp-page-when-syscall-is-called-fix-6.patch:
I guess it all got much too messy to divide up in a hurry.

I've noticed no problems in testing (unlike the first time you moved
to working with pte_dirty); though of course I've not been using
MADV_FREE itself at all.

One aspect has worried me for a while, but I think I've reached the
conclusion that it doesn't matter at all.  The swap that's allocated
in add_to_swap() would normally get freed again (after try_to_unmap
found it was a MADV_FREE !pte_dirty !PageDirty case) at the bottom
of shrink_page_list(), in __remove_mapping(), yes?

The bit that worried me is that on rare occasions, something unknown
might take a speculative reference to the page, and __remove_mapping()
fail to freeze refs for that reason.  Much too rare to worry over not
freeing that page immediately, but it leaves us with a PageUptodate
PageSwapCache !PageDirty page, yet its contents are not the contents
of that location on swap.

But since this can only happen when you have *not* inserted the
corresponding swapent anywhere, I cannot think of anything that would
have a legitimate interest in its contents matching that location on swap.
So I don't think it's worth looking for somewhere to add a SetPageDirty
(or a delete_from_swap_cache) just to regularize that case.

> ---
>  include/linux/rmap.h |  6 +----
>  mm/huge_memory.c     |  5 ----
>  mm/rmap.c            | 42 ++++++----------------------------
>  mm/swap_state.c      |  5 ++--
>  mm/vmscan.c          | 64 ++++++++++++++++------------------------------------
>  5 files changed, 30 insertions(+), 92 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 6b6233fafb53..978f65066fd5 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -193,8 +193,7 @@ static inline void page_dup_rmap(struct page *page, bool compound)
>   * Called from mm/vmscan.c to handle paging out
>   */
>  int page_referenced(struct page *, int is_locked,
> -			struct mem_cgroup *memcg, unsigned long *vm_flags,
> -			int *is_pte_dirty);
> +			struct mem_cgroup *memcg, unsigned long *vm_flags);
>  
>  #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
>  
> @@ -272,11 +271,8 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc);
>  static inline int page_referenced(struct page *page, int is_locked,
>  				  struct mem_cgroup *memcg,
>  				  unsigned long *vm_flags,
> -				  int *is_pte_dirty)
>  {
>  	*vm_flags = 0;
> -	if (is_pte_dirty)
> -		*is_pte_dirty = 0;
>  	return 0;
>  }
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 269ed99493f0..adccfb48ce57 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1753,11 +1753,6 @@ pmd_t *page_check_address_pmd(struct page *page,
>  	return NULL;
>  }
>  
> -int pmd_freeable(pmd_t pmd)
> -{
> -	return !pmd_dirty(pmd);
> -}
> -
>  #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
>  
>  int hugepage_madvise(struct vm_area_struct *vma,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 94ee372e238b..fd64f79c87c4 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -797,7 +797,6 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
>  }
>  
>  struct page_referenced_arg {
> -	int dirtied;
>  	int mapcount;
>  	int referenced;
>  	unsigned long vm_flags;
> @@ -812,7 +811,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>  	struct mm_struct *mm = vma->vm_mm;
>  	spinlock_t *ptl;
>  	int referenced = 0;
> -	int dirty = 0;
>  	struct page_referenced_arg *pra = arg;
>  
>  	if (unlikely(PageTransHuge(page))) {
> @@ -835,14 +833,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>  		if (pmdp_clear_flush_young_notify(vma, address, pmd))
>  			referenced++;
>  
> -		/*
> -		 * Use pmd_freeable instead of raw pmd_dirty because in some
> -		 * of architecture, pmd_dirty is not defined unless
> -		 * CONFIG_TRANSPARENT_HUGEPAGE is enabled
> -		 */
> -		if (!pmd_freeable(*pmd))
> -			dirty++;
> -
>  		spin_unlock(ptl);
>  	} else {
>  		pte_t *pte;
> @@ -873,9 +863,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>  				referenced++;
>  		}
>  
> -		if (pte_dirty(*pte))
> -			dirty++;
> -
>  		pte_unmap_unlock(pte, ptl);
>  	}
>  
> @@ -889,9 +876,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>  		pra->vm_flags |= vma->vm_flags;
>  	}
>  
> -	if (dirty)
> -		pra->dirtied++;
> -
>  	pra->mapcount--;
>  	if (!pra->mapcount)
>  		return SWAP_SUCCESS; /* To break the loop */
> @@ -916,7 +900,6 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
>   * @is_locked: caller holds lock on the page
>   * @memcg: target memory cgroup
>   * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
> - * @is_pte_dirty: ptes which have marked dirty bit - used for lazyfree page
>   *
>   * Quick test_and_clear_referenced for all mappings to a page,
>   * returns the number of ptes which referenced the page.
> @@ -924,8 +907,7 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
>  int page_referenced(struct page *page,
>  		    int is_locked,
>  		    struct mem_cgroup *memcg,
> -		    unsigned long *vm_flags,
> -		    int *is_pte_dirty)
> +		    unsigned long *vm_flags)
>  {
>  	int ret;
>  	int we_locked = 0;
> @@ -940,8 +922,6 @@ int page_referenced(struct page *page,
>  	};
>  
>  	*vm_flags = 0;
> -	if (is_pte_dirty)
> -		*is_pte_dirty = 0;
>  
>  	if (!page_mapped(page))
>  		return 0;
> @@ -970,9 +950,6 @@ int page_referenced(struct page *page,
>  	if (we_locked)
>  		unlock_page(page);
>  
> -	if (is_pte_dirty)
> -		*is_pte_dirty = pra.dirtied;
> -
>  	return pra.referenced;
>  }
>  
> @@ -1453,17 +1430,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  		swp_entry_t entry = { .val = page_private(page) };
>  		pte_t swp_pte;
>  
> -		if (flags & TTU_FREE) {
> -			VM_BUG_ON_PAGE(PageSwapCache(page), page);
> -			if (!PageDirty(page)) {
> -				/* It's a freeable page by MADV_FREE */
> -				dec_mm_counter(mm, MM_ANONPAGES);
> -				goto discard;
> -			} else {
> -				set_pte_at(mm, address, pte, pteval);
> -				ret = SWAP_FAIL;
> -				goto out_unmap;
> -			}
> +		if (!PageDirty(page) && (flags & TTU_FREE)) {
> +			/* It's a freeable page by MADV_FREE */
> +			dec_mm_counter(mm, MM_ANONPAGES);
> +			goto discard;
>  		}
>  
>  		if (PageSwapCache(page)) {
> @@ -1476,6 +1446,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  				ret = SWAP_FAIL;
>  				goto out_unmap;
>  			}
> +			if (!PageDirty(page))
> +				SetPageDirty(page);
>  			if (list_empty(&mm->mmlist)) {
>  				spin_lock(&mmlist_lock);
>  				if (list_empty(&mm->mmlist))
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index d783872d746c..676ff2991380 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -185,13 +185,12 @@ int add_to_swap(struct page *page, struct list_head *list)
>  	 * deadlock in the swap out path.
>  	 */
>  	/*
> -	 * Add it to the swap cache and mark it dirty
> +	 * Add it to the swap cache.
>  	 */
>  	err = add_to_swap_cache(page, entry,
>  			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
>  
> -	if (!err) {	/* Success */
> -		SetPageDirty(page);
> +	if (!err) {
>  		return 1;
>  	} else {	/* -ENOMEM radix-tree allocation failure */
>  		/*
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 27d580b5e853..9b52ecf91194 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -791,17 +791,15 @@ enum page_references {
>  };
>  
>  static enum page_references page_check_references(struct page *page,
> -						  struct scan_control *sc,
> -						  bool *freeable)
> +						  struct scan_control *sc)
>  {
>  	int referenced_ptes, referenced_page;
>  	unsigned long vm_flags;
> -	int pte_dirty;
>  
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  
>  	referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
> -					  &vm_flags, &pte_dirty);
> +					  &vm_flags);
>  	referenced_page = TestClearPageReferenced(page);
>  
>  	/*
> @@ -842,10 +840,6 @@ static enum page_references page_check_references(struct page *page,
>  		return PAGEREF_KEEP;
>  	}
>  
> -	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
> -			!PageDirty(page))
> -		*freeable = true;
> -
>  	/* Reclaim if clean, defer dirty pages to writeback */
>  	if (referenced_page && !PageSwapBacked(page))
>  		return PAGEREF_RECLAIM_CLEAN;
> @@ -1037,8 +1031,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		}
>  
>  		if (!force_reclaim)
> -			references = page_check_references(page, sc,
> -							&freeable);
> +			references = page_check_references(page, sc);
>  
>  		switch (references) {
>  		case PAGEREF_ACTIVATE:
> @@ -1055,31 +1048,24 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		 * Try to allocate it some swap space here.
>  		 */
>  		if (PageAnon(page) && !PageSwapCache(page)) {
> -			if (!freeable) {
> -				if (!(sc->gfp_mask & __GFP_IO))
> -					goto keep_locked;
> -				if (!add_to_swap(page, page_list))
> -					goto activate_locked;
> -				may_enter_fs = 1;
> -				/* Adding to swap updated mapping */
> -				mapping = page_mapping(page);
> -			} else {
> -				if (likely(!PageTransHuge(page)))
> -					goto unmap;
> -				/* try_to_unmap isn't aware of THP page */
> -				if (unlikely(split_huge_page_to_list(page,
> -								page_list)))
> -					goto keep_locked;
> -			}
> +			if (!(sc->gfp_mask & __GFP_IO))
> +				goto keep_locked;
> +			if (!add_to_swap(page, page_list))
> +				goto activate_locked;
> +			freeable = true;
> +			may_enter_fs = 1;
> +			/* Adding to swap updated mapping */
> +			mapping = page_mapping(page);
>  		}
> -unmap:
> +
>  		/*
>  		 * The page is mapped into the page tables of one or more
>  		 * processes. Try to unmap it here.
>  		 */
> -		if (page_mapped(page) && (mapping || freeable)) {
> +		if (page_mapped(page) && mapping) {
>  			switch (try_to_unmap(page, freeable ?
> -					TTU_FREE : ttu_flags|TTU_BATCH_FLUSH)) {
> +					ttu_flags | TTU_BATCH_FLUSH | TTU_FREE :
> +					ttu_flags | TTU_BATCH_FLUSH)) {
>  			case SWAP_FAIL:
>  				goto activate_locked;
>  			case SWAP_AGAIN:
> @@ -1087,20 +1073,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			case SWAP_MLOCK:
>  				goto cull_mlocked;
>  			case SWAP_SUCCESS:
> -				/* try to free the page below */
> -				if (!freeable)
> -					break;
> -				/*
> -				 * Freeable anon page doesn't have mapping
> -				 * due to skipping of swapcache so we free
> -				 * page in here rather than __remove_mapping.
> -				 */
> -				VM_BUG_ON_PAGE(PageSwapCache(page), page);
> -				if (!page_freeze_refs(page, 1))
> -					goto keep_locked;
> -				__ClearPageLocked(page);
> -				count_vm_event(PGLAZYFREED);
> -				goto free_it;
> +				; /* try to free the page below */
>  			}
>  		}
>  
> @@ -1217,6 +1190,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		 */
>  		__ClearPageLocked(page);
>  free_it:
> +		if (freeable && !PageDirty(page))
> +			count_vm_event(PGLAZYFREED);
> +
>  		nr_reclaimed++;
>  
>  		/*
> @@ -1847,7 +1823,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  		}
>  
>  		if (page_referenced(page, 0, sc->target_mem_cgroup,
> -				    &vm_flags, NULL)) {
> +				    &vm_flags)) {
>  			nr_rotated += hpage_nr_pages(page);
>  			/*
>  			 * Identify referenced, file-backed active pages and
> -- 
> 1.9.1

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] mm: mark stable page dirty in KSM
  2015-10-19  6:31 ` [PATCH 5/5] mm: mark stable page dirty in KSM Minchan Kim
@ 2015-10-27  2:23   ` Hugh Dickins
  2015-10-27  6:58     ` Minchan Kim
  0 siblings, 1 reply; 26+ messages in thread
From: Hugh Dickins @ 2015-10-27  2:23 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins,
	Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner,
	Kirill A. Shutemov, Vlastimil Babka

On Mon, 19 Oct 2015, Minchan Kim wrote:

> Stable page could be shared by several processes and last process
> could own the page among them after CoW or zapping for every process
> except last process happens. Then, page table entry of the page
> in last process can have no dirty bit and PG_dirty flag in page->flags.
> In this case, MADV_FREE could discard the page wrongly.
> For preventing it, we mark stable page dirty.

I agree with the change, but found that comment (repeated in the source)
rather hard to follow.  And it doesn't really do justice to the changes
you have made.

This is not now a MADV_FREE thing, it's more general than that, even
if MADV_FREE is the only thing that takes advantage of it.  I like
very much that you've made page reclaim sane, freeing non-dirty
anonymous pages instead of swapping them out, without having to
think of whether it's for MADV_FREE or not.

Would you mind if we replace your patch by a re-commented version?

[PATCH] mm: mark stable page dirty in KSM

The MADV_FREE patchset changes page reclaim to simply free a clean
anonymous page with no dirty ptes, instead of swapping it out; but
KSM uses clean write-protected ptes to reference the stable ksm page.
So be sure to mark that page dirty, so it's never mistakenly discarded.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/ksm.c |    6 ++++++
 1 file changed, 6 insertions(+)

diff -puN mm/ksm.c~mm-mark-stable-page-dirty-in-ksm mm/ksm.c
--- a/mm/ksm.c~mm-mark-stable-page-dirty-in-ksm
+++ a/mm/ksm.c
@@ -1050,6 +1050,12 @@ static int try_to_merge_one_page(struct
 			 */
 			set_page_stable_node(page, NULL);
 			mark_page_accessed(page);
+			/*
+			 * Page reclaim just frees a clean page with no dirty
+			 * ptes: make sure that the ksm page would be swapped.
+			 */
+			if (!PageDirty(page))
+				SetPageDirty(page);
 			err = 0;
 		} else if (pages_identical(page, kpage))
 			err = replace_page(vma, page, kpage, orig_pte);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] mm: simplify reclaim path for MADV_FREE
  2015-10-27  2:09   ` Hugh Dickins
@ 2015-10-27  3:44     ` yalin wang
  2015-10-27  7:09       ` Minchan Kim
  2015-10-27  6:54     ` Minchan Kim
  1 sibling, 1 reply; 26+ messages in thread
From: yalin wang @ 2015-10-27  3:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Minchan Kim, Andrew Morton, open list:MEMORY MANAGEMENT, lkml,
	Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner,
	Kirill A. Shutemov, Vlastimil Babka


> On Oct 27, 2015, at 10:09, Hugh Dickins <hughd@google.com> wrote:
> 
> On Mon, 19 Oct 2015, Minchan Kim wrote:
> 
>> I made reclaim path mess to check and free MADV_FREEed page.
>> This patch simplify it with tweaking add_to_swap.
>> 
>> So far, we mark page as PG_dirty when we add the page into
>> swap cache(ie, add_to_swap) to page out to swap device but
>> this patch moves PG_dirty marking under try_to_unmap_one
>> when we decide to change pte from anon to swapent so if
>> any process's pte has swapent for the page, the page must
>> be swapped out. IOW, there should be no funcional behavior
>> change. It makes relcaim path really simple for MADV_FREE
>> because we just need to check PG_dirty of page to decide
>> discarding the page or not.
>> 
>> Other thing this patch does is to pass TTU_BATCH_FLUSH to
>> try_to_unmap when we handle freeable page because I don't
>> see any reason to prevent it.
>> 
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> Acked-by: Hugh Dickins <hughd@google.com>
> 
> This is sooooooo much nicer than the code it replaces!  Really good.
> Kudos also to Hannes for suggesting this approach originally, I think.
> 
> I hope this implementation satisfies a good proportion of the people
> who have been wanting MADV_FREE: I'm not among them, and have long
> lost touch with those discussions, so won't judge how usable it is.
> 
> I assume you'll refactor the series again before it goes to Linus,
> so the previous messier implementations vanish?  I notice Andrew
> has this "mm: simplify reclaim path for MADV_FREE" in mmotm as
> mm-dont-split-thp-page-when-syscall-is-called-fix-6.patch:
> I guess it all got much too messy to divide up in a hurry.
> 
> I've noticed no problems in testing (unlike the first time you moved
> to working with pte_dirty); though of course I've not been using
> MADV_FREE itself at all.
> 
> One aspect has worried me for a while, but I think I've reached the
> conclusion that it doesn't matter at all.  The swap that's allocated
> in add_to_swap() would normally get freed again (after try_to_unmap
> found it was a MADV_FREE !pte_dirty !PageDirty case) at the bottom
> of shrink_page_list(), in __remove_mapping(), yes?
> 
> The bit that worried me is that on rare occasions, something unknown
> might take a speculative reference to the page, and __remove_mapping()
> fail to freeze refs for that reason.  Much too rare to worry over not
> freeing that page immediately, but it leaves us with a PageUptodate
> PageSwapCache !PageDirty page, yet its contents are not the contents
> of that location on swap.
> 
> But since this can only happen when you have *not* inserted the
> corresponding swapent anywhere, I cannot think of anything that would
> have a legitimate interest in its contents matching that location on swap.
> So I don't think it's worth looking for somewhere to add a SetPageDirty
> (or a delete_from_swap_cache) just to regularize that case.
> 
>> ---
>> include/linux/rmap.h |  6 +----
>> mm/huge_memory.c     |  5 ----
>> mm/rmap.c            | 42 ++++++----------------------------
>> mm/swap_state.c      |  5 ++--
>> mm/vmscan.c          | 64 ++++++++++++++++------------------------------------
>> 5 files changed, 30 insertions(+), 92 deletions(-)
>> 
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index 6b6233fafb53..978f65066fd5 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -193,8 +193,7 @@ static inline void page_dup_rmap(struct page *page, bool compound)
>>  * Called from mm/vmscan.c to handle paging out
>>  */
>> int page_referenced(struct page *, int is_locked,
>> -			struct mem_cgroup *memcg, unsigned long *vm_flags,
>> -			int *is_pte_dirty);
>> +			struct mem_cgroup *memcg, unsigned long *vm_flags);
>> 
>> #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
>> 
>> @@ -272,11 +271,8 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc);
>> static inline int page_referenced(struct page *page, int is_locked,
>> 				  struct mem_cgroup *memcg,
>> 				  unsigned long *vm_flags,
>> -				  int *is_pte_dirty)
>> {
>> 	*vm_flags = 0;
>> -	if (is_pte_dirty)
>> -		*is_pte_dirty = 0;
>> 	return 0;
>> }
>> 
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 269ed99493f0..adccfb48ce57 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1753,11 +1753,6 @@ pmd_t *page_check_address_pmd(struct page *page,
>> 	return NULL;
>> }
>> 
>> -int pmd_freeable(pmd_t pmd)
>> -{
>> -	return !pmd_dirty(pmd);
>> -}
>> -
>> #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
>> 
>> int hugepage_madvise(struct vm_area_struct *vma,
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 94ee372e238b..fd64f79c87c4 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -797,7 +797,6 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
>> }
>> 
>> struct page_referenced_arg {
>> -	int dirtied;
>> 	int mapcount;
>> 	int referenced;
>> 	unsigned long vm_flags;
>> @@ -812,7 +811,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>> 	struct mm_struct *mm = vma->vm_mm;
>> 	spinlock_t *ptl;
>> 	int referenced = 0;
>> -	int dirty = 0;
>> 	struct page_referenced_arg *pra = arg;
>> 
>> 	if (unlikely(PageTransHuge(page))) {
>> @@ -835,14 +833,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>> 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
>> 			referenced++;
>> 
>> -		/*
>> -		 * Use pmd_freeable instead of raw pmd_dirty because in some
>> -		 * of architecture, pmd_dirty is not defined unless
>> -		 * CONFIG_TRANSPARENT_HUGEPAGE is enabled
>> -		 */
>> -		if (!pmd_freeable(*pmd))
>> -			dirty++;
>> -
>> 		spin_unlock(ptl);
>> 	} else {
>> 		pte_t *pte;
>> @@ -873,9 +863,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>> 				referenced++;
>> 		}
>> 
>> -		if (pte_dirty(*pte))
>> -			dirty++;
>> -
>> 		pte_unmap_unlock(pte, ptl);
>> 	}
>> 
>> @@ -889,9 +876,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>> 		pra->vm_flags |= vma->vm_flags;
>> 	}
>> 
>> -	if (dirty)
>> -		pra->dirtied++;
>> -
>> 	pra->mapcount--;
>> 	if (!pra->mapcount)
>> 		return SWAP_SUCCESS; /* To break the loop */
>> @@ -916,7 +900,6 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
>>  * @is_locked: caller holds lock on the page
>>  * @memcg: target memory cgroup
>>  * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
>> - * @is_pte_dirty: ptes which have marked dirty bit - used for lazyfree page
>>  *
>>  * Quick test_and_clear_referenced for all mappings to a page,
>>  * returns the number of ptes which referenced the page.
>> @@ -924,8 +907,7 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
>> int page_referenced(struct page *page,
>> 		    int is_locked,
>> 		    struct mem_cgroup *memcg,
>> -		    unsigned long *vm_flags,
>> -		    int *is_pte_dirty)
>> +		    unsigned long *vm_flags)
>> {
>> 	int ret;
>> 	int we_locked = 0;
>> @@ -940,8 +922,6 @@ int page_referenced(struct page *page,
>> 	};
>> 
>> 	*vm_flags = 0;
>> -	if (is_pte_dirty)
>> -		*is_pte_dirty = 0;
>> 
>> 	if (!page_mapped(page))
>> 		return 0;
>> @@ -970,9 +950,6 @@ int page_referenced(struct page *page,
>> 	if (we_locked)
>> 		unlock_page(page);
>> 
>> -	if (is_pte_dirty)
>> -		*is_pte_dirty = pra.dirtied;
>> -
>> 	return pra.referenced;
>> }
>> 
>> @@ -1453,17 +1430,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> 		swp_entry_t entry = { .val = page_private(page) };
>> 		pte_t swp_pte;
>> 
>> -		if (flags & TTU_FREE) {
>> -			VM_BUG_ON_PAGE(PageSwapCache(page), page);
>> -			if (!PageDirty(page)) {
>> -				/* It's a freeable page by MADV_FREE */
>> -				dec_mm_counter(mm, MM_ANONPAGES);
>> -				goto discard;
>> -			} else {
>> -				set_pte_at(mm, address, pte, pteval);
>> -				ret = SWAP_FAIL;
>> -				goto out_unmap;
>> -			}
>> +		if (!PageDirty(page) && (flags & TTU_FREE)) {
>> +			/* It's a freeable page by MADV_FREE */
>> +			dec_mm_counter(mm, MM_ANONPAGES);
>> +			goto discard;
>> 		}
>> 
>> 		if (PageSwapCache(page)) {
>> @@ -1476,6 +1446,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> 				ret = SWAP_FAIL;
>> 				goto out_unmap;
>> 			}
>> +			if (!PageDirty(page))
>> +				SetPageDirty(page);
>> 			if (list_empty(&mm->mmlist)) {
>> 				spin_lock(&mmlist_lock);
>> 				if (list_empty(&mm->mmlist))
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index d783872d746c..676ff2991380 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -185,13 +185,12 @@ int add_to_swap(struct page *page, struct list_head *list)
>> 	 * deadlock in the swap out path.
>> 	 */
>> 	/*
>> -	 * Add it to the swap cache and mark it dirty
>> +	 * Add it to the swap cache.
>> 	 */
>> 	err = add_to_swap_cache(page, entry,
>> 			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
>> 
>> -	if (!err) {	/* Success */
>> -		SetPageDirty(page);
>> +	if (!err) {
>> 		return 1;
>> 	} else {	/* -ENOMEM radix-tree allocation failure */
>> 		/*
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 27d580b5e853..9b52ecf91194 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -791,17 +791,15 @@ enum page_references {
>> };
>> 
>> static enum page_references page_check_references(struct page *page,
>> -						  struct scan_control *sc,
>> -						  bool *freeable)
>> +						  struct scan_control *sc)
>> {
>> 	int referenced_ptes, referenced_page;
>> 	unsigned long vm_flags;
>> -	int pte_dirty;
>> 
>> 	VM_BUG_ON_PAGE(!PageLocked(page), page);
>> 
>> 	referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
>> -					  &vm_flags, &pte_dirty);
>> +					  &vm_flags);
>> 	referenced_page = TestClearPageReferenced(page);
>> 
>> 	/*
>> @@ -842,10 +840,6 @@ static enum page_references page_check_references(struct page *page,
>> 		return PAGEREF_KEEP;
>> 	}
>> 
>> -	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
>> -			!PageDirty(page))
>> -		*freeable = true;
>> -
>> 	/* Reclaim if clean, defer dirty pages to writeback */
>> 	if (referenced_page && !PageSwapBacked(page))
>> 		return PAGEREF_RECLAIM_CLEAN;
>> @@ -1037,8 +1031,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> 		}
>> 
>> 		if (!force_reclaim)
>> -			references = page_check_references(page, sc,
>> -							&freeable);
>> +			references = page_check_references(page, sc);
>> 
>> 		switch (references) {
>> 		case PAGEREF_ACTIVATE:
>> @@ -1055,31 +1048,24 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> 		 * Try to allocate it some swap space here.
>> 		 */
>> 		if (PageAnon(page) && !PageSwapCache(page)) {
>> -			if (!freeable) {
>> -				if (!(sc->gfp_mask & __GFP_IO))
>> -					goto keep_locked;
>> -				if (!add_to_swap(page, page_list))
>> -					goto activate_locked;
>> -				may_enter_fs = 1;
>> -				/* Adding to swap updated mapping */
>> -				mapping = page_mapping(page);
>> -			} else {
>> -				if (likely(!PageTransHuge(page)))
>> -					goto unmap;
>> -				/* try_to_unmap isn't aware of THP page */
>> -				if (unlikely(split_huge_page_to_list(page,
>> -								page_list)))
>> -					goto keep_locked;
>> -			}
>> +			if (!(sc->gfp_mask & __GFP_IO))
>> +				goto keep_locked;
>> +			if (!add_to_swap(page, page_list))
>> +				goto activate_locked;
>> +			freeable = true;
>> +			may_enter_fs = 1;
>> +			/* Adding to swap updated mapping */
>> +			mapping = page_mapping(page);
>> 		}
>> -unmap:
>> +
>> 		/*
>> 		 * The page is mapped into the page tables of one or more
>> 		 * processes. Try to unmap it here.
>> 		 */
>> -		if (page_mapped(page) && (mapping || freeable)) {
>> +		if (page_mapped(page) && mapping) {
>> 			switch (try_to_unmap(page, freeable ?
>> -					TTU_FREE : ttu_flags|TTU_BATCH_FLUSH)) {
>> +					ttu_flags | TTU_BATCH_FLUSH | TTU_FREE :
>> +					ttu_flags | TTU_BATCH_FLUSH)) {
>> 			case SWAP_FAIL:
>> 				goto activate_locked;
>> 			case SWAP_AGAIN:
>> @@ -1087,20 +1073,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> 			case SWAP_MLOCK:
>> 				goto cull_mlocked;
>> 			case SWAP_SUCCESS:
>> -				/* try to free the page below */
>> -				if (!freeable)
>> -					break;
>> -				/*
>> -				 * Freeable anon page doesn't have mapping
>> -				 * due to skipping of swapcache so we free
>> -				 * page in here rather than __remove_mapping.
>> -				 */
>> -				VM_BUG_ON_PAGE(PageSwapCache(page), page);
>> -				if (!page_freeze_refs(page, 1))
>> -					goto keep_locked;
>> -				__ClearPageLocked(page);
>> -				count_vm_event(PGLAZYFREED);
>> -				goto free_it;
>> +				; /* try to free the page below */
>> 			}
>> 		}
>> 
>> @@ -1217,6 +1190,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> 		 */
>> 		__ClearPageLocked(page);
>> free_it:
>> +		if (freeable && !PageDirty(page))
>> +			count_vm_event(PGLAZYFREED);
>> +
>> 		nr_reclaimed++;
>> 
>> 		/*
>> @@ -1847,7 +1823,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>> 		}
>> 
>> 		if (page_referenced(page, 0, sc->target_mem_cgroup,
>> -				    &vm_flags, NULL)) {
>> +				    &vm_flags)) {
>> 			nr_rotated += hpage_nr_pages(page);
>> 			/*
>> 			 * Identify referenced, file-backed active pages and
>> -- 
>> 1.9.1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
it is wrong here if you only check PageDirty() to decide if the page is freezable or not .
The Anon page are shared by multiple process, _mapcount > 1 ,
so you must check all pt_dirty bit during page_referenced() function,
see this mail thread:
http://ns1.ske-art.com/lists/kernel/msg1934021.html
Thanks










^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] mm: clear PG_dirty to mark page freeable
  2015-10-27  1:28   ` Hugh Dickins
@ 2015-10-27  6:50     ` Minchan Kim
  0 siblings, 0 replies; 26+ messages in thread
From: Minchan Kim @ 2015-10-27  6:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Mel Gorman,
	Michal Hocko, Johannes Weiner, Kirill A. Shutemov,
	Vlastimil Babka

On Mon, Oct 26, 2015 at 06:28:13PM -0700, Hugh Dickins wrote:
> On Mon, 19 Oct 2015, Minchan Kim wrote:
> 
> > Basically, MADV_FREE relies on dirty bit in page table entry
> > to decide whether VM allows to discard the page or not.
> > IOW, if page table entry includes marked dirty bit, VM shouldn't
> > discard the page.
> > 
> > However, as a example, if swap-in by read fault happens,
> > page table entry doesn't have dirty bit so MADV_FREE could discard
> > the page wrongly.
> > 
> > For avoiding the problem, MADV_FREE did more checks with PageDirty
> > and PageSwapCache. It worked out because swapped-in page lives on
> > swap cache and since it is evicted from the swap cache, the page has
> > PG_dirty flag. So both page flags check effectively prevent
> > wrong discarding by MADV_FREE.
> > 
> > However, a problem in above logic is that swapped-in page has
> > PG_dirty still after they are removed from swap cache so VM cannot
> > consider the page as freeable any more even if madvise_free is
> > called in future.
> > 
> > Look at below example for detail.
> > 
> >     ptr = malloc();
> >     memset(ptr);
> >     ..
> >     ..
> >     .. heavy memory pressure so all of pages are swapped out
> >     ..
> >     ..
> >     var = *ptr; -> a page swapped-in and could be removed from
> >                    swapcache. Then, page table doesn't mark
> >                    dirty bit and page descriptor includes PG_dirty
> >     ..
> >     ..
> >     madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
> >     ..
> >     ..
> >     ..
> >     .. heavy memory pressure again.
> >     .. In this time, VM cannot discard the page because the page
> >     .. has *PG_dirty*
> > 
> > To solve the problem, this patch clears PG_dirty if only the page
> > is owned exclusively by current process when madvise is called
> > because PG_dirty represents ptes's dirtiness in several processes
> > so we could clear it only if we own it exclusively.
> > 
> > Cc: Hugh Dickins <hughd@google.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> Acked-by: Hugh Dickins <hughd@google.com>
> 
> (and patches 1/5 and 2/5 too if you like)
> 

Thanks for the review, Hugh!

I will rebase all series from the beginning as you suggested
and will add your Acked-by because I feel you just reviewed
all of MADV_FREE code line and you have no found any problem.

If something happens(ie, I abuse your Acked-by) wrong, please
shout me.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] mm: simplify reclaim path for MADV_FREE
  2015-10-27  2:09   ` Hugh Dickins
  2015-10-27  3:44     ` yalin wang
@ 2015-10-27  6:54     ` Minchan Kim
  1 sibling, 0 replies; 26+ messages in thread
From: Minchan Kim @ 2015-10-27  6:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Mel Gorman,
	Michal Hocko, Johannes Weiner, Kirill A. Shutemov,
	Vlastimil Babka

On Mon, Oct 26, 2015 at 07:09:15PM -0700, Hugh Dickins wrote:
> On Mon, 19 Oct 2015, Minchan Kim wrote:
> 
> > I made reclaim path mess to check and free MADV_FREEed page.
> > This patch simplify it with tweaking add_to_swap.
> > 
> > So far, we mark page as PG_dirty when we add the page into
> > swap cache(ie, add_to_swap) to page out to swap device but
> > this patch moves PG_dirty marking under try_to_unmap_one
> > when we decide to change pte from anon to swapent so if
> > any process's pte has swapent for the page, the page must
> > be swapped out. IOW, there should be no funcional behavior
> > change. It makes relcaim path really simple for MADV_FREE
> > because we just need to check PG_dirty of page to decide
> > discarding the page or not.
> > 
> > Other thing this patch does is to pass TTU_BATCH_FLUSH to
> > try_to_unmap when we handle freeable page because I don't
> > see any reason to prevent it.
> > 
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> Acked-by: Hugh Dickins <hughd@google.com>
> 
> This is sooooooo much nicer than the code it replaces!  Really good.

Thanks!

> Kudos also to Hannes for suggesting this approach originally, I think.

I should buy beer or soju if Hannes likes.

> 
> I hope this implementation satisfies a good proportion of the people
> who have been wanting MADV_FREE: I'm not among them, and have long
> lost touch with those discussions, so won't judge how usable it is.
> 
> I assume you'll refactor the series again before it goes to Linus,
> so the previous messier implementations vanish?  I notice Andrew

Actutally, I didn't think about that but once you mentioned it,
I realized that would be better. Thanks for the suggestion.

> has this "mm: simplify reclaim path for MADV_FREE" in mmotm as
> mm-dont-split-thp-page-when-syscall-is-called-fix-6.patch:
> I guess it all got much too messy to divide up in a hurry.

Yeb, I will rebase all series from the beginning based on recent mmtom
so I will vanish the mess in git-blame.

When I rebases it in mmotm, I will do it before reaching THP refcount
new design if Andrew and Kirill don't mind it because it makes to fail
my test as I reported. I don't know it's long time unknown bug or
something THP-ref new introduces. Anyway, I want to test smoothly.

> 
> I've noticed no problems in testing (unlike the first time you moved
> to working with pte_dirty); though of course I've not been using

Thanks for testing!

> MADV_FREE itself at all.
> 
> One aspect has worried me for a while, but I think I've reached the
> conclusion that it doesn't matter at all.  The swap that's allocated
> in add_to_swap() would normally get freed again (after try_to_unmap
> found it was a MADV_FREE !pte_dirty !PageDirty case) at the bottom
> of shrink_page_list(), in __remove_mapping(), yes?

Right.

> 
> The bit that worried me is that on rare occasions, something unknown
> might take a speculative reference to the page, and __remove_mapping()
> fail to freeze refs for that reason.  Much too rare to worry over not
> freeing that page immediately, but it leaves us with a PageUptodate
> PageSwapCache !PageDirty page, yet its contents are not the contents
> of that location on swap.
> 
> But since this can only happen when you have *not* inserted the
> corresponding swapent anywhere, I cannot think of anything that would
> have a legitimate interest in its contents matching that location on swap.
> So I don't think it's worth looking for somewhere to add a SetPageDirty
> (or a delete_from_swap_cache) just to regularize that case.


Exactly.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] mm: mark stable page dirty in KSM
  2015-10-27  2:23   ` Hugh Dickins
@ 2015-10-27  6:58     ` Minchan Kim
  0 siblings, 0 replies; 26+ messages in thread
From: Minchan Kim @ 2015-10-27  6:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Mel Gorman,
	Michal Hocko, Johannes Weiner, Kirill A. Shutemov,
	Vlastimil Babka

On Mon, Oct 26, 2015 at 07:23:12PM -0700, Hugh Dickins wrote:
> On Mon, 19 Oct 2015, Minchan Kim wrote:
> 
> > Stable page could be shared by several processes and last process
> > could own the page among them after CoW or zapping for every process
> > except last process happens. Then, page table entry of the page
> > in last process can have no dirty bit and PG_dirty flag in page->flags.
> > In this case, MADV_FREE could discard the page wrongly.
> > For preventing it, we mark stable page dirty.
> 
> I agree with the change, but found that comment (repeated in the source)
> rather hard to follow.  And it doesn't really do justice to the changes
> you have made.
> 
> This is not now a MADV_FREE thing, it's more general than that, even
> if MADV_FREE is the only thing that takes advantage of it.  I like
> very much that you've made page reclaim sane, freeing non-dirty
> anonymous pages instead of swapping them out, without having to
> think of whether it's for MADV_FREE or not.
> 
> Would you mind if we replace your patch by a re-commented version?
> 
> [PATCH] mm: mark stable page dirty in KSM
> 
> The MADV_FREE patchset changes page reclaim to simply free a clean
> anonymous page with no dirty ptes, instead of swapping it out; but
> KSM uses clean write-protected ptes to reference the stable ksm page.
> So be sure to mark that page dirty, so it's never mistakenly discarded.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Hugh Dickins <hughd@google.com>

Looks better than mine.
I will include this in my patchset when I respin.

Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] mm: simplify reclaim path for MADV_FREE
  2015-10-27  3:44     ` yalin wang
@ 2015-10-27  7:09       ` Minchan Kim
  2015-10-27  7:39         ` yalin wang
  0 siblings, 1 reply; 26+ messages in thread
From: Minchan Kim @ 2015-10-27  7:09 UTC (permalink / raw)
  To: yalin wang
  Cc: Hugh Dickins, Andrew Morton, open list:MEMORY MANAGEMENT, lkml,
	Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner,
	Kirill A. Shutemov, Vlastimil Babka

Hello Yalin,

Sorry for missing you in Cc list.
IIRC, mails to send your previous mail address(Yalin.Wang@sonymobile.com)
were returned.

On Tue, Oct 27, 2015 at 11:44:09AM +0800, yalin wang wrote:
> 
> > On Oct 27, 2015, at 10:09, Hugh Dickins <hughd@google.com> wrote:
> > 
> > On Mon, 19 Oct 2015, Minchan Kim wrote:
> > 
> >> I made reclaim path mess to check and free MADV_FREEed page.
> >> This patch simplify it with tweaking add_to_swap.
> >> 
> >> So far, we mark page as PG_dirty when we add the page into
> >> swap cache(ie, add_to_swap) to page out to swap device but
> >> this patch moves PG_dirty marking under try_to_unmap_one
> >> when we decide to change pte from anon to swapent so if
> >> any process's pte has swapent for the page, the page must
> >> be swapped out. IOW, there should be no funcional behavior
> >> change. It makes relcaim path really simple for MADV_FREE
> >> because we just need to check PG_dirty of page to decide
> >> discarding the page or not.
> >> 
> >> Other thing this patch does is to pass TTU_BATCH_FLUSH to
> >> try_to_unmap when we handle freeable page because I don't
> >> see any reason to prevent it.
> >> 
> >> Cc: Hugh Dickins <hughd@google.com>
> >> Cc: Mel Gorman <mgorman@suse.de>
> >> Signed-off-by: Minchan Kim <minchan@kernel.org>
> > 
> > Acked-by: Hugh Dickins <hughd@google.com>
> > 
> > This is sooooooo much nicer than the code it replaces!  Really good.
> > Kudos also to Hannes for suggesting this approach originally, I think.
> > 
> > I hope this implementation satisfies a good proportion of the people
> > who have been wanting MADV_FREE: I'm not among them, and have long
> > lost touch with those discussions, so won't judge how usable it is.
> > 
> > I assume you'll refactor the series again before it goes to Linus,
> > so the previous messier implementations vanish?  I notice Andrew
> > has this "mm: simplify reclaim path for MADV_FREE" in mmotm as
> > mm-dont-split-thp-page-when-syscall-is-called-fix-6.patch:
> > I guess it all got much too messy to divide up in a hurry.
> > 
> > I've noticed no problems in testing (unlike the first time you moved
> > to working with pte_dirty); though of course I've not been using
> > MADV_FREE itself at all.
> > 
> > One aspect has worried me for a while, but I think I've reached the
> > conclusion that it doesn't matter at all.  The swap that's allocated
> > in add_to_swap() would normally get freed again (after try_to_unmap
> > found it was a MADV_FREE !pte_dirty !PageDirty case) at the bottom
> > of shrink_page_list(), in __remove_mapping(), yes?
> > 
> > The bit that worried me is that on rare occasions, something unknown
> > might take a speculative reference to the page, and __remove_mapping()
> > fail to freeze refs for that reason.  Much too rare to worry over not
> > freeing that page immediately, but it leaves us with a PageUptodate
> > PageSwapCache !PageDirty page, yet its contents are not the contents
> > of that location on swap.
> > 
> > But since this can only happen when you have *not* inserted the
> > corresponding swapent anywhere, I cannot think of anything that would
> > have a legitimate interest in its contents matching that location on swap.
> > So I don't think it's worth looking for somewhere to add a SetPageDirty
> > (or a delete_from_swap_cache) just to regularize that case.
> > 
> >> ---
> >> include/linux/rmap.h |  6 +----
> >> mm/huge_memory.c     |  5 ----
> >> mm/rmap.c            | 42 ++++++----------------------------
> >> mm/swap_state.c      |  5 ++--
> >> mm/vmscan.c          | 64 ++++++++++++++++------------------------------------
> >> 5 files changed, 30 insertions(+), 92 deletions(-)
> >> 

<snip>

You added comment bottom line so I'm not sure what PageDirty you meant.

> it is wrong here if you only check PageDirty() to decide if the page is freezable or not .
> The Anon page are shared by multiple process, _mapcount > 1 ,
> so you must check all pt_dirty bit during page_referenced() function,
> see this mail thread:
> http://ns1.ske-art.com/lists/kernel/msg1934021.html

If one of pte among process sharing the page was dirty, the dirtiness should
be propagated from pte to PG_dirty by try_to_unmap_one.
IOW, if the page doesn't have PG_dirty flag, it means all of process did
MADV_FREE.

Am I missing something from you question?
If so, could you show exact scenario I am missing?

Thanks for the interest.


> Thanks
> 
> 
> 
> 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] mm: simplify reclaim path for MADV_FREE
  2015-10-27  7:09       ` Minchan Kim
@ 2015-10-27  7:39         ` yalin wang
  2015-10-27  8:10           ` Minchan Kim
  0 siblings, 1 reply; 26+ messages in thread
From: yalin wang @ 2015-10-27  7:39 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Hugh Dickins, Andrew Morton, open list:MEMORY MANAGEMENT, lkml,
	Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner,
	Kirill A. Shutemov, Vlastimil Babka


> On Oct 27, 2015, at 15:09, Minchan Kim <minchan@kernel.org> wrote:
> 
> Hello Yalin,
> 
> Sorry for missing you in Cc list.
> IIRC, mails to send your previous mail address(Yalin.Wang@sonymobile.com)
> were returned.
> 
> You added comment bottom line so I'm not sure what PageDirty you meant.
> 
>> it is wrong here if you only check PageDirty() to decide if the page is freezable or not .
>> The Anon page are shared by multiple process, _mapcount > 1 ,
>> so you must check all pt_dirty bit during page_referenced() function,
>> see this mail thread:
>> http://ns1.ske-art.com/lists/kernel/msg1934021.html
> 
> If one of pte among process sharing the page was dirty, the dirtiness should
> be propagated from pte to PG_dirty by try_to_unmap_one.
> IOW, if the page doesn't have PG_dirty flag, it means all of process did
> MADV_FREE.
> 
> Am I missing something from you question?
> If so, could you show exact scenario I am missing?
> 
> Thanks for the interest.
oh, yeah , that is right , i miss that , pte_dirty will propagate to PG_dirty ,
so that is correct .
Generic to say this patch move set_page_dirty() from add_to_swap() to 
try_to_unmap(), i think can change a little about this patch:

@@ -1476,6 +1446,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
				ret = SWAP_FAIL;
				goto out_unmap;
			}
+			if (!PageDirty(page))
+				SetPageDirty(page);
			if (list_empty(&mm->mmlist)) {
				spin_lock(&mmlist_lock);
				if (list_empty(&mm->mmlist))

i think this 2 lines can be removed ,
since  pte_dirty have propagated to set_page_dirty() , we don’t need this line here ,
otherwise you will always dirty a AnonPage, even it is clean,
then we will page out this clean page to swap partition one more , this is not needed.
am i understanding correctly ?

By the way, please change my mail address to yalin.wang2010@gmail.com in CC list .
Thanks a lot. :) 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] mm: simplify reclaim path for MADV_FREE
  2015-10-27  7:39         ` yalin wang
@ 2015-10-27  8:10           ` Minchan Kim
  2015-10-27  8:52             ` yalin wang
  0 siblings, 1 reply; 26+ messages in thread
From: Minchan Kim @ 2015-10-27  8:10 UTC (permalink / raw)
  To: yalin wang
  Cc: Hugh Dickins, Andrew Morton, open list:MEMORY MANAGEMENT, lkml,
	Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner,
	Kirill A. Shutemov, Vlastimil Babka

On Tue, Oct 27, 2015 at 03:39:16PM +0800, yalin wang wrote:
> 
> > On Oct 27, 2015, at 15:09, Minchan Kim <minchan@kernel.org> wrote:
> > 
> > Hello Yalin,
> > 
> > Sorry for missing you in Cc list.
> > IIRC, mails to send your previous mail address(Yalin.Wang@sonymobile.com)
> > were returned.
> > 
> > You added comment bottom line so I'm not sure what PageDirty you meant.
> > 
> >> it is wrong here if you only check PageDirty() to decide if the page is freezable or not .
> >> The Anon page are shared by multiple process, _mapcount > 1 ,
> >> so you must check all pt_dirty bit during page_referenced() function,
> >> see this mail thread:
> >> http://ns1.ske-art.com/lists/kernel/msg1934021.html
> > 
> > If one of pte among process sharing the page was dirty, the dirtiness should
> > be propagated from pte to PG_dirty by try_to_unmap_one.
> > IOW, if the page doesn't have PG_dirty flag, it means all of process did
> > MADV_FREE.
> > 
> > Am I missing something from you question?
> > If so, could you show exact scenario I am missing?
> > 
> > Thanks for the interest.
> oh, yeah , that is right , i miss that , pte_dirty will propagate to PG_dirty ,
> so that is correct .
> Generic to say this patch move set_page_dirty() from add_to_swap() to 
> try_to_unmap(), i think can change a little about this patch:
> 
> @@ -1476,6 +1446,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> 				ret = SWAP_FAIL;
> 				goto out_unmap;
> 			}
> +			if (!PageDirty(page))
> +				SetPageDirty(page);
> 			if (list_empty(&mm->mmlist)) {
> 				spin_lock(&mmlist_lock);
> 				if (list_empty(&mm->mmlist))
> 
> i think this 2 lines can be removed ,
> since  pte_dirty have propagated to set_page_dirty() , we don’t need this line here ,
> otherwise you will always dirty a AnonPage, even it is clean,
> then we will page out this clean page to swap partition one more , this is not needed.
> am i understanding correctly ?

Your understanding is correct.
I will fix it in next spin.

> 
> By the way, please change my mail address to yalin.wang2010@gmail.com in CC list .
> Thanks a lot. :) 

Thanks for the review!

> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] mm: simplify reclaim path for MADV_FREE
  2015-10-27  8:10           ` Minchan Kim
@ 2015-10-27  8:52             ` yalin wang
  2015-10-28  4:03               ` yalin wang
  0 siblings, 1 reply; 26+ messages in thread
From: yalin wang @ 2015-10-27  8:52 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Hugh Dickins, Andrew Morton, open list:MEMORY MANAGEMENT, lkml,
	Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner,
	Kirill A. Shutemov, Vlastimil Babka


> On Oct 27, 2015, at 16:10, Minchan Kim <minchan@kernel.org> wrote:
> 
> On Tue, Oct 27, 2015 at 03:39:16PM +0800, yalin wang wrote:
>> 
>>> On Oct 27, 2015, at 15:09, Minchan Kim <minchan@kernel.org> wrote:
>>> 
>>> Hello Yalin,
>>> 
>>> Sorry for missing you in Cc list.
>>> IIRC, mails to send your previous mail address(Yalin.Wang@sonymobile.com)
>>> were returned.
>>> 
>>> You added comment bottom line so I'm not sure what PageDirty you meant.
>>> 
>>>> it is wrong here if you only check PageDirty() to decide if the page is freezable or not .
>>>> The Anon page are shared by multiple process, _mapcount > 1 ,
>>>> so you must check all pt_dirty bit during page_referenced() function,
>>>> see this mail thread:
>>>> http://ns1.ske-art.com/lists/kernel/msg1934021.html
>>> 
>>> If one of pte among process sharing the page was dirty, the dirtiness should
>>> be propagated from pte to PG_dirty by try_to_unmap_one.
>>> IOW, if the page doesn't have PG_dirty flag, it means all of process did
>>> MADV_FREE.
>>> 
>>> Am I missing something from you question?
>>> If so, could you show exact scenario I am missing?
>>> 
>>> Thanks for the interest.
>> oh, yeah , that is right , i miss that , pte_dirty will propagate to PG_dirty ,
>> so that is correct .
>> Generic to say this patch move set_page_dirty() from add_to_swap() to 
>> try_to_unmap(), i think can change a little about this patch:
>> 
>> @@ -1476,6 +1446,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> 				ret = SWAP_FAIL;
>> 				goto out_unmap;
>> 			}
>> +			if (!PageDirty(page))
>> +				SetPageDirty(page);
>> 			if (list_empty(&mm->mmlist)) {
>> 				spin_lock(&mmlist_lock);
>> 				if (list_empty(&mm->mmlist))
>> 
>> i think this 2 lines can be removed ,
>> since  pte_dirty have propagated to set_page_dirty() , we don’t need this line here ,
>> otherwise you will always dirty a AnonPage, even it is clean,
>> then we will page out this clean page to swap partition one more , this is not needed.
>> am i understanding correctly ?
> 
> Your understanding is correct.
> I will fix it in next spin.
> 
>> 
>> By the way, please change my mail address to yalin.wang2010@gmail.com in CC list .
>> Thanks a lot. :) 
> 
> Thanks for the review!

i have a look at the old mail list , i recall the scenario that multiple processes share a AnonPage 
special case :

for example Process A have a AnonPage map like this:
	! pte_dirty() && PageDirty()==1   (this is possible after read fault happened on swap entry, and try_to_free_swap() succeed.)
Process A  do a fork() , New process is called B .
Then A  syscall(MADV_FREE) on the page .
At this time, page table like this:

A  ! pte_dirty() && PageDirty() == 0  && PageSwapCache() == 0

B ! pte_dirty() && PageDirty() == 0  && PageSwapCache() == 0

This means this page is freeable , and can be freed during page reclaim.
This is not fair for Process B . Since B don’t call syscall(MADV_FREE) ,
its page should not be discard .  Will cause some strange behaviour if happened .

This is discussed by 
http://www.serverphorums.com/read.php?12,1220840
but i don’t know why the patch is not merged .

Thanks 













^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] mm: simplify reclaim path for MADV_FREE
  2015-10-27  8:52             ` yalin wang
@ 2015-10-28  4:03               ` yalin wang
  0 siblings, 0 replies; 26+ messages in thread
From: yalin wang @ 2015-10-28  4:03 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Hugh Dickins, Andrew Morton, open list:MEMORY MANAGEMENT, lkml,
	Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner,
	Kirill A. Shutemov, Vlastimil Babka


> On Oct 27, 2015, at 16:52, yalin wang <yalin.wang2010@gmail.com> wrote:
> 
> 
>> On Oct 27, 2015, at 16:10, Minchan Kim <minchan@kernel.org> wrote:
>> 
>> On Tue, Oct 27, 2015 at 03:39:16PM +0800, yalin wang wrote:
>>> 
>>>> On Oct 27, 2015, at 15:09, Minchan Kim <minchan@kernel.org> wrote:
>>>> 
>>>> Hello Yalin,
>>>> 
>>>> Sorry for missing you in Cc list.
>>>> IIRC, mails to send your previous mail address(Yalin.Wang@sonymobile.com)
>>>> were returned.
>>>> 
>>>> You added comment bottom line so I'm not sure what PageDirty you meant.
>>>> 
>>>>> it is wrong here if you only check PageDirty() to decide if the page is freezable or not .
>>>>> The Anon page are shared by multiple process, _mapcount > 1 ,
>>>>> so you must check all pt_dirty bit during page_referenced() function,
>>>>> see this mail thread:
>>>>> http://ns1.ske-art.com/lists/kernel/msg1934021.html
>>>> 
>>>> If one of pte among process sharing the page was dirty, the dirtiness should
>>>> be propagated from pte to PG_dirty by try_to_unmap_one.
>>>> IOW, if the page doesn't have PG_dirty flag, it means all of process did
>>>> MADV_FREE.
>>>> 
>>>> Am I missing something from you question?
>>>> If so, could you show exact scenario I am missing?
>>>> 
>>>> Thanks for the interest.
>>> oh, yeah , that is right , i miss that , pte_dirty will propagate to PG_dirty ,
>>> so that is correct .
>>> Generic to say this patch move set_page_dirty() from add_to_swap() to 
>>> try_to_unmap(), i think can change a little about this patch:
>>> 
>>> @@ -1476,6 +1446,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>> 				ret = SWAP_FAIL;
>>> 				goto out_unmap;
>>> 			}
>>> +			if (!PageDirty(page))
>>> +				SetPageDirty(page);
>>> 			if (list_empty(&mm->mmlist)) {
>>> 				spin_lock(&mmlist_lock);
>>> 				if (list_empty(&mm->mmlist))
>>> 
>>> i think this 2 lines can be removed ,
>>> since  pte_dirty have propagated to set_page_dirty() , we don’t need this line here ,
>>> otherwise you will always dirty a AnonPage, even it is clean,
>>> then we will page out this clean page to swap partition one more , this is not needed.
>>> am i understanding correctly ?
>> 
>> Your understanding is correct.
>> I will fix it in next spin.
>> 
>>> 
>>> By the way, please change my mail address to yalin.wang2010@gmail.com in CC list .
>>> Thanks a lot. :) 
>> 
>> Thanks for the review!
> 
> i have a look at the old mail list , i recall the scenario that multiple processes share a AnonPage 
> special case :
> 
> for example Process A have a AnonPage map like this:
> 	! pte_dirty() && PageDirty()==1   (this is possible after read fault happened on swap entry, and try_to_free_swap() succeed.)
> Process A  do a fork() , New process is called B .
> Then A  syscall(MADV_FREE) on the page .
> At this time, page table like this:
> 
> A  ! pte_dirty() && PageDirty() == 0  && PageSwapCache() == 0
> 
> B ! pte_dirty() && PageDirty() == 0  && PageSwapCache() == 0
> 
> This means this page is freeable , and can be freed during page reclaim.
> This is not fair for Process B . Since B don’t call syscall(MADV_FREE) ,
> its page should not be discard .  Will cause some strange behaviour if happened .
> 
> This is discussed by 
> http://www.serverphorums.com/read.php?12,1220840
> but i don’t know why the patch is not merged .
> 
> Thanks 
oh, i have see 0b502297d1cc26e09b98955b4efa728be1c48921
this commit merged , then this problem should be fixed by this method.
ignore this mail. :)

Thanks a lot .






^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2015-10-28  4:04 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-19  6:31 [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim
2015-10-19  6:31 ` [PATCH 1/5] mm: MADV_FREE trivial clean up Minchan Kim
2015-10-19  6:31 ` [PATCH 2/5] mm: skip huge zero page in MADV_FREE Minchan Kim
2015-10-19  6:31 ` [PATCH 3/5] mm: clear PG_dirty to mark page freeable Minchan Kim
2015-10-27  1:28   ` Hugh Dickins
2015-10-27  6:50     ` Minchan Kim
2015-10-19  6:31 ` [PATCH 4/5] mm: simplify reclaim path for MADV_FREE Minchan Kim
2015-10-27  2:09   ` Hugh Dickins
2015-10-27  3:44     ` yalin wang
2015-10-27  7:09       ` Minchan Kim
2015-10-27  7:39         ` yalin wang
2015-10-27  8:10           ` Minchan Kim
2015-10-27  8:52             ` yalin wang
2015-10-28  4:03               ` yalin wang
2015-10-27  6:54     ` Minchan Kim
2015-10-19  6:31 ` [PATCH 5/5] mm: mark stable page dirty in KSM Minchan Kim
2015-10-27  2:23   ` Hugh Dickins
2015-10-27  6:58     ` Minchan Kim
2015-10-19 10:01 ` [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim
2015-10-20  1:38   ` Minchan Kim
2015-10-20  7:21   ` Minchan Kim
2015-10-20  7:27     ` Minchan Kim
2015-10-20 21:36     ` Andrew Morton
2015-10-20 22:43       ` Kirill A. Shutemov
2015-10-21  5:11         ` Minchan Kim
2015-10-21  7:50           ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).