* [PATCH 0/5] MADV_FREE refactoring and fix KSM page @ 2015-10-19 6:31 Minchan Kim 2015-10-19 6:31 ` [PATCH 1/5] mm: MADV_FREE trivial clean up Minchan Kim ` (5 more replies) 0 siblings, 6 replies; 26+ messages in thread From: Minchan Kim @ 2015-10-19 6:31 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka, Minchan Kim Hello, it's too late since I sent previos patch. https://lkml.org/lkml/2015/6/3/37 This patch is alomost new compared to previos approach. I think this is more simple, clear and easy to review. One thing I should notice is that I have tested this patch and couldn't find any critical problem so I rebased patchset onto recent mmotm(ie, mmotm-2015-10-15-15-20) to send formal patchset. Unfortunately, I start to see sudden discarding of the page we shouldn't do. IOW, application's valid anonymous page was disappeared suddenly. When I look through THP changes, I think we could lose dirty bit of pte between freeze_page and unfreeze_page when we mark it as migration entry and restore it. So, I added below simple code without enough considering and cannot see the problem any more. I hope it's good hint to find right fix this problem. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d5ea516ffb54..e881c04f5950 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3138,6 +3138,9 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page, if (is_write_migration_entry(swp_entry)) entry = maybe_mkwrite(entry, vma); + if (PageDirty(page)) + SetPageDirty(page); + flush_dcache_page(page); set_pte_at(vma->vm_mm, address, pte + i, entry); Although it fixes abvove problem, I can encounter below another bug in several hours. BUG: Bad rss-counter state mm:ffff88007fc28000 idx:1 val:439 BUG: Bad rss-counter state mm:ffff88007fc28000 idx:2 val:73 Or BUG: Bad rss-counter state mm:ffff88007fc28000 idx:1 val:512 It seems we are zapping THP page without decreasing MM_ANONPAGES and MM_SWAPENTS. Of course, it could be a bug of MADV_FREE and recent changes of THP reveals it. What I can say is I couldn't see any problem until mmotm-2015-10-06-16-30 so I guess there is some conflict with THP-refcount redesign of Kirill or it makes to reveal MADV_FREE's hidden bug. I will hunt it down but I hope Kirill might catch it up earlier than me. Major thing with this patch is two things. 1. Work with MADV_FREE on PG_dirty page. So far, MADV_FREE doesn't work with page which is not in swap cache but has PG_dirty(ex, swapped-in page). Details are in [3/5]. 2. Make MADV_FREE discard path simple Current logic for discarding hinted page is really mess so [4/5] makes it simple and clean. 3. Fix with KSM page A process can have KSM page which is no dirty bit in page table entry and no PG_dirty in page->flags so VM could discard it wrongly. [5/5] fixes it. Minchan Kim (5): [1/5] mm: MADV_FREE trivial clean up [2/5] mm: skip huge zero page in MADV_FREE [3/5] mm: clear PG_dirty to mark page freeable [4/5] mm: simplify reclaim path for MADV_FREE [5/5] mm: mark stable page dirty in KSM include/linux/rmap.h | 6 +---- mm/huge_memory.c | 9 ++++---- mm/ksm.c | 12 ++++++++++ mm/madvise.c | 29 +++++++++++------------- mm/rmap.c | 46 +++++++------------------------------ mm/swap_state.c | 5 ++-- mm/vmscan.c | 64 ++++++++++++++++------------------------------------ 7 files changed, 60 insertions(+), 111 deletions(-) -- 1.9.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 1/5] mm: MADV_FREE trivial clean up 2015-10-19 6:31 [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim @ 2015-10-19 6:31 ` Minchan Kim 2015-10-19 6:31 ` [PATCH 2/5] mm: skip huge zero page in MADV_FREE Minchan Kim ` (4 subsequent siblings) 5 siblings, 0 replies; 26+ messages in thread From: Minchan Kim @ 2015-10-19 6:31 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka, Minchan Kim 1. Page table waker already pass the vma it is processing so we don't need to pass vma. 2. If page table entry is dirty in try_to_unmap_one, the dirtiness should propagate to PG_dirty of the page. So, it's enough to check only PageDirty without other pte dirty bit checking. Signed-off-by: Minchan Kim <minchan@kernel.org> --- mm/madvise.c | 17 +++-------------- mm/rmap.c | 6 ++---- 2 files changed, 5 insertions(+), 18 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 7835bc1eaccb..fdfb14a78c60 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -24,11 +24,6 @@ #include <asm/tlb.h> -struct madvise_free_private { - struct vm_area_struct *vma; - struct mmu_gather *tlb; -}; - /* * Any behaviour which results in changes to the vma->vm_flags needs to * take mmap_sem for writing. Others, which simply traverse vmas, need @@ -269,10 +264,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { - struct madvise_free_private *fp = walk->private; - struct mmu_gather *tlb = fp->tlb; + struct mmu_gather *tlb = walk->private; struct mm_struct *mm = tlb->mm; - struct vm_area_struct *vma = fp->vma; + struct vm_area_struct *vma = walk->vma; spinlock_t *ptl; pte_t *pte, ptent; struct page *page; @@ -365,15 +359,10 @@ static void madvise_free_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long addr, unsigned long end) { - struct madvise_free_private fp = { - .vma = vma, - .tlb = tlb, - }; - struct mm_walk free_walk = { .pmd_entry = madvise_free_pte_range, .mm = vma->vm_mm, - .private = &fp, + .private = tlb, }; BUG_ON(addr >= end); diff --git a/mm/rmap.c b/mm/rmap.c index 6f0f9331a20f..94ee372e238b 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1380,7 +1380,6 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, spinlock_t *ptl; int ret = SWAP_AGAIN; enum ttu_flags flags = (enum ttu_flags)arg; - int dirty = 0; pte = page_check_address(page, mm, address, &ptl, 0); if (!pte) @@ -1423,8 +1422,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } /* Move the dirty bit to the physical page now the pte is gone. */ - dirty = pte_dirty(pteval); - if (dirty) + if (pte_dirty(pteval)) set_page_dirty(page); /* Update high watermark before we lower rss */ @@ -1457,7 +1455,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (flags & TTU_FREE) { VM_BUG_ON_PAGE(PageSwapCache(page), page); - if (!dirty && !PageDirty(page)) { + if (!PageDirty(page)) { /* It's a freeable page by MADV_FREE */ dec_mm_counter(mm, MM_ANONPAGES); goto discard; -- 1.9.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 2/5] mm: skip huge zero page in MADV_FREE 2015-10-19 6:31 [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim 2015-10-19 6:31 ` [PATCH 1/5] mm: MADV_FREE trivial clean up Minchan Kim @ 2015-10-19 6:31 ` Minchan Kim 2015-10-19 6:31 ` [PATCH 3/5] mm: clear PG_dirty to mark page freeable Minchan Kim ` (3 subsequent siblings) 5 siblings, 0 replies; 26+ messages in thread From: Minchan Kim @ 2015-10-19 6:31 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka, Minchan Kim It is pointless to mark huge zero page as freeable. Let's skip it. Signed-off-by: Minchan Kim <minchan@kernel.org> --- mm/huge_memory.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f1de4ce583a6..269ed99493f0 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1542,6 +1542,9 @@ int madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, struct page *page; pmd_t orig_pmd; + if (is_huge_zero_pmd(*pmd)) + goto out; + orig_pmd = pmdp_huge_get_and_clear(mm, addr, pmd); /* No hugepage in swapcache */ @@ -1553,6 +1556,7 @@ int madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, set_pmd_at(mm, addr, pmd, orig_pmd); tlb_remove_pmd_tlb_entry(tlb, pmd, addr); +out: spin_unlock(ptl); ret = 0; } -- 1.9.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 3/5] mm: clear PG_dirty to mark page freeable 2015-10-19 6:31 [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim 2015-10-19 6:31 ` [PATCH 1/5] mm: MADV_FREE trivial clean up Minchan Kim 2015-10-19 6:31 ` [PATCH 2/5] mm: skip huge zero page in MADV_FREE Minchan Kim @ 2015-10-19 6:31 ` Minchan Kim 2015-10-27 1:28 ` Hugh Dickins 2015-10-19 6:31 ` [PATCH 4/5] mm: simplify reclaim path for MADV_FREE Minchan Kim ` (2 subsequent siblings) 5 siblings, 1 reply; 26+ messages in thread From: Minchan Kim @ 2015-10-19 6:31 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka, Minchan Kim Basically, MADV_FREE relies on dirty bit in page table entry to decide whether VM allows to discard the page or not. IOW, if page table entry includes marked dirty bit, VM shouldn't discard the page. However, as a example, if swap-in by read fault happens, page table entry doesn't have dirty bit so MADV_FREE could discard the page wrongly. For avoiding the problem, MADV_FREE did more checks with PageDirty and PageSwapCache. It worked out because swapped-in page lives on swap cache and since it is evicted from the swap cache, the page has PG_dirty flag. So both page flags check effectively prevent wrong discarding by MADV_FREE. However, a problem in above logic is that swapped-in page has PG_dirty still after they are removed from swap cache so VM cannot consider the page as freeable any more even if madvise_free is called in future. Look at below example for detail. ptr = malloc(); memset(ptr); .. .. .. heavy memory pressure so all of pages are swapped out .. .. var = *ptr; -> a page swapped-in and could be removed from swapcache. Then, page table doesn't mark dirty bit and page descriptor includes PG_dirty .. .. madvise_free(ptr); -> It doesn't clear PG_dirty of the page. .. .. .. .. heavy memory pressure again. .. In this time, VM cannot discard the page because the page .. has *PG_dirty* To solve the problem, this patch clears PG_dirty if only the page is owned exclusively by current process when madvise is called because PG_dirty represents ptes's dirtiness in several processes so we could clear it only if we own it exclusively. Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Minchan Kim <minchan@kernel.org> --- mm/madvise.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index fdfb14a78c60..5db546431285 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -312,11 +312,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (!page) continue; - if (PageSwapCache(page)) { + if (PageSwapCache(page) || PageDirty(page)) { if (!trylock_page(page)) continue; + /* + * If page is shared with others, we couldn't clear + * PG_dirty of the page. + */ + if (page_count(page) != 1 + !!PageSwapCache(page)) { + unlock_page(page); + continue; + } - if (!try_to_free_swap(page)) { + if (PageSwapCache(page) && !try_to_free_swap(page)) { unlock_page(page); continue; } -- 1.9.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 3/5] mm: clear PG_dirty to mark page freeable 2015-10-19 6:31 ` [PATCH 3/5] mm: clear PG_dirty to mark page freeable Minchan Kim @ 2015-10-27 1:28 ` Hugh Dickins 2015-10-27 6:50 ` Minchan Kim 0 siblings, 1 reply; 26+ messages in thread From: Hugh Dickins @ 2015-10-27 1:28 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka On Mon, 19 Oct 2015, Minchan Kim wrote: > Basically, MADV_FREE relies on dirty bit in page table entry > to decide whether VM allows to discard the page or not. > IOW, if page table entry includes marked dirty bit, VM shouldn't > discard the page. > > However, as a example, if swap-in by read fault happens, > page table entry doesn't have dirty bit so MADV_FREE could discard > the page wrongly. > > For avoiding the problem, MADV_FREE did more checks with PageDirty > and PageSwapCache. It worked out because swapped-in page lives on > swap cache and since it is evicted from the swap cache, the page has > PG_dirty flag. So both page flags check effectively prevent > wrong discarding by MADV_FREE. > > However, a problem in above logic is that swapped-in page has > PG_dirty still after they are removed from swap cache so VM cannot > consider the page as freeable any more even if madvise_free is > called in future. > > Look at below example for detail. > > ptr = malloc(); > memset(ptr); > .. > .. > .. heavy memory pressure so all of pages are swapped out > .. > .. > var = *ptr; -> a page swapped-in and could be removed from > swapcache. Then, page table doesn't mark > dirty bit and page descriptor includes PG_dirty > .. > .. > madvise_free(ptr); -> It doesn't clear PG_dirty of the page. > .. > .. > .. > .. heavy memory pressure again. > .. In this time, VM cannot discard the page because the page > .. has *PG_dirty* > > To solve the problem, this patch clears PG_dirty if only the page > is owned exclusively by current process when madvise is called > because PG_dirty represents ptes's dirtiness in several processes > so we could clear it only if we own it exclusively. > > Cc: Hugh Dickins <hughd@google.com> > Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Hugh Dickins <hughd@google.com> (and patches 1/5 and 2/5 too if you like) > --- > mm/madvise.c | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index fdfb14a78c60..5db546431285 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -312,11 +312,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > if (!page) > continue; > > - if (PageSwapCache(page)) { > + if (PageSwapCache(page) || PageDirty(page)) { > if (!trylock_page(page)) > continue; > + /* > + * If page is shared with others, we couldn't clear > + * PG_dirty of the page. > + */ > + if (page_count(page) != 1 + !!PageSwapCache(page)) { > + unlock_page(page); > + continue; > + } > > - if (!try_to_free_swap(page)) { > + if (PageSwapCache(page) && !try_to_free_swap(page)) { > unlock_page(page); > continue; > } > -- > 1.9.1 > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 3/5] mm: clear PG_dirty to mark page freeable 2015-10-27 1:28 ` Hugh Dickins @ 2015-10-27 6:50 ` Minchan Kim 0 siblings, 0 replies; 26+ messages in thread From: Minchan Kim @ 2015-10-27 6:50 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka On Mon, Oct 26, 2015 at 06:28:13PM -0700, Hugh Dickins wrote: > On Mon, 19 Oct 2015, Minchan Kim wrote: > > > Basically, MADV_FREE relies on dirty bit in page table entry > > to decide whether VM allows to discard the page or not. > > IOW, if page table entry includes marked dirty bit, VM shouldn't > > discard the page. > > > > However, as a example, if swap-in by read fault happens, > > page table entry doesn't have dirty bit so MADV_FREE could discard > > the page wrongly. > > > > For avoiding the problem, MADV_FREE did more checks with PageDirty > > and PageSwapCache. It worked out because swapped-in page lives on > > swap cache and since it is evicted from the swap cache, the page has > > PG_dirty flag. So both page flags check effectively prevent > > wrong discarding by MADV_FREE. > > > > However, a problem in above logic is that swapped-in page has > > PG_dirty still after they are removed from swap cache so VM cannot > > consider the page as freeable any more even if madvise_free is > > called in future. > > > > Look at below example for detail. > > > > ptr = malloc(); > > memset(ptr); > > .. > > .. > > .. heavy memory pressure so all of pages are swapped out > > .. > > .. > > var = *ptr; -> a page swapped-in and could be removed from > > swapcache. Then, page table doesn't mark > > dirty bit and page descriptor includes PG_dirty > > .. > > .. > > madvise_free(ptr); -> It doesn't clear PG_dirty of the page. > > .. > > .. > > .. > > .. heavy memory pressure again. > > .. In this time, VM cannot discard the page because the page > > .. has *PG_dirty* > > > > To solve the problem, this patch clears PG_dirty if only the page > > is owned exclusively by current process when madvise is called > > because PG_dirty represents ptes's dirtiness in several processes > > so we could clear it only if we own it exclusively. > > > > Cc: Hugh Dickins <hughd@google.com> > > Signed-off-by: Minchan Kim <minchan@kernel.org> > > Acked-by: Hugh Dickins <hughd@google.com> > > (and patches 1/5 and 2/5 too if you like) > Thanks for the review, Hugh! I will rebase all series from the beginning as you suggested and will add your Acked-by because I feel you just reviewed all of MADV_FREE code line and you have no found any problem. If something happens(ie, I abuse your Acked-by) wrong, please shout me. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 4/5] mm: simplify reclaim path for MADV_FREE 2015-10-19 6:31 [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim ` (2 preceding siblings ...) 2015-10-19 6:31 ` [PATCH 3/5] mm: clear PG_dirty to mark page freeable Minchan Kim @ 2015-10-19 6:31 ` Minchan Kim 2015-10-27 2:09 ` Hugh Dickins 2015-10-19 6:31 ` [PATCH 5/5] mm: mark stable page dirty in KSM Minchan Kim 2015-10-19 10:01 ` [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim 5 siblings, 1 reply; 26+ messages in thread From: Minchan Kim @ 2015-10-19 6:31 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka, Minchan Kim I made reclaim path mess to check and free MADV_FREEed page. This patch simplify it with tweaking add_to_swap. So far, we mark page as PG_dirty when we add the page into swap cache(ie, add_to_swap) to page out to swap device but this patch moves PG_dirty marking under try_to_unmap_one when we decide to change pte from anon to swapent so if any process's pte has swapent for the page, the page must be swapped out. IOW, there should be no funcional behavior change. It makes relcaim path really simple for MADV_FREE because we just need to check PG_dirty of page to decide discarding the page or not. Other thing this patch does is to pass TTU_BATCH_FLUSH to try_to_unmap when we handle freeable page because I don't see any reason to prevent it. Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Minchan Kim <minchan@kernel.org> --- include/linux/rmap.h | 6 +---- mm/huge_memory.c | 5 ---- mm/rmap.c | 42 ++++++---------------------------- mm/swap_state.c | 5 ++-- mm/vmscan.c | 64 ++++++++++++++++------------------------------------ 5 files changed, 30 insertions(+), 92 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 6b6233fafb53..978f65066fd5 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -193,8 +193,7 @@ static inline void page_dup_rmap(struct page *page, bool compound) * Called from mm/vmscan.c to handle paging out */ int page_referenced(struct page *, int is_locked, - struct mem_cgroup *memcg, unsigned long *vm_flags, - int *is_pte_dirty); + struct mem_cgroup *memcg, unsigned long *vm_flags); #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK) @@ -272,11 +271,8 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc); static inline int page_referenced(struct page *page, int is_locked, struct mem_cgroup *memcg, unsigned long *vm_flags, - int *is_pte_dirty) { *vm_flags = 0; - if (is_pte_dirty) - *is_pte_dirty = 0; return 0; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 269ed99493f0..adccfb48ce57 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1753,11 +1753,6 @@ pmd_t *page_check_address_pmd(struct page *page, return NULL; } -int pmd_freeable(pmd_t pmd) -{ - return !pmd_dirty(pmd); -} - #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE) int hugepage_madvise(struct vm_area_struct *vma, diff --git a/mm/rmap.c b/mm/rmap.c index 94ee372e238b..fd64f79c87c4 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -797,7 +797,6 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma) } struct page_referenced_arg { - int dirtied; int mapcount; int referenced; unsigned long vm_flags; @@ -812,7 +811,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma, struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; int referenced = 0; - int dirty = 0; struct page_referenced_arg *pra = arg; if (unlikely(PageTransHuge(page))) { @@ -835,14 +833,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma, if (pmdp_clear_flush_young_notify(vma, address, pmd)) referenced++; - /* - * Use pmd_freeable instead of raw pmd_dirty because in some - * of architecture, pmd_dirty is not defined unless - * CONFIG_TRANSPARENT_HUGEPAGE is enabled - */ - if (!pmd_freeable(*pmd)) - dirty++; - spin_unlock(ptl); } else { pte_t *pte; @@ -873,9 +863,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma, referenced++; } - if (pte_dirty(*pte)) - dirty++; - pte_unmap_unlock(pte, ptl); } @@ -889,9 +876,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma, pra->vm_flags |= vma->vm_flags; } - if (dirty) - pra->dirtied++; - pra->mapcount--; if (!pra->mapcount) return SWAP_SUCCESS; /* To break the loop */ @@ -916,7 +900,6 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg) * @is_locked: caller holds lock on the page * @memcg: target memory cgroup * @vm_flags: collect encountered vma->vm_flags who actually referenced the page - * @is_pte_dirty: ptes which have marked dirty bit - used for lazyfree page * * Quick test_and_clear_referenced for all mappings to a page, * returns the number of ptes which referenced the page. @@ -924,8 +907,7 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg) int page_referenced(struct page *page, int is_locked, struct mem_cgroup *memcg, - unsigned long *vm_flags, - int *is_pte_dirty) + unsigned long *vm_flags) { int ret; int we_locked = 0; @@ -940,8 +922,6 @@ int page_referenced(struct page *page, }; *vm_flags = 0; - if (is_pte_dirty) - *is_pte_dirty = 0; if (!page_mapped(page)) return 0; @@ -970,9 +950,6 @@ int page_referenced(struct page *page, if (we_locked) unlock_page(page); - if (is_pte_dirty) - *is_pte_dirty = pra.dirtied; - return pra.referenced; } @@ -1453,17 +1430,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, swp_entry_t entry = { .val = page_private(page) }; pte_t swp_pte; - if (flags & TTU_FREE) { - VM_BUG_ON_PAGE(PageSwapCache(page), page); - if (!PageDirty(page)) { - /* It's a freeable page by MADV_FREE */ - dec_mm_counter(mm, MM_ANONPAGES); - goto discard; - } else { - set_pte_at(mm, address, pte, pteval); - ret = SWAP_FAIL; - goto out_unmap; - } + if (!PageDirty(page) && (flags & TTU_FREE)) { + /* It's a freeable page by MADV_FREE */ + dec_mm_counter(mm, MM_ANONPAGES); + goto discard; } if (PageSwapCache(page)) { @@ -1476,6 +1446,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, ret = SWAP_FAIL; goto out_unmap; } + if (!PageDirty(page)) + SetPageDirty(page); if (list_empty(&mm->mmlist)) { spin_lock(&mmlist_lock); if (list_empty(&mm->mmlist)) diff --git a/mm/swap_state.c b/mm/swap_state.c index d783872d746c..676ff2991380 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -185,13 +185,12 @@ int add_to_swap(struct page *page, struct list_head *list) * deadlock in the swap out path. */ /* - * Add it to the swap cache and mark it dirty + * Add it to the swap cache. */ err = add_to_swap_cache(page, entry, __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN); - if (!err) { /* Success */ - SetPageDirty(page); + if (!err) { return 1; } else { /* -ENOMEM radix-tree allocation failure */ /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 27d580b5e853..9b52ecf91194 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -791,17 +791,15 @@ enum page_references { }; static enum page_references page_check_references(struct page *page, - struct scan_control *sc, - bool *freeable) + struct scan_control *sc) { int referenced_ptes, referenced_page; unsigned long vm_flags; - int pte_dirty; VM_BUG_ON_PAGE(!PageLocked(page), page); referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup, - &vm_flags, &pte_dirty); + &vm_flags); referenced_page = TestClearPageReferenced(page); /* @@ -842,10 +840,6 @@ static enum page_references page_check_references(struct page *page, return PAGEREF_KEEP; } - if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) && - !PageDirty(page)) - *freeable = true; - /* Reclaim if clean, defer dirty pages to writeback */ if (referenced_page && !PageSwapBacked(page)) return PAGEREF_RECLAIM_CLEAN; @@ -1037,8 +1031,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, } if (!force_reclaim) - references = page_check_references(page, sc, - &freeable); + references = page_check_references(page, sc); switch (references) { case PAGEREF_ACTIVATE: @@ -1055,31 +1048,24 @@ static unsigned long shrink_page_list(struct list_head *page_list, * Try to allocate it some swap space here. */ if (PageAnon(page) && !PageSwapCache(page)) { - if (!freeable) { - if (!(sc->gfp_mask & __GFP_IO)) - goto keep_locked; - if (!add_to_swap(page, page_list)) - goto activate_locked; - may_enter_fs = 1; - /* Adding to swap updated mapping */ - mapping = page_mapping(page); - } else { - if (likely(!PageTransHuge(page))) - goto unmap; - /* try_to_unmap isn't aware of THP page */ - if (unlikely(split_huge_page_to_list(page, - page_list))) - goto keep_locked; - } + if (!(sc->gfp_mask & __GFP_IO)) + goto keep_locked; + if (!add_to_swap(page, page_list)) + goto activate_locked; + freeable = true; + may_enter_fs = 1; + /* Adding to swap updated mapping */ + mapping = page_mapping(page); } -unmap: + /* * The page is mapped into the page tables of one or more * processes. Try to unmap it here. */ - if (page_mapped(page) && (mapping || freeable)) { + if (page_mapped(page) && mapping) { switch (try_to_unmap(page, freeable ? - TTU_FREE : ttu_flags|TTU_BATCH_FLUSH)) { + ttu_flags | TTU_BATCH_FLUSH | TTU_FREE : + ttu_flags | TTU_BATCH_FLUSH)) { case SWAP_FAIL: goto activate_locked; case SWAP_AGAIN: @@ -1087,20 +1073,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, case SWAP_MLOCK: goto cull_mlocked; case SWAP_SUCCESS: - /* try to free the page below */ - if (!freeable) - break; - /* - * Freeable anon page doesn't have mapping - * due to skipping of swapcache so we free - * page in here rather than __remove_mapping. - */ - VM_BUG_ON_PAGE(PageSwapCache(page), page); - if (!page_freeze_refs(page, 1)) - goto keep_locked; - __ClearPageLocked(page); - count_vm_event(PGLAZYFREED); - goto free_it; + ; /* try to free the page below */ } } @@ -1217,6 +1190,9 @@ static unsigned long shrink_page_list(struct list_head *page_list, */ __ClearPageLocked(page); free_it: + if (freeable && !PageDirty(page)) + count_vm_event(PGLAZYFREED); + nr_reclaimed++; /* @@ -1847,7 +1823,7 @@ static void shrink_active_list(unsigned long nr_to_scan, } if (page_referenced(page, 0, sc->target_mem_cgroup, - &vm_flags, NULL)) { + &vm_flags)) { nr_rotated += hpage_nr_pages(page); /* * Identify referenced, file-backed active pages and -- 1.9.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] mm: simplify reclaim path for MADV_FREE 2015-10-19 6:31 ` [PATCH 4/5] mm: simplify reclaim path for MADV_FREE Minchan Kim @ 2015-10-27 2:09 ` Hugh Dickins 2015-10-27 3:44 ` yalin wang 2015-10-27 6:54 ` Minchan Kim 0 siblings, 2 replies; 26+ messages in thread From: Hugh Dickins @ 2015-10-27 2:09 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka On Mon, 19 Oct 2015, Minchan Kim wrote: > I made reclaim path mess to check and free MADV_FREEed page. > This patch simplify it with tweaking add_to_swap. > > So far, we mark page as PG_dirty when we add the page into > swap cache(ie, add_to_swap) to page out to swap device but > this patch moves PG_dirty marking under try_to_unmap_one > when we decide to change pte from anon to swapent so if > any process's pte has swapent for the page, the page must > be swapped out. IOW, there should be no funcional behavior > change. It makes relcaim path really simple for MADV_FREE > because we just need to check PG_dirty of page to decide > discarding the page or not. > > Other thing this patch does is to pass TTU_BATCH_FLUSH to > try_to_unmap when we handle freeable page because I don't > see any reason to prevent it. > > Cc: Hugh Dickins <hughd@google.com> > Cc: Mel Gorman <mgorman@suse.de> > Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Hugh Dickins <hughd@google.com> This is sooooooo much nicer than the code it replaces! Really good. Kudos also to Hannes for suggesting this approach originally, I think. I hope this implementation satisfies a good proportion of the people who have been wanting MADV_FREE: I'm not among them, and have long lost touch with those discussions, so won't judge how usable it is. I assume you'll refactor the series again before it goes to Linus, so the previous messier implementations vanish? I notice Andrew has this "mm: simplify reclaim path for MADV_FREE" in mmotm as mm-dont-split-thp-page-when-syscall-is-called-fix-6.patch: I guess it all got much too messy to divide up in a hurry. I've noticed no problems in testing (unlike the first time you moved to working with pte_dirty); though of course I've not been using MADV_FREE itself at all. One aspect has worried me for a while, but I think I've reached the conclusion that it doesn't matter at all. The swap that's allocated in add_to_swap() would normally get freed again (after try_to_unmap found it was a MADV_FREE !pte_dirty !PageDirty case) at the bottom of shrink_page_list(), in __remove_mapping(), yes? The bit that worried me is that on rare occasions, something unknown might take a speculative reference to the page, and __remove_mapping() fail to freeze refs for that reason. Much too rare to worry over not freeing that page immediately, but it leaves us with a PageUptodate PageSwapCache !PageDirty page, yet its contents are not the contents of that location on swap. But since this can only happen when you have *not* inserted the corresponding swapent anywhere, I cannot think of anything that would have a legitimate interest in its contents matching that location on swap. So I don't think it's worth looking for somewhere to add a SetPageDirty (or a delete_from_swap_cache) just to regularize that case. > --- > include/linux/rmap.h | 6 +---- > mm/huge_memory.c | 5 ---- > mm/rmap.c | 42 ++++++---------------------------- > mm/swap_state.c | 5 ++-- > mm/vmscan.c | 64 ++++++++++++++++------------------------------------ > 5 files changed, 30 insertions(+), 92 deletions(-) > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index 6b6233fafb53..978f65066fd5 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -193,8 +193,7 @@ static inline void page_dup_rmap(struct page *page, bool compound) > * Called from mm/vmscan.c to handle paging out > */ > int page_referenced(struct page *, int is_locked, > - struct mem_cgroup *memcg, unsigned long *vm_flags, > - int *is_pte_dirty); > + struct mem_cgroup *memcg, unsigned long *vm_flags); > > #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK) > > @@ -272,11 +271,8 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc); > static inline int page_referenced(struct page *page, int is_locked, > struct mem_cgroup *memcg, > unsigned long *vm_flags, > - int *is_pte_dirty) > { > *vm_flags = 0; > - if (is_pte_dirty) > - *is_pte_dirty = 0; > return 0; > } > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 269ed99493f0..adccfb48ce57 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1753,11 +1753,6 @@ pmd_t *page_check_address_pmd(struct page *page, > return NULL; > } > > -int pmd_freeable(pmd_t pmd) > -{ > - return !pmd_dirty(pmd); > -} > - > #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE) > > int hugepage_madvise(struct vm_area_struct *vma, > diff --git a/mm/rmap.c b/mm/rmap.c > index 94ee372e238b..fd64f79c87c4 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -797,7 +797,6 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma) > } > > struct page_referenced_arg { > - int dirtied; > int mapcount; > int referenced; > unsigned long vm_flags; > @@ -812,7 +811,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma, > struct mm_struct *mm = vma->vm_mm; > spinlock_t *ptl; > int referenced = 0; > - int dirty = 0; > struct page_referenced_arg *pra = arg; > > if (unlikely(PageTransHuge(page))) { > @@ -835,14 +833,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma, > if (pmdp_clear_flush_young_notify(vma, address, pmd)) > referenced++; > > - /* > - * Use pmd_freeable instead of raw pmd_dirty because in some > - * of architecture, pmd_dirty is not defined unless > - * CONFIG_TRANSPARENT_HUGEPAGE is enabled > - */ > - if (!pmd_freeable(*pmd)) > - dirty++; > - > spin_unlock(ptl); > } else { > pte_t *pte; > @@ -873,9 +863,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma, > referenced++; > } > > - if (pte_dirty(*pte)) > - dirty++; > - > pte_unmap_unlock(pte, ptl); > } > > @@ -889,9 +876,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma, > pra->vm_flags |= vma->vm_flags; > } > > - if (dirty) > - pra->dirtied++; > - > pra->mapcount--; > if (!pra->mapcount) > return SWAP_SUCCESS; /* To break the loop */ > @@ -916,7 +900,6 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg) > * @is_locked: caller holds lock on the page > * @memcg: target memory cgroup > * @vm_flags: collect encountered vma->vm_flags who actually referenced the page > - * @is_pte_dirty: ptes which have marked dirty bit - used for lazyfree page > * > * Quick test_and_clear_referenced for all mappings to a page, > * returns the number of ptes which referenced the page. > @@ -924,8 +907,7 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg) > int page_referenced(struct page *page, > int is_locked, > struct mem_cgroup *memcg, > - unsigned long *vm_flags, > - int *is_pte_dirty) > + unsigned long *vm_flags) > { > int ret; > int we_locked = 0; > @@ -940,8 +922,6 @@ int page_referenced(struct page *page, > }; > > *vm_flags = 0; > - if (is_pte_dirty) > - *is_pte_dirty = 0; > > if (!page_mapped(page)) > return 0; > @@ -970,9 +950,6 @@ int page_referenced(struct page *page, > if (we_locked) > unlock_page(page); > > - if (is_pte_dirty) > - *is_pte_dirty = pra.dirtied; > - > return pra.referenced; > } > > @@ -1453,17 +1430,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > swp_entry_t entry = { .val = page_private(page) }; > pte_t swp_pte; > > - if (flags & TTU_FREE) { > - VM_BUG_ON_PAGE(PageSwapCache(page), page); > - if (!PageDirty(page)) { > - /* It's a freeable page by MADV_FREE */ > - dec_mm_counter(mm, MM_ANONPAGES); > - goto discard; > - } else { > - set_pte_at(mm, address, pte, pteval); > - ret = SWAP_FAIL; > - goto out_unmap; > - } > + if (!PageDirty(page) && (flags & TTU_FREE)) { > + /* It's a freeable page by MADV_FREE */ > + dec_mm_counter(mm, MM_ANONPAGES); > + goto discard; > } > > if (PageSwapCache(page)) { > @@ -1476,6 +1446,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > ret = SWAP_FAIL; > goto out_unmap; > } > + if (!PageDirty(page)) > + SetPageDirty(page); > if (list_empty(&mm->mmlist)) { > spin_lock(&mmlist_lock); > if (list_empty(&mm->mmlist)) > diff --git a/mm/swap_state.c b/mm/swap_state.c > index d783872d746c..676ff2991380 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -185,13 +185,12 @@ int add_to_swap(struct page *page, struct list_head *list) > * deadlock in the swap out path. > */ > /* > - * Add it to the swap cache and mark it dirty > + * Add it to the swap cache. > */ > err = add_to_swap_cache(page, entry, > __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN); > > - if (!err) { /* Success */ > - SetPageDirty(page); > + if (!err) { > return 1; > } else { /* -ENOMEM radix-tree allocation failure */ > /* > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 27d580b5e853..9b52ecf91194 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -791,17 +791,15 @@ enum page_references { > }; > > static enum page_references page_check_references(struct page *page, > - struct scan_control *sc, > - bool *freeable) > + struct scan_control *sc) > { > int referenced_ptes, referenced_page; > unsigned long vm_flags; > - int pte_dirty; > > VM_BUG_ON_PAGE(!PageLocked(page), page); > > referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup, > - &vm_flags, &pte_dirty); > + &vm_flags); > referenced_page = TestClearPageReferenced(page); > > /* > @@ -842,10 +840,6 @@ static enum page_references page_check_references(struct page *page, > return PAGEREF_KEEP; > } > > - if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) && > - !PageDirty(page)) > - *freeable = true; > - > /* Reclaim if clean, defer dirty pages to writeback */ > if (referenced_page && !PageSwapBacked(page)) > return PAGEREF_RECLAIM_CLEAN; > @@ -1037,8 +1031,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, > } > > if (!force_reclaim) > - references = page_check_references(page, sc, > - &freeable); > + references = page_check_references(page, sc); > > switch (references) { > case PAGEREF_ACTIVATE: > @@ -1055,31 +1048,24 @@ static unsigned long shrink_page_list(struct list_head *page_list, > * Try to allocate it some swap space here. > */ > if (PageAnon(page) && !PageSwapCache(page)) { > - if (!freeable) { > - if (!(sc->gfp_mask & __GFP_IO)) > - goto keep_locked; > - if (!add_to_swap(page, page_list)) > - goto activate_locked; > - may_enter_fs = 1; > - /* Adding to swap updated mapping */ > - mapping = page_mapping(page); > - } else { > - if (likely(!PageTransHuge(page))) > - goto unmap; > - /* try_to_unmap isn't aware of THP page */ > - if (unlikely(split_huge_page_to_list(page, > - page_list))) > - goto keep_locked; > - } > + if (!(sc->gfp_mask & __GFP_IO)) > + goto keep_locked; > + if (!add_to_swap(page, page_list)) > + goto activate_locked; > + freeable = true; > + may_enter_fs = 1; > + /* Adding to swap updated mapping */ > + mapping = page_mapping(page); > } > -unmap: > + > /* > * The page is mapped into the page tables of one or more > * processes. Try to unmap it here. > */ > - if (page_mapped(page) && (mapping || freeable)) { > + if (page_mapped(page) && mapping) { > switch (try_to_unmap(page, freeable ? > - TTU_FREE : ttu_flags|TTU_BATCH_FLUSH)) { > + ttu_flags | TTU_BATCH_FLUSH | TTU_FREE : > + ttu_flags | TTU_BATCH_FLUSH)) { > case SWAP_FAIL: > goto activate_locked; > case SWAP_AGAIN: > @@ -1087,20 +1073,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, > case SWAP_MLOCK: > goto cull_mlocked; > case SWAP_SUCCESS: > - /* try to free the page below */ > - if (!freeable) > - break; > - /* > - * Freeable anon page doesn't have mapping > - * due to skipping of swapcache so we free > - * page in here rather than __remove_mapping. > - */ > - VM_BUG_ON_PAGE(PageSwapCache(page), page); > - if (!page_freeze_refs(page, 1)) > - goto keep_locked; > - __ClearPageLocked(page); > - count_vm_event(PGLAZYFREED); > - goto free_it; > + ; /* try to free the page below */ > } > } > > @@ -1217,6 +1190,9 @@ static unsigned long shrink_page_list(struct list_head *page_list, > */ > __ClearPageLocked(page); > free_it: > + if (freeable && !PageDirty(page)) > + count_vm_event(PGLAZYFREED); > + > nr_reclaimed++; > > /* > @@ -1847,7 +1823,7 @@ static void shrink_active_list(unsigned long nr_to_scan, > } > > if (page_referenced(page, 0, sc->target_mem_cgroup, > - &vm_flags, NULL)) { > + &vm_flags)) { > nr_rotated += hpage_nr_pages(page); > /* > * Identify referenced, file-backed active pages and > -- > 1.9.1 ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] mm: simplify reclaim path for MADV_FREE 2015-10-27 2:09 ` Hugh Dickins @ 2015-10-27 3:44 ` yalin wang 2015-10-27 7:09 ` Minchan Kim 2015-10-27 6:54 ` Minchan Kim 1 sibling, 1 reply; 26+ messages in thread From: yalin wang @ 2015-10-27 3:44 UTC (permalink / raw) To: Hugh Dickins Cc: Minchan Kim, Andrew Morton, open list:MEMORY MANAGEMENT, lkml, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka > On Oct 27, 2015, at 10:09, Hugh Dickins <hughd@google.com> wrote: > > On Mon, 19 Oct 2015, Minchan Kim wrote: > >> I made reclaim path mess to check and free MADV_FREEed page. >> This patch simplify it with tweaking add_to_swap. >> >> So far, we mark page as PG_dirty when we add the page into >> swap cache(ie, add_to_swap) to page out to swap device but >> this patch moves PG_dirty marking under try_to_unmap_one >> when we decide to change pte from anon to swapent so if >> any process's pte has swapent for the page, the page must >> be swapped out. IOW, there should be no funcional behavior >> change. It makes relcaim path really simple for MADV_FREE >> because we just need to check PG_dirty of page to decide >> discarding the page or not. >> >> Other thing this patch does is to pass TTU_BATCH_FLUSH to >> try_to_unmap when we handle freeable page because I don't >> see any reason to prevent it. >> >> Cc: Hugh Dickins <hughd@google.com> >> Cc: Mel Gorman <mgorman@suse.de> >> Signed-off-by: Minchan Kim <minchan@kernel.org> > > Acked-by: Hugh Dickins <hughd@google.com> > > This is sooooooo much nicer than the code it replaces! Really good. > Kudos also to Hannes for suggesting this approach originally, I think. > > I hope this implementation satisfies a good proportion of the people > who have been wanting MADV_FREE: I'm not among them, and have long > lost touch with those discussions, so won't judge how usable it is. > > I assume you'll refactor the series again before it goes to Linus, > so the previous messier implementations vanish? I notice Andrew > has this "mm: simplify reclaim path for MADV_FREE" in mmotm as > mm-dont-split-thp-page-when-syscall-is-called-fix-6.patch: > I guess it all got much too messy to divide up in a hurry. > > I've noticed no problems in testing (unlike the first time you moved > to working with pte_dirty); though of course I've not been using > MADV_FREE itself at all. > > One aspect has worried me for a while, but I think I've reached the > conclusion that it doesn't matter at all. The swap that's allocated > in add_to_swap() would normally get freed again (after try_to_unmap > found it was a MADV_FREE !pte_dirty !PageDirty case) at the bottom > of shrink_page_list(), in __remove_mapping(), yes? > > The bit that worried me is that on rare occasions, something unknown > might take a speculative reference to the page, and __remove_mapping() > fail to freeze refs for that reason. Much too rare to worry over not > freeing that page immediately, but it leaves us with a PageUptodate > PageSwapCache !PageDirty page, yet its contents are not the contents > of that location on swap. > > But since this can only happen when you have *not* inserted the > corresponding swapent anywhere, I cannot think of anything that would > have a legitimate interest in its contents matching that location on swap. > So I don't think it's worth looking for somewhere to add a SetPageDirty > (or a delete_from_swap_cache) just to regularize that case. > >> --- >> include/linux/rmap.h | 6 +---- >> mm/huge_memory.c | 5 ---- >> mm/rmap.c | 42 ++++++---------------------------- >> mm/swap_state.c | 5 ++-- >> mm/vmscan.c | 64 ++++++++++++++++------------------------------------ >> 5 files changed, 30 insertions(+), 92 deletions(-) >> >> diff --git a/include/linux/rmap.h b/include/linux/rmap.h >> index 6b6233fafb53..978f65066fd5 100644 >> --- a/include/linux/rmap.h >> +++ b/include/linux/rmap.h >> @@ -193,8 +193,7 @@ static inline void page_dup_rmap(struct page *page, bool compound) >> * Called from mm/vmscan.c to handle paging out >> */ >> int page_referenced(struct page *, int is_locked, >> - struct mem_cgroup *memcg, unsigned long *vm_flags, >> - int *is_pte_dirty); >> + struct mem_cgroup *memcg, unsigned long *vm_flags); >> >> #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK) >> >> @@ -272,11 +271,8 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc); >> static inline int page_referenced(struct page *page, int is_locked, >> struct mem_cgroup *memcg, >> unsigned long *vm_flags, >> - int *is_pte_dirty) >> { >> *vm_flags = 0; >> - if (is_pte_dirty) >> - *is_pte_dirty = 0; >> return 0; >> } >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 269ed99493f0..adccfb48ce57 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -1753,11 +1753,6 @@ pmd_t *page_check_address_pmd(struct page *page, >> return NULL; >> } >> >> -int pmd_freeable(pmd_t pmd) >> -{ >> - return !pmd_dirty(pmd); >> -} >> - >> #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE) >> >> int hugepage_madvise(struct vm_area_struct *vma, >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 94ee372e238b..fd64f79c87c4 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -797,7 +797,6 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma) >> } >> >> struct page_referenced_arg { >> - int dirtied; >> int mapcount; >> int referenced; >> unsigned long vm_flags; >> @@ -812,7 +811,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma, >> struct mm_struct *mm = vma->vm_mm; >> spinlock_t *ptl; >> int referenced = 0; >> - int dirty = 0; >> struct page_referenced_arg *pra = arg; >> >> if (unlikely(PageTransHuge(page))) { >> @@ -835,14 +833,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma, >> if (pmdp_clear_flush_young_notify(vma, address, pmd)) >> referenced++; >> >> - /* >> - * Use pmd_freeable instead of raw pmd_dirty because in some >> - * of architecture, pmd_dirty is not defined unless >> - * CONFIG_TRANSPARENT_HUGEPAGE is enabled >> - */ >> - if (!pmd_freeable(*pmd)) >> - dirty++; >> - >> spin_unlock(ptl); >> } else { >> pte_t *pte; >> @@ -873,9 +863,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma, >> referenced++; >> } >> >> - if (pte_dirty(*pte)) >> - dirty++; >> - >> pte_unmap_unlock(pte, ptl); >> } >> >> @@ -889,9 +876,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma, >> pra->vm_flags |= vma->vm_flags; >> } >> >> - if (dirty) >> - pra->dirtied++; >> - >> pra->mapcount--; >> if (!pra->mapcount) >> return SWAP_SUCCESS; /* To break the loop */ >> @@ -916,7 +900,6 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg) >> * @is_locked: caller holds lock on the page >> * @memcg: target memory cgroup >> * @vm_flags: collect encountered vma->vm_flags who actually referenced the page >> - * @is_pte_dirty: ptes which have marked dirty bit - used for lazyfree page >> * >> * Quick test_and_clear_referenced for all mappings to a page, >> * returns the number of ptes which referenced the page. >> @@ -924,8 +907,7 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg) >> int page_referenced(struct page *page, >> int is_locked, >> struct mem_cgroup *memcg, >> - unsigned long *vm_flags, >> - int *is_pte_dirty) >> + unsigned long *vm_flags) >> { >> int ret; >> int we_locked = 0; >> @@ -940,8 +922,6 @@ int page_referenced(struct page *page, >> }; >> >> *vm_flags = 0; >> - if (is_pte_dirty) >> - *is_pte_dirty = 0; >> >> if (!page_mapped(page)) >> return 0; >> @@ -970,9 +950,6 @@ int page_referenced(struct page *page, >> if (we_locked) >> unlock_page(page); >> >> - if (is_pte_dirty) >> - *is_pte_dirty = pra.dirtied; >> - >> return pra.referenced; >> } >> >> @@ -1453,17 +1430,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> swp_entry_t entry = { .val = page_private(page) }; >> pte_t swp_pte; >> >> - if (flags & TTU_FREE) { >> - VM_BUG_ON_PAGE(PageSwapCache(page), page); >> - if (!PageDirty(page)) { >> - /* It's a freeable page by MADV_FREE */ >> - dec_mm_counter(mm, MM_ANONPAGES); >> - goto discard; >> - } else { >> - set_pte_at(mm, address, pte, pteval); >> - ret = SWAP_FAIL; >> - goto out_unmap; >> - } >> + if (!PageDirty(page) && (flags & TTU_FREE)) { >> + /* It's a freeable page by MADV_FREE */ >> + dec_mm_counter(mm, MM_ANONPAGES); >> + goto discard; >> } >> >> if (PageSwapCache(page)) { >> @@ -1476,6 +1446,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> ret = SWAP_FAIL; >> goto out_unmap; >> } >> + if (!PageDirty(page)) >> + SetPageDirty(page); >> if (list_empty(&mm->mmlist)) { >> spin_lock(&mmlist_lock); >> if (list_empty(&mm->mmlist)) >> diff --git a/mm/swap_state.c b/mm/swap_state.c >> index d783872d746c..676ff2991380 100644 >> --- a/mm/swap_state.c >> +++ b/mm/swap_state.c >> @@ -185,13 +185,12 @@ int add_to_swap(struct page *page, struct list_head *list) >> * deadlock in the swap out path. >> */ >> /* >> - * Add it to the swap cache and mark it dirty >> + * Add it to the swap cache. >> */ >> err = add_to_swap_cache(page, entry, >> __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN); >> >> - if (!err) { /* Success */ >> - SetPageDirty(page); >> + if (!err) { >> return 1; >> } else { /* -ENOMEM radix-tree allocation failure */ >> /* >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 27d580b5e853..9b52ecf91194 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -791,17 +791,15 @@ enum page_references { >> }; >> >> static enum page_references page_check_references(struct page *page, >> - struct scan_control *sc, >> - bool *freeable) >> + struct scan_control *sc) >> { >> int referenced_ptes, referenced_page; >> unsigned long vm_flags; >> - int pte_dirty; >> >> VM_BUG_ON_PAGE(!PageLocked(page), page); >> >> referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup, >> - &vm_flags, &pte_dirty); >> + &vm_flags); >> referenced_page = TestClearPageReferenced(page); >> >> /* >> @@ -842,10 +840,6 @@ static enum page_references page_check_references(struct page *page, >> return PAGEREF_KEEP; >> } >> >> - if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) && >> - !PageDirty(page)) >> - *freeable = true; >> - >> /* Reclaim if clean, defer dirty pages to writeback */ >> if (referenced_page && !PageSwapBacked(page)) >> return PAGEREF_RECLAIM_CLEAN; >> @@ -1037,8 +1031,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, >> } >> >> if (!force_reclaim) >> - references = page_check_references(page, sc, >> - &freeable); >> + references = page_check_references(page, sc); >> >> switch (references) { >> case PAGEREF_ACTIVATE: >> @@ -1055,31 +1048,24 @@ static unsigned long shrink_page_list(struct list_head *page_list, >> * Try to allocate it some swap space here. >> */ >> if (PageAnon(page) && !PageSwapCache(page)) { >> - if (!freeable) { >> - if (!(sc->gfp_mask & __GFP_IO)) >> - goto keep_locked; >> - if (!add_to_swap(page, page_list)) >> - goto activate_locked; >> - may_enter_fs = 1; >> - /* Adding to swap updated mapping */ >> - mapping = page_mapping(page); >> - } else { >> - if (likely(!PageTransHuge(page))) >> - goto unmap; >> - /* try_to_unmap isn't aware of THP page */ >> - if (unlikely(split_huge_page_to_list(page, >> - page_list))) >> - goto keep_locked; >> - } >> + if (!(sc->gfp_mask & __GFP_IO)) >> + goto keep_locked; >> + if (!add_to_swap(page, page_list)) >> + goto activate_locked; >> + freeable = true; >> + may_enter_fs = 1; >> + /* Adding to swap updated mapping */ >> + mapping = page_mapping(page); >> } >> -unmap: >> + >> /* >> * The page is mapped into the page tables of one or more >> * processes. Try to unmap it here. >> */ >> - if (page_mapped(page) && (mapping || freeable)) { >> + if (page_mapped(page) && mapping) { >> switch (try_to_unmap(page, freeable ? >> - TTU_FREE : ttu_flags|TTU_BATCH_FLUSH)) { >> + ttu_flags | TTU_BATCH_FLUSH | TTU_FREE : >> + ttu_flags | TTU_BATCH_FLUSH)) { >> case SWAP_FAIL: >> goto activate_locked; >> case SWAP_AGAIN: >> @@ -1087,20 +1073,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, >> case SWAP_MLOCK: >> goto cull_mlocked; >> case SWAP_SUCCESS: >> - /* try to free the page below */ >> - if (!freeable) >> - break; >> - /* >> - * Freeable anon page doesn't have mapping >> - * due to skipping of swapcache so we free >> - * page in here rather than __remove_mapping. >> - */ >> - VM_BUG_ON_PAGE(PageSwapCache(page), page); >> - if (!page_freeze_refs(page, 1)) >> - goto keep_locked; >> - __ClearPageLocked(page); >> - count_vm_event(PGLAZYFREED); >> - goto free_it; >> + ; /* try to free the page below */ >> } >> } >> >> @@ -1217,6 +1190,9 @@ static unsigned long shrink_page_list(struct list_head *page_list, >> */ >> __ClearPageLocked(page); >> free_it: >> + if (freeable && !PageDirty(page)) >> + count_vm_event(PGLAZYFREED); >> + >> nr_reclaimed++; >> >> /* >> @@ -1847,7 +1823,7 @@ static void shrink_active_list(unsigned long nr_to_scan, >> } >> >> if (page_referenced(page, 0, sc->target_mem_cgroup, >> - &vm_flags, NULL)) { >> + &vm_flags)) { >> nr_rotated += hpage_nr_pages(page); >> /* >> * Identify referenced, file-backed active pages and >> -- >> 1.9.1 > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ it is wrong here if you only check PageDirty() to decide if the page is freezable or not . The Anon page are shared by multiple process, _mapcount > 1 , so you must check all pt_dirty bit during page_referenced() function, see this mail thread: http://ns1.ske-art.com/lists/kernel/msg1934021.html Thanks ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] mm: simplify reclaim path for MADV_FREE 2015-10-27 3:44 ` yalin wang @ 2015-10-27 7:09 ` Minchan Kim 2015-10-27 7:39 ` yalin wang 0 siblings, 1 reply; 26+ messages in thread From: Minchan Kim @ 2015-10-27 7:09 UTC (permalink / raw) To: yalin wang Cc: Hugh Dickins, Andrew Morton, open list:MEMORY MANAGEMENT, lkml, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka Hello Yalin, Sorry for missing you in Cc list. IIRC, mails to send your previous mail address(Yalin.Wang@sonymobile.com) were returned. On Tue, Oct 27, 2015 at 11:44:09AM +0800, yalin wang wrote: > > > On Oct 27, 2015, at 10:09, Hugh Dickins <hughd@google.com> wrote: > > > > On Mon, 19 Oct 2015, Minchan Kim wrote: > > > >> I made reclaim path mess to check and free MADV_FREEed page. > >> This patch simplify it with tweaking add_to_swap. > >> > >> So far, we mark page as PG_dirty when we add the page into > >> swap cache(ie, add_to_swap) to page out to swap device but > >> this patch moves PG_dirty marking under try_to_unmap_one > >> when we decide to change pte from anon to swapent so if > >> any process's pte has swapent for the page, the page must > >> be swapped out. IOW, there should be no funcional behavior > >> change. It makes relcaim path really simple for MADV_FREE > >> because we just need to check PG_dirty of page to decide > >> discarding the page or not. > >> > >> Other thing this patch does is to pass TTU_BATCH_FLUSH to > >> try_to_unmap when we handle freeable page because I don't > >> see any reason to prevent it. > >> > >> Cc: Hugh Dickins <hughd@google.com> > >> Cc: Mel Gorman <mgorman@suse.de> > >> Signed-off-by: Minchan Kim <minchan@kernel.org> > > > > Acked-by: Hugh Dickins <hughd@google.com> > > > > This is sooooooo much nicer than the code it replaces! Really good. > > Kudos also to Hannes for suggesting this approach originally, I think. > > > > I hope this implementation satisfies a good proportion of the people > > who have been wanting MADV_FREE: I'm not among them, and have long > > lost touch with those discussions, so won't judge how usable it is. > > > > I assume you'll refactor the series again before it goes to Linus, > > so the previous messier implementations vanish? I notice Andrew > > has this "mm: simplify reclaim path for MADV_FREE" in mmotm as > > mm-dont-split-thp-page-when-syscall-is-called-fix-6.patch: > > I guess it all got much too messy to divide up in a hurry. > > > > I've noticed no problems in testing (unlike the first time you moved > > to working with pte_dirty); though of course I've not been using > > MADV_FREE itself at all. > > > > One aspect has worried me for a while, but I think I've reached the > > conclusion that it doesn't matter at all. The swap that's allocated > > in add_to_swap() would normally get freed again (after try_to_unmap > > found it was a MADV_FREE !pte_dirty !PageDirty case) at the bottom > > of shrink_page_list(), in __remove_mapping(), yes? > > > > The bit that worried me is that on rare occasions, something unknown > > might take a speculative reference to the page, and __remove_mapping() > > fail to freeze refs for that reason. Much too rare to worry over not > > freeing that page immediately, but it leaves us with a PageUptodate > > PageSwapCache !PageDirty page, yet its contents are not the contents > > of that location on swap. > > > > But since this can only happen when you have *not* inserted the > > corresponding swapent anywhere, I cannot think of anything that would > > have a legitimate interest in its contents matching that location on swap. > > So I don't think it's worth looking for somewhere to add a SetPageDirty > > (or a delete_from_swap_cache) just to regularize that case. > > > >> --- > >> include/linux/rmap.h | 6 +---- > >> mm/huge_memory.c | 5 ---- > >> mm/rmap.c | 42 ++++++---------------------------- > >> mm/swap_state.c | 5 ++-- > >> mm/vmscan.c | 64 ++++++++++++++++------------------------------------ > >> 5 files changed, 30 insertions(+), 92 deletions(-) > >> <snip> You added comment bottom line so I'm not sure what PageDirty you meant. > it is wrong here if you only check PageDirty() to decide if the page is freezable or not . > The Anon page are shared by multiple process, _mapcount > 1 , > so you must check all pt_dirty bit during page_referenced() function, > see this mail thread: > http://ns1.ske-art.com/lists/kernel/msg1934021.html If one of pte among process sharing the page was dirty, the dirtiness should be propagated from pte to PG_dirty by try_to_unmap_one. IOW, if the page doesn't have PG_dirty flag, it means all of process did MADV_FREE. Am I missing something from you question? If so, could you show exact scenario I am missing? Thanks for the interest. > Thanks > > > > > > > > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] mm: simplify reclaim path for MADV_FREE 2015-10-27 7:09 ` Minchan Kim @ 2015-10-27 7:39 ` yalin wang 2015-10-27 8:10 ` Minchan Kim 0 siblings, 1 reply; 26+ messages in thread From: yalin wang @ 2015-10-27 7:39 UTC (permalink / raw) To: Minchan Kim Cc: Hugh Dickins, Andrew Morton, open list:MEMORY MANAGEMENT, lkml, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka > On Oct 27, 2015, at 15:09, Minchan Kim <minchan@kernel.org> wrote: > > Hello Yalin, > > Sorry for missing you in Cc list. > IIRC, mails to send your previous mail address(Yalin.Wang@sonymobile.com) > were returned. > > You added comment bottom line so I'm not sure what PageDirty you meant. > >> it is wrong here if you only check PageDirty() to decide if the page is freezable or not . >> The Anon page are shared by multiple process, _mapcount > 1 , >> so you must check all pt_dirty bit during page_referenced() function, >> see this mail thread: >> http://ns1.ske-art.com/lists/kernel/msg1934021.html > > If one of pte among process sharing the page was dirty, the dirtiness should > be propagated from pte to PG_dirty by try_to_unmap_one. > IOW, if the page doesn't have PG_dirty flag, it means all of process did > MADV_FREE. > > Am I missing something from you question? > If so, could you show exact scenario I am missing? > > Thanks for the interest. oh, yeah , that is right , i miss that , pte_dirty will propagate to PG_dirty , so that is correct . Generic to say this patch move set_page_dirty() from add_to_swap() to try_to_unmap(), i think can change a little about this patch: @@ -1476,6 +1446,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, ret = SWAP_FAIL; goto out_unmap; } + if (!PageDirty(page)) + SetPageDirty(page); if (list_empty(&mm->mmlist)) { spin_lock(&mmlist_lock); if (list_empty(&mm->mmlist)) i think this 2 lines can be removed , since pte_dirty have propagated to set_page_dirty() , we don’t need this line here , otherwise you will always dirty a AnonPage, even it is clean, then we will page out this clean page to swap partition one more , this is not needed. am i understanding correctly ? By the way, please change my mail address to yalin.wang2010@gmail.com in CC list . Thanks a lot. :) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] mm: simplify reclaim path for MADV_FREE 2015-10-27 7:39 ` yalin wang @ 2015-10-27 8:10 ` Minchan Kim 2015-10-27 8:52 ` yalin wang 0 siblings, 1 reply; 26+ messages in thread From: Minchan Kim @ 2015-10-27 8:10 UTC (permalink / raw) To: yalin wang Cc: Hugh Dickins, Andrew Morton, open list:MEMORY MANAGEMENT, lkml, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka On Tue, Oct 27, 2015 at 03:39:16PM +0800, yalin wang wrote: > > > On Oct 27, 2015, at 15:09, Minchan Kim <minchan@kernel.org> wrote: > > > > Hello Yalin, > > > > Sorry for missing you in Cc list. > > IIRC, mails to send your previous mail address(Yalin.Wang@sonymobile.com) > > were returned. > > > > You added comment bottom line so I'm not sure what PageDirty you meant. > > > >> it is wrong here if you only check PageDirty() to decide if the page is freezable or not . > >> The Anon page are shared by multiple process, _mapcount > 1 , > >> so you must check all pt_dirty bit during page_referenced() function, > >> see this mail thread: > >> http://ns1.ske-art.com/lists/kernel/msg1934021.html > > > > If one of pte among process sharing the page was dirty, the dirtiness should > > be propagated from pte to PG_dirty by try_to_unmap_one. > > IOW, if the page doesn't have PG_dirty flag, it means all of process did > > MADV_FREE. > > > > Am I missing something from you question? > > If so, could you show exact scenario I am missing? > > > > Thanks for the interest. > oh, yeah , that is right , i miss that , pte_dirty will propagate to PG_dirty , > so that is correct . > Generic to say this patch move set_page_dirty() from add_to_swap() to > try_to_unmap(), i think can change a little about this patch: > > @@ -1476,6 +1446,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > ret = SWAP_FAIL; > goto out_unmap; > } > + if (!PageDirty(page)) > + SetPageDirty(page); > if (list_empty(&mm->mmlist)) { > spin_lock(&mmlist_lock); > if (list_empty(&mm->mmlist)) > > i think this 2 lines can be removed , > since pte_dirty have propagated to set_page_dirty() , we don’t need this line here , > otherwise you will always dirty a AnonPage, even it is clean, > then we will page out this clean page to swap partition one more , this is not needed. > am i understanding correctly ? Your understanding is correct. I will fix it in next spin. > > By the way, please change my mail address to yalin.wang2010@gmail.com in CC list . > Thanks a lot. :) Thanks for the review! > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] mm: simplify reclaim path for MADV_FREE 2015-10-27 8:10 ` Minchan Kim @ 2015-10-27 8:52 ` yalin wang 2015-10-28 4:03 ` yalin wang 0 siblings, 1 reply; 26+ messages in thread From: yalin wang @ 2015-10-27 8:52 UTC (permalink / raw) To: Minchan Kim Cc: Hugh Dickins, Andrew Morton, open list:MEMORY MANAGEMENT, lkml, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka > On Oct 27, 2015, at 16:10, Minchan Kim <minchan@kernel.org> wrote: > > On Tue, Oct 27, 2015 at 03:39:16PM +0800, yalin wang wrote: >> >>> On Oct 27, 2015, at 15:09, Minchan Kim <minchan@kernel.org> wrote: >>> >>> Hello Yalin, >>> >>> Sorry for missing you in Cc list. >>> IIRC, mails to send your previous mail address(Yalin.Wang@sonymobile.com) >>> were returned. >>> >>> You added comment bottom line so I'm not sure what PageDirty you meant. >>> >>>> it is wrong here if you only check PageDirty() to decide if the page is freezable or not . >>>> The Anon page are shared by multiple process, _mapcount > 1 , >>>> so you must check all pt_dirty bit during page_referenced() function, >>>> see this mail thread: >>>> http://ns1.ske-art.com/lists/kernel/msg1934021.html >>> >>> If one of pte among process sharing the page was dirty, the dirtiness should >>> be propagated from pte to PG_dirty by try_to_unmap_one. >>> IOW, if the page doesn't have PG_dirty flag, it means all of process did >>> MADV_FREE. >>> >>> Am I missing something from you question? >>> If so, could you show exact scenario I am missing? >>> >>> Thanks for the interest. >> oh, yeah , that is right , i miss that , pte_dirty will propagate to PG_dirty , >> so that is correct . >> Generic to say this patch move set_page_dirty() from add_to_swap() to >> try_to_unmap(), i think can change a little about this patch: >> >> @@ -1476,6 +1446,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> ret = SWAP_FAIL; >> goto out_unmap; >> } >> + if (!PageDirty(page)) >> + SetPageDirty(page); >> if (list_empty(&mm->mmlist)) { >> spin_lock(&mmlist_lock); >> if (list_empty(&mm->mmlist)) >> >> i think this 2 lines can be removed , >> since pte_dirty have propagated to set_page_dirty() , we don’t need this line here , >> otherwise you will always dirty a AnonPage, even it is clean, >> then we will page out this clean page to swap partition one more , this is not needed. >> am i understanding correctly ? > > Your understanding is correct. > I will fix it in next spin. > >> >> By the way, please change my mail address to yalin.wang2010@gmail.com in CC list . >> Thanks a lot. :) > > Thanks for the review! i have a look at the old mail list , i recall the scenario that multiple processes share a AnonPage special case : for example Process A have a AnonPage map like this: ! pte_dirty() && PageDirty()==1 (this is possible after read fault happened on swap entry, and try_to_free_swap() succeed.) Process A do a fork() , New process is called B . Then A syscall(MADV_FREE) on the page . At this time, page table like this: A ! pte_dirty() && PageDirty() == 0 && PageSwapCache() == 0 B ! pte_dirty() && PageDirty() == 0 && PageSwapCache() == 0 This means this page is freeable , and can be freed during page reclaim. This is not fair for Process B . Since B don’t call syscall(MADV_FREE) , its page should not be discard . Will cause some strange behaviour if happened . This is discussed by http://www.serverphorums.com/read.php?12,1220840 but i don’t know why the patch is not merged . Thanks ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] mm: simplify reclaim path for MADV_FREE 2015-10-27 8:52 ` yalin wang @ 2015-10-28 4:03 ` yalin wang 0 siblings, 0 replies; 26+ messages in thread From: yalin wang @ 2015-10-28 4:03 UTC (permalink / raw) To: Minchan Kim Cc: Hugh Dickins, Andrew Morton, open list:MEMORY MANAGEMENT, lkml, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka > On Oct 27, 2015, at 16:52, yalin wang <yalin.wang2010@gmail.com> wrote: > > >> On Oct 27, 2015, at 16:10, Minchan Kim <minchan@kernel.org> wrote: >> >> On Tue, Oct 27, 2015 at 03:39:16PM +0800, yalin wang wrote: >>> >>>> On Oct 27, 2015, at 15:09, Minchan Kim <minchan@kernel.org> wrote: >>>> >>>> Hello Yalin, >>>> >>>> Sorry for missing you in Cc list. >>>> IIRC, mails to send your previous mail address(Yalin.Wang@sonymobile.com) >>>> were returned. >>>> >>>> You added comment bottom line so I'm not sure what PageDirty you meant. >>>> >>>>> it is wrong here if you only check PageDirty() to decide if the page is freezable or not . >>>>> The Anon page are shared by multiple process, _mapcount > 1 , >>>>> so you must check all pt_dirty bit during page_referenced() function, >>>>> see this mail thread: >>>>> http://ns1.ske-art.com/lists/kernel/msg1934021.html >>>> >>>> If one of pte among process sharing the page was dirty, the dirtiness should >>>> be propagated from pte to PG_dirty by try_to_unmap_one. >>>> IOW, if the page doesn't have PG_dirty flag, it means all of process did >>>> MADV_FREE. >>>> >>>> Am I missing something from you question? >>>> If so, could you show exact scenario I am missing? >>>> >>>> Thanks for the interest. >>> oh, yeah , that is right , i miss that , pte_dirty will propagate to PG_dirty , >>> so that is correct . >>> Generic to say this patch move set_page_dirty() from add_to_swap() to >>> try_to_unmap(), i think can change a little about this patch: >>> >>> @@ -1476,6 +1446,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >>> ret = SWAP_FAIL; >>> goto out_unmap; >>> } >>> + if (!PageDirty(page)) >>> + SetPageDirty(page); >>> if (list_empty(&mm->mmlist)) { >>> spin_lock(&mmlist_lock); >>> if (list_empty(&mm->mmlist)) >>> >>> i think this 2 lines can be removed , >>> since pte_dirty have propagated to set_page_dirty() , we don’t need this line here , >>> otherwise you will always dirty a AnonPage, even it is clean, >>> then we will page out this clean page to swap partition one more , this is not needed. >>> am i understanding correctly ? >> >> Your understanding is correct. >> I will fix it in next spin. >> >>> >>> By the way, please change my mail address to yalin.wang2010@gmail.com in CC list . >>> Thanks a lot. :) >> >> Thanks for the review! > > i have a look at the old mail list , i recall the scenario that multiple processes share a AnonPage > special case : > > for example Process A have a AnonPage map like this: > ! pte_dirty() && PageDirty()==1 (this is possible after read fault happened on swap entry, and try_to_free_swap() succeed.) > Process A do a fork() , New process is called B . > Then A syscall(MADV_FREE) on the page . > At this time, page table like this: > > A ! pte_dirty() && PageDirty() == 0 && PageSwapCache() == 0 > > B ! pte_dirty() && PageDirty() == 0 && PageSwapCache() == 0 > > This means this page is freeable , and can be freed during page reclaim. > This is not fair for Process B . Since B don’t call syscall(MADV_FREE) , > its page should not be discard . Will cause some strange behaviour if happened . > > This is discussed by > http://www.serverphorums.com/read.php?12,1220840 > but i don’t know why the patch is not merged . > > Thanks oh, i have see 0b502297d1cc26e09b98955b4efa728be1c48921 this commit merged , then this problem should be fixed by this method. ignore this mail. :) Thanks a lot . ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] mm: simplify reclaim path for MADV_FREE 2015-10-27 2:09 ` Hugh Dickins 2015-10-27 3:44 ` yalin wang @ 2015-10-27 6:54 ` Minchan Kim 1 sibling, 0 replies; 26+ messages in thread From: Minchan Kim @ 2015-10-27 6:54 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka On Mon, Oct 26, 2015 at 07:09:15PM -0700, Hugh Dickins wrote: > On Mon, 19 Oct 2015, Minchan Kim wrote: > > > I made reclaim path mess to check and free MADV_FREEed page. > > This patch simplify it with tweaking add_to_swap. > > > > So far, we mark page as PG_dirty when we add the page into > > swap cache(ie, add_to_swap) to page out to swap device but > > this patch moves PG_dirty marking under try_to_unmap_one > > when we decide to change pte from anon to swapent so if > > any process's pte has swapent for the page, the page must > > be swapped out. IOW, there should be no funcional behavior > > change. It makes relcaim path really simple for MADV_FREE > > because we just need to check PG_dirty of page to decide > > discarding the page or not. > > > > Other thing this patch does is to pass TTU_BATCH_FLUSH to > > try_to_unmap when we handle freeable page because I don't > > see any reason to prevent it. > > > > Cc: Hugh Dickins <hughd@google.com> > > Cc: Mel Gorman <mgorman@suse.de> > > Signed-off-by: Minchan Kim <minchan@kernel.org> > > Acked-by: Hugh Dickins <hughd@google.com> > > This is sooooooo much nicer than the code it replaces! Really good. Thanks! > Kudos also to Hannes for suggesting this approach originally, I think. I should buy beer or soju if Hannes likes. > > I hope this implementation satisfies a good proportion of the people > who have been wanting MADV_FREE: I'm not among them, and have long > lost touch with those discussions, so won't judge how usable it is. > > I assume you'll refactor the series again before it goes to Linus, > so the previous messier implementations vanish? I notice Andrew Actutally, I didn't think about that but once you mentioned it, I realized that would be better. Thanks for the suggestion. > has this "mm: simplify reclaim path for MADV_FREE" in mmotm as > mm-dont-split-thp-page-when-syscall-is-called-fix-6.patch: > I guess it all got much too messy to divide up in a hurry. Yeb, I will rebase all series from the beginning based on recent mmtom so I will vanish the mess in git-blame. When I rebases it in mmotm, I will do it before reaching THP refcount new design if Andrew and Kirill don't mind it because it makes to fail my test as I reported. I don't know it's long time unknown bug or something THP-ref new introduces. Anyway, I want to test smoothly. > > I've noticed no problems in testing (unlike the first time you moved > to working with pte_dirty); though of course I've not been using Thanks for testing! > MADV_FREE itself at all. > > One aspect has worried me for a while, but I think I've reached the > conclusion that it doesn't matter at all. The swap that's allocated > in add_to_swap() would normally get freed again (after try_to_unmap > found it was a MADV_FREE !pte_dirty !PageDirty case) at the bottom > of shrink_page_list(), in __remove_mapping(), yes? Right. > > The bit that worried me is that on rare occasions, something unknown > might take a speculative reference to the page, and __remove_mapping() > fail to freeze refs for that reason. Much too rare to worry over not > freeing that page immediately, but it leaves us with a PageUptodate > PageSwapCache !PageDirty page, yet its contents are not the contents > of that location on swap. > > But since this can only happen when you have *not* inserted the > corresponding swapent anywhere, I cannot think of anything that would > have a legitimate interest in its contents matching that location on swap. > So I don't think it's worth looking for somewhere to add a SetPageDirty > (or a delete_from_swap_cache) just to regularize that case. Exactly. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 5/5] mm: mark stable page dirty in KSM 2015-10-19 6:31 [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim ` (3 preceding siblings ...) 2015-10-19 6:31 ` [PATCH 4/5] mm: simplify reclaim path for MADV_FREE Minchan Kim @ 2015-10-19 6:31 ` Minchan Kim 2015-10-27 2:23 ` Hugh Dickins 2015-10-19 10:01 ` [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim 5 siblings, 1 reply; 26+ messages in thread From: Minchan Kim @ 2015-10-19 6:31 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka, Minchan Kim Stable page could be shared by several processes and last process could own the page among them after CoW or zapping for every process except last process happens. Then, page table entry of the page in last process can have no dirty bit and PG_dirty flag in page->flags. In this case, MADV_FREE could discard the page wrongly. For preventing it, we mark stable page dirty. Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Minchan Kim <minchan@kernel.org> --- mm/ksm.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/mm/ksm.c b/mm/ksm.c index 8f0faf809bf5..659e2b5119c0 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1050,6 +1050,18 @@ static int try_to_merge_one_page(struct vm_area_struct *vma, */ set_page_stable_node(page, NULL); mark_page_accessed(page); + /* + * Stable page could be shared by several processes + * and last process could own the page among them after + * CoW or zapping for every process except last process + * happens. Then, page table entry of the page + * in last process can have no dirty bit. + * In this case, MADV_FREE could discard the page + * wrongly. + * For preventing it, we mark stable page dirty. + */ + if (!PageDirty(page)) + SetPageDirty(page); err = 0; } else if (pages_identical(page, kpage)) err = replace_page(vma, page, kpage, orig_pte); -- 1.9.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 5/5] mm: mark stable page dirty in KSM 2015-10-19 6:31 ` [PATCH 5/5] mm: mark stable page dirty in KSM Minchan Kim @ 2015-10-27 2:23 ` Hugh Dickins 2015-10-27 6:58 ` Minchan Kim 0 siblings, 1 reply; 26+ messages in thread From: Hugh Dickins @ 2015-10-27 2:23 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka On Mon, 19 Oct 2015, Minchan Kim wrote: > Stable page could be shared by several processes and last process > could own the page among them after CoW or zapping for every process > except last process happens. Then, page table entry of the page > in last process can have no dirty bit and PG_dirty flag in page->flags. > In this case, MADV_FREE could discard the page wrongly. > For preventing it, we mark stable page dirty. I agree with the change, but found that comment (repeated in the source) rather hard to follow. And it doesn't really do justice to the changes you have made. This is not now a MADV_FREE thing, it's more general than that, even if MADV_FREE is the only thing that takes advantage of it. I like very much that you've made page reclaim sane, freeing non-dirty anonymous pages instead of swapping them out, without having to think of whether it's for MADV_FREE or not. Would you mind if we replace your patch by a re-commented version? [PATCH] mm: mark stable page dirty in KSM The MADV_FREE patchset changes page reclaim to simply free a clean anonymous page with no dirty ptes, instead of swapping it out; but KSM uses clean write-protected ptes to reference the stable ksm page. So be sure to mark that page dirty, so it's never mistakenly discarded. Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Hugh Dickins <hughd@google.com> --- mm/ksm.c | 6 ++++++ 1 file changed, 6 insertions(+) diff -puN mm/ksm.c~mm-mark-stable-page-dirty-in-ksm mm/ksm.c --- a/mm/ksm.c~mm-mark-stable-page-dirty-in-ksm +++ a/mm/ksm.c @@ -1050,6 +1050,12 @@ static int try_to_merge_one_page(struct */ set_page_stable_node(page, NULL); mark_page_accessed(page); + /* + * Page reclaim just frees a clean page with no dirty + * ptes: make sure that the ksm page would be swapped. + */ + if (!PageDirty(page)) + SetPageDirty(page); err = 0; } else if (pages_identical(page, kpage)) err = replace_page(vma, page, kpage, orig_pte); ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 5/5] mm: mark stable page dirty in KSM 2015-10-27 2:23 ` Hugh Dickins @ 2015-10-27 6:58 ` Minchan Kim 0 siblings, 0 replies; 26+ messages in thread From: Minchan Kim @ 2015-10-27 6:58 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka On Mon, Oct 26, 2015 at 07:23:12PM -0700, Hugh Dickins wrote: > On Mon, 19 Oct 2015, Minchan Kim wrote: > > > Stable page could be shared by several processes and last process > > could own the page among them after CoW or zapping for every process > > except last process happens. Then, page table entry of the page > > in last process can have no dirty bit and PG_dirty flag in page->flags. > > In this case, MADV_FREE could discard the page wrongly. > > For preventing it, we mark stable page dirty. > > I agree with the change, but found that comment (repeated in the source) > rather hard to follow. And it doesn't really do justice to the changes > you have made. > > This is not now a MADV_FREE thing, it's more general than that, even > if MADV_FREE is the only thing that takes advantage of it. I like > very much that you've made page reclaim sane, freeing non-dirty > anonymous pages instead of swapping them out, without having to > think of whether it's for MADV_FREE or not. > > Would you mind if we replace your patch by a re-commented version? > > [PATCH] mm: mark stable page dirty in KSM > > The MADV_FREE patchset changes page reclaim to simply free a clean > anonymous page with no dirty ptes, instead of swapping it out; but > KSM uses clean write-protected ptes to reference the stable ksm page. > So be sure to mark that page dirty, so it's never mistakenly discarded. > > Signed-off-by: Minchan Kim <minchan@kernel.org> > Signed-off-by: Hugh Dickins <hughd@google.com> Looks better than mine. I will include this in my patchset when I respin. Thanks! ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/5] MADV_FREE refactoring and fix KSM page 2015-10-19 6:31 [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim ` (4 preceding siblings ...) 2015-10-19 6:31 ` [PATCH 5/5] mm: mark stable page dirty in KSM Minchan Kim @ 2015-10-19 10:01 ` Minchan Kim 2015-10-20 1:38 ` Minchan Kim 2015-10-20 7:21 ` Minchan Kim 5 siblings, 2 replies; 26+ messages in thread From: Minchan Kim @ 2015-10-19 10:01 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka On Mon, Oct 19, 2015 at 03:31:42PM +0900, Minchan Kim wrote: > Hello, it's too late since I sent previos patch. > https://lkml.org/lkml/2015/6/3/37 > > This patch is alomost new compared to previos approach. > I think this is more simple, clear and easy to review. > > One thing I should notice is that I have tested this patch > and couldn't find any critical problem so I rebased patchset > onto recent mmotm(ie, mmotm-2015-10-15-15-20) to send formal > patchset. Unfortunately, I start to see sudden discarding of > the page we shouldn't do. IOW, application's valid anonymous page > was disappeared suddenly. > > When I look through THP changes, I think we could lose > dirty bit of pte between freeze_page and unfreeze_page > when we mark it as migration entry and restore it. > So, I added below simple code without enough considering > and cannot see the problem any more. > I hope it's good hint to find right fix this problem. > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index d5ea516ffb54..e881c04f5950 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3138,6 +3138,9 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page, > if (is_write_migration_entry(swp_entry)) > entry = maybe_mkwrite(entry, vma); > > + if (PageDirty(page)) > + SetPageDirty(page); The condition of PageDirty was typo. I didn't add the condition. Just added. SetPageDirty(page); ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/5] MADV_FREE refactoring and fix KSM page 2015-10-19 10:01 ` [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim @ 2015-10-20 1:38 ` Minchan Kim 2015-10-20 7:21 ` Minchan Kim 1 sibling, 0 replies; 26+ messages in thread From: Minchan Kim @ 2015-10-20 1:38 UTC (permalink / raw) To: Andrew Morton, Kirill A. Shutemov Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Kirill A. Shutemov, Vlastimil Babka [-- Attachment #1: Type: text/plain, Size: 10303 bytes --] On Mon, Oct 19, 2015 at 07:01:50PM +0900, Minchan Kim wrote: > On Mon, Oct 19, 2015 at 03:31:42PM +0900, Minchan Kim wrote: > > Hello, it's too late since I sent previos patch. > > https://lkml.org/lkml/2015/6/3/37 > > > > This patch is alomost new compared to previos approach. > > I think this is more simple, clear and easy to review. > > > > One thing I should notice is that I have tested this patch > > and couldn't find any critical problem so I rebased patchset > > onto recent mmotm(ie, mmotm-2015-10-15-15-20) to send formal > > patchset. Unfortunately, I start to see sudden discarding of > > the page we shouldn't do. IOW, application's valid anonymous page > > was disappeared suddenly. > > > > When I look through THP changes, I think we could lose > > dirty bit of pte between freeze_page and unfreeze_page > > when we mark it as migration entry and restore it. > > So, I added below simple code without enough considering > > and cannot see the problem any more. > > I hope it's good hint to find right fix this problem. > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index d5ea516ffb54..e881c04f5950 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -3138,6 +3138,9 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page, > > if (is_write_migration_entry(swp_entry)) > > entry = maybe_mkwrite(entry, vma); > > > > + if (PageDirty(page)) > > + SetPageDirty(page); > > The condition of PageDirty was typo. I didn't add the condition. > Just added. > > SetPageDirty(page); For the first step to find this bug, I removed all MADV_FREE related code in mmotm-2015-10-15-15-20. IOW, git checkout 54bad5da4834 (arm64: add pmd_[dirty|mkclean] for THP) so the tree doesn't have any core code of MADV_FREE. I tested following workloads in my KVM machine. 0. make memcg 1. limit memcg 2. fork several processes 3. each process allocates THP page and fill 4. increase limit of the memcg to swapoff successfully 5. swapoff 6. kill all of processes 7. goto 1 Within a few hours, I encounter following bug. Attached detailed boot log and dmesg result. Initializing cgroup subsys cpu Command line: hung_task_panic=1 earlyprintk=ttyS0,115200 debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic console=ttyS0,115200 console=tty0 earlyprintk=ttyS0 ignore_loglevel ftrace_dump_on_oops vga=normal root=/dev/vda1 rw KERNEL supported cpus: Intel GenuineIntel x86/fpu: Legacy x87 FPU detected. x86/fpu: Using 'lazy' FPU context switches. e820: BIOS-provided physical RAM map: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved BIOS-e820: [mem 0x0000000000100000-0x00000000bfffbfff] usable BIOS-e820: [mem 0x00000000bfffc000-0x00000000bfffffff] reserved BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved <snip> Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff810782a9>] down_read_trylock+0x9/0x30 PGD 0 Oops: 0000 [#1] SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 26445 Comm: sh Not tainted 4.3.0-rc5-mm1-diet-meta+ #1545 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff8800b9af3480 ti: ffff88007fea0000 task.ti: ffff88007fea0000 RIP: 0010:[<ffffffff810782a9>] [<ffffffff810782a9>] down_read_trylock+0x9/0x30 RSP: 0018:ffff88007fea3648 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffffea0002324900 RCX: ffff88007fea37e8 RDX: 0000000000000000 RSI: ffff88007fea36e8 RDI: 0000000000000008 RBP: ffff88007fea3648 R08: ffffffff818446a0 R09: ffff8800b9af4c80 R10: 0000000000000216 R11: 0000000000000001 R12: ffff88007f58d6e1 R13: ffff88007f58d6e0 R14: 0000000000000008 R15: 0000000000000001 FS: 00007f0993e78740(0000) GS:ffff8800bfa20000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 000000007edee000 CR4: 00000000000006a0 Stack: ffff88007fea3678 ffffffff81124ff0 ffffea0002324900 ffff88007fea36e8 ffff88009ffe8400 0000000000000000 ffff88007fea36c0 ffffffff81125733 ffff8800bfa34540 ffffffff8105dc9d ffffea0002324900 ffff88007fea37e8 Call Trace: [<ffffffff81124ff0>] page_lock_anon_vma_read+0x60/0x180 [<ffffffff81125733>] rmap_walk+0x1b3/0x3f0 [<ffffffff8105dc9d>] ? finish_task_switch+0x5d/0x1f0 [<ffffffff81125b13>] page_referenced+0x1a3/0x220 [<ffffffff81123e30>] ? __page_check_address+0x1a0/0x1a0 [<ffffffff81124f90>] ? page_get_anon_vma+0xd0/0xd0 [<ffffffff81123820>] ? anon_vma_ctor+0x40/0x40 [<ffffffff8110087b>] shrink_page_list+0x5ab/0xde0 [<ffffffff8110174c>] shrink_inactive_list+0x18c/0x4b0 [<ffffffff811023bd>] shrink_lruvec+0x59d/0x740 [<ffffffff811025f0>] shrink_zone+0x90/0x250 [<ffffffff811028dd>] do_try_to_free_pages+0x12d/0x3b0 [<ffffffff81102d3d>] try_to_free_mem_cgroup_pages+0x9d/0x120 [<ffffffff811496c3>] try_charge+0x163/0x700 [<ffffffff81149cb4>] mem_cgroup_do_precharge+0x54/0x70 [<ffffffff81149e45>] mem_cgroup_can_attach+0x175/0x1b0 [<ffffffff811b2c57>] ? kernfs_iattrs.isra.6+0x37/0xd0 [<ffffffff81148e70>] ? get_mctgt_type+0x320/0x320 [<ffffffff810a9d29>] cgroup_migrate+0x149/0x440 [<ffffffff810aa60c>] cgroup_attach_task+0x7c/0xe0 [<ffffffff810aa904>] __cgroup_procs_write.isra.33+0x1d4/0x2b0 [<ffffffff810aaa10>] cgroup_tasks_write+0x10/0x20 [<ffffffff810a6238>] cgroup_file_write+0x38/0xf0 [<ffffffff811b54ad>] kernfs_fop_write+0x11d/0x170 [<ffffffff81153918>] __vfs_write+0x28/0xe0 [<ffffffff8116e614>] ? __fd_install+0x24/0xc0 [<ffffffff810784a1>] ? percpu_down_read+0x21/0x50 [<ffffffff81153e91>] vfs_write+0xa1/0x170 [<ffffffff81154716>] SyS_write+0x46/0xa0 [<ffffffff81420a17>] entry_SYSCALL_64_fastpath+0x12/0x6a Code: 5e 82 3a 00 48 83 c4 08 5b 5d c3 48 89 45 f0 e8 9b 6a 3a 00 48 8b 45 f0 eb df 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7 RIP [<ffffffff810782a9>] down_read_trylock+0x9/0x30 RSP <ffff88007fea3648> CR2: 0000000000000008 BUG: unable to handle kernel ---[ end trace e81a82c8122b447d ]--- Kernel panic - not syncing: Fatal exception NULL pointer dereference at 0000000000000008 IP: [<ffffffff810782a9>] down_read_trylock+0x9/0x30 PGD 0 Oops: 0000 [#2] SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 10 PID: 59 Comm: khugepaged Tainted: G D 4.3.0-rc5-mm1-diet-meta+ #1545 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff8800b9851a40 ti: ffff8800b985c000 task.ti: ffff8800b985c000 RIP: 0010:[<ffffffff810782a9>] [<ffffffff810782a9>] down_read_trylock+0x9/0x30 RSP: 0018:ffff8800b985f778 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffffea0002321800 RCX: ffff8800b985f918 RDX: 0000000000000000 RSI: ffff8800b985f818 RDI: 0000000000000008 RBP: ffff8800b985f778 R08: ffffffff818446a0 R09: ffff8800b9853240 R10: 000000000000ba03 R11: 0000000000000001 R12: ffff88007f58d6e1 R13: ffff88007f58d6e0 R14: 0000000000000008 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff8800bfb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 0000000001808000 CR4: 00000000000006a0 Stack: ffff8800b985f7a8 ffffffff81124ff0 ffffea0002321800 ffff8800b985f818 ffff88009ffe8400 0000000000000000 ffff8800b985f7f0 ffffffff81125733 ffff8800bfb54540 ffffffff8105dc9d ffffea0002321800 ffff8800b985f918 Call Trace: [<ffffffff81124ff0>] page_lock_anon_vma_read+0x60/0x180 [<ffffffff81125733>] rmap_walk+0x1b3/0x3f0 [<ffffffff8105dc9d>] ? finish_task_switch+0x5d/0x1f0 [<ffffffff81125b13>] page_referenced+0x1a3/0x220 [<ffffffff81123e30>] ? __page_check_address+0x1a0/0x1a0 [<ffffffff81124f90>] ? page_get_anon_vma+0xd0/0xd0 [<ffffffff81123820>] ? anon_vma_ctor+0x40/0x40 [<ffffffff8110087b>] shrink_page_list+0x5ab/0xde0 [<ffffffff8110174c>] shrink_inactive_list+0x18c/0x4b0 [<ffffffff811023bd>] shrink_lruvec+0x59d/0x740 [<ffffffff811025f0>] shrink_zone+0x90/0x250 [<ffffffff811028dd>] do_try_to_free_pages+0x12d/0x3b0 [<ffffffff81102d3d>] try_to_free_mem_cgroup_pages+0x9d/0x120 [<ffffffff811496c3>] try_charge+0x163/0x700 [<ffffffff8141d1f3>] ? schedule+0x33/0x80 [<ffffffff8114d45f>] mem_cgroup_try_charge+0x9f/0x1d0 [<ffffffff811434bc>] khugepaged+0x7cc/0x1ac0 [<ffffffff81066e01>] ? hrtick_update+0x1/0x70 [<ffffffff81072430>] ? prepare_to_wait_event+0xf0/0xf0 [<ffffffff81142cf0>] ? total_mapcount+0x70/0x70 [<ffffffff81056cd9>] kthread+0xc9/0xe0 [<ffffffff81056c10>] ? kthread_park+0x60/0x60 [<ffffffff81420d6f>] ret_from_fork+0x3f/0x70 [<ffffffff81056c10>] ? kthread_park+0x60/0x60 Code: 5e 82 3a 00 48 83 c4 08 5b 5d c3 48 89 45 f0 e8 9b 6a 3a 00 48 8b 45 f0 eb df 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7 RIP [<ffffffff810782a9>] down_read_trylock+0x9/0x30 RSP <ffff8800b985f778> CR2: 0000000000000008 ---[ end trace e81a82c8122b447e ]--- Shutting down cpus with NMI Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled [-- Attachment #2: test_bug.log --] [-- Type: text/plain, Size: 46938 bytes --] QEMU 2.0.0 monitor - type 'help' for more information (qemu) s^[[Kearly console in setup code Initializing cgroup subsys cpu Linux version 4.3.0-rc5-mm1-diet-meta+ (barrios@bbox) (gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) ) #1545 SMP Tue Oct 20 08:55:45 KST 2015 Command line: hung_task_panic=1 earlyprintk=ttyS0,115200 debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic console=ttyS0,115200 console=tty0 earlyprintk=ttyS0 ignore_loglevel ftrace_dump_on_oops vga=normal root=/dev/vda1 rw KERNEL supported cpus: Intel GenuineIntel x86/fpu: Legacy x87 FPU detected. x86/fpu: Using 'lazy' FPU context switches. e820: BIOS-provided physical RAM map: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved BIOS-e820: [mem 0x0000000000100000-0x00000000bfffbfff] usable BIOS-e820: [mem 0x00000000bfffc000-0x00000000bfffffff] reserved BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved bootconsole [earlyser0] enabled debug: ignoring loglevel setting. NX (Execute Disable) protection: active SMBIOS 2.4 present. DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 e820: update [mem 0x00000000-0x00000fff] usable ==> reserved e820: remove [mem 0x000a0000-0x000fffff] usable e820: last_pfn = 0xbfffc max_arch_pfn = 0x400000000 MTRR default type: write-back MTRR fixed ranges enabled: 00000-9FFFF write-back A0000-BFFFF uncachable C0000-FFFFF write-protect MTRR variable ranges enabled: 0 base 00C0000000 mask FFC0000000 uncachable 1 disabled 2 disabled 3 disabled 4 disabled 5 disabled 6 disabled 7 disabled x86/PAT: PAT not supported by CPU. Scan for SMP in [mem 0x00000000-0x000003ff] Scan for SMP in [mem 0x0009fc00-0x0009ffff] Scan for SMP in [mem 0x000f0000-0x000fffff] found SMP MP-table at [mem 0x000f0a70-0x000f0a7f] mapped at [ffff8800000f0a70] mpc: f0a80-f0c44 Scanning 1 areas for low memory corruption Base memory trampoline at [ffff880000099000] 99000 size 24576 init_memory_mapping: [mem 0x00000000-0x000fffff] [mem 0x00000000-0x000fffff] page 4k BRK [0x0220e000, 0x0220efff] PGTABLE BRK [0x0220f000, 0x0220ffff] PGTABLE BRK [0x02210000, 0x02210fff] PGTABLE init_memory_mapping: [mem 0xbfc00000-0xbfdfffff] [mem 0xbfc00000-0xbfdfffff] page 2M BRK [0x02211000, 0x02211fff] PGTABLE init_memory_mapping: [mem 0xa0000000-0xbfbfffff] [mem 0xa0000000-0xbfbfffff] page 2M init_memory_mapping: [mem 0x80000000-0x9fffffff] [mem 0x80000000-0x9fffffff] page 2M init_memory_mapping: [mem 0x00100000-0x7fffffff] [mem 0x00100000-0x001fffff] page 4k [mem 0x00200000-0x7fffffff] page 2M init_memory_mapping: [mem 0xbfe00000-0xbfffbfff] [mem 0xbfe00000-0xbfffbfff] page 4k BRK [0x02212000, 0x02212fff] PGTABLE RAMDISK: [mem 0x7851a000-0x7fffffff] [ffffea0000000000-ffffea0002ffffff] PMD -> [ffff8800bc400000-ffff8800bf3fffff] on node 0 Zone ranges: DMA [mem 0x0000000000001000-0x0000000000ffffff] DMA32 [mem 0x0000000001000000-0x00000000bfffbfff] Normal empty Movable zone start for each node Early memory node ranges node 0: [mem 0x0000000000001000-0x000000000009efff] node 0: [mem 0x0000000000100000-0x00000000bfffbfff] Initmem setup node 0 [mem 0x0000000000001000-0x00000000bfffbfff] On node 0 totalpages: 786330 DMA zone: 64 pages used for memmap DMA zone: 21 pages reserved DMA zone: 3998 pages, LIFO batch:0 DMA32 zone: 12224 pages used for memmap DMA32 zone: 782332 pages, LIFO batch:31 Intel MultiProcessor Specification v1.4 mpc: f0a80-f0c44 MPTABLE: OEM ID: BOCHSCPU MPTABLE: Product ID: 0.1 MPTABLE: APIC at: 0xFEE00000 mapped APIC to ffffffffff5fd000 ( fee00000) Processor #0 (Bootup-CPU) Processor #1 Processor #2 Processor #3 Processor #4 Processor #5 Processor #6 Processor #7 Processor #8 Processor #9 Processor #10 Processor #11 Bus #0 is PCI Bus #1 is ISA IOAPIC[0]: apic_id 0, version 17, address 0xfec00000, GSI 0-23 Int: type 0, pol 1, trig 0, bus 00, IRQ 04, APIC ID 0, APIC INT 09 Int: type 0, pol 1, trig 0, bus 00, IRQ 0c, APIC ID 0, APIC INT 0b Int: type 0, pol 1, trig 0, bus 00, IRQ 10, APIC ID 0, APIC INT 0b Int: type 0, pol 1, trig 0, bus 00, IRQ 14, APIC ID 0, APIC INT 0a Int: type 0, pol 1, trig 0, bus 00, IRQ 18, APIC ID 0, APIC INT 0a Int: type 0, pol 0, trig 0, bus 01, IRQ 00, APIC ID 0, APIC INT 02 Int: type 0, pol 0, trig 0, bus 01, IRQ 01, APIC ID 0, APIC INT 01 Int: type 0, pol 0, trig 0, bus 01, IRQ 03, APIC ID 0, APIC INT 03 Int: type 0, pol 0, trig 0, bus 01, IRQ 04, APIC ID 0, APIC INT 04 Int: type 0, pol 0, trig 0, bus 01, IRQ 06, APIC ID 0, APIC INT 06 Int: type 0, pol 0, trig 0, bus 01, IRQ 07, APIC ID 0, APIC INT 07 Int: type 0, pol 0, trig 0, bus 01, IRQ 08, APIC ID 0, APIC INT 08 Int: type 0, pol 0, trig 0, bus 01, IRQ 0c, APIC ID 0, APIC INT 0c Int: type 0, pol 0, trig 0, bus 01, IRQ 0d, APIC ID 0, APIC INT 0d Int: type 0, pol 0, trig 0, bus 01, IRQ 0e, APIC ID 0, APIC INT 0e Int: type 0, pol 0, trig 0, bus 01, IRQ 0f, APIC ID 0, APIC INT 0f Lint: type 3, pol 0, trig 0, bus 01, IRQ 00, APIC ID 0, APIC LINT 00 Lint: type 1, pol 0, trig 0, bus 01, IRQ 00, APIC ID ff, APIC LINT 01 Processors: 12 smpboot: Allowing 12 CPUs, 0 hotplug CPUs mapped IOAPIC to ffffffffff5fc000 (fec00000) e820: [mem 0xc0000000-0xfeffbfff] available for PCI devices clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns setup_percpu: NR_CPUS:16 nr_cpumask_bits:16 nr_cpu_ids:12 nr_node_ids:1 PERCPU: Embedded 31 pages/cpu @ffff8800bfa00000 s87640 r8192 d31144 u131072 pcpu-alloc: s87640 r8192 d31144 u131072 alloc=1*2097152 pcpu-alloc: [0] 00 01 02 03 04 05 06 07 08 09 10 11 -- -- -- -- Built 1 zonelists in Zone order, mobility grouping on. Total pages: 774021 Kernel command line: hung_task_panic=1 earlyprintk=ttyS0,115200 debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic console=ttyS0,115200 console=tty0 earlyprintk=ttyS0 ignore_loglevel ftrace_dump_on_oops vga=normal root=/dev/vda1 rw sysrq: sysrq always enabled. log_buf_len individual max cpu contribution: 2097152 bytes log_buf_len total cpu_extra contributions: 23068672 bytes log_buf_len min size: 8388608 bytes log_buf_len: 33554432 bytes early log buf free: 8380096(99%) PID hash table entries: 4096 (order: 3, 32768 bytes) Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes) Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes) Memory: 2911172K/3145320K available (4237K kernel code, 721K rwdata, 1988K rodata, 936K init, 8608K bss, 234148K reserved, 0K cma-reserved) SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=12, Nodes=1 Hierarchical RCU implementation. Build-time adjustment of leaf fanout to 64. RCU restricting CPUs from NR_CPUS=16 to nr_cpu_ids=12. RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=12 NR_IRQS:4352 nr_irqs:136 16 Console: colour VGA+ 80x25 console [tty0] enabled bootconsole [earlyser0] disabled Initializing cgroup subsys cpu Linux version 4.3.0-rc5-mm1-diet-meta+ (barrios@bbox) (gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) ) #1545 SMP Tue Oct 20 08:55:45 KST 2015 Command line: hung_task_panic=1 earlyprintk=ttyS0,115200 debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic console=ttyS0,115200 console=tty0 earlyprintk=ttyS0 ignore_loglevel ftrace_dump_on_oops vga=normal root=/dev/vda1 rw KERNEL supported cpus: Intel GenuineIntel x86/fpu: Legacy x87 FPU detected. x86/fpu: Using 'lazy' FPU context switches. e820: BIOS-provided physical RAM map: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved BIOS-e820: [mem 0x0000000000100000-0x00000000bfffbfff] usable BIOS-e820: [mem 0x00000000bfffc000-0x00000000bfffffff] reserved BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved bootconsole [earlyser0] enabled debug: ignoring loglevel setting. NX (Execute Disable) protection: active SMBIOS 2.4 present. DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 e820: update [mem 0x00000000-0x00000fff] usable ==> reserved e820: remove [mem 0x000a0000-0x000fffff] usable e820: last_pfn = 0xbfffc max_arch_pfn = 0x400000000 MTRR default type: write-back MTRR fixed ranges enabled: 00000-9FFFF write-back A0000-BFFFF uncachable C0000-FFFFF write-protect MTRR variable ranges enabled: 0 base 00C0000000 mask FFC0000000 uncachable 1 disabled 2 disabled 3 disabled 4 disabled 5 disabled 6 disabled 7 disabled x86/PAT: PAT not supported by CPU. Scan for SMP in [mem 0x00000000-0x000003ff] Scan for SMP in [mem 0x0009fc00-0x0009ffff] Scan for SMP in [mem 0x000f0000-0x000fffff] found SMP MP-table at [mem 0x000f0a70-0x000f0a7f] mapped at [ffff8800000f0a70] mpc: f0a80-f0c44 Scanning 1 areas for low memory corruption Base memory trampoline at [ffff880000099000] 99000 size 24576 init_memory_mapping: [mem 0x00000000-0x000fffff] [mem 0x00000000-0x000fffff] page 4k BRK [0x0220e000, 0x0220efff] PGTABLE BRK [0x0220f000, 0x0220ffff] PGTABLE BRK [0x02210000, 0x02210fff] PGTABLE init_memory_mapping: [mem 0xbfc00000-0xbfdfffff] [mem 0xbfc00000-0xbfdfffff] page 2M BRK [0x02211000, 0x02211fff] PGTABLE init_memory_mapping: [mem 0xa0000000-0xbfbfffff] [mem 0xa0000000-0xbfbfffff] page 2M init_memory_mapping: [mem 0x80000000-0x9fffffff] [mem 0x80000000-0x9fffffff] page 2M init_memory_mapping: [mem 0x00100000-0x7fffffff] [mem 0x00100000-0x001fffff] page 4k [mem 0x00200000-0x7fffffff] page 2M init_memory_mapping: [mem 0xbfe00000-0xbfffbfff] [mem 0xbfe00000-0xbfffbfff] page 4k BRK [0x02212000, 0x02212fff] PGTABLE RAMDISK: [mem 0x7851a000-0x7fffffff] [ffffea0000000000-ffffea0002ffffff] PMD -> [ffff8800bc400000-ffff8800bf3fffff] on node 0 Zone ranges: DMA [mem 0x0000000000001000-0x0000000000ffffff] DMA32 [mem 0x0000000001000000-0x00000000bfffbfff] Normal empty Movable zone start for each node Early memory node ranges node 0: [mem 0x0000000000001000-0x000000000009efff] node 0: [mem 0x0000000000100000-0x00000000bfffbfff] Initmem setup node 0 [mem 0x0000000000001000-0x00000000bfffbfff] On node 0 totalpages: 786330 DMA zone: 64 pages used for memmap DMA zone: 21 pages reserved DMA zone: 3998 pages, LIFO batch:0 DMA32 zone: 12224 pages used for memmap DMA32 zone: 782332 pages, LIFO batch:31 Intel MultiProcessor Specification v1.4 mpc: f0a80-f0c44 MPTABLE: OEM ID: BOCHSCPU MPTABLE: Product ID: 0.1 MPTABLE: APIC at: 0xFEE00000 mapped APIC to ffffffffff5fd000 ( fee00000) Processor #0 (Bootup-CPU) Processor #1 Processor #2 Processor #3 Processor #4 Processor #5 Processor #6 Processor #7 Processor #8 Processor #9 Processor #10 Processor #11 Bus #0 is PCI Bus #1 is ISA IOAPIC[0]: apic_id 0, version 17, address 0xfec00000, GSI 0-23 Int: type 0, pol 1, trig 0, bus 00, IRQ 04, APIC ID 0, APIC INT 09 Int: type 0, pol 1, trig 0, bus 00, IRQ 0c, APIC ID 0, APIC INT 0b Int: type 0, pol 1, trig 0, bus 00, IRQ 10, APIC ID 0, APIC INT 0b Int: type 0, pol 1, trig 0, bus 00, IRQ 14, APIC ID 0, APIC INT 0a Int: type 0, pol 1, trig 0, bus 00, IRQ 18, APIC ID 0, APIC INT 0a Int: type 0, pol 0, trig 0, bus 01, IRQ 00, APIC ID 0, APIC INT 02 Int: type 0, pol 0, trig 0, bus 01, IRQ 01, APIC ID 0, APIC INT 01 Int: type 0, pol 0, trig 0, bus 01, IRQ 03, APIC ID 0, APIC INT 03 Int: type 0, pol 0, trig 0, bus 01, IRQ 04, APIC ID 0, APIC INT 04 Int: type 0, pol 0, trig 0, bus 01, IRQ 06, APIC ID 0, APIC INT 06 Int: type 0, pol 0, trig 0, bus 01, IRQ 07, APIC ID 0, APIC INT 07 Int: type 0, pol 0, trig 0, bus 01, IRQ 08, APIC ID 0, APIC INT 08 Int: type 0, pol 0, trig 0, bus 01, IRQ 0c, APIC ID 0, APIC INT 0c Int: type 0, pol 0, trig 0, bus 01, IRQ 0d, APIC ID 0, APIC INT 0d Int: type 0, pol 0, trig 0, bus 01, IRQ 0e, APIC ID 0, APIC INT 0e Int: type 0, pol 0, trig 0, bus 01, IRQ 0f, APIC ID 0, APIC INT 0f Lint: type 3, pol 0, trig 0, bus 01, IRQ 00, APIC ID 0, APIC LINT 00 Lint: type 1, pol 0, trig 0, bus 01, IRQ 00, APIC ID ff, APIC LINT 01 Processors: 12 smpboot: Allowing 12 CPUs, 0 hotplug CPUs mapped IOAPIC to ffffffffff5fc000 (fec00000) e820: [mem 0xc0000000-0xfeffbfff] available for PCI devices clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns setup_percpu: NR_CPUS:16 nr_cpumask_bits:16 nr_cpu_ids:12 nr_node_ids:1 PERCPU: Embedded 31 pages/cpu @ffff8800bfa00000 s87640 r8192 d31144 u131072 pcpu-alloc: s87640 r8192 d31144 u131072 alloc=1*2097152 pcpu-alloc: [0] 00 01 02 03 04 05 06 07 08 09 10 11 -- -- -- -- Built 1 zonelists in Zone order, mobility grouping on. Total pages: 774021 Kernel command line: hung_task_panic=1 earlyprintk=ttyS0,115200 debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic console=ttyS0,115200 console=tty0 earlyprintk=ttyS0 ignore_loglevel ftrace_dump_on_oops vga=normal root=/dev/vda1 rw sysrq: sysrq always enabled. log_buf_len individual max cpu contribution: 2097152 bytes log_buf_len total cpu_extra contributions: 23068672 bytes log_buf_len min size: 8388608 bytes log_buf_len: 33554432 bytes early log buf free: 8380096(99%) PID hash table entries: 4096 (order: 3, 32768 bytes) Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes) Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes) Memory: 2911172K/3145320K available (4237K kernel code, 721K rwdata, 1988K rodata, 936K init, 8608K bss, 234148K reserved, 0K cma-reserved) SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=12, Nodes=1 Hierarchical RCU implementation. Build-time adjustment of leaf fanout to 64. RCU restricting CPUs from NR_CPUS=16 to nr_cpu_ids=12. RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=12 NR_IRQS:4352 nr_irqs:136 16 Console: colour VGA+ 80x25 console [tty0] enabled bootconsole [earlyser0] disabled console [ttyS0] enabled tsc: Fast TSC calibration using PIT tsc: Detected 3199.926 MHz processor Calibrating delay loop (skipped), value calculated using timer frequency.. 6399.85 BogoMIPS (lpj=12799704) pid_max: default: 32768 minimum: 301 Mount-cache hash table entries: 8192 (order: 4, 65536 bytes) Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes) Initializing cgroup subsys memory Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0 Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0 Freeing SMP alternatives memory: 20K (ffffffff819a0000 - ffffffff819a5000) ftrace: allocating 16664 entries in 66 pages Switched APIC routing to physical flat. enabled ExtINT on CPU#0 ENABLING IO-APIC IRQs init IO_APIC IRQs apic 0 pin 0 not connected IOAPIC[0]: Set routing entry (0-1 -> 0x31 -> IRQ 1 Mode:0 Active:0 Dest:0) IOAPIC[0]: Set routing entry (0-2 -> 0x30 -> IRQ 0 Mode:0 Active:0 Dest:0) IOAPIC[0]: Set routing entry (0-3 -> 0x33 -> IRQ 3 Mode:0 Active:0 Dest:0) IOAPIC[0]: Set routing entry (0-4 -> 0x34 -> IRQ 4 Mode:0 Active:0 Dest:0) apic 0 pin 5 not connected IOAPIC[0]: Set routing entry (0-6 -> 0x36 -> IRQ 6 Mode:0 Active:0 Dest:0) IOAPIC[0]: Set routing entry (0-7 -> 0x37 -> IRQ 7 Mode:0 Active:0 Dest:0) IOAPIC[0]: Set routing entry (0-8 -> 0x38 -> IRQ 8 Mode:0 Active:0 Dest:0) IOAPIC[0]: Set routing entry (0-9 -> 0x39 -> IRQ 9 Mode:1 Active:0 Dest:0) IOAPIC[0]: Set routing entry (0-10 -> 0x3a -> IRQ 10 Mode:1 Active:0 Dest:0) IOAPIC[0]: Set routing entry (0-11 -> 0x3b -> IRQ 11 Mode:1 Active:0 Dest:0) IOAPIC[0]: Set routing entry (0-12 -> 0x3c -> IRQ 12 Mode:0 Active:0 Dest:0) IOAPIC[0]: Set routing entry (0-13 -> 0x3d -> IRQ 13 Mode:0 Active:0 Dest:0) IOAPIC[0]: Set routing entry (0-14 -> 0x3e -> IRQ 14 Mode:0 Active:0 Dest:0) IOAPIC[0]: Set routing entry (0-15 -> 0x3f -> IRQ 15 Mode:0 Active:0 Dest:0) apic 0 pin 16 not connected apic 0 pin 17 not connected apic 0 pin 18 not connected apic 0 pin 19 not connected apic 0 pin 20 not connected apic 0 pin 21 not connected apic 0 pin 22 not connected apic 0 pin 23 not connected ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 Using local APIC timer interrupts. calibrating APIC timer ... ... lapic delta = 6251755 ..... delta 6251755 ..... mult: 268510832 ..... calibration result: 4001123 ..... CPU clock speed is 3200.3592 MHz. ..... host bus clock speed is 1000.1123 MHz. ... verify APIC timer ... jiffies delta = 25 ... jiffies result ok smpboot: CPU0: Intel QEMU Virtual CPU version 2.0.0 (family: 0x6, model: 0x6, stepping: 0x3) Performance Events: Broken PMU hardware detected, using software events only. Failed to access perfctr msr (MSR c2 is 0) x86: Booting SMP configuration: .... node #0, CPUs: #1 masked ExtINT on CPU#1 #2 masked ExtINT on CPU#2 #3 masked ExtINT on CPU#3 #4 masked ExtINT on CPU#4 #5 masked ExtINT on CPU#5 #6 masked ExtINT on CPU#6 #7 masked ExtINT on CPU#7 #8 masked ExtINT on CPU#8 #9 masked ExtINT on CPU#9 #10 masked ExtINT on CPU#10 #11 masked ExtINT on CPU#11 x86: Booted up 1 node, 12 CPUs smpboot: Total of 12 processors activated (76818.13 BogoMIPS) devtmpfs: initialized clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns NET: Registered protocol family 16 PCI: Using configuration type 1 for base access vgaarb: loaded SCSI subsystem initialized libata version 3.00 loaded. PCI: Probing PCI hardware PCI: root bus 00: using default resources PCI: Probing PCI hardware (bus 00) PCI host bridge to bus 0000:00 pci_bus 0000:00: root bus resource [io 0x0000-0xffff] pci_bus 0000:00: root bus resource [mem 0x00000000-0xffffffffff] pci_bus 0000:00: No busn resource found for root bus, will use [bus 00-ff] pci 0000:00:00.0: [8086:1237] type 00 class 0x060000 pci 0000:00:01.0: [8086:7000] type 00 class 0x060100 pci 0000:00:01.1: [8086:7010] type 00 class 0x010180 pci 0000:00:01.1: reg 0x20: [io 0xc0c0-0xc0cf] pci 0000:00:01.1: legacy IDE quirk: reg 0x10: [io 0x01f0-0x01f7] pci 0000:00:01.1: legacy IDE quirk: reg 0x14: [io 0x03f6] pci 0000:00:01.1: legacy IDE quirk: reg 0x18: [io 0x0170-0x0177] pci 0000:00:01.1: legacy IDE quirk: reg 0x1c: [io 0x0376] pci 0000:00:01.3: [8086:7113] type 00 class 0x068000 pci 0000:00:02.0: [1013:00b8] type 00 class 0x030000 pci 0000:00:02.0: reg 0x10: [mem 0xfc000000-0xfdffffff pref] pci 0000:00:02.0: reg 0x14: [mem 0xfebd0000-0xfebd0fff] pci 0000:00:02.0: reg 0x30: [mem 0xfebc0000-0xfebcffff pref] vgaarb: setting as boot device: PCI:0000:00:02.0 vgaarb: device added: PCI:0000:00:02.0,decodes=io+mem,owns=io+mem,locks=none pci 0000:00:03.0: [1af4:1000] type 00 class 0x020000 pci 0000:00:03.0: reg 0x10: [io 0xc080-0xc09f] pci 0000:00:03.0: reg 0x14: [mem 0xfebd1000-0xfebd1fff] pci 0000:00:03.0: reg 0x30: [mem 0xfeb80000-0xfebbffff pref] pci 0000:00:04.0: [1af4:1002] type 00 class 0x00ff00 pci 0000:00:04.0: reg 0x10: [io 0xc0a0-0xc0bf] pci 0000:00:05.0: [1af4:1001] type 00 class 0x010000 pci 0000:00:05.0: reg 0x10: [io 0xc000-0xc03f] pci 0000:00:05.0: reg 0x14: [mem 0xfebd2000-0xfebd2fff] pci 0000:00:06.0: [1af4:1001] type 00 class 0x010000 pci 0000:00:06.0: reg 0x10: [io 0xc040-0xc07f] pci 0000:00:06.0: reg 0x14: [mem 0xfebd3000-0xfebd3fff] pci 0000:00:07.0: [8086:25ab] type 00 class 0x088000 pci 0000:00:07.0: reg 0x10: [mem 0xfebd4000-0xfebd400f] pci_bus 0000:00: busn_res: [bus 00-ff] end is updated to 00 pci 0000:00:01.0: PIIX/ICH IRQ router [8086:7000] PCI: pci_cache_line_size set to 64 bytes e820: reserve RAM buffer [mem 0x0009fc00-0x0009ffff] e820: reserve RAM buffer [mem 0xbfffc000-0xbfffffff] clocksource: Switched to clocksource refined-jiffies pci_bus 0000:00: resource 4 [io 0x0000-0xffff] pci_bus 0000:00: resource 5 [mem 0x00000000-0xffffffffff] NET: Registered protocol family 2 TCP established hash table entries: 32768 (order: 6, 262144 bytes) TCP bind hash table entries: 32768 (order: 7, 524288 bytes) TCP: Hash tables configured (established 32768 bind 32768) UDP hash table entries: 2048 (order: 4, 65536 bytes) UDP-Lite hash table entries: 2048 (order: 4, 65536 bytes) NET: Registered protocol family 1 Trying to unpack rootfs image as initramfs... Freeing initrd memory: 125848K (ffff88007851a000 - ffff880080000000) platform rtc_cmos: registered platform RTC device (no PNP device found) Scanning for low memory corruption every 60 seconds futex hash table entries: 4096 (order: 6, 262144 bytes) HugeTLB registered 2 MB page size, pre-allocated 0 pages fuse init (API version 7.23) 9p: Installing v9fs 9p2000 file system support cryptomgr_test (74) used greatest stack depth: 15352 bytes left cryptomgr_test (82) used greatest stack depth: 15136 bytes left Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251) io scheduler noop registered io scheduler deadline registered io scheduler cfq registered (default) querying PCI -> IRQ mapping bus:0, slot:3, pin:0. virtio-pci 0000:00:03.0: PCI->APIC IRQ transform: INT A -> IRQ 11 virtio-pci 0000:00:03.0: virtio_pci: leaving for legacy driver querying PCI -> IRQ mapping bus:0, slot:4, pin:0. virtio-pci 0000:00:04.0: PCI->APIC IRQ transform: INT A -> IRQ 11 virtio-pci 0000:00:04.0: virtio_pci: leaving for legacy driver querying PCI -> IRQ mapping bus:0, slot:5, pin:0. virtio-pci 0000:00:05.0: PCI->APIC IRQ transform: INT A -> IRQ 10 virtio-pci 0000:00:05.0: virtio_pci: leaving for legacy driver querying PCI -> IRQ mapping bus:0, slot:6, pin:0. virtio-pci 0000:00:06.0: PCI->APIC IRQ transform: INT A -> IRQ 10 virtio-pci 0000:00:06.0: virtio_pci: leaving for legacy driver Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled serial8250: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A Linux agpgart interface v0.103 brd: module loaded loop: module loaded vda: vda1 vda2 < vda5 > zram: Added device: zram0 libphy: Fixed MDIO Bus: probed tun: Universal TUN/TAP device driver, 1.6 tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com> serio: i8042 KBD port at 0x60,0x64 irq 1 serio: i8042 AUX port at 0x60,0x64 irq 12 mousedev: PS/2 mouse device common for all mice rtc_cmos rtc_cmos: rtc core: registered rtc_cmos as rtc0 rtc_cmos rtc_cmos: alarms up to one day, 114 bytes nvram device-mapper: ioctl: 4.33.0-ioctl (2015-8-18) initialised: dm-devel@redhat.com device-mapper: cache cleaner: version 1.0.0 loaded NET: Registered protocol family 17 9pnet: Installing 9P2000 support ... APIC ID: 00000000 (0) ... APIC VERSION: 01050014 0000000000000000000000000000000000000000000000000000000000000000 000000000e000000000000000000000000000000000000000000000000000000 0000000000020000000000000000000000000000000000000000000000008000 number of MP IRQ sources: 16. number of IO-APIC #0 registers: 24. testing the IO APIC....................... IO APIC #0...... .... register #00: 00000000 ....... : physical APIC id: 00 ....... : Delivery Type: 0 ....... : LTS : 0 .... register #01: 00170011 ....... : max redirection entries: 17 ....... : PRQ implemented: 0 ....... : IO APIC version: 11 .... register #02: 00000000 ....... : arbitration: 00 .... IRQ redirection table: IOAPIC 0: pin00, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0) pin01, enabled , edge , high, V(31), IRR(0), S(0), physical, D(00), M(0) pin02, enabled , edge , high, V(30), IRR(0), S(0), physical, D(00), M(0) pin03, enabled , edge , high, V(33), IRR(0), S(0), physical, D(00), M(0) pin04, disabled, edge , high, V(34), IRR(0), S(0), physical, D(00), M(0) pin05, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0) pin06, enabled , edge , high, V(36), IRR(0), S(0), physical, D(00), M(0) pin07, enabled , edge , high, V(37), IRR(0), S(0), physical, D(00), M(0) pin08, enabled , edge , high, V(38), IRR(0), S(0), physical, D(00), M(0) pin09, disabled, level, high, V(39), IRR(0), S(0), physical, D(00), M(0) pin0a, enabled , level, high, V(3A), IRR(0), S(0), physical, D(00), M(0) pin0b, enabled , level, high, V(3B), IRR(0), S(0), physical, D(00), M(0) pin0c, enabled , edge , high, V(3C), IRR(0), S(0), physical, D(00), M(0) pin0d, enabled , edge , high, V(3D), IRR(0), S(0), physical, D(00), M(0) pin0e, enabled , edge , high, V(3E), IRR(0), S(0), physical, D(00), M(0) pin0f, enabled , edge , high, V(3F), IRR(0), S(0), physical, D(00), M(0) pin10, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0) pin11, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0) pin12, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0) pin13, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0) pin14, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0) pin15, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0) pin16, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0) pin17, disabled, edge , high, V(00), IRR(0), S(0), physical, D(00), M(0) IRQ to pin mappings: IRQ0 -> 0:2 IRQ1 -> 0:1 IRQ3 -> 0:3 IRQ4 -> 0:4 IRQ6 -> 0:6 IRQ7 -> 0:7 IRQ8 -> 0:8 IRQ9 -> 0:9 IRQ10 -> 0:10 IRQ11 -> 0:11 IRQ12 -> 0:12 IRQ13 -> 0:13 IRQ14 -> 0:14 IRQ15 -> 0:15 .................................... done. rtc_cmos rtc_cmos: setting system clock to 2015-10-20 08:57:55 UTC (1445331475) input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input0 Freeing unused kernel memory: 936K (ffffffff818b6000 - ffffffff819a0000) Write protecting the kernel read-only data: 8192k Freeing unused kernel memory: 1900K (ffff880001425000 - ffff880001600000) Freeing unused kernel memory: 60K (ffff8800017f1000 - ffff880001800000) busybox (117) used greatest stack depth: 14480 bytes left exe (124) used greatest stack depth: 14024 bytes left udevd[140]: starting version 175 blkid (151) used greatest stack depth: 13920 bytes left modprobe (242) used greatest stack depth: 13784 bytes left clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2e200418439, max_idle_ns: 440795220848 ns clocksource: Switched to clocksource tsc EXT4-fs (vda1): recovery complete EXT4-fs (vda1): mounted filesystem with ordered data mode. Opts: (null) exe (262) used greatest stack depth: 13032 bytes left random: init urandom read with 9 bits of entropy available init: plymouth-upstart-bridge main process (279) terminated with status 1 init: plymouth-upstart-bridge main process ended, respawning init: plymouth-upstart-bridge main process (289) terminated with status 1 init: plymouth-upstart-bridge main process ended, respawning init: plymouth-upstart-bridge main process (293) terminated with status 1 init: plymouth-upstart-bridge main process ended, respawning init: ureadahead main process (282) terminated with status 5 Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS systemd-udevd[423]: starting version 204 EXT4-fs (vdb): mounted filesystem with ordered data mode. Opts: errors=remount-ro * Stopping Send an event to indicate plymouth is up^[[74G[ OK ] * Starting Mount filesystems on boot^[[74G[ OK ] * Starting Signal sysvinit that the rootfs is mounted^[[74G[ OK ] * Starting Populate /dev filesystem^[[74G[ OK ] * Starting Populate and link to /run filesystem^[[74G[ OK ] * Stopping Populate /dev filesystem^[[74G[ OK ] * Stopping Populate and link to /run filesystem^[[74G[ OK ] * Starting Clean /tmp directory^[[74G[ OK ] * Stopping Track if upstart is running in a container^[[74G[ OK ] * Stopping Clean /tmp directory^[[74G[ OK ] * Starting Initialize or finalize resolvconf^[[74G[ OK ] * Starting set console keymap^[[74G[ OK ] * Starting Signal sysvinit that virtual filesystems are mounted^[[74G[ OK ] * Starting Signal sysvinit that virtual filesystems are mounted^[[74G[ OK ] * Starting Bridge udev events into upstart^[[74G[ OK ] * Starting Signal sysvinit that remote filesystems are mounted^[[74G[ OK ] * Stopping set console keymap^[[74G[ OK ] * Starting device node and kernel event manager^[[74G[ OK ] * Starting load modules from /etc/modules^[[74G[ OK ] * Starting cold plug devices^[[74G[ OK ] * Starting log initial device creation^[[74G[ OK ] * Stopping Read required files in advance (for other mountpoints)^[[74G[ OK ] * Stopping load modules from /etc/modules^[[74G[ OK ] * Starting Signal sysvinit that local filesystems are mounted^[[74G[ OK ] * Starting flush early job output to logs^[[74G[ OK ] * Stopping Mount filesystems on boot^[[74G[ OK ] * Stopping flush early job output to logs^[[74G[ OK ] * Starting D-Bus system message bus^[[74G[ OK ] * Starting SystemD login management service^[[74G[ OK ] * Starting system logging daemon^[[74G[ OK ] * Stopping cold plug devices^[[74G[ OK ] * Starting Uncomplicated firewall^[[74G[ OK ] * Starting configure network device security^[[74G[ OK ] * Stopping log initial device creation^[[74G[ OK ] * Starting configure network device security^[[74G[ OK ] * Starting save udev log and update rules^[[74G[ OK ] * Starting set console font^[[74G[ OK ] * Stopping save udev log and update rules^[[74G[ OK ] * Starting Mount network filesystems^[[74G[ OK ] * Starting Failsafe Boot Delay^[[74G[ OK ] * Starting configure network device security^[[74G[ OK ] * Stopping Mount network filesystems^[[74G[ OK ] * Starting configure network device^[[74G[ OK ] * Starting configure network device^[[74G[ OK ] * Starting Bridge file events into upstart^[[74G[ OK ] * Starting Bridge socket events into upstart^[[74G[ OK ] * Stopping set console font^[[74G[ OK ] * Starting userspace bootsplash^[[74G[ OK ] * Starting Send an event to indicate plymouth is up^[[74G[ OK ] * Stopping userspace bootsplash^[[74G[ OK ] * Stopping Send an event to indicate plymouth is up^[[74G[ OK ] * Starting Mount network filesystems^[[74G[ OK ] init: failsafe main process (591) killed by TERM signal * Stopping Failsafe Boot Delay^[[74G[ OK ] * Starting System V initialisation compatibility^[[74G[ OK ] * Stopping Mount network filesystems^[[74G[ OK ] * Starting configure virtual network devices^[[74G[ OK ] * Stopping System V initialisation compatibility^[[74G[ OK ] * Starting System V runlevel compatibility^[[74G[ OK ] * Starting deferred execution scheduler^[[74G[ OK ] * Starting regular background program processing daemon^[[74G[ OK ] * Starting ACPI daemon^[[74G[ OK ] * Starting save kernel messages^[[74G[ OK ] * Starting CPU interrupts balancing daemon^[[74G[ OK ] * Stopping save kernel messages^[[74G[ OK ] * Starting OpenSSH server^[[74G[ OK ] * Starting automatic crash report generation^[[74G[ OK ] * Restoring resolver state... ^[[80G \r^[[74G[ OK ] eth0 Link encap:Ethernet HWaddr 52:54:79:12:34:57 inet addr:192.168.0.21 Bcast:192.168.0.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:34 errors:0 dropped:24 overruns:0 frame:0 TX packets:4 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:5780 (5.7 KB) TX bytes:800 (800.0 B) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) * Stopping System V runlevel compatibility^[[74G[ OK ] init: plymouth-upstart-bridge main process ended, respawning sh (1429) used greatest stack depth: 11752 bytes left sh (1454) used greatest stack depth: 11528 bytes left random: nonblocking pool is initialized Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS sh (2785) used greatest stack depth: 11480 bytes left Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff810782a9>] down_read_trylock+0x9/0x30 PGD 0 Oops: 0000 [#1] SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 26445 Comm: sh Not tainted 4.3.0-rc5-mm1-diet-meta+ #1545 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff8800b9af3480 ti: ffff88007fea0000 task.ti: ffff88007fea0000 RIP: 0010:[<ffffffff810782a9>] [<ffffffff810782a9>] down_read_trylock+0x9/0x30 RSP: 0018:ffff88007fea3648 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffffea0002324900 RCX: ffff88007fea37e8 RDX: 0000000000000000 RSI: ffff88007fea36e8 RDI: 0000000000000008 RBP: ffff88007fea3648 R08: ffffffff818446a0 R09: ffff8800b9af4c80 R10: 0000000000000216 R11: 0000000000000001 R12: ffff88007f58d6e1 R13: ffff88007f58d6e0 R14: 0000000000000008 R15: 0000000000000001 FS: 00007f0993e78740(0000) GS:ffff8800bfa20000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 000000007edee000 CR4: 00000000000006a0 Stack: ffff88007fea3678 ffffffff81124ff0 ffffea0002324900 ffff88007fea36e8 ffff88009ffe8400 0000000000000000 ffff88007fea36c0 ffffffff81125733 ffff8800bfa34540 ffffffff8105dc9d ffffea0002324900 ffff88007fea37e8 Call Trace: [<ffffffff81124ff0>] page_lock_anon_vma_read+0x60/0x180 [<ffffffff81125733>] rmap_walk+0x1b3/0x3f0 [<ffffffff8105dc9d>] ? finish_task_switch+0x5d/0x1f0 [<ffffffff81125b13>] page_referenced+0x1a3/0x220 [<ffffffff81123e30>] ? __page_check_address+0x1a0/0x1a0 [<ffffffff81124f90>] ? page_get_anon_vma+0xd0/0xd0 [<ffffffff81123820>] ? anon_vma_ctor+0x40/0x40 [<ffffffff8110087b>] shrink_page_list+0x5ab/0xde0 [<ffffffff8110174c>] shrink_inactive_list+0x18c/0x4b0 [<ffffffff811023bd>] shrink_lruvec+0x59d/0x740 [<ffffffff811025f0>] shrink_zone+0x90/0x250 [<ffffffff811028dd>] do_try_to_free_pages+0x12d/0x3b0 [<ffffffff81102d3d>] try_to_free_mem_cgroup_pages+0x9d/0x120 [<ffffffff811496c3>] try_charge+0x163/0x700 [<ffffffff81149cb4>] mem_cgroup_do_precharge+0x54/0x70 [<ffffffff81149e45>] mem_cgroup_can_attach+0x175/0x1b0 [<ffffffff811b2c57>] ? kernfs_iattrs.isra.6+0x37/0xd0 [<ffffffff81148e70>] ? get_mctgt_type+0x320/0x320 [<ffffffff810a9d29>] cgroup_migrate+0x149/0x440 [<ffffffff810aa60c>] cgroup_attach_task+0x7c/0xe0 [<ffffffff810aa904>] __cgroup_procs_write.isra.33+0x1d4/0x2b0 [<ffffffff810aaa10>] cgroup_tasks_write+0x10/0x20 [<ffffffff810a6238>] cgroup_file_write+0x38/0xf0 [<ffffffff811b54ad>] kernfs_fop_write+0x11d/0x170 [<ffffffff81153918>] __vfs_write+0x28/0xe0 [<ffffffff8116e614>] ? __fd_install+0x24/0xc0 [<ffffffff810784a1>] ? percpu_down_read+0x21/0x50 [<ffffffff81153e91>] vfs_write+0xa1/0x170 [<ffffffff81154716>] SyS_write+0x46/0xa0 [<ffffffff81420a17>] entry_SYSCALL_64_fastpath+0x12/0x6a Code: 5e 82 3a 00 48 83 c4 08 5b 5d c3 48 89 45 f0 e8 9b 6a 3a 00 48 8b 45 f0 eb df 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7 RIP [<ffffffff810782a9>] down_read_trylock+0x9/0x30 RSP <ffff88007fea3648> CR2: 0000000000000008 BUG: unable to handle kernel ---[ end trace e81a82c8122b447d ]--- Kernel panic - not syncing: Fatal exception NULL pointer dereference at 0000000000000008 IP: [<ffffffff810782a9>] down_read_trylock+0x9/0x30 PGD 0 Oops: 0000 [#2] SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 10 PID: 59 Comm: khugepaged Tainted: G D 4.3.0-rc5-mm1-diet-meta+ #1545 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff8800b9851a40 ti: ffff8800b985c000 task.ti: ffff8800b985c000 RIP: 0010:[<ffffffff810782a9>] [<ffffffff810782a9>] down_read_trylock+0x9/0x30 RSP: 0018:ffff8800b985f778 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffffea0002321800 RCX: ffff8800b985f918 RDX: 0000000000000000 RSI: ffff8800b985f818 RDI: 0000000000000008 RBP: ffff8800b985f778 R08: ffffffff818446a0 R09: ffff8800b9853240 R10: 000000000000ba03 R11: 0000000000000001 R12: ffff88007f58d6e1 R13: ffff88007f58d6e0 R14: 0000000000000008 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff8800bfb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 0000000001808000 CR4: 00000000000006a0 Stack: ffff8800b985f7a8 ffffffff81124ff0 ffffea0002321800 ffff8800b985f818 ffff88009ffe8400 0000000000000000 ffff8800b985f7f0 ffffffff81125733 ffff8800bfb54540 ffffffff8105dc9d ffffea0002321800 ffff8800b985f918 Call Trace: [<ffffffff81124ff0>] page_lock_anon_vma_read+0x60/0x180 [<ffffffff81125733>] rmap_walk+0x1b3/0x3f0 [<ffffffff8105dc9d>] ? finish_task_switch+0x5d/0x1f0 [<ffffffff81125b13>] page_referenced+0x1a3/0x220 [<ffffffff81123e30>] ? __page_check_address+0x1a0/0x1a0 [<ffffffff81124f90>] ? page_get_anon_vma+0xd0/0xd0 [<ffffffff81123820>] ? anon_vma_ctor+0x40/0x40 [<ffffffff8110087b>] shrink_page_list+0x5ab/0xde0 [<ffffffff8110174c>] shrink_inactive_list+0x18c/0x4b0 [<ffffffff811023bd>] shrink_lruvec+0x59d/0x740 [<ffffffff811025f0>] shrink_zone+0x90/0x250 [<ffffffff811028dd>] do_try_to_free_pages+0x12d/0x3b0 [<ffffffff81102d3d>] try_to_free_mem_cgroup_pages+0x9d/0x120 [<ffffffff811496c3>] try_charge+0x163/0x700 [<ffffffff8141d1f3>] ? schedule+0x33/0x80 [<ffffffff8114d45f>] mem_cgroup_try_charge+0x9f/0x1d0 [<ffffffff811434bc>] khugepaged+0x7cc/0x1ac0 [<ffffffff81066e01>] ? hrtick_update+0x1/0x70 [<ffffffff81072430>] ? prepare_to_wait_event+0xf0/0xf0 [<ffffffff81142cf0>] ? total_mapcount+0x70/0x70 [<ffffffff81056cd9>] kthread+0xc9/0xe0 [<ffffffff81056c10>] ? kthread_park+0x60/0x60 [<ffffffff81420d6f>] ret_from_fork+0x3f/0x70 [<ffffffff81056c10>] ? kthread_park+0x60/0x60 Code: 5e 82 3a 00 48 83 c4 08 5b 5d c3 48 89 45 f0 e8 9b 6a 3a 00 48 8b 45 f0 eb df 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7 RIP [<ffffffff810782a9>] down_read_trylock+0x9/0x30 RSP <ffff8800b985f778> CR2: 0000000000000008 ---[ end trace e81a82c8122b447e ]--- Shutting down cpus with NMI Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/5] MADV_FREE refactoring and fix KSM page 2015-10-19 10:01 ` [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim 2015-10-20 1:38 ` Minchan Kim @ 2015-10-20 7:21 ` Minchan Kim 2015-10-20 7:27 ` Minchan Kim 2015-10-20 21:36 ` Andrew Morton 1 sibling, 2 replies; 26+ messages in thread From: Minchan Kim @ 2015-10-20 7:21 UTC (permalink / raw) To: Andrew Morton, Kirill A. Shutemov Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Vlastimil Babka On Mon, Oct 19, 2015 at 07:01:50PM +0900, Minchan Kim wrote: > On Mon, Oct 19, 2015 at 03:31:42PM +0900, Minchan Kim wrote: > > Hello, it's too late since I sent previos patch. > > https://lkml.org/lkml/2015/6/3/37 > > > > This patch is alomost new compared to previos approach. > > I think this is more simple, clear and easy to review. > > > > One thing I should notice is that I have tested this patch > > and couldn't find any critical problem so I rebased patchset > > onto recent mmotm(ie, mmotm-2015-10-15-15-20) to send formal > > patchset. Unfortunately, I start to see sudden discarding of > > the page we shouldn't do. IOW, application's valid anonymous page > > was disappeared suddenly. > > > > When I look through THP changes, I think we could lose > > dirty bit of pte between freeze_page and unfreeze_page > > when we mark it as migration entry and restore it. > > So, I added below simple code without enough considering > > and cannot see the problem any more. > > I hope it's good hint to find right fix this problem. > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index d5ea516ffb54..e881c04f5950 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -3138,6 +3138,9 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page, > > if (is_write_migration_entry(swp_entry)) > > entry = maybe_mkwrite(entry, vma); > > > > + if (PageDirty(page)) > > + SetPageDirty(page); > > The condition of PageDirty was typo. I didn't add the condition. > Just added. > > SetPageDirty(page); I reviewed THP refcount redesign patch and It seems below patch fixes MADV_FREE problem. It works well for hours. >From 104a0940b4c0f97e61de9fee0fd602926ff28312 Mon Sep 17 00:00:00 2001 From: Minchan Kim <minchan@kernel.org> Date: Tue, 20 Oct 2015 16:00:52 +0900 Subject: [PATCH] mm: mark head page dirty in split_huge_page In thp split in old THP refcount, we mappped all of pages (ie, head + tails) to pte_mkdirty and mark PG_flags to every tail pages. But with THP refcount redesign, we can lose dirty bit in page table and PG_dirty for head page if we want to free the THP page using migration_entry. It ends up discarding head page by madvise_free suddenly. This patch fixes it by mark the head page PG_dirty when VM splits the THP page. Signed-off-by: Minchan Kim <minchan@kernel.org> --- mm/huge_memory.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index adccfb48ce57..7fbbd42554a1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3258,6 +3258,7 @@ static void __split_huge_page(struct page *page, struct list_head *list) atomic_sub(tail_mapcount, &head->_count); ClearPageCompound(head); + SetPageDirty(head); spin_unlock_irq(&zone->lru_lock); unfreeze_page(page_anon_vma(head), head); -- 1.9.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 0/5] MADV_FREE refactoring and fix KSM page 2015-10-20 7:21 ` Minchan Kim @ 2015-10-20 7:27 ` Minchan Kim 2015-10-20 21:36 ` Andrew Morton 1 sibling, 0 replies; 26+ messages in thread From: Minchan Kim @ 2015-10-20 7:27 UTC (permalink / raw) To: Andrew Morton, Kirill A. Shutemov Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Vlastimil Babka On Tue, Oct 20, 2015 at 04:21:09PM +0900, Minchan Kim wrote: > On Mon, Oct 19, 2015 at 07:01:50PM +0900, Minchan Kim wrote: > > On Mon, Oct 19, 2015 at 03:31:42PM +0900, Minchan Kim wrote: > > > Hello, it's too late since I sent previos patch. > > > https://lkml.org/lkml/2015/6/3/37 > > > > > > This patch is alomost new compared to previos approach. > > > I think this is more simple, clear and easy to review. > > > > > > One thing I should notice is that I have tested this patch > > > and couldn't find any critical problem so I rebased patchset > > > onto recent mmotm(ie, mmotm-2015-10-15-15-20) to send formal > > > patchset. Unfortunately, I start to see sudden discarding of > > > the page we shouldn't do. IOW, application's valid anonymous page > > > was disappeared suddenly. > > > > > > When I look through THP changes, I think we could lose > > > dirty bit of pte between freeze_page and unfreeze_page > > > when we mark it as migration entry and restore it. > > > So, I added below simple code without enough considering > > > and cannot see the problem any more. > > > I hope it's good hint to find right fix this problem. > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > index d5ea516ffb54..e881c04f5950 100644 > > > --- a/mm/huge_memory.c > > > +++ b/mm/huge_memory.c > > > @@ -3138,6 +3138,9 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page, > > > if (is_write_migration_entry(swp_entry)) > > > entry = maybe_mkwrite(entry, vma); > > > > > > + if (PageDirty(page)) > > > + SetPageDirty(page); > > > > The condition of PageDirty was typo. I didn't add the condition. > > Just added. > > > > SetPageDirty(page); > > I reviewed THP refcount redesign patch and It seems below patch fixes > MADV_FREE problem. It works well for hours. > > From 104a0940b4c0f97e61de9fee0fd602926ff28312 Mon Sep 17 00:00:00 2001 > From: Minchan Kim <minchan@kernel.org> > Date: Tue, 20 Oct 2015 16:00:52 +0900 > Subject: [PATCH] mm: mark head page dirty in split_huge_page > > In thp split in old THP refcount, we mappped all of pages > (ie, head + tails) to pte_mkdirty and mark PG_flags to every > tail pages. > > But with THP refcount redesign, we can lose dirty bit in page table > and PG_dirty for head page if we want to free the THP page using typo. freeze ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/5] MADV_FREE refactoring and fix KSM page 2015-10-20 7:21 ` Minchan Kim 2015-10-20 7:27 ` Minchan Kim @ 2015-10-20 21:36 ` Andrew Morton 2015-10-20 22:43 ` Kirill A. Shutemov 1 sibling, 1 reply; 26+ messages in thread From: Andrew Morton @ 2015-10-20 21:36 UTC (permalink / raw) To: Minchan Kim Cc: Kirill A. Shutemov, linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Vlastimil Babka On Tue, 20 Oct 2015 16:21:09 +0900 Minchan Kim <minchan@kernel.org> wrote: > > I reviewed THP refcount redesign patch and It seems below patch fixes > MADV_FREE problem. It works well for hours. > > >From 104a0940b4c0f97e61de9fee0fd602926ff28312 Mon Sep 17 00:00:00 2001 > From: Minchan Kim <minchan@kernel.org> > Date: Tue, 20 Oct 2015 16:00:52 +0900 > Subject: [PATCH] mm: mark head page dirty in split_huge_page > > In thp split in old THP refcount, we mappped all of pages > (ie, head + tails) to pte_mkdirty and mark PG_flags to every > tail pages. > > But with THP refcount redesign, we can lose dirty bit in page table > and PG_dirty for head page if we want to free the THP page using > migration_entry. > > It ends up discarding head page by madvise_free suddenly. > This patch fixes it by mark the head page PG_dirty when VM splits > the THP page. > > Signed-off-by: Minchan Kim <minchan@kernel.org> > --- > mm/huge_memory.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index adccfb48ce57..7fbbd42554a1 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3258,6 +3258,7 @@ static void __split_huge_page(struct page *page, struct list_head *list) > atomic_sub(tail_mapcount, &head->_count); > > ClearPageCompound(head); > + SetPageDirty(head); > spin_unlock_irq(&zone->lru_lock); > > unfreeze_page(page_anon_vma(head), head); This appears to be a bugfix against Kirill's "thp: reintroduce split_huge_page()"? Yes, __split_huge_page() is marking the tail pages dirty but forgot about the head page You say "we can lose dirty bit in page table" but I don't see how the above patch fixes that? Why does __split_huge_page() unconditionally mark the pages dirty, btw? Is it because the THP page was known to be dirty? If so, the head page already had PG_dirty, so this patch doesn't do anything. freeze_page(), unfreeze_page() and their callees desperately need some description of what they're doing. Kirill, could you cook somethnig up please? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/5] MADV_FREE refactoring and fix KSM page 2015-10-20 21:36 ` Andrew Morton @ 2015-10-20 22:43 ` Kirill A. Shutemov 2015-10-21 5:11 ` Minchan Kim 0 siblings, 1 reply; 26+ messages in thread From: Kirill A. Shutemov @ 2015-10-20 22:43 UTC (permalink / raw) To: Andrew Morton Cc: Minchan Kim, linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Vlastimil Babka On Tue, Oct 20, 2015 at 02:36:51PM -0700, Andrew Morton wrote: > On Tue, 20 Oct 2015 16:21:09 +0900 Minchan Kim <minchan@kernel.org> wrote: > > > > > I reviewed THP refcount redesign patch and It seems below patch fixes > > MADV_FREE problem. It works well for hours. > > > > >From 104a0940b4c0f97e61de9fee0fd602926ff28312 Mon Sep 17 00:00:00 2001 > > From: Minchan Kim <minchan@kernel.org> > > Date: Tue, 20 Oct 2015 16:00:52 +0900 > > Subject: [PATCH] mm: mark head page dirty in split_huge_page > > > > In thp split in old THP refcount, we mappped all of pages > > (ie, head + tails) to pte_mkdirty and mark PG_flags to every > > tail pages. > > > > But with THP refcount redesign, we can lose dirty bit in page table > > and PG_dirty for head page if we want to free the THP page using > > migration_entry. > > > > It ends up discarding head page by madvise_free suddenly. > > This patch fixes it by mark the head page PG_dirty when VM splits > > the THP page. > > > > Signed-off-by: Minchan Kim <minchan@kernel.org> > > --- > > mm/huge_memory.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index adccfb48ce57..7fbbd42554a1 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -3258,6 +3258,7 @@ static void __split_huge_page(struct page *page, struct list_head *list) > > atomic_sub(tail_mapcount, &head->_count); > > > > ClearPageCompound(head); > > + SetPageDirty(head); > > spin_unlock_irq(&zone->lru_lock); > > > > unfreeze_page(page_anon_vma(head), head); Sorry, I've missed the email at first. > This appears to be a bugfix against Kirill's "thp: reintroduce > split_huge_page()"? > > Yes, __split_huge_page() is marking the tail pages dirty but forgot > about the head page > > You say "we can lose dirty bit in page table" but I don't see how the > above patch fixes that? I think the problem is in unfreeze_page_vma(), where I missed dirtying pte. > Why does __split_huge_page() unconditionally mark the pages dirty, btw? > Is it because the THP page was known to be dirty? THP doesn't have backing storage and cannot be swapped out without splitting, therefore always dirty. (huge zero page is exception, I guess). > If so, the head page already had PG_dirty, so this patch doesn't do > anything. PG_dirty appears on struct page as result of transferring from dirty bit in page tables. There's no guarantee that it's happened. > freeze_page(), unfreeze_page() and their callees desperately need some > description of what they're doing. Kirill, could you cook somethnig up > please? Minchan, could you test patch below instead? diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 86924cc34bac..ea1f3805afa3 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3115,7 +3115,7 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page, entry = pte_mkold(mk_pte(page, vma->vm_page_prot)); if (is_write_migration_entry(swp_entry)) - entry = maybe_mkwrite(entry, vma); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); flush_dcache_page(page); set_pte_at(vma->vm_mm, address, pte + i, entry); -- Kirill A. Shutemov ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 0/5] MADV_FREE refactoring and fix KSM page 2015-10-20 22:43 ` Kirill A. Shutemov @ 2015-10-21 5:11 ` Minchan Kim 2015-10-21 7:50 ` Kirill A. Shutemov 0 siblings, 1 reply; 26+ messages in thread From: Minchan Kim @ 2015-10-21 5:11 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Vlastimil Babka On Wed, Oct 21, 2015 at 01:43:53AM +0300, Kirill A. Shutemov wrote: > On Tue, Oct 20, 2015 at 02:36:51PM -0700, Andrew Morton wrote: > > On Tue, 20 Oct 2015 16:21:09 +0900 Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > I reviewed THP refcount redesign patch and It seems below patch fixes > > > MADV_FREE problem. It works well for hours. > > > > > > >From 104a0940b4c0f97e61de9fee0fd602926ff28312 Mon Sep 17 00:00:00 2001 > > > From: Minchan Kim <minchan@kernel.org> > > > Date: Tue, 20 Oct 2015 16:00:52 +0900 > > > Subject: [PATCH] mm: mark head page dirty in split_huge_page > > > > > > In thp split in old THP refcount, we mappped all of pages > > > (ie, head + tails) to pte_mkdirty and mark PG_flags to every > > > tail pages. > > > > > > But with THP refcount redesign, we can lose dirty bit in page table > > > and PG_dirty for head page if we want to free the THP page using > > > migration_entry. > > > > > > It ends up discarding head page by madvise_free suddenly. > > > This patch fixes it by mark the head page PG_dirty when VM splits > > > the THP page. > > > > > > Signed-off-by: Minchan Kim <minchan@kernel.org> > > > --- > > > mm/huge_memory.c | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > index adccfb48ce57..7fbbd42554a1 100644 > > > --- a/mm/huge_memory.c > > > +++ b/mm/huge_memory.c > > > @@ -3258,6 +3258,7 @@ static void __split_huge_page(struct page *page, struct list_head *list) > > > atomic_sub(tail_mapcount, &head->_count); > > > > > > ClearPageCompound(head); > > > + SetPageDirty(head); > > > spin_unlock_irq(&zone->lru_lock); > > > > > > unfreeze_page(page_anon_vma(head), head); > > Sorry, I've missed the email at first. > > > This appears to be a bugfix against Kirill's "thp: reintroduce > > split_huge_page()"? > > > > Yes, __split_huge_page() is marking the tail pages dirty but forgot > > about the head page > > > > You say "we can lose dirty bit in page table" but I don't see how the > > above patch fixes that? > > I think the problem is in unfreeze_page_vma(), where I missed dirtying > pte. > > > Why does __split_huge_page() unconditionally mark the pages dirty, btw? > > Is it because the THP page was known to be dirty? > > THP doesn't have backing storage and cannot be swapped out without > splitting, therefore always dirty. (huge zero page is exception, I guess). It's right until now but I think we need more(e.g. is_dirty_migration_entry, make_migration_entry(struct page *page, int write, int dirty) in terms of MADV_FREE to keep dirty bit of pte rather than making pages dirty unconditionally. For example, we could call madvise_free to THP page so madvise_free clears dirty bit of pmd without split THP pages(ie, lazy split, maybe you suggest it, thanks!) instantly. Then, when VM tries to reclaim the THP page and splits it, every page will be marked PG_dirty or pte_mkdirty even if there is no write ever since then so madvise_free can never discard it although we could. Anyway it shouldn't be party-pooper. It could be enhanced and I will check it. > > > If so, the head page already had PG_dirty, so this patch doesn't do > > anything. > > PG_dirty appears on struct page as result of transferring from dirty bit > in page tables. There's no guarantee that it's happened. > > > freeze_page(), unfreeze_page() and their callees desperately need some > > description of what they're doing. Kirill, could you cook somethnig up > > please? > > Minchan, could you test patch below instead? I think it will definitely work and more right fix than mine because it covers split_huge_page_to_list's error path(ie, unfreeze_page(anon_vma, head); ret = -EBUSY; } I will queue it to test machine. .. Zzzz .. After 2 hours, I don't see any problemso far but I have a question below. > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 86924cc34bac..ea1f3805afa3 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3115,7 +3115,7 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page, > > entry = pte_mkold(mk_pte(page, vma->vm_page_prot)); > if (is_write_migration_entry(swp_entry)) > - entry = maybe_mkwrite(entry, vma); > + entry = maybe_mkwrite(pte_mkdirty(entry), vma); Why should we do pte_mkdiry only if is_write_migration_entry is true? Doesn't it lose a dirty bit again if someone changes protection from RW to R? > > flush_dcache_page(page); > set_pte_at(vma->vm_mm, address, pte + i, entry); > -- > Kirill A. Shutemov ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/5] MADV_FREE refactoring and fix KSM page 2015-10-21 5:11 ` Minchan Kim @ 2015-10-21 7:50 ` Kirill A. Shutemov 0 siblings, 0 replies; 26+ messages in thread From: Kirill A. Shutemov @ 2015-10-21 7:50 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Mel Gorman, Michal Hocko, Johannes Weiner, Vlastimil Babka On Wed, Oct 21, 2015 at 02:11:39PM +0900, Minchan Kim wrote: > On Wed, Oct 21, 2015 at 01:43:53AM +0300, Kirill A. Shutemov wrote: > > On Tue, Oct 20, 2015 at 02:36:51PM -0700, Andrew Morton wrote: > > > On Tue, 20 Oct 2015 16:21:09 +0900 Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > > > > I reviewed THP refcount redesign patch and It seems below patch fixes > > > > MADV_FREE problem. It works well for hours. > > > > > > > > >From 104a0940b4c0f97e61de9fee0fd602926ff28312 Mon Sep 17 00:00:00 2001 > > > > From: Minchan Kim <minchan@kernel.org> > > > > Date: Tue, 20 Oct 2015 16:00:52 +0900 > > > > Subject: [PATCH] mm: mark head page dirty in split_huge_page > > > > > > > > In thp split in old THP refcount, we mappped all of pages > > > > (ie, head + tails) to pte_mkdirty and mark PG_flags to every > > > > tail pages. > > > > > > > > But with THP refcount redesign, we can lose dirty bit in page table > > > > and PG_dirty for head page if we want to free the THP page using > > > > migration_entry. > > > > > > > > It ends up discarding head page by madvise_free suddenly. > > > > This patch fixes it by mark the head page PG_dirty when VM splits > > > > the THP page. > > > > > > > > Signed-off-by: Minchan Kim <minchan@kernel.org> > > > > --- > > > > mm/huge_memory.c | 1 + > > > > 1 file changed, 1 insertion(+) > > > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > index adccfb48ce57..7fbbd42554a1 100644 > > > > --- a/mm/huge_memory.c > > > > +++ b/mm/huge_memory.c > > > > @@ -3258,6 +3258,7 @@ static void __split_huge_page(struct page *page, struct list_head *list) > > > > atomic_sub(tail_mapcount, &head->_count); > > > > > > > > ClearPageCompound(head); > > > > + SetPageDirty(head); > > > > spin_unlock_irq(&zone->lru_lock); > > > > > > > > unfreeze_page(page_anon_vma(head), head); > > > > Sorry, I've missed the email at first. > > > > > This appears to be a bugfix against Kirill's "thp: reintroduce > > > split_huge_page()"? > > > > > > Yes, __split_huge_page() is marking the tail pages dirty but forgot > > > about the head page > > > > > > You say "we can lose dirty bit in page table" but I don't see how the > > > above patch fixes that? > > > > I think the problem is in unfreeze_page_vma(), where I missed dirtying > > pte. > > > > > Why does __split_huge_page() unconditionally mark the pages dirty, btw? > > > Is it because the THP page was known to be dirty? > > > > THP doesn't have backing storage and cannot be swapped out without > > splitting, therefore always dirty. (huge zero page is exception, I guess). > > It's right until now but I think we need more(e.g. is_dirty_migration_entry, > make_migration_entry(struct page *page, int write, int dirty) in terms of > MADV_FREE to keep dirty bit of pte rather than making pages dirty > unconditionally. That means you need to find one more bit in swap entries. I'm not sure it's possible on all architectures. > > For example, we could call madvise_free to THP page so madvise_free clears > dirty bit of pmd without split THP pages(ie, lazy split, maybe you suggest > it, thanks!) instantly. Then, when VM tries to reclaim the THP page and > splits it, every page will be marked PG_dirty or pte_mkdirty even if > there is no write ever since then so madvise_free can never discard it > although we could. > > Anyway it shouldn't be party-pooper. It could be enhanced and I will check > it. > > > > > > > If so, the head page already had PG_dirty, so this patch doesn't do > > > anything. > > > > PG_dirty appears on struct page as result of transferring from dirty bit > > in page tables. There's no guarantee that it's happened. > > > > > freeze_page(), unfreeze_page() and their callees desperately need some > > > description of what they're doing. Kirill, could you cook somethnig up > > > please? > > > > Minchan, could you test patch below instead? > > I think it will definitely work and more right fix than mine because > it covers split_huge_page_to_list's error path(ie, > > unfreeze_page(anon_vma, head); > ret = -EBUSY; > } > > > I will queue it to test machine. > > .. > Zzzz > .. > > After 2 hours, I don't see any problemso far but I have a question below. > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index 86924cc34bac..ea1f3805afa3 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -3115,7 +3115,7 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page, > > > > entry = pte_mkold(mk_pte(page, vma->vm_page_prot)); > > if (is_write_migration_entry(swp_entry)) > > - entry = maybe_mkwrite(entry, vma); > > + entry = maybe_mkwrite(pte_mkdirty(entry), vma); > > Why should we do pte_mkdiry only if is_write_migration_entry is true? > Doesn't it lose a dirty bit again if someone changes protection > from RW to R? 2 a.m. is not ideal time for patches. You are right. It need to be unconditionally. Andrew, could you fold the patch below into "thp: reintroduce split_huge_page()" instead of patch from Minchan? diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 86924cc34bac..f297baf8e793 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3114,6 +3114,7 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page, continue; entry = pte_mkold(mk_pte(page, vma->vm_page_prot)); + entry = pte_mkdirty(entry); if (is_write_migration_entry(swp_entry)) entry = maybe_mkwrite(entry, vma); -- Kirill A. Shutemov ^ permalink raw reply related [flat|nested] 26+ messages in thread
end of thread, other threads:[~2015-10-28 4:04 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-10-19 6:31 [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim 2015-10-19 6:31 ` [PATCH 1/5] mm: MADV_FREE trivial clean up Minchan Kim 2015-10-19 6:31 ` [PATCH 2/5] mm: skip huge zero page in MADV_FREE Minchan Kim 2015-10-19 6:31 ` [PATCH 3/5] mm: clear PG_dirty to mark page freeable Minchan Kim 2015-10-27 1:28 ` Hugh Dickins 2015-10-27 6:50 ` Minchan Kim 2015-10-19 6:31 ` [PATCH 4/5] mm: simplify reclaim path for MADV_FREE Minchan Kim 2015-10-27 2:09 ` Hugh Dickins 2015-10-27 3:44 ` yalin wang 2015-10-27 7:09 ` Minchan Kim 2015-10-27 7:39 ` yalin wang 2015-10-27 8:10 ` Minchan Kim 2015-10-27 8:52 ` yalin wang 2015-10-28 4:03 ` yalin wang 2015-10-27 6:54 ` Minchan Kim 2015-10-19 6:31 ` [PATCH 5/5] mm: mark stable page dirty in KSM Minchan Kim 2015-10-27 2:23 ` Hugh Dickins 2015-10-27 6:58 ` Minchan Kim 2015-10-19 10:01 ` [PATCH 0/5] MADV_FREE refactoring and fix KSM page Minchan Kim 2015-10-20 1:38 ` Minchan Kim 2015-10-20 7:21 ` Minchan Kim 2015-10-20 7:27 ` Minchan Kim 2015-10-20 21:36 ` Andrew Morton 2015-10-20 22:43 ` Kirill A. Shutemov 2015-10-21 5:11 ` Minchan Kim 2015-10-21 7:50 ` Kirill A. Shutemov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).