From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id E67866B01F1 for ; Thu, 1 Apr 2010 20:45:20 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-Id: Date: Fri, 02 Apr 2010 02:41:27 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: Hello, With some heavy forking and split_huge_page stressing testcase, I found a slight problem probably made visible by the anon_vma_chain: during the anon_vma walk of __split_huge_page_splitting, page_check_address_pmd run in a pmd that had the splitting bit set. The splitting but was set by a previously forked process calling split_huge_page on its private page belonging to the child anon_vma. The parent still has visiblity on the vma of the child so the rmap walk of the parent covers the child too, but the split of the child page can happen in parallel now. This triggered a VM_BUG_ON false positive and it was enough to move the check on the page above the check to fix it. (it would not have been noticeable with CONFIG_DEBUG_VM=n). All runs back flawless now with the debug turned on. @@ -1109,9 +1109,11 @@ new file mode 100644 + pmd = pmd_offset(pud, address); + if (pmd_none(*pmd)) + goto out; ++ if (pmd_page(*pmd) != page) ++ goto out; + VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG && + pmd_trans_splitting(*pmd)); -+ if (pmd_trans_huge(*pmd) && pmd_page(*pmd) == page) { ++ if (pmd_trans_huge(*pmd)) { + VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG && + !pmd_trans_splitting(*pmd)); + ret = pmd; Then there was one more issues while testing ksm and khugepaged co-existing and mergeing and collapsing pages on the same vma simultanously (which works fine now in #17). One check for PageTransCompound was missing in ksm and another had to be converted from PageTransHuge to PageTransCompound. This also has the fixed version of the remove-PG_buddy patch, that moves memory_hotplug bootmem typing code to use page->lru.next with a proper enum to freeup mapcount -2 for PG_buddy semantics. Not included by email but available in the directory there is the latest version of the ksm-swapcache fix (waiting a comment from Hugh to deliver it separately). http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.34-rc2-mm1/transparent_hugepage-17/ http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.34-rc2-mm1/transparent_hugepage-17.gz Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id E5CED6B01F0 for ; Thu, 1 Apr 2010 20:45:20 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 20 of 41] add pmd_huge_pte to mm_struct Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:47 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli This increase the size of the mm struct a bit but it is needed to preallocate one pte for each hugepage so that split_huge_page will not require a fail path. Guarantee of success is a fundamental property of split_huge_page to avoid decrasing swapping reliability and to avoid adding -ENOMEM fail paths that would otherwise force the hugepage-unaware VM code to learn rolling back in the middle of its pte mangling operations (if something we need it to learn handling pmd_trans_huge natively rather being capable of rollback). When split_huge_page runs a pte is needed to succeed the split, to map the newly splitted regular pages with a regular pte. This way all existing VM code remains backwards compatible by just adding a split_huge_page* one liner. The memory waste of those preallocated ptes is negligible and so it is worth it. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel --- diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -310,6 +310,9 @@ struct mm_struct { #ifdef CONFIG_MMU_NOTIFIER struct mmu_notifier_mm *mmu_notifier_mm; #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + pgtable_t pmd_huge_pte; /* protected by page_table_lock */ +#endif }; /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */ diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -522,6 +522,9 @@ void __mmdrop(struct mm_struct *mm) mm_free_pgd(mm); destroy_context(mm); mmu_notifier_mm_destroy(mm); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + VM_BUG_ON(mm->pmd_huge_pte); +#endif free_mm(mm); } EXPORT_SYMBOL_GPL(__mmdrop); @@ -662,6 +665,10 @@ struct mm_struct *dup_mm(struct task_str mm->token_priority = 0; mm->last_interval = 0; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + mm->pmd_huge_pte = NULL; +#endif + if (!mm_init(mm, tsk)) goto fail_nomem; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 307E16B01F3 for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 35 of 41] skip transhuge pages in ksm for now Message-Id: <14f320d06189a8bba363.1270168922@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:42:02 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Skip transhuge pages in ksm for now. Signed-off-by: Andrea Arcangeli Reviewed-by: Rik van Riel --- diff --git a/mm/ksm.c b/mm/ksm.c --- a/mm/ksm.c +++ b/mm/ksm.c @@ -449,7 +449,7 @@ static struct page *get_mergeable_page(s page = follow_page(vma, addr, FOLL_GET); if (!page) goto out; - if (PageAnon(page)) { + if (PageAnon(page) && !PageTransCompound(page)) { flush_anon_page(vma, page, addr); flush_dcache_page(page); } else { @@ -1294,7 +1294,19 @@ next_mm: if (ksm_test_exit(mm)) break; *page = follow_page(vma, ksm_scan.address, FOLL_GET); - if (*page && PageAnon(*page)) { + if (!*page) { + ksm_scan.address += PAGE_SIZE; + cond_resched(); + continue; + } + if (PageTransCompound(*page)) { + put_page(*page); + ksm_scan.address &= HPAGE_PMD_MASK; + ksm_scan.address += HPAGE_PMD_SIZE; + cond_resched(); + continue; + } + if (PageAnon(*page)) { flush_anon_page(vma, *page, ksm_scan.address); flush_dcache_page(*page); rmap_item = get_next_rmap_item(slot, @@ -1308,8 +1320,7 @@ next_mm: up_read(&mm->mmap_sem); return rmap_item; } - if (*page) - put_page(*page); + put_page(*page); ksm_scan.address += PAGE_SIZE; cond_resched(); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 10BC26B01F2 for ; Thu, 1 Apr 2010 20:45:20 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 09 of 41] no paravirt version of pmd ops Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:36 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli No paravirt version of set_pmd_at/pmd_update/pmd_update_defer. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel Acked-by: Mel Gorman --- diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -33,6 +33,7 @@ extern struct list_head pgd_list; #else /* !CONFIG_PARAVIRT */ #define set_pte(ptep, pte) native_set_pte(ptep, pte) #define set_pte_at(mm, addr, ptep, pte) native_set_pte_at(mm, addr, ptep, pte) +#define set_pmd_at(mm, addr, pmdp, pmd) native_set_pmd_at(mm, addr, pmdp, pmd) #define set_pte_atomic(ptep, pte) \ native_set_pte_atomic(ptep, pte) @@ -57,6 +58,8 @@ extern struct list_head pgd_list; #define pte_update(mm, addr, ptep) do { } while (0) #define pte_update_defer(mm, addr, ptep) do { } while (0) +#define pmd_update(mm, addr, ptep) do { } while (0) +#define pmd_update_defer(mm, addr, ptep) do { } while (0) #define pgd_val(x) native_pgd_val(x) #define __pgd(x) native_make_pgd(x) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 309AA6B01F4 for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 16 of 41] bail out gup_fast on splitting pmd Message-Id: <5da7deb4906c1d554029.1270168903@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:43 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Force gup_fast to take the slow path and block if the pmd is splitting, not only if it's none. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel Acked-by: Mel Gorman --- diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -160,7 +160,18 @@ static int gup_pmd_range(pud_t pud, unsi pmd_t pmd = *pmdp; next = pmd_addr_end(addr, end); - if (pmd_none(pmd)) + /* + * The pmd_trans_splitting() check below explains why + * pmdp_splitting_flush has to flush the tlb, to stop + * this gup-fast code from running while we set the + * splitting bit in the pmd. Returning zero will take + * the slow path that will call wait_split_huge_page() + * if the pmd is still in splitting state. gup-fast + * can't because it has irq disabled and + * wait_split_huge_page() would never return as the + * tlb flush IPI wouldn't run. + */ + if (pmd_none(pmd) || pmd_trans_splitting(pmd)) return 0; if (unlikely(pmd_large(pmd))) { if (!gup_huge_pmd(pmd, addr, next, write, pages, nr)) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 4FF466B01F5 for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 28 of 41] verify pmd_trans_huge isn't leaking Message-Id: <03a6148230050bd901c3.1270168915@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:55 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli pte_trans_huge must not leak in certain vmas like the mmio special pfn or filebacked mappings. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel --- diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -1421,6 +1421,7 @@ int __get_user_pages(struct task_struct pmd = pmd_offset(pud, pg); if (pmd_none(*pmd)) return i ? : -EFAULT; + VM_BUG_ON(pmd_trans_huge(*pmd)); pte = pte_offset_map(pmd, pg); if (pte_none(*pte)) { pte_unmap(pte); @@ -1622,8 +1623,10 @@ pte_t *get_locked_pte(struct mm_struct * pud_t * pud = pud_alloc(mm, pgd, addr); if (pud) { pmd_t * pmd = pmd_alloc(mm, pud, addr); - if (pmd) + if (pmd) { + VM_BUG_ON(pmd_trans_huge(*pmd)); return pte_alloc_map_lock(mm, pmd, addr, ptl); + } } return NULL; } @@ -1842,6 +1845,7 @@ static inline int remap_pmd_range(struct pmd = pmd_alloc(mm, pud, addr); if (!pmd) return -ENOMEM; + VM_BUG_ON(pmd_trans_huge(*pmd)); do { next = pmd_addr_end(addr, end); if (remap_pte_range(mm, pmd, addr, next, @@ -3317,6 +3321,7 @@ static int follow_pte(struct mm_struct * goto out; pmd = pmd_offset(pud, address); + VM_BUG_ON(pmd_trans_huge(*pmd)); if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) goto out; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 504C56B01F6 for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 13 of 41] special pmd_trans_* functions Message-Id: <244f89f5c6dd248777a6.1270168900@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:40 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli These returns 0 at compile time when the config option is disabled, to allow gcc to eliminate the transparent hugepage function calls at compile time without additional #ifdefs (only the export of those functions have to be visible to gcc but they won't be required at link time and huge_memory.o can be not built at all). _PAGE_BIT_UNUSED1 is never used for pmd, only on pte. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel --- diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -168,6 +168,19 @@ extern void cleanup_highmap(void); #define kc_offset_to_vaddr(o) ((o) | ~__VIRTUAL_MASK) #define __HAVE_ARCH_PTE_SAME + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static inline int pmd_trans_splitting(pmd_t pmd) +{ + return pmd_val(pmd) & _PAGE_SPLITTING; +} + +static inline int pmd_trans_huge(pmd_t pmd) +{ + return pmd_val(pmd) & _PAGE_PSE; +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + #endif /* !__ASSEMBLY__ */ #endif /* _ASM_X86_PGTABLE_64_H */ diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -22,6 +22,7 @@ #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ #define _PAGE_BIT_SPECIAL _PAGE_BIT_UNUSED1 #define _PAGE_BIT_CPA_TEST _PAGE_BIT_UNUSED1 +#define _PAGE_BIT_SPLITTING _PAGE_BIT_UNUSED1 /* only valid on a PSE pmd */ #define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */ /* If _PAGE_BIT_PRESENT is clear, we use these: */ @@ -45,6 +46,7 @@ #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE) #define _PAGE_SPECIAL (_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL) #define _PAGE_CPA_TEST (_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST) +#define _PAGE_SPLITTING (_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING) #define __HAVE_ARCH_PTE_SPECIAL #ifdef CONFIG_KMEMCHECK diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -344,6 +344,11 @@ extern void untrack_pfn_vma(struct vm_ar unsigned long size); #endif +#ifndef CONFIG_TRANSPARENT_HUGEPAGE +#define pmd_trans_huge(pmd) 0 +#define pmd_trans_splitting(pmd) 0 +#endif + #endif /* !__ASSEMBLY__ */ #endif /* _ASM_GENERIC_PGTABLE_H */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 513276B01F7 for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 39 of 41] add pmd_modify Message-Id: <9dd19a699656ec5bb8ba.1270168926@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:42:06 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Johannes Weiner Add pmd_modify() for use with mprotect() on huge pmds. Signed-off-by: Johannes Weiner Signed-off-by: Andrea Arcangeli Reviewed-by: Rik van Riel --- diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -323,6 +323,16 @@ static inline pte_t pte_modify(pte_t pte return __pte(val); } +static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) +{ + pmdval_t val = pmd_val(pmd); + + val &= _HPAGE_CHG_MASK; + val |= massage_pgprot(newprot) & ~_HPAGE_CHG_MASK; + + return __pmd(val); +} + /* mprotect needs to preserve PAT bits when updating vm_page_prot */ #define pgprot_modify pgprot_modify static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -72,6 +72,7 @@ /* Set of bits not changed in pte_modify */ #define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY) +#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE) #define _PAGE_CACHE_MASK (_PAGE_PCD | _PAGE_PWT) #define _PAGE_CACHE_WB (0) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 6B1076B01FA for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 05 of 41] fix bad_page to show the real reason the page is bad Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:32 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli page_count shows the count of the head page, but the actual check is done on the tail page, so show what is really being checked. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel Acked-by: Mel Gorman --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5291,7 +5291,7 @@ void dump_page(struct page *page) { printk(KERN_ALERT "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n", - page, page_count(page), page_mapcount(page), + page, atomic_read(&page->_count), page_mapcount(page), page->mapping, page->index); dump_page_flags(page->flags); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 797076B01F1 for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 08 of 41] add pmd paravirt ops Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:35 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Paravirt ops pmd_update/pmd_update_defer/pmd_set_at. Not all might be necessary (vmware needs pmd_update, Xen needs set_pmd_at, nobody needs pmd_update_defer), but this is to keep full simmetry with pte paravirt ops, which looks cleaner and simpler from a common code POV. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel Acked-by: Mel Gorman --- diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -440,6 +440,11 @@ static inline void pte_update(struct mm_ { PVOP_VCALL3(pv_mmu_ops.pte_update, mm, addr, ptep); } +static inline void pmd_update(struct mm_struct *mm, unsigned long addr, + pmd_t *pmdp) +{ + PVOP_VCALL3(pv_mmu_ops.pmd_update, mm, addr, pmdp); +} static inline void pte_update_defer(struct mm_struct *mm, unsigned long addr, pte_t *ptep) @@ -447,6 +452,12 @@ static inline void pte_update_defer(stru PVOP_VCALL3(pv_mmu_ops.pte_update_defer, mm, addr, ptep); } +static inline void pmd_update_defer(struct mm_struct *mm, unsigned long addr, + pmd_t *pmdp) +{ + PVOP_VCALL3(pv_mmu_ops.pmd_update_defer, mm, addr, pmdp); +} + static inline pte_t __pte(pteval_t val) { pteval_t ret; @@ -548,6 +559,18 @@ static inline void set_pte_at(struct mm_ PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte); } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr, + pmd_t *pmdp, pmd_t pmd) +{ + if (sizeof(pmdval_t) > sizeof(long)) + /* 5 arg words */ + pv_mmu_ops.set_pmd_at(mm, addr, pmdp, pmd); + else + PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp, pmd.pmd); +} +#endif + static inline void set_pmd(pmd_t *pmdp, pmd_t pmd) { pmdval_t val = native_pmd_val(pmd); diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -266,10 +266,16 @@ struct pv_mmu_ops { void (*set_pte_at)(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pteval); void (*set_pmd)(pmd_t *pmdp, pmd_t pmdval); + void (*set_pmd_at)(struct mm_struct *mm, unsigned long addr, + pmd_t *pmdp, pmd_t pmdval); void (*pte_update)(struct mm_struct *mm, unsigned long addr, pte_t *ptep); void (*pte_update_defer)(struct mm_struct *mm, unsigned long addr, pte_t *ptep); + void (*pmd_update)(struct mm_struct *mm, unsigned long addr, + pmd_t *pmdp); + void (*pmd_update_defer)(struct mm_struct *mm, + unsigned long addr, pmd_t *pmdp); pte_t (*ptep_modify_prot_start)(struct mm_struct *mm, unsigned long addr, pte_t *ptep); diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -422,8 +422,11 @@ struct pv_mmu_ops pv_mmu_ops = { .set_pte = native_set_pte, .set_pte_at = native_set_pte_at, .set_pmd = native_set_pmd, + .set_pmd_at = native_set_pmd_at, .pte_update = paravirt_nop, .pte_update_defer = paravirt_nop, + .pmd_update = paravirt_nop, + .pmd_update_defer = paravirt_nop, .ptep_modify_prot_start = __ptep_modify_prot_start, .ptep_modify_prot_commit = __ptep_modify_prot_commit, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 7B78D6B01FC for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 10 of 41] export maybe_mkwrite Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:37 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli huge_memory.c needs it too when it fallbacks in copying hugepages into regular fragmented pages if hugepage allocation fails during COW. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel Acked-by: Mel Gorman --- diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -390,6 +390,19 @@ static inline void set_compound_order(st } /* + * Do pte_mkwrite, but only if the vma says VM_WRITE. We do this when + * servicing faults for write access. In the normal case, do always want + * pte_mkwrite. But get_user_pages can cause write faults for mappings + * that do not have writing enabled, when used by access_process_vm. + */ +static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) +{ + if (likely(vma->vm_flags & VM_WRITE)) + pte = pte_mkwrite(pte); + return pte; +} + +/* * Multiple processes may "see" the same page. E.g. for untouched * mappings of /dev/null, all processes see the same page full of * zeroes, and text pages of executables and shared libraries have diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -2031,19 +2031,6 @@ static inline int pte_unmap_same(struct return same; } -/* - * Do pte_mkwrite, but only if the vma says VM_WRITE. We do this when - * servicing faults for write access. In the normal case, do always want - * pte_mkwrite. But get_user_pages can cause write faults for mappings - * that do not have writing enabled, when used by access_process_vm. - */ -static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) -{ - if (likely(vma->vm_flags & VM_WRITE)) - pte = pte_mkwrite(pte); - return pte; -} - static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) { /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 9B8A96B0200 for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 12 of 41] config_transparent_hugepage Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:39 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Add config option. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel Acked-by: Mel Gorman --- diff --git a/mm/Kconfig b/mm/Kconfig --- a/mm/Kconfig +++ b/mm/Kconfig @@ -287,3 +287,17 @@ config NOMMU_INITIAL_TRIM_EXCESS of 1 says that all excess pages should be trimmed. See Documentation/nommu-mmap.txt for more information. + +config TRANSPARENT_HUGEPAGE + bool "Transparent Hugepage support" if EMBEDDED + depends on X86_64 + default y + help + Transparent Hugepages allows the kernel to use huge pages and + huge tlb transparently to the applications whenever possible. + This feature can improve computing performance to certain + applications by speeding up page faults during memory + allocation, by reducing the number of tlb misses and by speeding + up the pagetable walking. + + If memory constrained on embedded, you may want to say N. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 754456B01FB for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 40 of 41] mprotect: pass vma down to page table walkers Message-Id: <7182d4c4a688e1a1363b.1270168927@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:42:07 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Johannes Weiner Waiting for huge pmds to finish splitting requires the vma's anon_vma, so pass along the vma instead of the mm, we can always get the latter when we need it. Signed-off-by: Johannes Weiner Signed-off-by: Andrea Arcangeli Reviewed-by: Rik van Riel --- diff --git a/mm/mprotect.c b/mm/mprotect.c --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -36,10 +36,11 @@ static inline pgprot_t pgprot_modify(pgp } #endif -static void change_pte_range(struct mm_struct *mm, pmd_t *pmd, +static void change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t newprot, int dirty_accountable) { + struct mm_struct *mm = vma->vm_mm; pte_t *pte, oldpte; spinlock_t *ptl; @@ -79,7 +80,7 @@ static void change_pte_range(struct mm_s pte_unmap_unlock(pte - 1, ptl); } -static inline void change_pmd_range(struct mm_struct *mm, pud_t *pud, +static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, pgprot_t newprot, int dirty_accountable) { @@ -89,14 +90,14 @@ static inline void change_pmd_range(stru pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); - split_huge_page_pmd(mm, pmd); + split_huge_page_pmd(vma->vm_mm, pmd); if (pmd_none_or_clear_bad(pmd)) continue; - change_pte_range(mm, pmd, addr, next, newprot, dirty_accountable); + change_pte_range(vma, pmd, addr, next, newprot, dirty_accountable); } while (pmd++, addr = next, addr != end); } -static inline void change_pud_range(struct mm_struct *mm, pgd_t *pgd, +static inline void change_pud_range(struct vm_area_struct *vma, pgd_t *pgd, unsigned long addr, unsigned long end, pgprot_t newprot, int dirty_accountable) { @@ -108,7 +109,7 @@ static inline void change_pud_range(stru next = pud_addr_end(addr, end); if (pud_none_or_clear_bad(pud)) continue; - change_pmd_range(mm, pud, addr, next, newprot, dirty_accountable); + change_pmd_range(vma, pud, addr, next, newprot, dirty_accountable); } while (pud++, addr = next, addr != end); } @@ -128,7 +129,7 @@ static void change_protection(struct vm_ next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) continue; - change_pud_range(mm, pgd, addr, next, newprot, dirty_accountable); + change_pud_range(vma, pgd, addr, next, newprot, dirty_accountable); } while (pgd++, addr = next, addr != end); flush_tlb_range(vma, start, end); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 9497D6B01FF for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 32 of 41] memcg huge memory Message-Id: <771e2453e0c5c76b180f.1270168919@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:59 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Add memcg charge/uncharge to hugepage faults in huge_memory.c. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel --- diff --git a/mm/huge_memory.c b/mm/huge_memory.c --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -225,6 +225,7 @@ static int __do_huge_pmd_anonymous_page( VM_BUG_ON(!PageCompound(page)); pgtable = pte_alloc_one(mm, haddr); if (unlikely(!pgtable)) { + mem_cgroup_uncharge_page(page); put_page(page); return VM_FAULT_OOM; } @@ -235,6 +236,7 @@ static int __do_huge_pmd_anonymous_page( spin_lock(&mm->page_table_lock); if (unlikely(!pmd_none(*pmd))) { spin_unlock(&mm->page_table_lock); + mem_cgroup_uncharge_page(page); put_page(page); pte_free(mm, pgtable); } else { @@ -278,6 +280,10 @@ int do_huge_pmd_anonymous_page(struct mm page = alloc_hugepage(transparent_hugepage_defrag(vma)); if (unlikely(!page)) goto out; + if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) { + put_page(page); + goto out; + } return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page); } @@ -377,9 +383,15 @@ static int do_huge_pmd_wp_page_fallback( for (i = 0; i < HPAGE_PMD_NR; i++) { pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); - if (unlikely(!pages[i])) { - while (--i >= 0) + if (unlikely(!pages[i] || + mem_cgroup_newpage_charge(pages[i], mm, + GFP_KERNEL))) { + if (pages[i]) put_page(pages[i]); + while (--i >= 0) { + mem_cgroup_uncharge_page(pages[i]); + put_page(pages[i]); + } kfree(pages); ret |= VM_FAULT_OOM; goto out; @@ -438,8 +450,10 @@ out: out_free_pages: spin_unlock(&mm->page_table_lock); - for (i = 0; i < HPAGE_PMD_NR; i++) + for (i = 0; i < HPAGE_PMD_NR; i++) { + mem_cgroup_uncharge_page(pages[i]); put_page(pages[i]); + } kfree(pages); goto out; } @@ -482,13 +496,19 @@ int do_huge_pmd_wp_page(struct mm_struct goto out; } + if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { + put_page(new_page); + ret |= VM_FAULT_OOM; + goto out; + } copy_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR); __SetPageUptodate(new_page); spin_lock(&mm->page_table_lock); - if (unlikely(!pmd_same(*pmd, orig_pmd))) + if (unlikely(!pmd_same(*pmd, orig_pmd))) { + mem_cgroup_uncharge_page(new_page); put_page(new_page); - else { + } else { pmd_t entry; entry = mk_pmd(new_page, vma->vm_page_prot); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 8D2276B01FE for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 07 of 41] add native_set_pmd_at Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:34 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Used by paravirt and not paravirt set_pmd_at. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel Acked-by: Mel Gorman --- diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -528,6 +528,12 @@ static inline void native_set_pte_at(str native_set_pte(ptep, pte); } +static inline void native_set_pmd_at(struct mm_struct *mm, unsigned long addr, + pmd_t *pmdp , pmd_t pmd) +{ + native_set_pmd(pmdp, pmd); +} + #ifndef CONFIG_PARAVIRT /* * Rules for using pte_update - it must be called after any PTE update which -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 8CE1E6B01FD for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 22 of 41] split_huge_page paging Message-Id: <98a07dc480b00b236e17.1270168909@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:49 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Paging logic that splits the page before it is unmapped and added to swap to ensure backwards compatibility with the legacy swap code. Eventually swap should natively pageout the hugepages to increase performance and decrease seeking and fragmentation of swap space. swapoff can just skip over huge pmd as they cannot be part of swap yet. In add_to_swap be careful to split the page only if we got a valid swap entry so we don't split hugepages with a full swap. In theory we could split pages before isolating them during the lru scan, but for khugepaged to be safe, I'm relying on either mmap_sem write mode, or PG_lock taken, so split_huge_page has to run either with mmap_sem read/write mode or PG_lock taken. Calling it from isolate_lru_page would make locking more complicated, in addition to that split_huge_page would deadlock if called by __isolate_lru_page because it has to take the lru lock to add the tail pages. Signed-off-by: Andrea Arcangeli Acked-by: Mel Gorman Acked-by: Rik van Riel --- diff --git a/mm/memory-failure.c b/mm/memory-failure.c --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -378,6 +378,8 @@ static void collect_procs_anon(struct pa struct task_struct *tsk; struct anon_vma *av; + if (unlikely(split_huge_page(page))) + return; read_lock(&tasklist_lock); av = page_lock_anon_vma(page); if (av == NULL) /* Not actually mapped anymore */ diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1284,6 +1284,7 @@ int try_to_unmap(struct page *page, enum int ret; BUG_ON(!PageLocked(page)); + BUG_ON(PageTransHuge(page)); if (unlikely(PageKsm(page))) ret = try_to_unmap_ksm(page, flags); diff --git a/mm/swap_state.c b/mm/swap_state.c --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -156,6 +156,12 @@ int add_to_swap(struct page *page) if (!entry.val) return 0; + if (unlikely(PageTransHuge(page))) + if (unlikely(split_huge_page(page))) { + swapcache_free(entry, NULL); + return 0; + } + /* * Radix-tree node allocations from PF_MEMALLOC contexts could * completely exhaust the page allocator. __GFP_NOMEMALLOC diff --git a/mm/swapfile.c b/mm/swapfile.c --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -937,6 +937,8 @@ static inline int unuse_pmd_range(struct pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); + if (unlikely(pmd_trans_huge(*pmd))) + continue; if (pmd_none_or_clear_bad(pmd)) continue; ret = unuse_pte_range(vma, pmd, addr, next, entry, page); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id C9E786B0206 for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 24 of 41] kvm mmu transparent hugepage support Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:51 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Marcelo Tosatti This should work for both hugetlbfs and transparent hugepages. Signed-off-by: Andrea Arcangeli Signed-off-by: Marcelo Tosatti Acked-by: Rik van Riel --- diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -470,6 +470,15 @@ static int host_mapping_level(struct kvm page_size = kvm_host_page_size(kvm, gfn); + /* check for transparent hugepages */ + if (page_size == PAGE_SIZE) { + struct page *page = gfn_to_page(kvm, gfn); + + if (!is_error_page(page) && PageTransCompound(page)) + page_size = KVM_HPAGE_SIZE(2); + kvm_release_page_clean(page); + } + for (i = PT_PAGE_TABLE_LEVEL; i < (PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES); ++i) { if (page_size >= KVM_HPAGE_SIZE(i)) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id D85586B0208 for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 19 of 41] clear page compound Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:46 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli split_huge_page must transform a compound page to a regular page and needs ClearPageCompound. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel Reviewed-by: Christoph Lameter --- diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -349,7 +349,7 @@ static inline void set_page_writeback(st * tests can be used in performance sensitive paths. PageCompound is * generally not used in hot code paths. */ -__PAGEFLAG(Head, head) +__PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head) __PAGEFLAG(Tail, tail) static inline int PageCompound(struct page *page) @@ -357,6 +357,13 @@ static inline int PageCompound(struct pa return page->flags & ((1L << PG_head) | (1L << PG_tail)); } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static inline void ClearPageCompound(struct page *page) +{ + BUG_ON(!PageHead(page)); + ClearPageHead(page); +} +#endif #else /* * Reduce page flag use as much as possible by overlapping @@ -394,6 +401,14 @@ static inline void __ClearPageTail(struc page->flags &= ~PG_head_tail_mask; } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static inline void ClearPageCompound(struct page *page) +{ + BUG_ON((page->flags & PG_head_tail_mask) != (1 << PG_compound)); + clear_bit(PG_compound, &page->flags); +} +#endif + #endif /* !PAGEFLAGS_EXTENDED */ #ifdef CONFIG_MMU -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id D800F6B01F0 for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 02 of 41] compound_lock Message-Id: <3b4cec7fa55a646af239.1270168889@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:29 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Add a new compound_lock() needed to serialize put_page against __split_huge_page_refcount(). Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel --- diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -13,6 +13,7 @@ #include #include #include +#include struct mempolicy; struct anon_vma; @@ -297,6 +298,20 @@ static inline int is_vmalloc_or_module_a } #endif +static inline void compound_lock(struct page *page) +{ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + bit_spin_lock(PG_compound_lock, &page->flags); +#endif +} + +static inline void compound_unlock(struct page *page) +{ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + bit_spin_unlock(PG_compound_lock, &page->flags); +#endif +} + static inline struct page *compound_head(struct page *page) { if (unlikely(PageTail(page))) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -108,6 +108,9 @@ enum pageflags { #ifdef CONFIG_MEMORY_FAILURE PG_hwpoison, /* hardware poisoned page. Don't touch */ #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + PG_compound_lock, +#endif __NR_PAGEFLAGS, /* Filesystems */ @@ -399,6 +402,12 @@ static inline void __ClearPageTail(struc #define __PG_MLOCKED 0 #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#define __PG_COMPOUND_LOCK (1 << PG_compound_lock) +#else +#define __PG_COMPOUND_LOCK 0 +#endif + /* * Flags checked when a page is freed. Pages being freed should not have * these flags set. It they are, there is a problem. @@ -408,7 +417,8 @@ static inline void __ClearPageTail(struc 1 << PG_private | 1 << PG_private_2 | \ 1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ - 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON) + 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ + __PG_COMPOUND_LOCK) /* * Flags checked when a page is prepped for return by the page allocator. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id C944B6B0204 for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 01 of 41] define MADV_HUGEPAGE Message-Id: <42065b93826f0fe977f4.1270168888@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:28 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Define MADV_HUGEPAGE. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel Acked-by: Arnd Bergmann --- diff --git a/arch/alpha/include/asm/mman.h b/arch/alpha/include/asm/mman.h --- a/arch/alpha/include/asm/mman.h +++ b/arch/alpha/include/asm/mman.h @@ -53,6 +53,8 @@ #define MADV_MERGEABLE 12 /* KSM may merge identical pages */ #define MADV_UNMERGEABLE 13 /* KSM may not merge identical pages */ +#define MADV_HUGEPAGE 14 /* Worth backing with hugepages */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/mips/include/asm/mman.h b/arch/mips/include/asm/mman.h --- a/arch/mips/include/asm/mman.h +++ b/arch/mips/include/asm/mman.h @@ -77,6 +77,8 @@ #define MADV_UNMERGEABLE 13 /* KSM may not merge identical pages */ #define MADV_HWPOISON 100 /* poison a page for testing */ +#define MADV_HUGEPAGE 14 /* Worth backing with hugepages */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/parisc/include/asm/mman.h b/arch/parisc/include/asm/mman.h --- a/arch/parisc/include/asm/mman.h +++ b/arch/parisc/include/asm/mman.h @@ -59,6 +59,8 @@ #define MADV_MERGEABLE 65 /* KSM may merge identical pages */ #define MADV_UNMERGEABLE 66 /* KSM may not merge identical pages */ +#define MADV_HUGEPAGE 67 /* Worth backing with hugepages */ + /* compatibility flags */ #define MAP_FILE 0 #define MAP_VARIABLE 0 diff --git a/arch/xtensa/include/asm/mman.h b/arch/xtensa/include/asm/mman.h --- a/arch/xtensa/include/asm/mman.h +++ b/arch/xtensa/include/asm/mman.h @@ -83,6 +83,8 @@ #define MADV_MERGEABLE 12 /* KSM may merge identical pages */ #define MADV_UNMERGEABLE 13 /* KSM may not merge identical pages */ +#define MADV_HUGEPAGE 14 /* Worth backing with hugepages */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h --- a/include/asm-generic/mman-common.h +++ b/include/asm-generic/mman-common.h @@ -45,7 +45,7 @@ #define MADV_MERGEABLE 12 /* KSM may merge identical pages */ #define MADV_UNMERGEABLE 13 /* KSM may not merge identical pages */ -#define MADV_HUGEPAGE 15 /* Worth backing with hugepages */ +#define MADV_HUGEPAGE 14 /* Worth backing with hugepages */ /* compatibility flags */ #define MAP_FILE 0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id A2BE86B0201 for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 17 of 41] pte alloc trans splitting Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:44 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli pte alloc routines must wait for split_huge_page if the pmd is not present and not null (i.e. pmd_trans_splitting). The additional branches are optimized away at compile time by pmd_trans_splitting if the config option is off. However we must pass the vma down in order to know the anon_vma lock to wait for. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel Acked-by: Mel Gorman --- diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1067,7 +1067,8 @@ static inline int __pmd_alloc(struct mm_ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address); #endif -int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address); +int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, + pmd_t *pmd, unsigned long address); int __pte_alloc_kernel(pmd_t *pmd, unsigned long address); /* @@ -1136,12 +1137,14 @@ static inline void pgtable_page_dtor(str pte_unmap(pte); \ } while (0) -#define pte_alloc_map(mm, pmd, address) \ - ((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \ - NULL: pte_offset_map(pmd, address)) +#define pte_alloc_map(mm, vma, pmd, address) \ + ((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, vma, \ + pmd, address))? \ + NULL: pte_offset_map(pmd, address)) #define pte_alloc_map_lock(mm, pmd, address, ptlp) \ - ((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \ + ((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, NULL, \ + pmd, address))? \ NULL: pte_offset_map_lock(mm, pmd, address, ptlp)) #define pte_alloc_kernel(pmd, address) \ diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -396,9 +396,11 @@ void free_pgtables(struct mmu_gather *tl } } -int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address) +int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, + pmd_t *pmd, unsigned long address) { pgtable_t new = pte_alloc_one(mm, address); + int wait_split_huge_page; if (!new) return -ENOMEM; @@ -418,14 +420,18 @@ int __pte_alloc(struct mm_struct *mm, pm smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */ spin_lock(&mm->page_table_lock); - if (!pmd_present(*pmd)) { /* Has another populated it ? */ + wait_split_huge_page = 0; + if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ mm->nr_ptes++; pmd_populate(mm, pmd, new); new = NULL; - } + } else if (unlikely(pmd_trans_splitting(*pmd))) + wait_split_huge_page = 1; spin_unlock(&mm->page_table_lock); if (new) pte_free(mm, new); + if (wait_split_huge_page) + wait_split_huge_page(vma->anon_vma, pmd); return 0; } @@ -438,10 +444,11 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig smp_wmb(); /* See comment in __pte_alloc */ spin_lock(&init_mm.page_table_lock); - if (!pmd_present(*pmd)) { /* Has another populated it ? */ + if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ pmd_populate_kernel(&init_mm, pmd, new); new = NULL; - } + } else + VM_BUG_ON(pmd_trans_splitting(*pmd)); spin_unlock(&init_mm.page_table_lock); if (new) pte_free_kernel(&init_mm, new); @@ -3119,7 +3126,7 @@ int handle_mm_fault(struct mm_struct *mm pmd = pmd_alloc(mm, pud, address); if (!pmd) return VM_FAULT_OOM; - pte = pte_alloc_map(mm, pmd, address); + pte = pte_alloc_map(mm, vma, pmd, address); if (!pte) return VM_FAULT_OOM; diff --git a/mm/mremap.c b/mm/mremap.c --- a/mm/mremap.c +++ b/mm/mremap.c @@ -48,7 +48,8 @@ static pmd_t *get_old_pmd(struct mm_stru return pmd; } -static pmd_t *alloc_new_pmd(struct mm_struct *mm, unsigned long addr) +static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long addr) { pgd_t *pgd; pud_t *pud; @@ -63,7 +64,7 @@ static pmd_t *alloc_new_pmd(struct mm_st if (!pmd) return NULL; - if (!pmd_present(*pmd) && __pte_alloc(mm, pmd, addr)) + if (!pmd_present(*pmd) && __pte_alloc(mm, vma, pmd, addr)) return NULL; return pmd; @@ -148,7 +149,7 @@ unsigned long move_page_tables(struct vm old_pmd = get_old_pmd(vma->vm_mm, old_addr); if (!old_pmd) continue; - new_pmd = alloc_new_pmd(vma->vm_mm, new_addr); + new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr); if (!new_pmd) break; next = (new_addr + PMD_SIZE) & PMD_MASK; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id BDDB86B01F8 for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 29 of 41] madvise(MADV_HUGEPAGE) Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:56 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Add madvise MADV_HUGEPAGE to mark regions that are important to be hugepage backed. Return -EINVAL if the vma is not of an anonymous type, or the feature isn't built into the kernel. Never silently return success. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel --- diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -99,6 +99,7 @@ extern void __split_huge_page_pmd(struct #endif extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma); +extern int hugepage_madvise(unsigned long *vm_flags); static inline int PageTransHuge(struct page *page) { VM_BUG_ON(PageTail(page)); @@ -121,6 +122,11 @@ static inline int split_huge_page(struct #define wait_split_huge_page(__anon_vma, __pmd) \ do { } while (0) #define PageTransHuge(page) 0 +static inline int hugepage_madvise(unsigned long *vm_flags) +{ + BUG_ON(0); + return 0; +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif /* _LINUX_HUGE_MM_H */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -842,6 +842,22 @@ out: return ret; } +int hugepage_madvise(unsigned long *vm_flags) +{ + /* + * Be somewhat over-protective like KSM for now! + */ + if (*vm_flags & (VM_HUGEPAGE | VM_SHARED | VM_MAYSHARE | + VM_PFNMAP | VM_IO | VM_DONTEXPAND | + VM_RESERVED | VM_HUGETLB | VM_INSERTPAGE | + VM_MIXEDMAP | VM_SAO)) + return -EINVAL; + + *vm_flags |= VM_HUGEPAGE; + + return 0; +} + void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd) { struct page *page; diff --git a/mm/madvise.c b/mm/madvise.c --- a/mm/madvise.c +++ b/mm/madvise.c @@ -71,6 +71,11 @@ static long madvise_behavior(struct vm_a if (error) goto out; break; + case MADV_HUGEPAGE: + error = hugepage_madvise(&new_flags); + if (error) + goto out; + break; } if (new_flags == vma->vm_flags) { @@ -283,6 +288,9 @@ madvise_behavior_valid(int behavior) case MADV_MERGEABLE: case MADV_UNMERGEABLE: #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + case MADV_HUGEPAGE: +#endif return 1; default: -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id C986E6B0205 for ; Thu, 1 Apr 2010 20:45:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 04 of 41] update futex compound knowledge Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:31 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Futex code is smarter than most other gup_fast O_DIRECT code and knows about the compound internals. However now doing a put_page(head_page) will not release the pin on the tail page taken by gup-fast, leading to all sort of refcounting bugchecks. Getting a stable head_page is a little tricky. page_head = page is there because if this is not a tail page it's also the page_head. Only in case this is a tail page, compound_head is called, otherwise it's guaranteed unnecessary. And if it's a tail page compound_head has to run atomically inside irq disabled section __get_user_pages_fast before returning. Otherwise ->first_page won't be a stable pointer. Disableing irq before __get_user_page_fast and releasing irq after running compound_head is needed because if __get_user_page_fast returns == 1, it means the huge pmd is established and cannot go away from under us. pmdp_splitting_flush_notify in __split_huge_page_splitting will have to wait for local_irq_enable before the IPI delivery can return. This means __split_huge_page_refcount can't be running from under us, and in turn when we run compound_head(page) we're not reading a dangling pointer from tailpage->first_page. Then after we get to stable head page, we are always safe to call compound_lock and after taking the compound lock on head page we can finally re-check if the page returned by gup-fast is still a tail page. in which case we're set and we didn't need to split the hugepage in order to take a futex on it. Signed-off-by: Andrea Arcangeli Acked-by: Mel Gorman Acked-by: Rik van Riel --- diff --git a/kernel/futex.c b/kernel/futex.c --- a/kernel/futex.c +++ b/kernel/futex.c @@ -218,7 +218,7 @@ get_futex_key(u32 __user *uaddr, int fsh { unsigned long address = (unsigned long)uaddr; struct mm_struct *mm = current->mm; - struct page *page; + struct page *page, *page_head; int err; /* @@ -250,10 +250,53 @@ again: if (err < 0) return err; - page = compound_head(page); - lock_page(page); - if (!page->mapping) { - unlock_page(page); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + page_head = page; + if (unlikely(PageTail(page))) { + put_page(page); + /* serialize against __split_huge_page_splitting() */ + local_irq_disable(); + if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) { + page_head = compound_head(page); + /* + * page_head is valid pointer but we must pin + * it before taking the PG_lock and/or + * PG_compound_lock. The moment we re-enable + * irqs __split_huge_page_splitting() can + * return and the head page can be freed from + * under us. We can't take the PG_lock and/or + * PG_compound_lock on a page that could be + * freed from under us. + */ + if (page != page_head) + get_page(page_head); + local_irq_enable(); + } else { + local_irq_enable(); + goto again; + } + } +#else + page_head = compound_head(page); + if (page != page_head) + get_page(page_head); +#endif + + lock_page(page_head); + if (unlikely(page_head != page)) { + compound_lock(page_head); + if (unlikely(!PageTail(page))) { + compound_unlock(page_head); + unlock_page(page_head); + put_page(page_head); + put_page(page); + goto again; + } + } + if (!page_head->mapping) { + unlock_page(page_head); + if (page_head != page) + put_page(page_head); put_page(page); goto again; } @@ -265,19 +308,25 @@ again: * it's a read-only handle, it's expected that futexes attach to * the object not the particular process. */ - if (PageAnon(page)) { + if (PageAnon(page_head)) { key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */ key->private.mm = mm; key->private.address = address; } else { key->both.offset |= FUT_OFF_INODE; /* inode-based key */ - key->shared.inode = page->mapping->host; - key->shared.pgoff = page->index; + key->shared.inode = page_head->mapping->host; + key->shared.pgoff = page_head->index; } get_futex_key_refs(key); - unlock_page(page); + unlock_page(page_head); + if (page != page_head) { + VM_BUG_ON(!PageTail(page)); + /* releasing compound_lock after page_lock won't matter */ + compound_unlock(page_head); + put_page(page_head); + } put_page(page); return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 3EA916B0207 for ; Thu, 1 Apr 2010 20:45:22 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 03 of 41] alter compound get_page/put_page Message-Id: <5ef6f02de603cf9e4d4a.1270168890@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:30 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Alter compound get_page/put_page to keep references on subpages too, in order to allow __split_huge_page_refcount to split an hugepage even while subpages have been pinned by one of the get_user_pages() variants. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel --- diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c --- a/arch/powerpc/mm/gup.c +++ b/arch/powerpc/mm/gup.c @@ -16,6 +16,16 @@ #ifdef __HAVE_ARCH_PTE_SPECIAL +static inline void pin_huge_page_tail(struct page *page) +{ + /* + * __split_huge_page_refcount() cannot run + * from under us. + */ + VM_BUG_ON(atomic_read(&page->_count) < 0); + atomic_inc(&page->_count); +} + /* * The performance critical leaf functions are made noinline otherwise gcc * inlines everything into a single function which results in too much @@ -47,6 +57,8 @@ static noinline int gup_pte_range(pmd_t put_page(page); return 0; } + if (PageTail(page)) + pin_huge_page_tail(page); pages[*nr] = page; (*nr)++; diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -105,6 +105,16 @@ static inline void get_head_page_multipl atomic_add(nr, &page->_count); } +static inline void pin_huge_page_tail(struct page *page) +{ + /* + * __split_huge_page_refcount() cannot run + * from under us. + */ + VM_BUG_ON(atomic_read(&page->_count) < 0); + atomic_inc(&page->_count); +} + static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { @@ -128,6 +138,8 @@ static noinline int gup_huge_pmd(pmd_t p do { VM_BUG_ON(compound_head(page) != head); pages[*nr] = page; + if (PageTail(page)) + pin_huge_page_tail(page); (*nr)++; page++; refs++; diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -326,9 +326,17 @@ static inline int page_count(struct page static inline void get_page(struct page *page) { - page = compound_head(page); - VM_BUG_ON(atomic_read(&page->_count) == 0); + VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page)); atomic_inc(&page->_count); + if (unlikely(PageTail(page))) { + /* + * This is safe only because + * __split_huge_page_refcount can't run under + * get_page(). + */ + VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0); + atomic_inc(&page->first_page->_count); + } } static inline struct page *virt_to_head_page(const void *x) diff --git a/mm/swap.c b/mm/swap.c --- a/mm/swap.c +++ b/mm/swap.c @@ -55,17 +55,82 @@ static void __page_cache_release(struct del_page_from_lru(zone, page); spin_unlock_irqrestore(&zone->lru_lock, flags); } +} + +static void __put_single_page(struct page *page) +{ + __page_cache_release(page); free_hot_cold_page(page, 0); } +static void __put_compound_page(struct page *page) +{ + compound_page_dtor *dtor; + + __page_cache_release(page); + dtor = get_compound_page_dtor(page); + (*dtor)(page); +} + static void put_compound_page(struct page *page) { - page = compound_head(page); - if (put_page_testzero(page)) { - compound_page_dtor *dtor; - - dtor = get_compound_page_dtor(page); - (*dtor)(page); + if (unlikely(PageTail(page))) { + /* __split_huge_page_refcount can run under us */ + struct page *page_head = page->first_page; + smp_rmb(); + if (likely(PageTail(page) && get_page_unless_zero(page_head))) { + if (unlikely(!PageHead(page_head))) { + /* PageHead is cleared after PageTail */ + smp_rmb(); + VM_BUG_ON(PageTail(page)); + goto out_put_head; + } + /* + * Only run compound_lock on a valid PageHead, + * after having it pinned with + * get_page_unless_zero() above. + */ + smp_mb(); + /* page_head wasn't a dangling pointer */ + compound_lock(page_head); + if (unlikely(!PageTail(page))) { + /* __split_huge_page_refcount run before us */ + compound_unlock(page_head); + VM_BUG_ON(PageHead(page_head)); + out_put_head: + if (put_page_testzero(page_head)) + __put_single_page(page_head); + out_put_single: + if (put_page_testzero(page)) + __put_single_page(page); + return; + } + VM_BUG_ON(page_head != page->first_page); + /* + * We can release the refcount taken by + * get_page_unless_zero now that + * split_huge_page_refcount is blocked on the + * compound_lock. + */ + if (put_page_testzero(page_head)) + VM_BUG_ON(1); + /* __split_huge_page_refcount will wait now */ + VM_BUG_ON(atomic_read(&page->_count) <= 0); + atomic_dec(&page->_count); + VM_BUG_ON(atomic_read(&page_head->_count) <= 0); + compound_unlock(page_head); + if (put_page_testzero(page_head)) + __put_compound_page(page_head); + } else { + /* page_head is a dangling pointer */ + VM_BUG_ON(PageTail(page)); + goto out_put_single; + } + } else if (put_page_testzero(page)) { + if (PageHead(page)) + __put_compound_page(page); + else + __put_single_page(page); } } @@ -74,7 +139,7 @@ void put_page(struct page *page) if (unlikely(PageCompound(page))) put_compound_page(page); else if (put_page_testzero(page)) - __page_cache_release(page); + __put_single_page(page); } EXPORT_SYMBOL(put_page); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 88D3E6B0202 for ; Thu, 1 Apr 2010 20:45:22 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 25 of 41] _GFP_NO_KSWAPD Message-Id: <709ace46592bee11e523.1270168912@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:52 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Transparent hugepage allocations must be allowed not to invoke kswapd or any other kind of indirect reclaim (especially when the defrag sysfs is control disabled). It's unacceptable to swap out anonymous pages (potentially anonymous transparent hugepages) in order to create new transparent hugepages. This is true for the MADV_HUGEPAGE areas too (swapping out a kvm virtual machine and so having it suffer an unbearable slowdown, so another one with guest physical memory marked MADV_HUGEPAGE can run 30% faster if it is running memory intensive workloads, makes no sense). If a transparent hugepage allocation fails the slowdown is minor and there is total fallback, so kswapd should never be asked to swapout memory to allow the high order allocation to succeed. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel --- diff --git a/include/linux/gfp.h b/include/linux/gfp.h --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -60,13 +60,15 @@ struct vm_area_struct; #define __GFP_NOTRACK ((__force gfp_t)0) #endif +#define __GFP_NO_KSWAPD ((__force gfp_t)0x400000u) + /* * This may seem redundant, but it's a way of annotating false positives vs. * allocations that simply cannot be supported (e.g. page tables). */ #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) -#define __GFP_BITS_SHIFT 22 /* Room for 22 __GFP_FOO bits */ +#define __GFP_BITS_SHIFT 23 /* Room for 23 __GFP_FOO bits */ #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /* This equals 0, but use constants in case they ever change */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1867,7 +1867,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, u goto nopage; restart: - wake_all_kswapd(order, zonelist, high_zoneidx); + if (!(gfp_mask & __GFP_NO_KSWAPD)) + wake_all_kswapd(order, zonelist, high_zoneidx); /* * OK, we're below the kswapd watermark and have kicked background -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id C75016B020A for ; Thu, 1 Apr 2010 20:45:22 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 26 of 41] don't alloc harder for gfp nomemalloc even if nowait Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:53 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Not worth throwing away the precious reserved free memory pool for allocations that can fail gracefully (either through mempool or because they're transhuge allocations later falling back to 4k allocations). Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1811,7 +1811,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask) */ alloc_flags |= (gfp_mask & __GFP_HIGH); - if (!wait) { + /* + * Not worth trying to allocate harder for __GFP_NOMEMALLOC + * even if it can't schedule. + */ + if (!wait && !(gfp_mask & __GFP_NOMEMALLOC)) { alloc_flags |= ALLOC_HARDER; /* * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id C032F6B020D for ; Thu, 1 Apr 2010 20:45:24 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 11 of 41] comment reminder in destroy_compound_page Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:38 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Warn destroy_compound_page that __split_huge_page_refcount is heavily dependent on its internal behavior. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel Acked-by: Mel Gorman --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -334,6 +334,7 @@ void prep_compound_page(struct page *pag } } +/* update __split_huge_page_refcount if you change this function */ static int destroy_compound_page(struct page *page, unsigned long order) { int i; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 3BA9B6B020C for ; Thu, 1 Apr 2010 20:45:25 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 30 of 41] pmd_trans_huge migrate bugcheck Message-Id: <18e07f3194de2f6b371f.1270168917@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:57 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli No pmd_trans_huge should ever materialize in migration ptes areas, because we split the hugepage before migration ptes are instantiated. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel --- diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -105,6 +105,10 @@ static inline int PageTransHuge(struct p VM_BUG_ON(PageTail(page)); return PageHead(page); } +static inline int PageTransCompound(struct page *page) +{ + return PageCompound(page); +} #else /* CONFIG_TRANSPARENT_HUGEPAGE */ #define HPAGE_PMD_SHIFT ({ BUG(); 0; }) #define HPAGE_PMD_MASK ({ BUG(); 0; }) @@ -122,6 +126,7 @@ static inline int split_huge_page(struct #define wait_split_huge_page(__anon_vma, __pmd) \ do { } while (0) #define PageTransHuge(page) 0 +#define PageTransCompound(page) 0 static inline int hugepage_madvise(unsigned long *vm_flags) { BUG_ON(0); diff --git a/mm/migrate.c b/mm/migrate.c --- a/mm/migrate.c +++ b/mm/migrate.c @@ -94,6 +94,7 @@ static int remove_migration_pte(struct p goto out; pmd = pmd_offset(pud, addr); + VM_BUG_ON(pmd_trans_huge(*pmd)); if (!pmd_present(*pmd)) goto out; @@ -810,6 +811,10 @@ static int do_move_page_to_node_array(st if (PageReserved(page) || PageKsm(page)) goto put_and_set; + if (unlikely(PageTransCompound(page))) + if (unlikely(split_huge_page(page))) + goto put_and_set; + pp->page = page; err = page_to_nid(page); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 1D41C6B01F5 for ; Thu, 1 Apr 2010 20:45:26 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 14 of 41] add pmd mangling generic functions Message-Id: <871d636cce05923b60d0.1270168901@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:41 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Some are needed to build but not actually used on archs not supporting transparent hugepages. Others like pmdp_clear_flush are used by x86 too. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel --- diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -25,6 +25,26 @@ }) #endif +#ifndef __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#define pmdp_set_access_flags(__vma, __address, __pmdp, __entry, __dirty) \ + ({ \ + int __changed = !pmd_same(*(__pmdp), __entry); \ + VM_BUG_ON((__address) & ~HPAGE_PMD_MASK); \ + if (__changed) { \ + set_pmd_at((__vma)->vm_mm, __address, __pmdp, \ + __entry); \ + flush_tlb_range(__vma, __address, \ + (__address) + HPAGE_PMD_SIZE); \ + } \ + __changed; \ + }) +#else /* CONFIG_TRANSPARENT_HUGEPAGE */ +#define pmdp_set_access_flags(__vma, __address, __pmdp, __entry, __dirty) \ + ({ BUG(); 0; }) +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#endif + #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG #define ptep_test_and_clear_young(__vma, __address, __ptep) \ ({ \ @@ -39,6 +59,25 @@ }) #endif +#ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#define pmdp_test_and_clear_young(__vma, __address, __pmdp) \ +({ \ + pmd_t __pmd = *(__pmdp); \ + int r = 1; \ + if (!pmd_young(__pmd)) \ + r = 0; \ + else \ + set_pmd_at((__vma)->vm_mm, (__address), \ + (__pmdp), pmd_mkold(__pmd)); \ + r; \ +}) +#else /* CONFIG_TRANSPARENT_HUGEPAGE */ +#define pmdp_test_and_clear_young(__vma, __address, __pmdp) \ + ({ BUG(); 0; }) +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#endif + #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH #define ptep_clear_flush_young(__vma, __address, __ptep) \ ({ \ @@ -50,6 +89,24 @@ }) #endif +#ifndef __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#define pmdp_clear_flush_young(__vma, __address, __pmdp) \ +({ \ + int __young; \ + VM_BUG_ON((__address) & ~HPAGE_PMD_MASK); \ + __young = pmdp_test_and_clear_young(__vma, __address, __pmdp); \ + if (__young) \ + flush_tlb_range(__vma, __address, \ + (__address) + HPAGE_PMD_SIZE); \ + __young; \ +}) +#else /* CONFIG_TRANSPARENT_HUGEPAGE */ +#define pmdp_clear_flush_young(__vma, __address, __pmdp) \ + ({ BUG(); 0; }) +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#endif + #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR #define ptep_get_and_clear(__mm, __address, __ptep) \ ({ \ @@ -59,6 +116,20 @@ }) #endif +#ifndef __HAVE_ARCH_PMDP_GET_AND_CLEAR +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#define pmdp_get_and_clear(__mm, __address, __pmdp) \ +({ \ + pmd_t __pmd = *(__pmdp); \ + pmd_clear((__mm), (__address), (__pmdp)); \ + __pmd; \ +}) +#else /* CONFIG_TRANSPARENT_HUGEPAGE */ +#define pmdp_get_and_clear(__mm, __address, __pmdp) \ + ({ BUG(); 0; }) +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#endif + #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL #define ptep_get_and_clear_full(__mm, __address, __ptep, __full) \ ({ \ @@ -90,6 +161,22 @@ do { \ }) #endif +#ifndef __HAVE_ARCH_PMDP_CLEAR_FLUSH +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#define pmdp_clear_flush(__vma, __address, __pmdp) \ +({ \ + pmd_t __pmd; \ + VM_BUG_ON((__address) & ~HPAGE_PMD_MASK); \ + __pmd = pmdp_get_and_clear((__vma)->vm_mm, __address, __pmdp); \ + flush_tlb_range(__vma, __address, (__address) + HPAGE_PMD_SIZE);\ + __pmd; \ +}) +#else /* CONFIG_TRANSPARENT_HUGEPAGE */ +#define pmdp_clear_flush(__vma, __address, __pmdp) \ + ({ BUG(); 0; }) +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#endif + #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT struct mm_struct; static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep) @@ -99,10 +186,45 @@ static inline void ptep_set_wrprotect(st } #endif +#ifndef __HAVE_ARCH_PMDP_SET_WRPROTECT +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long address, pmd_t *pmdp) +{ + pmd_t old_pmd = *pmdp; + set_pmd_at(mm, address, pmdp, pmd_wrprotect(old_pmd)); +} +#else /* CONFIG_TRANSPARENT_HUGEPAGE */ +#define pmdp_set_wrprotect(mm, address, pmdp) BUG() +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#endif + +#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#define pmdp_splitting_flush(__vma, __address, __pmdp) \ +({ \ + pmd_t __pmd = pmd_mksplitting(*(__pmdp)); \ + VM_BUG_ON((__address) & ~HPAGE_PMD_MASK); \ + set_pmd_at((__vma)->vm_mm, __address, __pmdp, __pmd); \ + /* tlb flush only to serialize against gup-fast */ \ + flush_tlb_range(__vma, __address, (__address) + HPAGE_PMD_SIZE);\ +}) +#else /* CONFIG_TRANSPARENT_HUGEPAGE */ +#define pmdp_splitting_flush(__vma, __address, __pmdp) BUG() +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#endif + #ifndef __HAVE_ARCH_PTE_SAME #define pte_same(A,B) (pte_val(A) == pte_val(B)) #endif +#ifndef __HAVE_ARCH_PMD_SAME +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#define pmd_same(A,B) (pmd_val(A) == pmd_val(B)) +#else /* CONFIG_TRANSPARENT_HUGEPAGE */ +#define pmd_same(A,B) ({ BUG(); 0; }) +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#endif + #ifndef __HAVE_ARCH_PAGE_TEST_DIRTY #define page_test_dirty(page) (0) #endif @@ -347,6 +469,9 @@ extern void untrack_pfn_vma(struct vm_ar #ifndef CONFIG_TRANSPARENT_HUGEPAGE #define pmd_trans_huge(pmd) 0 #define pmd_trans_splitting(pmd) 0 +#ifndef __HAVE_ARCH_PMD_WRITE +#define pmd_write(pmd) ({ BUG(); 0; }) +#endif /* __HAVE_ARCH_PMD_WRITE */ #endif #endif /* !__ASSEMBLY__ */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 226086B0213 for ; Thu, 1 Apr 2010 20:45:27 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 15 of 41] add pmd mangling functions to x86 Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:42 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Add needed pmd mangling functions with simmetry with their pte counterparts. pmdp_freeze_flush is the only exception only present on the pmd side and it's needed to serialize the VM against split_huge_page, it simply atomically clears the present bit in the same way pmdp_clear_flush_young atomically clears the accessed bit (and both need to flush the tlb to make it effective, which is mandatory to happen synchronously for pmdp_freeze_flush). Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel --- diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -300,15 +300,15 @@ pmd_t *populate_extra_pmd(unsigned long pte_t *populate_extra_pte(unsigned long vaddr); #endif /* __ASSEMBLY__ */ +#ifndef __ASSEMBLY__ +#include + #ifdef CONFIG_X86_32 # include "pgtable_32.h" #else # include "pgtable_64.h" #endif -#ifndef __ASSEMBLY__ -#include - static inline int pte_none(pte_t pte) { return !pte.pte; @@ -351,7 +351,7 @@ static inline unsigned long pmd_page_vad * Currently stuck as a macro due to indirect forward reference to * linux/mmzone.h's __section_mem_map_addr() definition: */ -#define pmd_page(pmd) pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT) +#define pmd_page(pmd) pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT) /* * the pmd page can be thought of an array like this: pmd_t[PTRS_PER_PMD] diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -72,6 +72,19 @@ static inline pte_t native_ptep_get_and_ #endif } +static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp) +{ +#ifdef CONFIG_SMP + return native_make_pmd(xchg(&xp->pmd, 0)); +#else + /* native_local_pmdp_get_and_clear, + but duplicated because of cyclic dependency */ + pmd_t ret = *xp; + native_pmd_clear(NULL, 0, xp); + return ret; +#endif +} + static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd) { *pmdp = pmd; @@ -181,6 +194,98 @@ static inline int pmd_trans_huge(pmd_t p } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#define mk_pmd(page, pgprot) pfn_pmd(page_to_pfn(page), (pgprot)) + +#define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS +extern int pmdp_set_access_flags(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp, + pmd_t entry, int dirty); + +#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG +extern int pmdp_test_and_clear_young(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmdp); + +#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH +extern int pmdp_clear_flush_young(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp); + + +#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH +extern void pmdp_splitting_flush(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmdp); + +#define __HAVE_ARCH_PMD_WRITE +static inline int pmd_write(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_RW; +} + +#define __HAVE_ARCH_PMDP_GET_AND_CLEAR +static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr, + pmd_t *pmdp) +{ + pmd_t pmd = native_pmdp_get_and_clear(pmdp); + pmd_update(mm, addr, pmdp); + return pmd; +} + +#define __HAVE_ARCH_PMDP_SET_WRPROTECT +static inline void pmdp_set_wrprotect(struct mm_struct *mm, + unsigned long addr, pmd_t *pmdp) +{ + clear_bit(_PAGE_BIT_RW, (unsigned long *)&pmdp->pmd); + pmd_update(mm, addr, pmdp); +} + +static inline int pmd_young(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_ACCESSED; +} + +static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set) +{ + pmdval_t v = native_pmd_val(pmd); + + return native_make_pmd(v | set); +} + +static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear) +{ + pmdval_t v = native_pmd_val(pmd); + + return native_make_pmd(v & ~clear); +} + +static inline pmd_t pmd_mkold(pmd_t pmd) +{ + return pmd_clear_flags(pmd, _PAGE_ACCESSED); +} + +static inline pmd_t pmd_wrprotect(pmd_t pmd) +{ + return pmd_clear_flags(pmd, _PAGE_RW); +} + +static inline pmd_t pmd_mkdirty(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_DIRTY); +} + +static inline pmd_t pmd_mkhuge(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_PSE); +} + +static inline pmd_t pmd_mkyoung(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_ACCESSED); +} + +static inline pmd_t pmd_mkwrite(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_RW); +} + #endif /* !__ASSEMBLY__ */ #endif /* _ASM_X86_PGTABLE_64_H */ diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -309,6 +309,25 @@ int ptep_set_access_flags(struct vm_area return changed; } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +int pmdp_set_access_flags(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp, + pmd_t entry, int dirty) +{ + int changed = !pmd_same(*pmdp, entry); + + VM_BUG_ON(address & ~HPAGE_PMD_MASK); + + if (changed && dirty) { + *pmdp = entry; + pmd_update_defer(vma->vm_mm, address, pmdp); + flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); + } + + return changed; +} +#endif + int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { @@ -324,6 +343,23 @@ int ptep_test_and_clear_young(struct vm_ return ret; } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +int pmdp_test_and_clear_young(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmdp) +{ + int ret = 0; + + if (pmd_young(*pmdp)) + ret = test_and_clear_bit(_PAGE_BIT_ACCESSED, + (unsigned long *) &pmdp->pmd); + + if (ret) + pmd_update(vma->vm_mm, addr, pmdp); + + return ret; +} +#endif + int ptep_clear_flush_young(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) { @@ -336,6 +372,36 @@ int ptep_clear_flush_young(struct vm_are return young; } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +int pmdp_clear_flush_young(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp) +{ + int young; + + VM_BUG_ON(address & ~HPAGE_PMD_MASK); + + young = pmdp_test_and_clear_young(vma, address, pmdp); + if (young) + flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); + + return young; +} + +void pmdp_splitting_flush(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp) +{ + int set; + VM_BUG_ON(address & ~HPAGE_PMD_MASK); + set = !test_and_set_bit(_PAGE_BIT_SPLITTING, + (unsigned long *)&pmdp->pmd); + if (set) { + pmd_update(vma->vm_mm, address, pmdp); + /* need tlb flush only to serialize against gup-fast */ + flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); + } +} +#endif + /** * reserve_top_address - reserves a hole in the top of kernel address space * @reserve - size of hole to reserve -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 3A952620084 for ; Thu, 1 Apr 2010 20:45:58 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 18 of 41] add pmd mmu_notifier helpers Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:45 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Add mmu notifier helpers to handle pmd huge operations. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel --- diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -243,6 +243,32 @@ static inline void mmu_notifier_mm_destr __pte; \ }) +#define pmdp_clear_flush_notify(__vma, __address, __pmdp) \ +({ \ + pmd_t __pmd; \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + VM_BUG_ON(__address & ~HPAGE_PMD_MASK); \ + mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address, \ + (__address)+HPAGE_PMD_SIZE);\ + __pmd = pmdp_clear_flush(___vma, ___address, __pmdp); \ + mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address, \ + (__address)+HPAGE_PMD_SIZE); \ + __pmd; \ +}) + +#define pmdp_splitting_flush_notify(__vma, __address, __pmdp) \ +({ \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + VM_BUG_ON(__address & ~HPAGE_PMD_MASK); \ + mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address, \ + (__address)+HPAGE_PMD_SIZE);\ + pmdp_splitting_flush(___vma, ___address, __pmdp); \ + mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address, \ + (__address)+HPAGE_PMD_SIZE); \ +}) + #define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ ({ \ int __young; \ @@ -254,6 +280,17 @@ static inline void mmu_notifier_mm_destr __young; \ }) +#define pmdp_clear_flush_young_notify(__vma, __address, __pmdp) \ +({ \ + int __young; \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + __young = pmdp_clear_flush_young(___vma, ___address, __pmdp); \ + __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \ + ___address); \ + __young; \ +}) + #define set_pte_at_notify(__mm, __address, __ptep, __pte) \ ({ \ struct mm_struct *___mm = __mm; \ @@ -305,7 +342,10 @@ static inline void mmu_notifier_mm_destr } #define ptep_clear_flush_young_notify ptep_clear_flush_young +#define pmdp_clear_flush_young_notify pmdp_clear_flush_young #define ptep_clear_flush_notify ptep_clear_flush +#define pmdp_clear_flush_notify pmdp_clear_flush +#define pmdp_splitting_flush_notify pmdp_splitting_flush #define set_pte_at_notify set_pte_at #endif /* CONFIG_MMU_NOTIFIER */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id F0233620087 for ; Thu, 1 Apr 2010 20:45:58 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 41 of 41] mprotect: transparent huge page support Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:42:08 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Johannes Weiner Natively handle huge pmds when changing page tables on behalf of mprotect(). I left out update_mmu_cache() because we do not need it on x86 anyway but more importantly the interface works on ptes, not pmds. Signed-off-by: Johannes Weiner Signed-off-by: Andrea Arcangeli Reviewed-by: Rik van Riel --- diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -22,6 +22,8 @@ extern int zap_huge_pmd(struct mmu_gathe extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, unsigned char *vec); +extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, pgprot_t newprot); enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_FLAG, diff --git a/mm/huge_memory.c b/mm/huge_memory.c --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -961,6 +961,33 @@ int mincore_huge_pmd(struct vm_area_stru return ret; } +int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, pgprot_t newprot) +{ + struct mm_struct *mm = vma->vm_mm; + int ret = 0; + + spin_lock(&mm->page_table_lock); + if (likely(pmd_trans_huge(*pmd))) { + if (unlikely(pmd_trans_splitting(*pmd))) { + spin_unlock(&mm->page_table_lock); + wait_split_huge_page(vma->anon_vma, pmd); + } else { + pmd_t entry; + + entry = pmdp_get_and_clear(mm, addr, pmd); + entry = pmd_modify(entry, newprot); + set_pmd_at(mm, addr, pmd, entry); + spin_unlock(&vma->vm_mm->page_table_lock); + flush_tlb_range(vma, addr, addr + HPAGE_PMD_SIZE); + ret = 1; + } + } else + spin_unlock(&vma->vm_mm->page_table_lock); + + return ret; +} + pmd_t *page_check_address_pmd(struct page *page, struct mm_struct *mm, unsigned long address, diff --git a/mm/mprotect.c b/mm/mprotect.c --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -90,7 +90,13 @@ static inline void change_pmd_range(stru pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); - split_huge_page_pmd(vma->vm_mm, pmd); + if (pmd_trans_huge(*pmd)) { + if (next - addr != HPAGE_PMD_SIZE) + split_huge_page_pmd(vma->vm_mm, pmd); + else if (change_huge_pmd(vma, pmd, addr, newprot)) + continue; + /* fall through */ + } if (pmd_none_or_clear_bad(pmd)) continue; change_pte_range(vma, pmd, addr, next, newprot, dirty_accountable); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id BAE6E620084 for ; Thu, 1 Apr 2010 20:46:21 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 38 of 41] mincore transparent hugepage support Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:42:05 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Johannes Weiner Handle transparent huge page pmd entries natively instead of splitting them into subpages. Signed-off-by: Johannes Weiner Signed-off-by: Andrea Arcangeli Reviewed-by: Rik van Riel --- diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -19,6 +19,9 @@ extern struct page *follow_trans_huge_pm extern int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd); +extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, unsigned long end, + unsigned char *vec); enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_FLAG, diff --git a/mm/huge_memory.c b/mm/huge_memory.c --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -936,6 +936,31 @@ int zap_huge_pmd(struct mmu_gather *tlb, return ret; } +int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, unsigned long end, + unsigned char *vec) +{ + int ret = 0; + + spin_lock(&vma->vm_mm->page_table_lock); + if (likely(pmd_trans_huge(*pmd))) { + ret = !pmd_trans_splitting(*pmd); + spin_unlock(&vma->vm_mm->page_table_lock); + if (unlikely(!ret)) + wait_split_huge_page(vma->anon_vma, pmd); + else { + /* + * All logical pages in the range are present + * if backed by a huge page. + */ + memset(vec, 1, (end - addr) >> PAGE_SHIFT); + } + } else + spin_unlock(&vma->vm_mm->page_table_lock); + + return ret; +} + pmd_t *page_check_address_pmd(struct page *page, struct mm_struct *mm, unsigned long address, diff --git a/mm/mincore.c b/mm/mincore.c --- a/mm/mincore.c +++ b/mm/mincore.c @@ -154,7 +154,13 @@ static void mincore_pmd_range(struct vm_ pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); - split_huge_page_pmd(vma->vm_mm, pmd); + if (pmd_trans_huge(*pmd)) { + if (mincore_huge_pmd(vma, pmd, addr, next, vec)) { + vec += (next - addr) >> PAGE_SHIFT; + continue; + } + /* fall through */ + } if (pmd_none_or_clear_bad(pmd)) mincore_unmapped_range(vma, addr, next, vec); else -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 0E031620084 for ; Thu, 1 Apr 2010 20:46:24 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 23 of 41] clear_copy_huge_page Message-Id: <92d5d5cb33f4d092bcf7.1270168910@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:50 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Move the copy/clear_huge_page functions to common code to share between hugetlb.c and huge_memory.c. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel Acked-by: Mel Gorman --- diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1507,5 +1507,14 @@ extern int soft_offline_page(struct page extern void dump_page(struct page *page); +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) +extern void clear_huge_page(struct page *page, + unsigned long addr, + unsigned int pages_per_huge_page); +extern void copy_huge_page(struct page *dst, struct page *src, + unsigned long addr, struct vm_area_struct *vma, + unsigned int pages_per_huge_page); +#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */ + #endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -385,70 +385,6 @@ static int vma_has_reserves(struct vm_ar return 0; } -static void clear_gigantic_page(struct page *page, - unsigned long addr, unsigned long sz) -{ - int i; - struct page *p = page; - - might_sleep(); - for (i = 0; i < sz/PAGE_SIZE; i++, p = mem_map_next(p, page, i)) { - cond_resched(); - clear_user_highpage(p, addr + i * PAGE_SIZE); - } -} -static void clear_huge_page(struct page *page, - unsigned long addr, unsigned long sz) -{ - int i; - - if (unlikely(sz/PAGE_SIZE > MAX_ORDER_NR_PAGES)) { - clear_gigantic_page(page, addr, sz); - return; - } - - might_sleep(); - for (i = 0; i < sz/PAGE_SIZE; i++) { - cond_resched(); - clear_user_highpage(page + i, addr + i * PAGE_SIZE); - } -} - -static void copy_gigantic_page(struct page *dst, struct page *src, - unsigned long addr, struct vm_area_struct *vma) -{ - int i; - struct hstate *h = hstate_vma(vma); - struct page *dst_base = dst; - struct page *src_base = src; - might_sleep(); - for (i = 0; i < pages_per_huge_page(h); ) { - cond_resched(); - copy_user_highpage(dst, src, addr + i*PAGE_SIZE, vma); - - i++; - dst = mem_map_next(dst, dst_base, i); - src = mem_map_next(src, src_base, i); - } -} -static void copy_huge_page(struct page *dst, struct page *src, - unsigned long addr, struct vm_area_struct *vma) -{ - int i; - struct hstate *h = hstate_vma(vma); - - if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) { - copy_gigantic_page(dst, src, addr, vma); - return; - } - - might_sleep(); - for (i = 0; i < pages_per_huge_page(h); i++) { - cond_resched(); - copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma); - } -} - static void enqueue_huge_page(struct hstate *h, struct page *page) { int nid = page_to_nid(page); @@ -2333,7 +2269,8 @@ retry_avoidcopy: return -PTR_ERR(new_page); } - copy_huge_page(new_page, old_page, address, vma); + copy_huge_page(new_page, old_page, address, vma, + pages_per_huge_page(h)); __SetPageUptodate(new_page); /* @@ -2429,7 +2366,7 @@ retry: ret = -PTR_ERR(page); goto out; } - clear_huge_page(page, address, huge_page_size(h)); + clear_huge_page(page, address, pages_per_huge_page(h)); __SetPageUptodate(page); if (vma->vm_flags & VM_MAYSHARE) { diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -3495,3 +3495,73 @@ void might_fault(void) } EXPORT_SYMBOL(might_fault); #endif + +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) +static void clear_gigantic_page(struct page *page, + unsigned long addr, + unsigned int pages_per_huge_page) +{ + int i; + struct page *p = page; + + might_sleep(); + for (i = 0; i < pages_per_huge_page; + i++, p = mem_map_next(p, page, i)) { + cond_resched(); + clear_user_highpage(p, addr + i * PAGE_SIZE); + } +} +void clear_huge_page(struct page *page, + unsigned long addr, unsigned int pages_per_huge_page) +{ + int i; + + if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) { + clear_gigantic_page(page, addr, pages_per_huge_page); + return; + } + + might_sleep(); + for (i = 0; i < pages_per_huge_page; i++) { + cond_resched(); + clear_user_highpage(page + i, addr + i * PAGE_SIZE); + } +} + +static void copy_gigantic_page(struct page *dst, struct page *src, + unsigned long addr, + struct vm_area_struct *vma, + unsigned int pages_per_huge_page) +{ + int i; + struct page *dst_base = dst; + struct page *src_base = src; + might_sleep(); + for (i = 0; i < pages_per_huge_page; ) { + cond_resched(); + copy_user_highpage(dst, src, addr + i*PAGE_SIZE, vma); + + i++; + dst = mem_map_next(dst, dst_base, i); + src = mem_map_next(src, src_base, i); + } +} +void copy_huge_page(struct page *dst, struct page *src, + unsigned long addr, struct vm_area_struct *vma, + unsigned int pages_per_huge_page) +{ + int i; + + if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) { + copy_gigantic_page(dst, src, addr, vma, pages_per_huge_page); + return; + } + + might_sleep(); + for (i = 0; i < pages_per_huge_page; i++) { + cond_resched(); + copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, + vma); + } +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 48DCC620089 for ; Thu, 1 Apr 2010 20:46:25 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 36 of 41] remove PG_buddy Message-Id: <95193bd9a60fcc1600b6.1270168923@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:42:03 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can be added to page->flags without overflowing (because of the sparse section bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also has to move the memory hotplug code from _mapcount to lru.next to avoid any risk of clashes. We can't use lru.next for PG_buddy removal, but memory hotplug can use lru.next even more easily than the mapcount instead. Signed-off-by: Andrea Arcangeli --- diff --git a/fs/proc/page.c b/fs/proc/page.c --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -116,15 +116,17 @@ u64 stable_page_flags(struct page *page) if (PageHuge(page)) u |= 1 << KPF_HUGE; + /* + * Caveats on high order pages: page->_count will only be set + * -1 on the head page; SLUB/SLQB do the same for PG_slab; + * SLOB won't set PG_slab at all on compound pages. + */ + if (PageBuddy(page)) + u |= 1 << KPF_BUDDY; + u |= kpf_copy_bit(k, KPF_LOCKED, PG_locked); - /* - * Caveats on high order pages: - * PG_buddy will only be set on the head page; SLUB/SLQB do the same - * for PG_slab; SLOB won't set PG_slab at all on compound pages. - */ u |= kpf_copy_bit(k, KPF_SLAB, PG_slab); - u |= kpf_copy_bit(k, KPF_BUDDY, PG_buddy); u |= kpf_copy_bit(k, KPF_ERROR, PG_error); u |= kpf_copy_bit(k, KPF_DIRTY, PG_dirty); diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -13,12 +13,16 @@ struct mem_section; #ifdef CONFIG_MEMORY_HOTPLUG /* - * Types for free bootmem. - * The normal smallest mapcount is -1. Here is smaller value than it. + * Types for free bootmem stored in page->lru.next. These have to be in + * some random range in unsigned long space for debugging purposes. */ -#define SECTION_INFO (-1 - 1) -#define MIX_SECTION_INFO (-1 - 2) -#define NODE_INFO (-1 - 3) +enum { + MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE = 12, + SECTION_INFO = MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE, + MIX_SECTION_INFO, + NODE_INFO, + MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO, +}; /* * pgdat resizing functions diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -358,6 +358,27 @@ static inline void init_page_count(struc atomic_set(&page->_count, 1); } +/* + * PageBuddy() indicate that the page is free and in the buddy system + * (see mm/page_alloc.c). + */ +static inline int PageBuddy(struct page *page) +{ + return atomic_read(&page->_mapcount) == -2; +} + +static inline void __SetPageBuddy(struct page *page) +{ + VM_BUG_ON(atomic_read(&page->_mapcount) != -1); + atomic_set(&page->_mapcount, -2); +} + +static inline void __ClearPageBuddy(struct page *page) +{ + VM_BUG_ON(!PageBuddy(page)); + atomic_set(&page->_mapcount, -1); +} + void put_page(struct page *page); void put_pages_list(struct list_head *pages); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -48,9 +48,6 @@ * struct page (these bits with information) are always mapped into kernel * address space... * - * PG_buddy is set to indicate that the page is free and in the buddy system - * (see mm/page_alloc.c). - * * PG_hwpoison indicates that a page got corrupted in hardware and contains * data with incorrect ECC bits that triggered a machine check. Accessing is * not safe since it may cause another machine check. Don't touch! @@ -96,7 +93,6 @@ enum pageflags { PG_swapcache, /* Swap page: swp_entry_t in private */ PG_mappedtodisk, /* Has blocks allocated on-disk */ PG_reclaim, /* To be reclaimed asap */ - PG_buddy, /* Page is free, on buddy lists */ PG_swapbacked, /* Page is backed by RAM/swap */ PG_unevictable, /* Page is "unevictable" */ #ifdef CONFIG_MMU @@ -235,7 +231,6 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC * risky: they bypass page accounting. */ TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback) -__PAGEFLAG(Buddy, buddy) PAGEFLAG(MappedToDisk, mappedtodisk) /* PG_readahead is only used for file reads; PG_reclaim is only for writes */ @@ -430,7 +425,7 @@ static inline void ClearPageCompound(str #define PAGE_FLAGS_CHECK_AT_FREE \ (1 << PG_lru | 1 << PG_locked | \ 1 << PG_private | 1 << PG_private_2 | \ - 1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \ + 1 << PG_writeback | 1 << PG_reserved | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ __PG_COMPOUND_LOCK) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -65,9 +65,10 @@ static void release_memory_resource(stru #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE #ifndef CONFIG_SPARSEMEM_VMEMMAP -static void get_page_bootmem(unsigned long info, struct page *page, int type) +static void get_page_bootmem(unsigned long info, struct page *page, + unsigned long type) { - atomic_set(&page->_mapcount, type); + page->lru.next = (struct list_head *) type; SetPagePrivate(page); set_page_private(page, info); atomic_inc(&page->_count); @@ -77,15 +78,16 @@ static void get_page_bootmem(unsigned lo * so use __ref to tell modpost not to generate a warning */ void __ref put_page_bootmem(struct page *page) { - int type; + unsigned long type; - type = atomic_read(&page->_mapcount); - BUG_ON(type >= -1); + type = (unsigned long) page->lru.next; + BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE || + type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE); if (atomic_dec_return(&page->_count) == 1) { ClearPagePrivate(page); set_page_private(page, 0); - reset_page_mapcount(page); + INIT_LIST_HEAD(&page->lru); __free_pages_bootmem(page, 0); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -426,8 +426,8 @@ __find_combined_index(unsigned long page * (c) a page and its buddy have the same order && * (d) a page and its buddy are in the same zone. * - * For recording whether a page is in the buddy system, we use PG_buddy. - * Setting, clearing, and testing PG_buddy is serialized by zone->lock. + * For recording whether a page is in the buddy system, we set ->_mapcount -2. + * Setting, clearing, and testing _mapcount -2 is serialized by zone->lock. * * For recording page's order, we use page_private(page). */ @@ -460,7 +460,7 @@ static inline int page_is_buddy(struct p * as necessary, plus some accounting needed to play nicely with other * parts of the VM system. * At each level, we keep a list of pages, which are heads of continuous - * free pages of length of (1 << order) and marked with PG_buddy. Page's + * free pages of length of (1 << order) and marked with _mapcount -2. Page's * order is recorded in page_private(page) field. * So when we are allocating or freeing one, we can derive the state of the * other. That is, if we allocate a small block, and both were @@ -5251,7 +5251,6 @@ static struct trace_print_flags pageflag {1UL << PG_swapcache, "swapcache" }, {1UL << PG_mappedtodisk, "mappedtodisk" }, {1UL << PG_reclaim, "reclaim" }, - {1UL << PG_buddy, "buddy" }, {1UL << PG_swapbacked, "swapbacked" }, {1UL << PG_unevictable, "unevictable" }, #ifdef CONFIG_MMU diff --git a/mm/sparse.c b/mm/sparse.c --- a/mm/sparse.c +++ b/mm/sparse.c @@ -670,10 +670,10 @@ static void __kfree_section_memmap(struc static void free_map_bootmem(struct page *page, unsigned long nr_pages) { unsigned long maps_section_nr, removing_section_nr, i; - int magic; + unsigned long magic; for (i = 0; i < nr_pages; i++, page++) { - magic = atomic_read(&page->_mapcount); + magic = (unsigned long) page->lru.next; BUG_ON(magic == NODE_INFO); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 91CEB620084 for ; Thu, 1 Apr 2010 20:46:26 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 33 of 41] transparent hugepage vmstat Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:42:00 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Add hugepage stat information to /proc/vmstat and /proc/meminfo. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel --- diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -101,6 +101,9 @@ static int meminfo_proc_show(struct seq_ #ifdef CONFIG_MEMORY_FAILURE "HardwareCorrupted: %5lu kB\n" #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + "AnonHugePages: %8lu kB\n" +#endif , K(i.totalram), K(i.freeram), @@ -151,6 +154,10 @@ static int meminfo_proc_show(struct seq_ #ifdef CONFIG_MEMORY_FAILURE ,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10) #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + ,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) * + HPAGE_PMD_NR) +#endif ); hugetlb_report_meminfo(m); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -112,6 +112,7 @@ enum zone_stat_item { NUMA_LOCAL, /* allocation from local node */ NUMA_OTHER, /* allocation from other node */ #endif + NR_ANON_TRANSPARENT_HUGEPAGES, NR_VM_ZONE_STAT_ITEMS }; /* diff --git a/mm/huge_memory.c b/mm/huge_memory.c --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -726,6 +726,9 @@ static void __split_huge_page_refcount(s put_page(page_tail); } + __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES); + __mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR); + ClearPageCompound(page); compound_unlock(page); spin_unlock_irq(&zone->lru_lock); diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -795,8 +795,13 @@ void page_add_anon_rmap(struct page *pag struct vm_area_struct *vma, unsigned long address) { int first = atomic_inc_and_test(&page->_mapcount); - if (first) - __inc_zone_page_state(page, NR_ANON_PAGES); + if (first) { + if (!PageTransHuge(page)) + __inc_zone_page_state(page, NR_ANON_PAGES); + else + __inc_zone_page_state(page, + NR_ANON_TRANSPARENT_HUGEPAGES); + } if (unlikely(PageKsm(page))) return; @@ -824,7 +829,10 @@ void page_add_new_anon_rmap(struct page VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end); SetPageSwapBacked(page); atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */ - __inc_zone_page_state(page, NR_ANON_PAGES); + if (!PageTransHuge(page)) + __inc_zone_page_state(page, NR_ANON_PAGES); + else + __inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES); __page_set_anon_rmap(page, vma, address); if (page_evictable(page, vma)) lru_cache_add_lru(page, LRU_ACTIVE_ANON); @@ -871,7 +879,11 @@ void page_remove_rmap(struct page *page) } if (PageAnon(page)) { mem_cgroup_uncharge_page(page); - __dec_zone_page_state(page, NR_ANON_PAGES); + if (!PageTransHuge(page)) + __dec_zone_page_state(page, NR_ANON_PAGES); + else + __dec_zone_page_state(page, + NR_ANON_TRANSPARENT_HUGEPAGES); } else { __dec_zone_page_state(page, NR_FILE_MAPPED); mem_cgroup_update_file_mapped(page, -1); diff --git a/mm/vmstat.c b/mm/vmstat.c --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -657,6 +657,9 @@ static const char * const vmstat_text[] "numa_local", "numa_other", #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + "nr_anon_transparent_hugepages", +#endif #ifdef CONFIG_VM_EVENT_COUNTERS "pgpgin", -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id C0021620089 for ; Thu, 1 Apr 2010 20:46:26 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 06 of 41] clear compound mapping Message-Id: <2b0bc568aa8688c2f92e.1270168893@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:33 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Clear compound mapping for anonymous compound pages like it already happens for regular anonymous pages. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel Acked-by: Mel Gorman --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -629,6 +629,8 @@ static void __free_pages_ok(struct page trace_mm_page_free_direct(page, order); kmemcheck_free_shadow(page, order); + if (PageAnon(page)) + page->mapping = NULL; for (i = 0 ; i < (1 << order) ; ++i) bad += free_pages_check(page + i); if (bad) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id F2C5662008D for ; Thu, 1 Apr 2010 20:46:26 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 37 of 41] add x86 32bit support Message-Id: <500f128ac519bfc4fde9.1270168924@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:42:04 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Johannes Weiner Add support for transparent hugepages to x86 32bit. Share the same VM_ bitflag for VM_MAPPED_COPY. mm/nommu.c will never support transparent hugepages. Signed-off-by: Johannes Weiner Signed-off-by: Andrea Arcangeli Reviewed-by: Rik van Riel --- diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h --- a/arch/x86/include/asm/pgtable-2level.h +++ b/arch/x86/include/asm/pgtable-2level.h @@ -46,6 +46,15 @@ static inline pte_t native_ptep_get_and_ #define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp) #endif +#ifdef CONFIG_SMP +static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp) +{ + return __pmd(xchg((pmdval_t *)xp, 0)); +} +#else +#define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp) +#endif + /* * Bits _PAGE_BIT_PRESENT, _PAGE_BIT_FILE and _PAGE_BIT_PROTNONE are taken, * split up the 29 bits of offset into this range: diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h --- a/arch/x86/include/asm/pgtable-3level.h +++ b/arch/x86/include/asm/pgtable-3level.h @@ -104,6 +104,29 @@ static inline pte_t native_ptep_get_and_ #define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp) #endif +#ifdef CONFIG_SMP +union split_pmd { + struct { + u32 pmd_low; + u32 pmd_high; + }; + pmd_t pmd; +}; +static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp) +{ + union split_pmd res, *orig = (union split_pmd *)pmdp; + + /* xchg acts as a barrier before setting of the high bits */ + res.pmd_low = xchg(&orig->pmd_low, 0); + res.pmd_high = orig->pmd_high; + orig->pmd_high = 0; + + return res.pmd; +} +#else +#define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp) +#endif + /* * Bits 0, 6 and 7 are taken in the low part of the pte, * put the 32 bits of offset into the high part. diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -95,6 +95,11 @@ static inline int pte_young(pte_t pte) return pte_flags(pte) & _PAGE_ACCESSED; } +static inline int pmd_young(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_ACCESSED; +} + static inline int pte_write(pte_t pte) { return pte_flags(pte) & _PAGE_RW; @@ -143,6 +148,18 @@ static inline int pmd_large(pmd_t pte) (_PAGE_PSE | _PAGE_PRESENT); } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static inline int pmd_trans_splitting(pmd_t pmd) +{ + return pmd_val(pmd) & _PAGE_SPLITTING; +} + +static inline int pmd_trans_huge(pmd_t pmd) +{ + return pmd_val(pmd) & _PAGE_PSE; +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + static inline pte_t pte_set_flags(pte_t pte, pteval_t set) { pteval_t v = native_pte_val(pte); @@ -217,6 +234,55 @@ static inline pte_t pte_mkspecial(pte_t return pte_set_flags(pte, _PAGE_SPECIAL); } +static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set) +{ + pmdval_t v = native_pmd_val(pmd); + + return __pmd(v | set); +} + +static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear) +{ + pmdval_t v = native_pmd_val(pmd); + + return __pmd(v & ~clear); +} + +static inline pmd_t pmd_mkold(pmd_t pmd) +{ + return pmd_clear_flags(pmd, _PAGE_ACCESSED); +} + +static inline pmd_t pmd_wrprotect(pmd_t pmd) +{ + return pmd_clear_flags(pmd, _PAGE_RW); +} + +static inline pmd_t pmd_mkdirty(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_DIRTY); +} + +static inline pmd_t pmd_mkhuge(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_PSE); +} + +static inline pmd_t pmd_mkyoung(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_ACCESSED); +} + +static inline pmd_t pmd_mkwrite(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_RW); +} + +static inline pmd_t pmd_mknotpresent(pmd_t pmd) +{ + return pmd_clear_flags(pmd, _PAGE_PRESENT); +} + /* * Mask out unsupported bits in a present pgprot. Non-present pgprots * can use those bits for other purposes, so leave them be. @@ -525,6 +591,14 @@ static inline pte_t native_local_ptep_ge return res; } +static inline pmd_t native_local_pmdp_get_and_clear(pmd_t *pmdp) +{ + pmd_t res = *pmdp; + + native_pmd_clear(pmdp); + return res; +} + static inline void native_set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep , pte_t pte) { @@ -612,6 +686,49 @@ static inline void ptep_set_wrprotect(st pte_update(mm, addr, ptep); } +#define mk_pmd(page, pgprot) pfn_pmd(page_to_pfn(page), (pgprot)) + +#define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS +extern int pmdp_set_access_flags(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp, + pmd_t entry, int dirty); + +#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG +extern int pmdp_test_and_clear_young(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmdp); + +#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH +extern int pmdp_clear_flush_young(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp); + + +#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH +extern void pmdp_splitting_flush(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmdp); + +#define __HAVE_ARCH_PMD_WRITE +static inline int pmd_write(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_RW; +} + +#define __HAVE_ARCH_PMDP_GET_AND_CLEAR +static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr, + pmd_t *pmdp) +{ + pmd_t pmd = native_pmdp_get_and_clear(pmdp); + pmd_update(mm, addr, pmdp); + return pmd; +} + +#define __HAVE_ARCH_PMDP_SET_WRPROTECT +static inline void pmdp_set_wrprotect(struct mm_struct *mm, + unsigned long addr, pmd_t *pmdp) +{ + clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp); + pmd_update(mm, addr, pmdp); +} + /* * clone_pgd_range(pgd_t *dst, pgd_t *src, int count); * diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -182,115 +182,6 @@ extern void cleanup_highmap(void); #define __HAVE_ARCH_PTE_SAME -#ifdef CONFIG_TRANSPARENT_HUGEPAGE -static inline int pmd_trans_splitting(pmd_t pmd) -{ - return pmd_val(pmd) & _PAGE_SPLITTING; -} - -static inline int pmd_trans_huge(pmd_t pmd) -{ - return pmd_val(pmd) & _PAGE_PSE; -} -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ - -#define mk_pmd(page, pgprot) pfn_pmd(page_to_pfn(page), (pgprot)) - -#define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS -extern int pmdp_set_access_flags(struct vm_area_struct *vma, - unsigned long address, pmd_t *pmdp, - pmd_t entry, int dirty); - -#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG -extern int pmdp_test_and_clear_young(struct vm_area_struct *vma, - unsigned long addr, pmd_t *pmdp); - -#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH -extern int pmdp_clear_flush_young(struct vm_area_struct *vma, - unsigned long address, pmd_t *pmdp); - - -#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH -extern void pmdp_splitting_flush(struct vm_area_struct *vma, - unsigned long addr, pmd_t *pmdp); - -#define __HAVE_ARCH_PMD_WRITE -static inline int pmd_write(pmd_t pmd) -{ - return pmd_flags(pmd) & _PAGE_RW; -} - -#define __HAVE_ARCH_PMDP_GET_AND_CLEAR -static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr, - pmd_t *pmdp) -{ - pmd_t pmd = native_pmdp_get_and_clear(pmdp); - pmd_update(mm, addr, pmdp); - return pmd; -} - -#define __HAVE_ARCH_PMDP_SET_WRPROTECT -static inline void pmdp_set_wrprotect(struct mm_struct *mm, - unsigned long addr, pmd_t *pmdp) -{ - clear_bit(_PAGE_BIT_RW, (unsigned long *)&pmdp->pmd); - pmd_update(mm, addr, pmdp); -} - -static inline int pmd_young(pmd_t pmd) -{ - return pmd_flags(pmd) & _PAGE_ACCESSED; -} - -static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set) -{ - pmdval_t v = native_pmd_val(pmd); - - return native_make_pmd(v | set); -} - -static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear) -{ - pmdval_t v = native_pmd_val(pmd); - - return native_make_pmd(v & ~clear); -} - -static inline pmd_t pmd_mkold(pmd_t pmd) -{ - return pmd_clear_flags(pmd, _PAGE_ACCESSED); -} - -static inline pmd_t pmd_wrprotect(pmd_t pmd) -{ - return pmd_clear_flags(pmd, _PAGE_RW); -} - -static inline pmd_t pmd_mkdirty(pmd_t pmd) -{ - return pmd_set_flags(pmd, _PAGE_DIRTY); -} - -static inline pmd_t pmd_mkhuge(pmd_t pmd) -{ - return pmd_set_flags(pmd, _PAGE_PSE); -} - -static inline pmd_t pmd_mkyoung(pmd_t pmd) -{ - return pmd_set_flags(pmd, _PAGE_ACCESSED); -} - -static inline pmd_t pmd_mkwrite(pmd_t pmd) -{ - return pmd_set_flags(pmd, _PAGE_RW); -} - -static inline pmd_t pmd_mknotpresent(pmd_t pmd) -{ - return pmd_clear_flags(pmd, _PAGE_PRESENT); -} - #endif /* !__ASSEMBLY__ */ #endif /* _ASM_X86_PGTABLE_64_H */ diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -351,7 +351,7 @@ int pmdp_test_and_clear_young(struct vm_ if (pmd_young(*pmdp)) ret = test_and_clear_bit(_PAGE_BIT_ACCESSED, - (unsigned long *) &pmdp->pmd); + (unsigned long *)pmdp); if (ret) pmd_update(vma->vm_mm, addr, pmdp); @@ -393,7 +393,7 @@ void pmdp_splitting_flush(struct vm_area int set; VM_BUG_ON(address & ~HPAGE_PMD_MASK); set = !test_and_set_bit(_PAGE_BIT_SPLITTING, - (unsigned long *)&pmdp->pmd); + (unsigned long *)pmdp); if (set) { pmd_update(vma->vm_mm, address, pmdp); /* need tlb flush only to serialize against gup-fast */ diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -98,7 +98,11 @@ extern unsigned int kobjsize(const void #define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */ #define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */ #define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */ +#ifndef CONFIG_TRANSPARENT_HUGEPAGE #define VM_MAPPED_COPY 0x01000000 /* T if mapped copy of data (nommu mmap) */ +#else +#define VM_HUGEPAGE 0x01000000 /* MADV_HUGEPAGE marked this vma */ +#endif #define VM_INSERTPAGE 0x02000000 /* The vma has had "vm_insert_page()" done on it */ #define VM_ALWAYSDUMP 0x04000000 /* Always include in core dumps */ @@ -107,9 +111,6 @@ extern unsigned int kobjsize(const void #define VM_SAO 0x20000000 /* Strong Access Ordering (powerpc) */ #define VM_PFN_AT_MMAP 0x40000000 /* PFNMAP vma that is fully mapped at mmap time */ #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */ -#if BITS_PER_LONG > 32 -#define VM_HUGEPAGE 0x100000000UL /* MADV_HUGEPAGE marked this vma */ -#endif #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */ #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS diff --git a/mm/Kconfig b/mm/Kconfig --- a/mm/Kconfig +++ b/mm/Kconfig @@ -290,7 +290,7 @@ config NOMMU_INITIAL_TRIM_EXCESS config TRANSPARENT_HUGEPAGE bool "Transparent Hugepage support" if EMBEDDED - depends on X86_64 + depends on X86 default y help Transparent Hugepages allows the kernel to use huge pages and -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 350EA62008E for ; Thu, 1 Apr 2010 20:46:27 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 21 of 41] split_huge_page_mm/vma Message-Id: In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:48 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli split_huge_page_pmd compat code. Each one of those would need to be expanded to hundred of lines of complex code without a fully reliable split_huge_page_pmd design. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel Acked-by: Mel Gorman Signed-off-by: Johannes Weiner --- diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c --- a/arch/x86/kernel/vm86_32.c +++ b/arch/x86/kernel/vm86_32.c @@ -179,6 +179,7 @@ static void mark_screen_rdonly(struct mm if (pud_none_or_clear_bad(pud)) goto out; pmd = pmd_offset(pud, 0xA0000); + split_huge_page_pmd(mm, pmd); if (pmd_none_or_clear_bad(pmd)) goto out; pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl); diff --git a/mm/mempolicy.c b/mm/mempolicy.c --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -443,6 +443,7 @@ static inline int check_pmd_range(struct pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); + split_huge_page_pmd(vma->vm_mm, pmd); if (pmd_none_or_clear_bad(pmd)) continue; if (check_pte_range(vma, pmd, addr, next, nodes, diff --git a/mm/mincore.c b/mm/mincore.c --- a/mm/mincore.c +++ b/mm/mincore.c @@ -154,6 +154,7 @@ static void mincore_pmd_range(struct vm_ pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); + split_huge_page_pmd(vma->vm_mm, pmd); if (pmd_none_or_clear_bad(pmd)) mincore_unmapped_range(vma, addr, next, vec); else diff --git a/mm/mprotect.c b/mm/mprotect.c --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -89,6 +89,7 @@ static inline void change_pmd_range(stru pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); + split_huge_page_pmd(mm, pmd); if (pmd_none_or_clear_bad(pmd)) continue; change_pte_range(mm, pmd, addr, next, newprot, dirty_accountable); diff --git a/mm/mremap.c b/mm/mremap.c --- a/mm/mremap.c +++ b/mm/mremap.c @@ -42,6 +42,7 @@ static pmd_t *get_old_pmd(struct mm_stru return NULL; pmd = pmd_offset(pud, addr); + split_huge_page_pmd(mm, pmd); if (pmd_none_or_clear_bad(pmd)) return NULL; diff --git a/mm/pagewalk.c b/mm/pagewalk.c --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -34,6 +34,7 @@ static int walk_pmd_range(pud_t *pud, un pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); + split_huge_page_pmd(walk->mm, pmd); if (pmd_none_or_clear_bad(pmd)) { if (walk->pte_hole) err = walk->pte_hole(addr, next, walk); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 4118E62008F for ; Thu, 1 Apr 2010 20:46:27 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 31 of 41] memcg compound Message-Id: <1bc2a33b3bbea7bf2082.1270168918@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:58 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Teach memcg to charge/uncharge compound pages. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel --- diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -4,6 +4,10 @@ NOTE: The Memory Resource Controller has to as the memory controller in this document. Do not confuse memory controller used here with the memory controller that is used in hardware. +NOTE: When in this documentation we refer to PAGE_SIZE, we actually +mean the real page size of the page being accounted which is bigger than +PAGE_SIZE for compound pages. + Salient features a. Enable control of Anonymous, Page Cache (mapped and unmapped) and diff --git a/mm/memcontrol.c b/mm/memcontrol.c --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1577,12 +1577,14 @@ static int __cpuinit memcg_stock_cpu_cal * oom-killer can be invoked. */ static int __mem_cgroup_try_charge(struct mm_struct *mm, - gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom) + gfp_t gfp_mask, + struct mem_cgroup **memcg, bool oom, + int page_size) { struct mem_cgroup *mem, *mem_over_limit; int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; struct res_counter *fail_res; - int csize = CHARGE_SIZE; + int csize = max(CHARGE_SIZE, (unsigned long) page_size); /* * Unlike gloval-vm's OOM-kill, we're not in memory shortage @@ -1617,8 +1619,9 @@ static int __mem_cgroup_try_charge(struc int ret = 0; unsigned long flags = 0; - if (consume_stock(mem)) - goto done; + if (page_size == PAGE_SIZE) + if (consume_stock(mem)) + goto done; ret = res_counter_charge(&mem->res, csize, &fail_res); if (likely(!ret)) { @@ -1638,8 +1641,8 @@ static int __mem_cgroup_try_charge(struc res); /* reduce request size and retry */ - if (csize > PAGE_SIZE) { - csize = PAGE_SIZE; + if (csize > page_size) { + csize = page_size; continue; } if (!(gfp_mask & __GFP_WAIT)) @@ -1715,8 +1718,10 @@ static int __mem_cgroup_try_charge(struc goto bypass; } } - if (csize > PAGE_SIZE) - refill_stock(mem, csize - PAGE_SIZE); + if (csize > page_size) + refill_stock(mem, csize - page_size); + if (page_size != PAGE_SIZE) + __css_get(&mem->css, page_size); done: return 0; nomem: @@ -1746,9 +1751,10 @@ static void __mem_cgroup_cancel_charge(s /* we don't need css_put for root */ } -static void mem_cgroup_cancel_charge(struct mem_cgroup *mem) +static void mem_cgroup_cancel_charge(struct mem_cgroup *mem, + int page_size) { - __mem_cgroup_cancel_charge(mem, 1); + __mem_cgroup_cancel_charge(mem, page_size >> PAGE_SHIFT); } /* @@ -1804,8 +1810,9 @@ struct mem_cgroup *try_get_mem_cgroup_fr */ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, - struct page_cgroup *pc, - enum charge_type ctype) + struct page_cgroup *pc, + enum charge_type ctype, + int page_size) { /* try_charge() can return NULL to *memcg, taking care of it. */ if (!mem) @@ -1814,7 +1821,7 @@ static void __mem_cgroup_commit_charge(s lock_page_cgroup(pc); if (unlikely(PageCgroupUsed(pc))) { unlock_page_cgroup(pc); - mem_cgroup_cancel_charge(mem); + mem_cgroup_cancel_charge(mem, page_size); return; } @@ -1891,7 +1898,7 @@ static void __mem_cgroup_move_account(st mem_cgroup_charge_statistics(from, pc, false); if (uncharge) /* This is not "cancel", but cancel_charge does all we need. */ - mem_cgroup_cancel_charge(from); + mem_cgroup_cancel_charge(from, PAGE_SIZE); /* caller should have done css_get */ pc->mem_cgroup = to; @@ -1952,13 +1959,14 @@ static int mem_cgroup_move_parent(struct goto put; parent = mem_cgroup_from_cont(pcg); - ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false); + ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, + PAGE_SIZE); if (ret || !parent) goto put_back; ret = mem_cgroup_move_account(pc, child, parent, true); if (ret) - mem_cgroup_cancel_charge(parent); + mem_cgroup_cancel_charge(parent, PAGE_SIZE); put_back: putback_lru_page(page); put: @@ -1980,6 +1988,10 @@ static int mem_cgroup_charge_common(stru struct mem_cgroup *mem; struct page_cgroup *pc; int ret; + int page_size = PAGE_SIZE; + + if (PageTransHuge(page)) + page_size <<= compound_order(page); pc = lookup_page_cgroup(page); /* can happen at boot */ @@ -1988,11 +2000,11 @@ static int mem_cgroup_charge_common(stru prefetchw(pc); mem = memcg; - ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page_size); if (ret || !mem) return ret; - __mem_cgroup_commit_charge(mem, pc, ctype); + __mem_cgroup_commit_charge(mem, pc, ctype, page_size); return 0; } @@ -2001,8 +2013,6 @@ int mem_cgroup_newpage_charge(struct pag { if (mem_cgroup_disabled()) return 0; - if (PageCompound(page)) - return 0; /* * If already mapped, we don't have to account. * If page cache, page->mapping has address_space. @@ -2015,7 +2025,7 @@ int mem_cgroup_newpage_charge(struct pag if (unlikely(!mm)) mm = &init_mm; return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL); + MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL); } static void @@ -2108,14 +2118,14 @@ int mem_cgroup_try_charge_swapin(struct if (!mem) goto charge_cur_mm; *ptr = mem; - ret = __mem_cgroup_try_charge(NULL, mask, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, PAGE_SIZE); /* drop extra refcnt from tryget */ css_put(&mem->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, ptr, true); + return __mem_cgroup_try_charge(mm, mask, ptr, true, PAGE_SIZE); } static void @@ -2131,7 +2141,7 @@ __mem_cgroup_commit_charge_swapin(struct cgroup_exclude_rmdir(&ptr->css); pc = lookup_page_cgroup(page); mem_cgroup_lru_del_before_commit_swapcache(page); - __mem_cgroup_commit_charge(ptr, pc, ctype); + __mem_cgroup_commit_charge(ptr, pc, ctype, PAGE_SIZE); mem_cgroup_lru_add_after_commit_swapcache(page); /* * Now swap is on-memory. This means this page may be @@ -2180,11 +2190,12 @@ void mem_cgroup_cancel_charge_swapin(str return; if (!mem) return; - mem_cgroup_cancel_charge(mem); + mem_cgroup_cancel_charge(mem, PAGE_SIZE); } static void -__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype) +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype, + int page_size) { struct memcg_batch_info *batch = NULL; bool uncharge_memsw = true; @@ -2219,14 +2230,14 @@ __do_uncharge(struct mem_cgroup *mem, co if (batch->memcg != mem) goto direct_uncharge; /* remember freed charge and uncharge it later */ - batch->bytes += PAGE_SIZE; + batch->bytes += page_size; if (uncharge_memsw) - batch->memsw_bytes += PAGE_SIZE; + batch->memsw_bytes += page_size; return; direct_uncharge: - res_counter_uncharge(&mem->res, PAGE_SIZE); + res_counter_uncharge(&mem->res, page_size); if (uncharge_memsw) - res_counter_uncharge(&mem->memsw, PAGE_SIZE); + res_counter_uncharge(&mem->memsw, page_size); if (unlikely(batch->memcg != mem)) memcg_oom_recover(mem); return; @@ -2241,6 +2252,10 @@ __mem_cgroup_uncharge_common(struct page struct page_cgroup *pc; struct mem_cgroup *mem = NULL; struct mem_cgroup_per_zone *mz; + int page_size = PAGE_SIZE; + + if (PageTransHuge(page)) + page_size <<= compound_order(page); if (mem_cgroup_disabled()) return NULL; @@ -2280,7 +2295,7 @@ __mem_cgroup_uncharge_common(struct page } if (!mem_cgroup_is_root(mem)) - __do_uncharge(mem, ctype); + __do_uncharge(mem, ctype, page_size); if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) mem_cgroup_swap_statistics(mem, true); mem_cgroup_charge_statistics(mem, pc, false); @@ -2506,7 +2521,8 @@ int mem_cgroup_prepare_migration(struct unlock_page_cgroup(pc); if (mem) { - ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false); + ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false, + PAGE_SIZE); css_put(&mem->css); } *ptr = mem; @@ -2549,7 +2565,7 @@ void mem_cgroup_end_migration(struct mem * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup. * So, double-counting is effectively avoided. */ - __mem_cgroup_commit_charge(mem, pc, ctype); + __mem_cgroup_commit_charge(mem, pc, ctype, PAGE_SIZE); /* * Both of oldpage and newpage are still under lock_page(). @@ -4144,7 +4160,8 @@ one_by_one: batch_count = PRECHARGE_COUNT_AT_ONCE; cond_resched(); } - ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false); + ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false, + PAGE_SIZE); if (ret || !mem) /* mem_cgroup_clear_mc() will do uncharge later */ return -ENOMEM; @@ -4259,6 +4276,7 @@ static int mem_cgroup_count_precharge_pt pte_t *pte; spinlock_t *ptl; + VM_BUG_ON(pmd_trans_huge(*pmd)); pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; pte++, addr += PAGE_SIZE) if (is_target_pte_for_mc(vma, addr, *pte, NULL)) @@ -4407,6 +4425,7 @@ static int mem_cgroup_move_charge_pte_ra spinlock_t *ptl; retry: + VM_BUG_ON(pmd_trans_huge(*pmd)); pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; addr += PAGE_SIZE) { pte_t ptent = *(pte++); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id ABB8A62008C for ; Thu, 1 Apr 2010 20:46:27 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 34 of 41] khugepaged Message-Id: <3dc8cccc15d11a8b5faf.1270168921@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:42:01 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Add khugepaged to relocate fragmented pages into hugepages if new hugepages become available. (this is indipendent of the defrag logic that will have to make new hugepages available) The fundamental reason why khugepaged is unavoidable, is that some memory can be fragmented and not everything can be relocated. So when a virtual machine quits and releases gigabytes of hugepages, we want to use those freely available hugepages to create huge-pmd in the other virtual machines that may be running on fragmented memory, to maximize the CPU efficiency at all times. The scan is slow, it takes nearly zero cpu time, except when it copies data (in which case it means we definitely want to pay for that cpu time) so it seems a good tradeoff. In addition to the hugepages being released by other process releasing memory, we have the strong suspicion that the performance impact of potentially defragmenting hugepages during or before each page fault could lead to more performance inconsistency than allocating small pages at first and having them collapsed into large pages later... if they prove themselfs to be long lived mappings (khugepaged scan is slow so short lived mappings have low probability to run into khugepaged if compared to long lived mappings). Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel --- diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -23,8 +23,11 @@ extern int zap_huge_pmd(struct mmu_gathe enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_FLAG, TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, + TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG, + TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG, TRANSPARENT_HUGEPAGE_DEFRAG_FLAG, TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, + TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG, #ifdef CONFIG_DEBUG_VM TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG, #endif diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h new file mode 100644 --- /dev/null +++ b/include/linux/khugepaged.h @@ -0,0 +1,66 @@ +#ifndef _LINUX_KHUGEPAGED_H +#define _LINUX_KHUGEPAGED_H + +#include /* MMF_VM_HUGEPAGE */ + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +extern int __khugepaged_enter(struct mm_struct *mm); +extern void __khugepaged_exit(struct mm_struct *mm); +extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma); + +#define khugepaged_enabled() \ + (transparent_hugepage_flags & \ + ((1<flags)) + return __khugepaged_enter(mm); + return 0; +} + +static inline void khugepaged_exit(struct mm_struct *mm) +{ + if (test_bit(MMF_VM_HUGEPAGE, &mm->flags)) + __khugepaged_exit(mm); +} + +static inline int khugepaged_enter(struct vm_area_struct *vma) +{ + if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags)) + if (khugepaged_always() || + (khugepaged_req_madv() && + vma->vm_flags & VM_HUGEPAGE)) + if (__khugepaged_enter(vma->vm_mm)) + return -ENOMEM; + return 0; +} +#else /* CONFIG_TRANSPARENT_HUGEPAGE */ +static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm) +{ + return 0; +} +static inline void khugepaged_exit(struct mm_struct *mm) +{ +} +static inline int khugepaged_enter(struct vm_area_struct *vma) +{ + return 0; +} +static inline int khugepaged_enter_vma_merge(struct vm_area_struct *vma) +{ + return 0; +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + +#endif /* _LINUX_KHUGEPAGED_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -435,6 +435,7 @@ extern int get_dumpable(struct mm_struct #endif /* leave room for more dump flags */ #define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */ +#define MMF_VM_HUGEPAGE 17 /* set when VM_HUGEPAGE is set on vma */ #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK) diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -65,6 +65,7 @@ #include #include #include +#include #include #include @@ -327,6 +328,9 @@ static int dup_mmap(struct mm_struct *mm retval = ksm_fork(mm, oldmm); if (retval) goto out; + retval = khugepaged_fork(mm, oldmm); + if (retval) + goto out; for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) { struct file *file; @@ -539,6 +543,7 @@ void mmput(struct mm_struct *mm) if (atomic_dec_and_test(&mm->mm_users)) { exit_aio(mm); ksm_exit(mm); + khugepaged_exit(mm); /* must run before exit_mmap */ exit_mmap(mm); set_mm_exe_file(mm, NULL); if (!list_empty(&mm->mmlist)) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -12,14 +12,124 @@ #include #include #include +#include +#include +#include #include #include #include "internal.h" +/* + * By default transparent hugepage support is enabled for all mappings + * and khugepaged scans all mappings. Defrag is only invoked by + * khugepaged hugepage allocations and by page faults inside + * MADV_HUGEPAGE regions to avoid the risk of slowing down short lived + * allocations. + */ unsigned long transparent_hugepage_flags __read_mostly = - (1< UINT_MAX) + return -EINVAL; + + khugepaged_scan_sleep_millisecs = msecs; + wakeup_khugepaged(); + + return count; +} +static struct kobj_attribute scan_sleep_millisecs_attr = + __ATTR(scan_sleep_millisecs, 0644, scan_sleep_millisecs_show, + scan_sleep_millisecs_store); + +static ssize_t alloc_sleep_millisecs_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%u\n", khugepaged_alloc_sleep_millisecs); +} + +static ssize_t alloc_sleep_millisecs_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned long msecs; + int err; + + err = strict_strtoul(buf, 10, &msecs); + if (err || msecs > UINT_MAX) + return -EINVAL; + + khugepaged_alloc_sleep_millisecs = msecs; + wakeup_khugepaged(); + + return count; +} +static struct kobj_attribute alloc_sleep_millisecs_attr = + __ATTR(alloc_sleep_millisecs, 0644, alloc_sleep_millisecs_show, + alloc_sleep_millisecs_store); + +static ssize_t pages_to_scan_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%u\n", khugepaged_pages_to_scan); +} +static ssize_t pages_to_scan_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int err; + unsigned long pages; + + err = strict_strtoul(buf, 10, &pages); + if (err || !pages || pages > UINT_MAX) + return -EINVAL; + + khugepaged_pages_to_scan = pages; + + return count; +} +static struct kobj_attribute pages_to_scan_attr = + __ATTR(pages_to_scan, 0644, pages_to_scan_show, + pages_to_scan_store); + +static ssize_t pages_collapsed_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%u\n", khugepaged_pages_collapsed); +} +static struct kobj_attribute pages_collapsed_attr = + __ATTR_RO(pages_collapsed); + +static ssize_t full_scans_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%u\n", khugepaged_full_scans); +} +static struct kobj_attribute full_scans_attr = + __ATTR_RO(full_scans); + +static ssize_t khugepaged_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return double_flag_show(kobj, attr, buf, + TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG, + TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG); +} +static ssize_t khugepaged_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + ssize_t ret; + + ret = double_flag_store(kobj, attr, buf, count, + TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG, + TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG); + if (ret > 0) { + int err = start_khugepaged(); + if (err) + ret = err; + } + return ret; +} +static struct kobj_attribute khugepaged_enabled_attr = + __ATTR(enabled, 0644, khugepaged_enabled_show, + khugepaged_enabled_store); + +static ssize_t khugepaged_defrag_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return single_flag_show(kobj, attr, buf, + TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG); +} +static ssize_t khugepaged_defrag_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + return single_flag_store(kobj, attr, buf, count, + TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG); +} +static struct kobj_attribute khugepaged_defrag_attr = + __ATTR(defrag, 0644, khugepaged_defrag_show, + khugepaged_defrag_store); + +/* + * max_ptes_none controls if khugepaged should collapse hugepages over + * any unmapped ptes in turn potentially increasing the memory + * footprint of the vmas. When max_ptes_none is 0 khugepaged will not + * reduce the available free memory in the system as it + * runs. Increasing max_ptes_none will instead potentially reduce the + * free memory in the system during the khugepaged scan. + */ +static ssize_t khugepaged_max_ptes_none_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%u\n", khugepaged_max_ptes_none); +} +static ssize_t khugepaged_max_ptes_none_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int err; + unsigned long max_ptes_none; + + err = strict_strtoul(buf, 10, &max_ptes_none); + if (err || max_ptes_none > HPAGE_PMD_NR-1) + return -EINVAL; + + khugepaged_max_ptes_none = max_ptes_none; + + return count; +} +static struct kobj_attribute khugepaged_max_ptes_none_attr = + __ATTR(max_ptes_none, 0644, khugepaged_max_ptes_none_show, + khugepaged_max_ptes_none_store); + +static struct attribute *khugepaged_attr[] = { + &khugepaged_enabled_attr.attr, + &khugepaged_defrag_attr.attr, + &khugepaged_max_ptes_none_attr.attr, + &pages_to_scan_attr.attr, + &pages_collapsed_attr.attr, + &full_scans_attr.attr, + &scan_sleep_millisecs_attr.attr, + &alloc_sleep_millisecs_attr.attr, + NULL, +}; + +static struct attribute_group khugepaged_attr_group = { + .attrs = khugepaged_attr, + .name = "khugepaged", }; #endif /* CONFIG_SYSFS */ static int __init hugepage_init(void) { + int err; #ifdef CONFIG_SYSFS - int err; + static struct kobject *hugepage_kobj; - err = sysfs_create_group(mm_kobj, &hugepage_attr_group); + err = -ENOMEM; + hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj); + if (unlikely(!hugepage_kobj)) { + printk(KERN_ERR "hugepage: failed kobject create\n"); + goto out; + } + + err = sysfs_create_group(hugepage_kobj, &hugepage_attr_group); + if (err) { + printk(KERN_ERR "hugepage: failed register hugeage group\n"); + goto out; + } + + err = sysfs_create_group(hugepage_kobj, &khugepaged_attr_group); + if (err) { + printk(KERN_ERR "hugepage: failed register hugeage group\n"); + goto out; + } +#endif + + err = khugepaged_slab_init(); if (err) - printk(KERN_ERR "hugepage: register sysfs failed\n"); -#endif - return 0; + goto out; + + err = mm_slots_hash_init(); + if (err) { + khugepaged_slab_free(); + goto out; + } + + start_khugepaged(); + +out: + return err; } module_init(hugepage_init) @@ -183,6 +513,15 @@ static int __init setup_transparent_huge transparent_hugepage_flags); transparent_hugepage_flags = 0; } + if (test_bit(TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG, + &transparent_hugepage_flags) && + test_bit(TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG, + &transparent_hugepage_flags)) { + printk(KERN_WARNING + "transparent_hugepage=%lu invalid parameter, disabling", + transparent_hugepage_flags); + transparent_hugepage_flags = 0; + } return 1; } __setup("transparent_hugepage=", setup_transparent_hugepage); @@ -277,6 +616,8 @@ int do_huge_pmd_anonymous_page(struct mm if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) { if (unlikely(anon_vma_prepare(vma))) return VM_FAULT_OOM; + if (unlikely(khugepaged_enter(vma))) + return VM_FAULT_OOM; page = alloc_hugepage(transparent_hugepage_defrag(vma)); if (unlikely(!page)) goto out; @@ -881,6 +1222,755 @@ int hugepage_madvise(unsigned long *vm_f return 0; } +static int __init khugepaged_slab_init(void) +{ + mm_slot_cache = kmem_cache_create("khugepaged_mm_slot", + sizeof(struct mm_slot), + __alignof__(struct mm_slot), 0, NULL); + if (!mm_slot_cache) + return -ENOMEM; + + return 0; +} + +static void __init khugepaged_slab_free(void) +{ + kmem_cache_destroy(mm_slot_cache); + mm_slot_cache = NULL; +} + +static inline struct mm_slot *alloc_mm_slot(void) +{ + if (!mm_slot_cache) /* initialization failed */ + return NULL; + return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL); +} + +static inline void free_mm_slot(struct mm_slot *mm_slot) +{ + kmem_cache_free(mm_slot_cache, mm_slot); +} + +static int __init mm_slots_hash_init(void) +{ + mm_slots_hash = kzalloc(MM_SLOTS_HASH_HEADS * sizeof(struct hlist_head), + GFP_KERNEL); + if (!mm_slots_hash) + return -ENOMEM; + return 0; +} + +#if 0 +static void __init mm_slots_hash_free(void) +{ + kfree(mm_slots_hash); + mm_slots_hash = NULL; +} +#endif + +static struct mm_slot *get_mm_slot(struct mm_struct *mm) +{ + struct mm_slot *mm_slot; + struct hlist_head *bucket; + struct hlist_node *node; + + bucket = &mm_slots_hash[((unsigned long)mm / sizeof(struct mm_struct)) + % MM_SLOTS_HASH_HEADS]; + hlist_for_each_entry(mm_slot, node, bucket, hash) { + if (mm == mm_slot->mm) + return mm_slot; + } + return NULL; +} + +static void insert_to_mm_slots_hash(struct mm_struct *mm, + struct mm_slot *mm_slot) +{ + struct hlist_head *bucket; + + bucket = &mm_slots_hash[((unsigned long)mm / sizeof(struct mm_struct)) + % MM_SLOTS_HASH_HEADS]; + mm_slot->mm = mm; + hlist_add_head(&mm_slot->hash, bucket); +} + +static inline int khugepaged_test_exit(struct mm_struct *mm) +{ + return atomic_read(&mm->mm_users) == 0; +} + +int __khugepaged_enter(struct mm_struct *mm) +{ + struct mm_slot *mm_slot; + int wakeup; + + mm_slot = alloc_mm_slot(); + if (!mm_slot) + return -ENOMEM; + + /* __khugepaged_exit() must not run from under us */ + VM_BUG_ON(khugepaged_test_exit(mm)); + if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) { + free_mm_slot(mm_slot); + return 0; + } + + spin_lock(&khugepaged_mm_lock); + insert_to_mm_slots_hash(mm, mm_slot); + /* + * Insert just behind the scanning cursor, to let the area settle + * down a little. + */ + wakeup = list_empty(&khugepaged_scan.mm_head); + list_add_tail(&mm_slot->mm_node, &khugepaged_scan.mm_head); + spin_unlock(&khugepaged_mm_lock); + + atomic_inc(&mm->mm_count); + if (wakeup) + wake_up_interruptible(&khugepaged_wait); + + return 0; +} + +int khugepaged_enter_vma_merge(struct vm_area_struct *vma) +{ + unsigned long hstart, hend; + if (!vma->anon_vma) + /* + * Not yet faulted in so we will register later in the + * page fault if needed. + */ + return 0; + if (vma->vm_file || vma->vm_ops) + /* khugepaged not yet working on file or special mappings */ + return 0; + VM_BUG_ON(is_linear_pfn_mapping(vma) || is_pfn_mapping(vma)); + hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; + hend = vma->vm_end & HPAGE_PMD_MASK; + if (hstart < hend) + return khugepaged_enter(vma); + return 0; +} + +void __khugepaged_exit(struct mm_struct *mm) +{ + struct mm_slot *mm_slot; + int free = 0; + + spin_lock(&khugepaged_mm_lock); + mm_slot = get_mm_slot(mm); + if (mm_slot && khugepaged_scan.mm_slot != mm_slot) { + hlist_del(&mm_slot->hash); + list_del(&mm_slot->mm_node); + free = 1; + } + + if (free) { + spin_unlock(&khugepaged_mm_lock); + clear_bit(MMF_VM_HUGEPAGE, &mm->flags); + free_mm_slot(mm_slot); + mmdrop(mm); + } else if (mm_slot) { + spin_unlock(&khugepaged_mm_lock); + /* + * This is required to serialize against + * khugepaged_test_exit() (which is guaranteed to run + * under mmap sem read mode). Stop here (after we + * return all pagetables will be destroyed) until + * khugepaged has finished working on the pagetables + * under the mmap_sem. + */ + down_write(&mm->mmap_sem); + up_write(&mm->mmap_sem); + } else + spin_unlock(&khugepaged_mm_lock); +} + +static void release_pte_page(struct page *page) +{ + /* 0 stands for page_is_file_cache(page) == false */ + dec_zone_page_state(page, NR_ISOLATED_ANON + 0); + unlock_page(page); + putback_lru_page(page); +} + +static void release_pte_pages(pte_t *pte, pte_t *_pte) +{ + while (--_pte >= pte) { + pte_t pteval = *_pte; + if (!pte_none(pteval)) + release_pte_page(pte_page(pteval)); + } +} + +static void release_all_pte_pages(pte_t *pte) +{ + release_pte_pages(pte, pte + HPAGE_PMD_NR); +} + +static int __collapse_huge_page_isolate(struct vm_area_struct *vma, + unsigned long address, + pte_t *pte) +{ + struct page *page; + pte_t *_pte; + int referenced = 0, isolated = 0, none = 0; + for (_pte = pte; _pte < pte+HPAGE_PMD_NR; + _pte++, address += PAGE_SIZE) { + pte_t pteval = *_pte; + if (pte_none(pteval)) { + if (++none <= khugepaged_max_ptes_none) + continue; + else { + release_pte_pages(pte, _pte); + goto out; + } + } + if (!pte_present(pteval) || !pte_write(pteval)) { + release_pte_pages(pte, _pte); + goto out; + } + page = vm_normal_page(vma, address, pteval); + if (unlikely(!page)) { + release_pte_pages(pte, _pte); + goto out; + } + VM_BUG_ON(PageCompound(page)); + BUG_ON(!PageAnon(page)); + VM_BUG_ON(!PageSwapBacked(page)); + + /* cannot use mapcount: can't collapse if there's a gup pin */ + if (page_count(page) != 1) { + release_pte_pages(pte, _pte); + goto out; + } + /* + * We can do it before isolate_lru_page because the + * page can't be freed from under us. NOTE: PG_lock + * is needed to serialize against split_huge_page + * when invoked from the VM. + */ + if (!trylock_page(page)) { + release_pte_pages(pte, _pte); + goto out; + } + /* + * Isolate the page to avoid collapsing an hugepage + * currently in use by the VM. + */ + if (isolate_lru_page(page)) { + unlock_page(page); + release_pte_pages(pte, _pte); + goto out; + } + /* 0 stands for page_is_file_cache(page) == false */ + inc_zone_page_state(page, NR_ISOLATED_ANON + 0); + VM_BUG_ON(!PageLocked(page)); + VM_BUG_ON(PageLRU(page)); + + /* If there is no mapped pte young don't collapse the page */ + if (pte_young(pteval)) + referenced = 1; + } + if (unlikely(!referenced)) + release_all_pte_pages(pte); + else + isolated = 1; +out: + return isolated; +} + +static void __collapse_huge_page_copy(pte_t *pte, struct page *page, + struct vm_area_struct *vma, + unsigned long address, + spinlock_t *ptl) +{ + pte_t *_pte; + for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) { + pte_t pteval = *_pte; + struct page *src_page; + + if (pte_none(pteval)) { + clear_user_highpage(page, address); + add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); + } else { + src_page = pte_page(pteval); + copy_user_highpage(page, src_page, address, vma); + VM_BUG_ON(page_mapcount(src_page) != 1); + VM_BUG_ON(page_count(src_page) != 2); + release_pte_page(src_page); + /* + * ptl mostly unnecessary, but preempt has to + * be disabled to update the per-cpu stats + * inside page_remove_rmap(). + */ + spin_lock(ptl); + /* + * paravirt calls inside pte_clear here are + * superfluous. + */ + pte_clear(vma->vm_mm, address, _pte); + page_remove_rmap(src_page); + spin_unlock(ptl); + free_page_and_swap_cache(src_page); + } + + address += PAGE_SIZE; + page++; + } +} + +static void collapse_huge_page(struct mm_struct *mm, + unsigned long address, + struct page **hpage) +{ + struct vm_area_struct *vma; + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd, _pmd; + pte_t *pte; + pgtable_t pgtable; + struct page *new_page; + spinlock_t *ptl; + int isolated; + unsigned long hstart, hend; + + VM_BUG_ON(address & ~HPAGE_PMD_MASK); + VM_BUG_ON(!*hpage); + + /* + * Prevent all access to pagetables with the exception of + * gup_fast later hanlded by the ptep_clear_flush and the VM + * handled by the anon_vma lock + PG_lock. + */ + down_write(&mm->mmap_sem); + if (unlikely(khugepaged_test_exit(mm))) + goto out; + + vma = find_vma(mm, address); + hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; + hend = vma->vm_end & HPAGE_PMD_MASK; + if (address < hstart || address + HPAGE_PMD_SIZE > hend) + goto out; + + if (!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) + goto out; + + /* VM_PFNMAP vmas may have vm_ops null but vm_file set */ + if (!vma->anon_vma || vma->vm_ops || vma->vm_file) + goto out; + VM_BUG_ON(is_linear_pfn_mapping(vma) || is_pfn_mapping(vma)); + + pgd = pgd_offset(mm, address); + if (!pgd_present(*pgd)) + goto out; + + pud = pud_offset(pgd, address); + if (!pud_present(*pud)) + goto out; + + pmd = pmd_offset(pud, address); + /* pmd can't go away or become huge under us */ + if (!pmd_present(*pmd) || pmd_trans_huge(*pmd)) + goto out; + + new_page = *hpage; + if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) + goto out; + + /* + * Stop anon_vma rmap pagetable access. vma->anon_vma->lock is + * enough for now (we don't need to check each anon_vma + * pointed by each page->mapping) because collapse_huge_page + * only works on not-shared anon pages (that are guaranteed to + * belong to vma->anon_vma). + */ + spin_lock(&vma->anon_vma->lock); + + pte = pte_offset_map(pmd, address); + ptl = pte_lockptr(mm, pmd); + + spin_lock(&mm->page_table_lock); /* probably unnecessary */ + /* + * After this gup_fast can't run anymore. This also removes + * any huge TLB entry from the CPU so we won't allow + * huge and small TLB entries for the same virtual address + * to avoid the risk of CPU bugs in that area. + */ + _pmd = pmdp_clear_flush_notify(vma, address, pmd); + spin_unlock(&mm->page_table_lock); + + spin_lock(ptl); + isolated = __collapse_huge_page_isolate(vma, address, pte); + spin_unlock(ptl); + pte_unmap(pte); + + if (unlikely(!isolated)) { + spin_lock(&mm->page_table_lock); + BUG_ON(!pmd_none(*pmd)); + set_pmd_at(mm, address, pmd, _pmd); + spin_unlock(&mm->page_table_lock); + spin_unlock(&vma->anon_vma->lock); + mem_cgroup_uncharge_page(new_page); + goto out; + } + + /* + * All pages are isolated and locked so anon_vma rmap + * can't run anymore. + */ + spin_unlock(&vma->anon_vma->lock); + + __collapse_huge_page_copy(pte, new_page, vma, address, ptl); + __SetPageUptodate(new_page); + pgtable = pmd_pgtable(_pmd); + VM_BUG_ON(page_count(pgtable) != 1); + VM_BUG_ON(page_mapcount(pgtable) != 0); + + _pmd = mk_pmd(new_page, vma->vm_page_prot); + _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma); + _pmd = pmd_mkhuge(_pmd); + + /* + * spin_lock() below is not the equivalent of smp_wmb(), so + * this is needed to avoid the copy_huge_page writes to become + * visible after the set_pmd_at() write. + */ + smp_wmb(); + + spin_lock(&mm->page_table_lock); + BUG_ON(!pmd_none(*pmd)); + page_add_new_anon_rmap(new_page, vma, address); + set_pmd_at(mm, address, pmd, _pmd); + update_mmu_cache(vma, address, entry); + prepare_pmd_huge_pte(pgtable, mm); + mm->nr_ptes--; + spin_unlock(&mm->page_table_lock); + + *hpage = NULL; + khugepaged_pages_collapsed++; +out: + up_write(&mm->mmap_sem); +} + +static int khugepaged_scan_pmd(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, + struct page **hpage) +{ + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *pte, *_pte; + int ret = 0, referenced = 0, none = 0; + struct page *page; + unsigned long _address; + spinlock_t *ptl; + + VM_BUG_ON(address & ~HPAGE_PMD_MASK); + + pgd = pgd_offset(mm, address); + if (!pgd_present(*pgd)) + goto out; + + pud = pud_offset(pgd, address); + if (!pud_present(*pud)) + goto out; + + pmd = pmd_offset(pud, address); + if (!pmd_present(*pmd) || pmd_trans_huge(*pmd)) + goto out; + + pte = pte_offset_map_lock(mm, pmd, address, &ptl); + for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR; + _pte++, _address += PAGE_SIZE) { + pte_t pteval = *_pte; + if (pte_none(pteval)) { + if (++none <= khugepaged_max_ptes_none) + continue; + else + goto out_unmap; + } + if (!pte_present(pteval) || !pte_write(pteval)) + goto out_unmap; + page = vm_normal_page(vma, _address, pteval); + if (unlikely(!page)) + goto out_unmap; + VM_BUG_ON(PageCompound(page)); + if (!PageLRU(page) || PageLocked(page) || !PageAnon(page)) + goto out_unmap; + /* cannot use mapcount: can't collapse if there's a gup pin */ + if (page_count(page) != 1) + goto out_unmap; + if (pte_young(pteval)) + referenced = 1; + } + if (referenced) + ret = 1; +out_unmap: + pte_unmap_unlock(pte, ptl); + if (ret) { + up_read(&mm->mmap_sem); + collapse_huge_page(mm, address, hpage); + } +out: + return ret; +} + +static void collect_mm_slot(struct mm_slot *mm_slot) +{ + struct mm_struct *mm = mm_slot->mm; + + VM_BUG_ON(!spin_is_locked(&khugepaged_mm_lock)); + + if (khugepaged_test_exit(mm)) { + /* free mm_slot */ + hlist_del(&mm_slot->hash); + list_del(&mm_slot->mm_node); + + /* + * Not strictly needed because the mm exited already. + * + * clear_bit(MMF_VM_HUGEPAGE, &mm->flags); + */ + + /* khugepaged_mm_lock actually not necessary for the below */ + free_mm_slot(mm_slot); + mmdrop(mm); + } +} + +static unsigned int khugepaged_scan_mm_slot(unsigned int pages, + struct page **hpage) +{ + struct mm_slot *mm_slot; + struct mm_struct *mm; + struct vm_area_struct *vma; + int progress = 0; + + VM_BUG_ON(!pages); + VM_BUG_ON(!spin_is_locked(&khugepaged_mm_lock)); + + if (khugepaged_scan.mm_slot) + mm_slot = khugepaged_scan.mm_slot; + else { + mm_slot = list_entry(khugepaged_scan.mm_head.next, + struct mm_slot, mm_node); + khugepaged_scan.address = 0; + khugepaged_scan.mm_slot = mm_slot; + } + spin_unlock(&khugepaged_mm_lock); + + mm = mm_slot->mm; + down_read(&mm->mmap_sem); + if (unlikely(khugepaged_test_exit(mm))) + vma = NULL; + else + vma = find_vma(mm, khugepaged_scan.address); + + progress++; + for (; vma; vma = vma->vm_next) { + unsigned long hstart, hend; + + cond_resched(); + if (unlikely(khugepaged_test_exit(mm))) { + progress++; + break; + } + + if (!(vma->vm_flags & VM_HUGEPAGE) && + !khugepaged_always()) { + progress++; + continue; + } + + /* VM_PFNMAP vmas may have vm_ops null but vm_file set */ + if (!vma->anon_vma || vma->vm_ops || vma->vm_file) { + khugepaged_scan.address = vma->vm_end; + progress++; + continue; + } + VM_BUG_ON(is_linear_pfn_mapping(vma) || is_pfn_mapping(vma)); + + hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; + hend = vma->vm_end & HPAGE_PMD_MASK; + if (hstart >= hend) { + progress++; + continue; + } + if (khugepaged_scan.address < hstart) + khugepaged_scan.address = hstart; + if (khugepaged_scan.address > hend) { + khugepaged_scan.address = hend + HPAGE_PMD_SIZE; + progress++; + continue; + } + BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK); + + while (khugepaged_scan.address < hend) { + int ret; + cond_resched(); + if (unlikely(khugepaged_test_exit(mm))) + goto breakouterloop; + + VM_BUG_ON(khugepaged_scan.address < hstart || + khugepaged_scan.address + HPAGE_PMD_SIZE > + hend); + ret = khugepaged_scan_pmd(mm, vma, + khugepaged_scan.address, + hpage); + /* move to next address */ + khugepaged_scan.address += HPAGE_PMD_SIZE; + progress += HPAGE_PMD_NR; + if (ret) + /* we released mmap_sem so break loop */ + goto breakouterloop_mmap_sem; + if (progress >= pages) + goto breakouterloop; + } + } +breakouterloop: + up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */ +breakouterloop_mmap_sem: + + spin_lock(&khugepaged_mm_lock); + BUG_ON(khugepaged_scan.mm_slot != mm_slot); + /* + * Release the current mm_slot if this mm is about to die, or + * if we scanned all vmas of this mm. + */ + if (khugepaged_test_exit(mm) || !vma) { + /* + * Make sure that if mm_users is reaching zero while + * khugepaged runs here, khugepaged_exit will find + * mm_slot not pointing to the exiting mm. + */ + if (mm_slot->mm_node.next != &khugepaged_scan.mm_head) { + khugepaged_scan.mm_slot = list_entry( + mm_slot->mm_node.next, + struct mm_slot, mm_node); + khugepaged_scan.address = 0; + } else { + khugepaged_scan.mm_slot = NULL; + khugepaged_full_scans++; + } + + collect_mm_slot(mm_slot); + } + + return progress; +} + +static int khugepaged_has_work(void) +{ + return !list_empty(&khugepaged_scan.mm_head) && + khugepaged_enabled(); +} + +static int khugepaged_wait_event(void) +{ + return !list_empty(&khugepaged_scan.mm_head) || + !khugepaged_enabled(); +} + +static void khugepaged_do_scan(struct page **hpage) +{ + unsigned int progress = 0, pass_through_head = 0; + unsigned int pages = khugepaged_pages_to_scan; + + barrier(); /* write khugepaged_pages_to_scan to local stack */ + + while (progress < pages) { + cond_resched(); + + if (!*hpage) { + *hpage = alloc_hugepage(khugepaged_defrag()); + if (unlikely(!*hpage)) + break; + } + + spin_lock(&khugepaged_mm_lock); + if (!khugepaged_scan.mm_slot) + pass_through_head++; + if (khugepaged_has_work() && + pass_through_head < 2) + progress += khugepaged_scan_mm_slot(pages - progress, + hpage); + else + progress = pages; + spin_unlock(&khugepaged_mm_lock); + } +} + +static struct page *khugepaged_alloc_hugepage(void) +{ + struct page *hpage; + + do { + hpage = alloc_hugepage(khugepaged_defrag()); + if (!hpage) + schedule_timeout_interruptible( + msecs_to_jiffies( + khugepaged_alloc_sleep_millisecs)); + } while (unlikely(!hpage) && + likely(khugepaged_enabled())); + return hpage; +} + +static void khugepaged_loop(void) +{ + struct page *hpage; + + while (likely(khugepaged_enabled())) { + hpage = khugepaged_alloc_hugepage(); + if (unlikely(!hpage)) + break; + + khugepaged_do_scan(&hpage); + if (hpage) + put_page(hpage); + if (khugepaged_has_work()) { + if (!khugepaged_scan_sleep_millisecs) + continue; + schedule_timeout_interruptible( + msecs_to_jiffies( + khugepaged_scan_sleep_millisecs)); + } else if (khugepaged_enabled()) + wait_event_interruptible(khugepaged_wait, + khugepaged_wait_event()); + } +} + +static int khugepaged(void *none) +{ + struct mm_slot *mm_slot; + + set_user_nice(current, 19); + + for (;;) { + BUG_ON(khugepaged_thread != current); + khugepaged_loop(); + BUG_ON(khugepaged_thread != current); + + mutex_lock(&khugepaged_mutex); + if (!khugepaged_enabled()) + break; + mutex_unlock(&khugepaged_mutex); + } + + spin_lock(&khugepaged_mm_lock); + mm_slot = khugepaged_scan.mm_slot; + khugepaged_scan.mm_slot = NULL; + if (mm_slot) + collect_mm_slot(mm_slot); + spin_unlock(&khugepaged_mm_lock); + + khugepaged_thread = NULL; + mutex_unlock(&khugepaged_mutex); + + return 0; +} + void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd) { struct page *page; diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include @@ -800,6 +801,7 @@ struct vm_area_struct *vma_merge(struct end, prev->vm_pgoff, NULL); if (err) return NULL; + khugepaged_enter_vma_merge(prev); return prev; } @@ -818,6 +820,7 @@ struct vm_area_struct *vma_merge(struct next->vm_pgoff - pglen, NULL); if (err) return NULL; + khugepaged_enter_vma_merge(area); return area; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 822D7620084 for ; Thu, 1 Apr 2010 20:46:27 -0400 (EDT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PATCH 27 of 41] transparent hugepage core Message-Id: <1a35f15f6556a7135990.1270168914@v2.random> In-Reply-To: References: Date: Fri, 02 Apr 2010 02:41:54 +0200 From: Andrea Arcangeli Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, Andrew Morton Cc: Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: From: Andrea Arcangeli Lately I've been working to make KVM use hugepages transparently without the usual restrictions of hugetlbfs. Some of the restrictions I'd like to see removed: 1) hugepages have to be swappable or the guest physical memory remains locked in RAM and can't be paged out to swap 2) if a hugepage allocation fails, regular pages should be allocated instead and mixed in the same vma without any failure and without userland noticing 3) if some task quits and more hugepages become available in the buddy, guest physical memory backed by regular pages should be relocated on hugepages automatically in regions under madvise(MADV_HUGEPAGE) (ideally event driven by waking up the kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes not null) 4) avoidance of reservation and maximization of use of hugepages whenever possible. Reservation (needed to avoid runtime fatal faliures) may be ok for 1 machine with 1 database with 1 database cache with 1 database cache size known at boot time. It's definitely not feasible with a virtualization hypervisor usage like RHEV-H that runs an unknown number of virtual machines with an unknown size of each virtual machine with an unknown amount of pagecache that could be potentially useful in the host for guest not using O_DIRECT (aka cache=off). hugepages in the virtualization hypervisor (and also in the guest!) are much more important than in a regular host not using virtualization, becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in case only the hypervisor uses transparent hugepages, and they decrease the tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and the linux guest both uses this patch (though the guest will limit the addition speedup to anonymous regions only for now...). Even more important is that the tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow paging or no-virtualization scenario. So maximizing the amount of virtual memory cached by the TLB pays off significantly more with NPT/EPT than without (even if there would be no significant speedup in the tlb-miss runtime). The first (and more tedious) part of this work requires allowing the VM to handle anonymous hugepages mixed with regular pages transparently on regular anonymous vmas. This is what this patch tries to achieve in the least intrusive possible way. We want hugepages and hugetlb to be used in a way so that all applications can benefit without changes (as usual we leverage the KVM virtualization design: by improving the Linux VM at large, KVM gets the performance boost too). The most important design choice is: always fallback to 4k allocation if the hugepage allocation fails! This is the _very_ opposite of some large pagecache patches that failed with -EIO back then if a 64k (or similar) allocation failed... Second important decision (to reduce the impact of the feature on the existing pagetable handling code) is that at any time we can split an hugepage into 512 regular pages and it has to be done with an operation that can't fail. This way the reliability of the swapping isn't decreased (no need to allocate memory when we are short on memory to swap) and it's trivial to plug a split_huge_page* one-liner where needed without polluting the VM. Over time we can teach mprotect, mremap and friends to handle pmd_trans_huge natively without calling split_huge_page*. The fact it can't fail isn't just for swap: if split_huge_page would return -ENOMEM (instead of the current void) we'd need to rollback the mprotect from the middle of it (ideally including undoing the split_vma) which would be a big change and in the very wrong direction (it'd likely be simpler not to call split_huge_page at all and to teach mprotect and friends to handle hugepages instead of rolling them back from the middle). In short the very value of split_huge_page is that it can't fail. The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and incremental and it'll just be an "harmless" addition later if this initial part is agreed upon. It also should be noted that locking-wise replacing regular pages with hugepages is going to be very easy if compared to what I'm doing below in split_huge_page, as it will only happen when page_count(page) matches page_mapcount(page) if we can take the PG_lock and mmap_sem in write mode. collapse_huge_page will be a "best effort" that (unlike split_huge_page) can fail at the minimal sign of trouble and we can try again later. collapse_huge_page will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will work similar to madvise(MADV_MERGEABLE). The default I like is that transparent hugepages are used at page fault time. This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set to three values "always", "madvise", "never" which mean respectively that hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions, or never used. /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage allocation should defrag memory aggressively "always", only inside "madvise" regions, or "never". The pmd_trans_splitting/pmd_trans_huge locking is very solid. The put_page (from get_user_page users that can't use mmu notifier like O_DIRECT) that runs against a __split_huge_page_refcount instead was a pain to serialize in a way that would result always in a coherent page count for both tail and head. I think my locking solution with a compound_lock taken only after the page_first is valid and is still a PageHead should be safe but it surely needs review from SMP race point of view. In short there is no current existing way to serialize the O_DIRECT final put_page against split_huge_page_refcount so I had to invent a new one (O_DIRECT loses knowledge on the mapping status by the time gup_fast returns so...). And I didn't want to impact all gup/gup_fast users for now, maybe if we change the gup interface substantially we can avoid this locking, I admit I didn't think too much about it because changing the gup unpinning interface would be invasive. If we ignored O_DIRECT we could stick to the existing compound refcounting code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu notifier user) would call it without FOLL_GET (and if FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the current task mmu notifier list yet). But O_DIRECT is fundamental for decent performance of virtualized I/O on fast storage so we can't avoid it to solve the race of put_page against split_huge_page_refcount to achieve a complete hugepage feature for KVM. Swap and oom works fine (well just like with regular pages ;). MMU notifier is handled transparently too, with the exception of the young bit on the pmd, that didn't have a range check but I think KVM will be fine because the whole point of hugepages is that EPT/NPT will also use a huge pmd when they notice gup returns pages with PageCompound set, so they won't care of a range and there's just the pmd young bit to check in that case. NOTE: in some cases if the L2 cache is small, this may slowdown and waste memory during COWs because 4M of memory are accessed in a single fault instead of 8k (the payoff is that after COW the program can run faster). So we might want to switch the copy_huge_page (and clear_huge_page too) to not temporal stores. I also extensively researched ways to avoid this cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k up to 1M (I can send those patches that fully implemented prefault) but I concluded they're not worth it and they add an huge additional complexity and they remove all tlb benefits until the full hugepage has been faulted in, to save a little bit of memory and some cache during app startup, but they still don't improve substantially the cache-trashing during startup if the prefault happens in >4k chunks. One reason is that those 4k pte entries copied are still mapped on a perfectly cache-colored hugepage, so the trashing is the worst one can generate in those copies (cow of 4k page copies aren't so well colored so they trashes less, but again this results in software running faster after the page fault). Those prefault patches allowed things like a pte where post-cow pages were local 4k regular anon pages and the not-yet-cowed pte entries were pointing in the middle of some hugepage mapped read-only. If it doesn't payoff substantially with todays hardware it will payoff even less in the future with larger l2 caches, and the prefault logic would blot the VM a lot. If one is emebdded transparent_hugepage can be disabled during boot with sysfs or with the boot commandline parameter transparent_hugepage=0 (or transparent_hugepage=2 to restrict hugepages inside madvise regions) that will ensure not a single hugepage is allocated at boot time. It is simple enough to just disable transparent hugepage globally and let transparent hugepages be allocated selectively by applications in the MADV_HUGEPAGE region (both at page fault time, and if enabled with the collapse_huge_page too through the kernel daemon). This patch supports only hugepages mapped in the pmd, archs that have smaller hugepages will not fit in this patch alone. Also some archs like power have certain tlb limits that prevents mixing different page size in the same regions so they will not fit in this framework that requires "graceful fallback" to basic PAGE_SIZE in case of physical memory fragmentation. hugetlbfs remains a perfect fit for those because its software limits happen to match the hardware limits. hugetlbfs also remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped to be found not fragmented after a certain system uptime and that would be very expensive to defragment with relocation, so requiring reservation. hugetlbfs is the "reservation way", the point of transparent hugepages is not to have any reservation at all and maximizing the use of cache and hugepages at all times automatically. Some performance result: vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep ages3 memset page fault 1566023 memset tlb miss 453854 memset second tlb miss 453321 random access tlb miss 41635 random access second tlb miss 41658 vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3 memset page fault 1566471 memset tlb miss 453375 memset second tlb miss 453320 random access tlb miss 41636 random access second tlb miss 41637 vmx andrea # ./largepages3 memset page fault 1566642 memset tlb miss 453417 memset second tlb miss 453313 random access tlb miss 41630 random access second tlb miss 41647 vmx andrea # ./largepages3 memset page fault 1566872 memset tlb miss 453418 memset second tlb miss 453315 random access tlb miss 41618 random access second tlb miss 41659 vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage vmx andrea # ./largepages3 memset page fault 2182476 memset tlb miss 460305 memset second tlb miss 460179 random access tlb miss 44483 random access second tlb miss 44186 vmx andrea # ./largepages3 memset page fault 2182791 memset tlb miss 460742 memset second tlb miss 459962 random access tlb miss 43981 random access second tlb miss 43988 ============ #include #include #include #include #define SIZE (3UL*1024*1024*1024) int main() { char *p = malloc(SIZE), *p2; struct timeval before, after; gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset page fault %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) *p2 = 0; gettimeofday(&after, NULL); printf("random access tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) *p2 = 0; gettimeofday(&after, NULL); printf("random access second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); return 0; } ============ Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel Signed-off-by: Johannes Weiner --- * * * adapt to mm_counter in -mm From: Andrea Arcangeli The interface changed slightly. Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel --- diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -286,6 +286,11 @@ static inline pmd_t pmd_mkwrite(pmd_t pm return pmd_set_flags(pmd, _PAGE_RW); } +static inline pmd_t pmd_mknotpresent(pmd_t pmd) +{ + return pmd_clear_flags(pmd, _PAGE_PRESENT); +} + #endif /* !__ASSEMBLY__ */ #endif /* _ASM_X86_PGTABLE_64_H */ diff --git a/include/linux/gfp.h b/include/linux/gfp.h --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -87,6 +87,9 @@ struct vm_area_struct; __GFP_HARDWALL | __GFP_HIGHMEM | \ __GFP_MOVABLE) #define GFP_IOFS (__GFP_IO | __GFP_FS) +#define GFP_TRANSHUGE (__GFP_HARDWALL | __GFP_HIGHMEM | \ + __GFP_MOVABLE | __GFP_COMP | __GFP_NOMEMALLOC | \ + __GFP_NORETRY | __GFP_NOWARN | __GFP_NO_KSWAPD) #ifdef CONFIG_NUMA #define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h new file mode 100644 --- /dev/null +++ b/include/linux/huge_mm.h @@ -0,0 +1,126 @@ +#ifndef _LINUX_HUGE_MM_H +#define _LINUX_HUGE_MM_H + +extern int do_huge_pmd_anonymous_page(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, + unsigned int flags); +extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, + pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, + struct vm_area_struct *vma); +extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, + pmd_t orig_pmd); +extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm); +extern struct page *follow_trans_huge_pmd(struct mm_struct *mm, + unsigned long addr, + pmd_t *pmd, + unsigned int flags); +extern int zap_huge_pmd(struct mmu_gather *tlb, + struct vm_area_struct *vma, + pmd_t *pmd); + +enum transparent_hugepage_flag { + TRANSPARENT_HUGEPAGE_FLAG, + TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, + TRANSPARENT_HUGEPAGE_DEFRAG_FLAG, + TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, +#ifdef CONFIG_DEBUG_VM + TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG, +#endif +}; + +enum page_check_address_pmd_flag { + PAGE_CHECK_ADDRESS_PMD_FLAG, + PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, + PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG, +}; +extern pmd_t *page_check_address_pmd(struct page *page, + struct mm_struct *mm, + unsigned long address, + enum page_check_address_pmd_flag flag); + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#define HPAGE_PMD_SHIFT HPAGE_SHIFT +#define HPAGE_PMD_MASK HPAGE_MASK +#define HPAGE_PMD_SIZE HPAGE_SIZE + +#define transparent_hugepage_enabled(__vma) \ + (transparent_hugepage_flags & (1<vm_flags & VM_HUGEPAGE)) +#define transparent_hugepage_defrag(__vma) \ + ((transparent_hugepage_flags & \ + (1<vm_flags & VM_HUGEPAGE)) +#ifdef CONFIG_DEBUG_VM +#define transparent_hugepage_debug_cow() \ + (transparent_hugepage_flags & \ + (1<lock); \ + /* \ + * spin_unlock_wait() is just a loop in C and so the \ + * CPU can reorder anything around it. \ + */ \ + smp_mb(); \ + BUG_ON(pmd_trans_splitting(*____pmd) || \ + pmd_trans_huge(*____pmd)); \ + } while (0) +#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) +#define HPAGE_PMD_NR (1< MAX_ORDER +#error "hugepages can't be allocated by the buddy allocator" +#endif + +extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma); +static inline int PageTransHuge(struct page *page) +{ + VM_BUG_ON(PageTail(page)); + return PageHead(page); +} +#else /* CONFIG_TRANSPARENT_HUGEPAGE */ +#define HPAGE_PMD_SHIFT ({ BUG(); 0; }) +#define HPAGE_PMD_MASK ({ BUG(); 0; }) +#define HPAGE_PMD_SIZE ({ BUG(); 0; }) + +#define transparent_hugepage_enabled(__vma) 0 + +#define transparent_hugepage_flags 0UL +static inline int split_huge_page(struct page *page) +{ + return 0; +} +#define split_huge_page_pmd(__mm, __pmd) \ + do { } while (0) +#define wait_split_huge_page(__anon_vma, __pmd) \ + do { } while (0) +#define PageTransHuge(page) 0 +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + +#endif /* _LINUX_HUGE_MM_H */ diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -107,6 +107,9 @@ extern unsigned int kobjsize(const void #define VM_SAO 0x20000000 /* Strong Access Ordering (powerpc) */ #define VM_PFN_AT_MMAP 0x40000000 /* PFNMAP vma that is fully mapped at mmap time */ #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */ +#if BITS_PER_LONG > 32 +#define VM_HUGEPAGE 0x100000000UL /* MADV_HUGEPAGE marked this vma */ +#endif #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */ #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS @@ -235,6 +238,7 @@ struct inode; * files which need it (119 of them) */ #include +#include /* * Methods to modify the page usage count. diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -20,11 +20,18 @@ static inline int page_is_file_cache(str } static inline void +__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l, + struct list_head *head) +{ + list_add(&page->lru, head); + __inc_zone_state(zone, NR_LRU_BASE + l); + mem_cgroup_add_lru_list(page, l); +} + +static inline void add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l) { - list_add(&page->lru, &zone->lru[l].list); - __inc_zone_state(zone, NR_LRU_BASE + l); - mem_cgroup_add_lru_list(page, l); + __add_page_to_lru_list(zone, page, l, &zone->lru[l].list); } static inline void diff --git a/include/linux/swap.h b/include/linux/swap.h --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -205,6 +205,8 @@ extern unsigned int nr_free_pagecache_pa /* linux/mm/swap.c */ extern void __lru_cache_add(struct page *, enum lru_list lru); extern void lru_cache_add_lru(struct page *, enum lru_list lru); +extern void lru_add_page_tail(struct zone* zone, + struct page *page, struct page *page_tail); extern void activate_page(struct page *); extern void mark_page_accessed(struct page *); extern void lru_add_drain(void); diff --git a/mm/Makefile b/mm/Makefile --- a/mm/Makefile +++ b/mm/Makefile @@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o diff --git a/mm/huge_memory.c b/mm/huge_memory.c new file mode 100644 --- /dev/null +++ b/mm/huge_memory.c @@ -0,0 +1,867 @@ +/* + * Copyright (C) 2009 Red Hat, Inc. + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "internal.h" + +unsigned long transparent_hugepage_flags __read_mostly = + (1<page_table_lock)); + + /* FIFO */ + if (!mm->pmd_huge_pte) + INIT_LIST_HEAD(&pgtable->lru); + else + list_add(&pgtable->lru, &mm->pmd_huge_pte->lru); + mm->pmd_huge_pte = pgtable; +} + +static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) +{ + if (likely(vma->vm_flags & VM_WRITE)) + pmd = pmd_mkwrite(pmd); + return pmd; +} + +static int __do_huge_pmd_anonymous_page(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long haddr, pmd_t *pmd, + struct page *page) +{ + int ret = 0; + pgtable_t pgtable; + + VM_BUG_ON(!PageCompound(page)); + pgtable = pte_alloc_one(mm, haddr); + if (unlikely(!pgtable)) { + put_page(page); + return VM_FAULT_OOM; + } + + clear_huge_page(page, haddr, HPAGE_PMD_NR); + __SetPageUptodate(page); + + spin_lock(&mm->page_table_lock); + if (unlikely(!pmd_none(*pmd))) { + spin_unlock(&mm->page_table_lock); + put_page(page); + pte_free(mm, pgtable); + } else { + pmd_t entry; + entry = mk_pmd(page, vma->vm_page_prot); + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); + entry = pmd_mkhuge(entry); + /* + * The spinlocking to take the lru_lock inside + * page_add_new_anon_rmap() acts as a full memory + * barrier to be sure clear_huge_page writes become + * visible after the set_pmd_at() write. + */ + page_add_new_anon_rmap(page, vma, haddr); + set_pmd_at(mm, haddr, pmd, entry); + prepare_pmd_huge_pte(pgtable, mm); + add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR); + spin_unlock(&mm->page_table_lock); + } + + return ret; +} + +static inline struct page *alloc_hugepage(int defrag) +{ + return alloc_pages(GFP_TRANSHUGE | (defrag ? __GFP_WAIT : 0), + HPAGE_PMD_ORDER); +} + +int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, + unsigned int flags) +{ + struct page *page; + unsigned long haddr = address & HPAGE_PMD_MASK; + pte_t *pte; + + if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) { + if (unlikely(anon_vma_prepare(vma))) + return VM_FAULT_OOM; + page = alloc_hugepage(transparent_hugepage_defrag(vma)); + if (unlikely(!page)) + goto out; + + return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page); + } +out: + pte = pte_alloc_map(mm, vma, pmd, address); + if (!pte) + return VM_FAULT_OOM; + return handle_pte_fault(mm, vma, address, pte, pmd, flags); +} + +int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, + pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, + struct vm_area_struct *vma) +{ + struct page *src_page; + pmd_t pmd; + pgtable_t pgtable; + int ret; + + ret = -ENOMEM; + pgtable = pte_alloc_one(dst_mm, addr); + if (unlikely(!pgtable)) + goto out; + + spin_lock(&dst_mm->page_table_lock); + spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING); + + ret = -EAGAIN; + pmd = *src_pmd; + if (unlikely(!pmd_trans_huge(pmd))) + goto out_unlock; + if (unlikely(pmd_trans_splitting(pmd))) { + /* split huge page running from under us */ + spin_unlock(&src_mm->page_table_lock); + spin_unlock(&dst_mm->page_table_lock); + + wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */ + goto out; + } + src_page = pmd_page(pmd); + VM_BUG_ON(!PageHead(src_page)); + get_page(src_page); + page_dup_rmap(src_page); + add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); + + pmdp_set_wrprotect(src_mm, addr, src_pmd); + pmd = pmd_mkold(pmd_wrprotect(pmd)); + set_pmd_at(dst_mm, addr, dst_pmd, pmd); + prepare_pmd_huge_pte(pgtable, dst_mm); + + ret = 0; +out_unlock: + spin_unlock(&src_mm->page_table_lock); + spin_unlock(&dst_mm->page_table_lock); +out: + return ret; +} + +/* no "address" argument so destroys page coloring of some arch */ +pgtable_t get_pmd_huge_pte(struct mm_struct *mm) +{ + pgtable_t pgtable; + + VM_BUG_ON(spin_can_lock(&mm->page_table_lock)); + + /* FIFO */ + pgtable = mm->pmd_huge_pte; + if (list_empty(&pgtable->lru)) + mm->pmd_huge_pte = NULL; + else { + mm->pmd_huge_pte = list_entry(pgtable->lru.next, + struct page, lru); + list_del(&pgtable->lru); + } + return pgtable; +} + +static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, + pmd_t *pmd, pmd_t orig_pmd, + struct page *page, + unsigned long haddr) +{ + pgtable_t pgtable; + pmd_t _pmd; + int ret = 0, i; + struct page **pages; + + pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR, + GFP_KERNEL); + if (unlikely(!pages)) { + ret |= VM_FAULT_OOM; + goto out; + } + + for (i = 0; i < HPAGE_PMD_NR; i++) { + pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE, + vma, address); + if (unlikely(!pages[i])) { + while (--i >= 0) + put_page(pages[i]); + kfree(pages); + ret |= VM_FAULT_OOM; + goto out; + } + } + + spin_lock(&mm->page_table_lock); + if (unlikely(!pmd_same(*pmd, orig_pmd))) + goto out_free_pages; + else + get_page(page); + spin_unlock(&mm->page_table_lock); + + for (i = 0; i < HPAGE_PMD_NR; i++) { + copy_user_highpage(pages[i], page + i, + haddr + PAGE_SHIFT*i, vma); + __SetPageUptodate(pages[i]); + cond_resched(); + } + + spin_lock(&mm->page_table_lock); + if (unlikely(!pmd_same(*pmd, orig_pmd))) + goto out_free_pages; + else + put_page(page); + + pmdp_clear_flush_notify(vma, haddr, pmd); + /* leave pmd empty until pte is filled */ + + pgtable = get_pmd_huge_pte(mm); + pmd_populate(mm, &_pmd, pgtable); + + for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) { + pte_t *pte, entry; + entry = mk_pte(pages[i], vma->vm_page_prot); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); + page_add_new_anon_rmap(pages[i], vma, haddr); + pte = pte_offset_map(&_pmd, haddr); + VM_BUG_ON(!pte_none(*pte)); + set_pte_at(mm, haddr, pte, entry); + pte_unmap(pte); + } + kfree(pages); + + mm->nr_ptes++; + smp_wmb(); /* make pte visible before pmd */ + pmd_populate(mm, pmd, pgtable); + page_remove_rmap(page); + spin_unlock(&mm->page_table_lock); + + ret |= VM_FAULT_WRITE; + put_page(page); + +out: + return ret; + +out_free_pages: + spin_unlock(&mm->page_table_lock); + for (i = 0; i < HPAGE_PMD_NR; i++) + put_page(pages[i]); + kfree(pages); + goto out; +} + +int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, pmd_t orig_pmd) +{ + int ret = 0; + struct page *page, *new_page; + unsigned long haddr; + + VM_BUG_ON(!vma->anon_vma); + spin_lock(&mm->page_table_lock); + if (unlikely(!pmd_same(*pmd, orig_pmd))) + goto out_unlock; + + page = pmd_page(orig_pmd); + VM_BUG_ON(!PageCompound(page) || !PageHead(page)); + haddr = address & HPAGE_PMD_MASK; + if (page_mapcount(page) == 1) { + pmd_t entry; + entry = pmd_mkyoung(orig_pmd); + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); + if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1)) + update_mmu_cache(vma, address, entry); + ret |= VM_FAULT_WRITE; + goto out_unlock; + } + spin_unlock(&mm->page_table_lock); + + if (transparent_hugepage_enabled(vma) && + !transparent_hugepage_debug_cow()) + new_page = alloc_hugepage(transparent_hugepage_defrag(vma)); + else + new_page = NULL; + + if (unlikely(!new_page)) { + ret = do_huge_pmd_wp_page_fallback(mm, vma, address, + pmd, orig_pmd, page, haddr); + goto out; + } + + copy_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR); + __SetPageUptodate(new_page); + + spin_lock(&mm->page_table_lock); + if (unlikely(!pmd_same(*pmd, orig_pmd))) + put_page(new_page); + else { + pmd_t entry; + entry = mk_pmd(new_page, vma->vm_page_prot); + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); + entry = pmd_mkhuge(entry); + pmdp_clear_flush_notify(vma, haddr, pmd); + page_add_new_anon_rmap(new_page, vma, haddr); + set_pmd_at(mm, haddr, pmd, entry); + update_mmu_cache(vma, address, entry); + page_remove_rmap(page); + put_page(page); + ret |= VM_FAULT_WRITE; + } +out_unlock: + spin_unlock(&mm->page_table_lock); +out: + return ret; +} + +struct page *follow_trans_huge_pmd(struct mm_struct *mm, + unsigned long addr, + pmd_t *pmd, + unsigned int flags) +{ + struct page *page = NULL; + + VM_BUG_ON(spin_can_lock(&mm->page_table_lock)); + + if (flags & FOLL_WRITE && !pmd_write(*pmd)) + goto out; + + page = pmd_page(*pmd); + VM_BUG_ON(!PageHead(page)); + if (flags & FOLL_TOUCH) { + pmd_t _pmd; + /* + * We should set the dirty bit only for FOLL_WRITE but + * for now the dirty bit in the pmd is meaningless. + * And if the dirty bit will become meaningful and + * we'll only set it with FOLL_WRITE, an atomic + * set_bit will be required on the pmd to set the + * young bit, instead of the current set_pmd_at. + */ + _pmd = pmd_mkyoung(pmd_mkdirty(*pmd)); + set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, _pmd); + } + page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT; + VM_BUG_ON(!PageCompound(page)); + if (flags & FOLL_GET) + get_page(page); + +out: + return page; +} + +int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, + pmd_t *pmd) +{ + int ret = 0; + + spin_lock(&tlb->mm->page_table_lock); + if (likely(pmd_trans_huge(*pmd))) { + if (unlikely(pmd_trans_splitting(*pmd))) { + spin_unlock(&tlb->mm->page_table_lock); + wait_split_huge_page(vma->anon_vma, + pmd); + } else { + struct page *page; + pgtable_t pgtable; + pgtable = get_pmd_huge_pte(tlb->mm); + page = pmd_page(*pmd); + pmd_clear(pmd); + page_remove_rmap(page); + VM_BUG_ON(page_mapcount(page) < 0); + add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); + spin_unlock(&tlb->mm->page_table_lock); + VM_BUG_ON(!PageHead(page)); + tlb_remove_page(tlb, page); + pte_free(tlb->mm, pgtable); + ret = 1; + } + } else + spin_unlock(&tlb->mm->page_table_lock); + + return ret; +} + +pmd_t *page_check_address_pmd(struct page *page, + struct mm_struct *mm, + unsigned long address, + enum page_check_address_pmd_flag flag) +{ + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd, *ret = NULL; + + if (address & ~HPAGE_PMD_MASK) + goto out; + + pgd = pgd_offset(mm, address); + if (!pgd_present(*pgd)) + goto out; + + pud = pud_offset(pgd, address); + if (!pud_present(*pud)) + goto out; + + pmd = pmd_offset(pud, address); + if (pmd_none(*pmd)) + goto out; + if (pmd_page(*pmd) != page) + goto out; + VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG && + pmd_trans_splitting(*pmd)); + if (pmd_trans_huge(*pmd)) { + VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG && + !pmd_trans_splitting(*pmd)); + ret = pmd; + } +out: + return ret; +} + +static int __split_huge_page_splitting(struct page *page, + struct vm_area_struct *vma, + unsigned long address) +{ + struct mm_struct *mm = vma->vm_mm; + pmd_t *pmd; + int ret = 0; + + spin_lock(&mm->page_table_lock); + pmd = page_check_address_pmd(page, mm, address, + PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG); + if (pmd) { + /* + * We can't temporarily set the pmd to null in order + * to split it, the pmd must remain marked huge at all + * times or the VM won't take the pmd_trans_huge paths + * and it won't wait on the anon_vma->lock to + * serialize against split_huge_page*. + */ + pmdp_splitting_flush_notify(vma, address, pmd); + ret = 1; + } + spin_unlock(&mm->page_table_lock); + + return ret; +} + +static void __split_huge_page_refcount(struct page *page) +{ + int i; + unsigned long head_index = page->index; + struct zone *zone = page_zone(page); + + /* prevent PageLRU to go away from under us, and freeze lru stats */ + spin_lock_irq(&zone->lru_lock); + compound_lock(page); + + for (i = 1; i < HPAGE_PMD_NR; i++) { + struct page *page_tail = page + i; + + /* tail_page->_count cannot change */ + atomic_sub(atomic_read(&page_tail->_count), &page->_count); + BUG_ON(page_count(page) <= 0); + atomic_add(page_mapcount(page) + 1, &page_tail->_count); + BUG_ON(atomic_read(&page_tail->_count) <= 0); + + /* after clearing PageTail the gup refcount can be released */ + smp_mb(); + + page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; + page_tail->flags |= (page->flags & + ((1L << PG_referenced) | + (1L << PG_swapbacked) | + (1L << PG_mlocked) | + (1L << PG_uptodate))); + page_tail->flags |= (1L << PG_dirty); + + /* + * 1) clear PageTail before overwriting first_page + * 2) clear PageTail before clearing PageHead for VM_BUG_ON + */ + smp_wmb(); + + /* + * __split_huge_page_splitting() already set the + * splitting bit in all pmd that could map this + * hugepage, that will ensure no CPU can alter the + * mapcount on the head page. The mapcount is only + * accounted in the head page and it has to be + * transferred to all tail pages in the below code. So + * for this code to be safe, the split the mapcount + * can't change. But that doesn't mean userland can't + * keep changing and reading the page contents while + * we transfer the mapcount, so the pmd splitting + * status is achieved setting a reserved bit in the + * pmd, not by clearing the present bit. + */ + BUG_ON(page_mapcount(page_tail)); + page_tail->_mapcount = page->_mapcount; + + BUG_ON(page_tail->mapping); + page_tail->mapping = page->mapping; + + page_tail->index = ++head_index; + + BUG_ON(!PageAnon(page_tail)); + BUG_ON(!PageUptodate(page_tail)); + BUG_ON(!PageDirty(page_tail)); + BUG_ON(!PageSwapBacked(page_tail)); + + lru_add_page_tail(zone, page, page_tail); + + put_page(page_tail); + } + + ClearPageCompound(page); + compound_unlock(page); + spin_unlock_irq(&zone->lru_lock); + + BUG_ON(page_count(page) <= 0); +} + +static int __split_huge_page_map(struct page *page, + struct vm_area_struct *vma, + unsigned long address) +{ + struct mm_struct *mm = vma->vm_mm; + pmd_t *pmd, _pmd; + int ret = 0, i; + pgtable_t pgtable; + unsigned long haddr; + + spin_lock(&mm->page_table_lock); + pmd = page_check_address_pmd(page, mm, address, + PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG); + if (pmd) { + pgtable = get_pmd_huge_pte(mm); + pmd_populate(mm, &_pmd, pgtable); + + for (i = 0, haddr = address; i < HPAGE_PMD_NR; + i++, haddr += PAGE_SIZE) { + pte_t *pte, entry; + BUG_ON(PageCompound(page+i)); + entry = mk_pte(page + i, vma->vm_page_prot); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); + if (!pmd_write(*pmd)) + entry = pte_wrprotect(entry); + else + BUG_ON(page_mapcount(page) != 1); + if (!pmd_young(*pmd)) + entry = pte_mkold(entry); + pte = pte_offset_map(&_pmd, haddr); + BUG_ON(!pte_none(*pte)); + set_pte_at(mm, haddr, pte, entry); + pte_unmap(pte); + } + + mm->nr_ptes++; + smp_wmb(); /* make pte visible before pmd */ + /* + * Up to this point the pmd is present and huge and + * userland has the whole access to the hugepage + * during the split (which happens in place). If we + * overwrite the pmd with the not-huge version + * pointing to the pte here (which of course we could + * if all CPUs were bug free), userland could trigger + * a small page size TLB miss on the small sized TLB + * while the hugepage TLB entry is still established + * in the huge TLB. Some CPU doesn't like that. See + * http://support.amd.com/us/Processor_TechDocs/41322.pdf, + * Erratum 383 on page 93. Intel should be safe but is + * also warns that it's only safe if the permission + * and cache attributes of the two entries loaded in + * the two TLB is identical (which should be the case + * here). But it is generally safer to never allow + * small and huge TLB entries for the same virtual + * address to be loaded simultaneously. So instead of + * doing "pmd_populate(); flush_tlb_range();" we first + * mark the current pmd notpresent (atomically because + * here the pmd_trans_huge and pmd_trans_splitting + * must remain set at all times on the pmd until the + * split is complete for this pmd), then we flush the + * SMP TLB and finally we write the non-huge version + * of the pmd entry with pmd_populate. + */ + set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd)); + flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); + pmd_populate(mm, pmd, pgtable); + ret = 1; + } + spin_unlock(&mm->page_table_lock); + + return ret; +} + +/* must be called with anon_vma->lock hold */ +static void __split_huge_page(struct page *page, + struct anon_vma *anon_vma) +{ + int mapcount, mapcount2; + struct anon_vma_chain *avc; + + BUG_ON(!PageHead(page)); + BUG_ON(PageTail(page)); + + mapcount = 0; + list_for_each_entry(avc, &anon_vma->head, same_anon_vma) { + struct vm_area_struct *vma = avc->vma; + unsigned long addr = vma_address(page, vma); + if (addr == -EFAULT) + continue; + mapcount += __split_huge_page_splitting(page, vma, addr); + } + BUG_ON(mapcount != page_mapcount(page)); + + __split_huge_page_refcount(page); + + mapcount2 = 0; + list_for_each_entry(avc, &anon_vma->head, same_anon_vma) { + struct vm_area_struct *vma = avc->vma; + unsigned long addr = vma_address(page, vma); + if (addr == -EFAULT) + continue; + mapcount2 += __split_huge_page_map(page, vma, addr); + } + BUG_ON(mapcount != mapcount2); +} + +int split_huge_page(struct page *page) +{ + struct anon_vma *anon_vma; + int ret = 1; + + BUG_ON(!PageAnon(page)); + anon_vma = page_lock_anon_vma(page); + if (!anon_vma) + goto out; + ret = 0; + if (!PageCompound(page)) + goto out_unlock; + + BUG_ON(!PageSwapBacked(page)); + __split_huge_page(page, anon_vma); + + BUG_ON(PageCompound(page)); +out_unlock: + page_unlock_anon_vma(anon_vma); +out: + return ret; +} + +void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd) +{ + struct page *page; + + spin_lock(&mm->page_table_lock); + if (unlikely(!pmd_trans_huge(*pmd))) { + spin_unlock(&mm->page_table_lock); + return; + } + page = pmd_page(*pmd); + VM_BUG_ON(!page_count(page)); + get_page(page); + spin_unlock(&mm->page_table_lock); + + /* + * The vma->anon_vma->lock is the wrong lock if the page is shared, + * the anon_vma->lock pointed by page->mapping is the right one. + */ + split_huge_page(page); + + put_page(page); + BUG_ON(pmd_trans_huge(*pmd)); +} diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -728,9 +728,9 @@ out_set_pte: return 0; } -static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma, - unsigned long addr, unsigned long end) +int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, + pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma, + unsigned long addr, unsigned long end) { pte_t *orig_src_pte, *orig_dst_pte; pte_t *src_pte, *dst_pte; @@ -804,6 +804,16 @@ static inline int copy_pmd_range(struct src_pmd = pmd_offset(src_pud, addr); do { next = pmd_addr_end(addr, end); + if (pmd_trans_huge(*src_pmd)) { + int err; + err = copy_huge_pmd(dst_mm, src_mm, + dst_pmd, src_pmd, addr, vma); + if (err == -ENOMEM) + return -ENOMEM; + if (!err) + continue; + /* fall through */ + } if (pmd_none_or_clear_bad(src_pmd)) continue; if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd, @@ -1006,6 +1016,15 @@ static inline unsigned long zap_pmd_rang pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); + if (pmd_trans_huge(*pmd)) { + if (next-addr != HPAGE_PMD_SIZE) + split_huge_page_pmd(vma->vm_mm, pmd); + else if (zap_huge_pmd(tlb, vma, pmd)) { + (*zap_work)--; + continue; + } + /* fall through */ + } if (pmd_none_or_clear_bad(pmd)) { (*zap_work)--; continue; @@ -1273,11 +1292,27 @@ struct page *follow_page(struct vm_area_ pmd = pmd_offset(pud, address); if (pmd_none(*pmd)) goto no_page_table; - if (pmd_huge(*pmd)) { + if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) { BUG_ON(flags & FOLL_GET); page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE); goto out; } + if (pmd_trans_huge(*pmd)) { + spin_lock(&mm->page_table_lock); + if (likely(pmd_trans_huge(*pmd))) { + if (unlikely(pmd_trans_splitting(*pmd))) { + spin_unlock(&mm->page_table_lock); + wait_split_huge_page(vma->anon_vma, pmd); + } else { + page = follow_trans_huge_pmd(mm, address, + pmd, flags); + spin_unlock(&mm->page_table_lock); + goto out; + } + } else + spin_unlock(&mm->page_table_lock); + /* fall through */ + } if (unlikely(pmd_bad(*pmd))) goto no_page_table; @@ -3045,9 +3080,9 @@ static int do_nonlinear_fault(struct mm_ * but allow concurrent faults), and pte mapped but not yet locked. * We return with mmap_sem still held, but pte unmapped and unlocked. */ -static inline int handle_pte_fault(struct mm_struct *mm, - struct vm_area_struct *vma, unsigned long address, - pte_t *pte, pmd_t *pmd, unsigned int flags) +int handle_pte_fault(struct mm_struct *mm, + struct vm_area_struct *vma, unsigned long address, + pte_t *pte, pmd_t *pmd, unsigned int flags) { pte_t entry; spinlock_t *ptl; @@ -3126,6 +3161,22 @@ int handle_mm_fault(struct mm_struct *mm pmd = pmd_alloc(mm, pud, address); if (!pmd) return VM_FAULT_OOM; + if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) { + if (!vma->vm_ops) + return do_huge_pmd_anonymous_page(mm, vma, address, + pmd, flags); + } else { + pmd_t orig_pmd = *pmd; + barrier(); + if (pmd_trans_huge(orig_pmd)) { + if (flags & FAULT_FLAG_WRITE && + !pmd_write(orig_pmd) && + !pmd_trans_splitting(orig_pmd)) + return do_huge_pmd_wp_page(mm, vma, address, + pmd, orig_pmd); + return 0; + } + } pte = pte_alloc_map(mm, vma, pmd, address); if (!pte) return VM_FAULT_OOM; diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -56,6 +56,7 @@ #include #include #include +#include #include @@ -318,7 +319,7 @@ void page_unlock_anon_vma(struct anon_vm * Returns virtual address or -EFAULT if page's index/offset is not * within the range mapped the @vma. */ -static inline unsigned long +inline unsigned long vma_address(struct page *page, struct vm_area_struct *vma) { pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); @@ -432,35 +433,17 @@ int page_referenced_one(struct page *pag unsigned long *vm_flags) { struct mm_struct *mm = vma->vm_mm; - pte_t *pte; - spinlock_t *ptl; int referenced = 0; - pte = page_check_address(page, mm, address, &ptl, 0); - if (!pte) - goto out; - /* * Don't want to elevate referenced for mlocked page that gets this far, * in order that it progresses to try_to_unmap and is moved to the * unevictable list. */ if (vma->vm_flags & VM_LOCKED) { - *mapcount = 1; /* break early from loop */ + *mapcount = 0; /* break early from loop */ *vm_flags |= VM_LOCKED; - goto out_unmap; - } - - if (ptep_clear_flush_young_notify(vma, address, pte)) { - /* - * Don't treat a reference through a sequentially read - * mapping as such. If the page has been used in - * another mapping, we will catch it; if this other - * mapping is already gone, the unmap path will have - * set PG_referenced or activated the page. - */ - if (likely(!VM_SequentialReadHint(vma))) - referenced++; + goto out; } /* Pretend the page is referenced if the task has the @@ -469,9 +452,39 @@ int page_referenced_one(struct page *pag rwsem_is_locked(&mm->mmap_sem)) referenced++; -out_unmap: + if (unlikely(PageTransHuge(page))) { + pmd_t *pmd; + + spin_lock(&mm->page_table_lock); + pmd = page_check_address_pmd(page, mm, address, + PAGE_CHECK_ADDRESS_PMD_FLAG); + if (pmd && !pmd_trans_splitting(*pmd) && + pmdp_clear_flush_young_notify(vma, address, pmd)) + referenced++; + spin_unlock(&mm->page_table_lock); + } else { + pte_t *pte; + spinlock_t *ptl; + + pte = page_check_address(page, mm, address, &ptl, 0); + if (!pte) + goto out; + + if (ptep_clear_flush_young_notify(vma, address, pte)) { + /* + * Don't treat a reference through a sequentially read + * mapping as such. If the page has been used in + * another mapping, we will catch it; if this other + * mapping is already gone, the unmap path will have + * set PG_referenced or activated the page. + */ + if (likely(!VM_SequentialReadHint(vma))) + referenced++; + } + pte_unmap_unlock(pte, ptl); + } + (*mapcount)--; - pte_unmap_unlock(pte, ptl); if (referenced) *vm_flags |= vma->vm_flags; diff --git a/mm/swap.c b/mm/swap.c --- a/mm/swap.c +++ b/mm/swap.c @@ -461,6 +461,43 @@ void __pagevec_release(struct pagevec *p EXPORT_SYMBOL(__pagevec_release); +/* used by __split_huge_page_refcount() */ +void lru_add_page_tail(struct zone* zone, + struct page *page, struct page *page_tail) +{ + int active; + enum lru_list lru; + const int file = 0; + struct list_head *head; + + VM_BUG_ON(!PageHead(page)); + VM_BUG_ON(PageCompound(page_tail)); + VM_BUG_ON(PageLRU(page_tail)); + VM_BUG_ON(!spin_is_locked(&zone->lru_lock)); + + SetPageLRU(page_tail); + + if (page_evictable(page_tail, NULL)) { + if (PageActive(page)) { + SetPageActive(page_tail); + active = 1; + lru = LRU_ACTIVE_ANON; + } else { + active = 0; + lru = LRU_INACTIVE_ANON; + } + update_page_reclaim_stat(zone, page_tail, file, active); + if (likely(PageLRU(page))) + head = page->lru.prev; + else + head = &zone->lru[lru].list; + __add_page_to_lru_list(zone, page_tail, lru, head); + } else { + SetPageUnevictable(page_tail); + add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE); + } +} + /* * Add the passed pages to the LRU, then drop the caller's refcount * on them. Reinitialises the caller's pagevec. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id EE6A86B01E3 for ; Mon, 5 Apr 2010 15:10:27 -0400 (EDT) Date: Mon, 5 Apr 2010 12:09:06 -0700 From: Andrew Morton Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-Id: <20100405120906.0abe8e58.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: Problem. It appears that these patches have only been sent to linux-mm. Linus doesn't read linux-mm and has never seen them. I do think we should get things squared away with him regarding the overall intent and implementation approach before trying to go further. I forwarded "[PATCH 27 of 41] transparent hugepage core" and his summary was "So I don't hate the patch, but it sure as hell doesn't make me happy either. And if the only advantage is about TLB miss costs, I really don't see the point personally.". So if there's more benefit to the patches than this, that will need some expounding upon. So I'd suggest that you a) address some minor Linus comments which I'll forward separately, b) rework [patch 0/n] to provide a complete description of the benefits and the downsides (if that isn't there already) and c) resend everything, cc'ing Linus and linux-kernel and we'll get it thrashed out. Sorry. Normally I use my own judgement on MM patches, but in this case if I was asked "why did you send all this stuff", I don't believe I personally have strong enough arguments to justify the changes - you're in a better position than I to make that case. Plus this is a *large* patchset, and it plays in an area where Linus is known to have, err, opinions. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id 7C8256B01E3 for ; Mon, 5 Apr 2010 15:36:44 -0400 (EDT) Date: Mon, 5 Apr 2010 21:36:16 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100405193616.GA5125@elte.hu> References: <20100405120906.0abe8e58.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100405120906.0abe8e58.akpm@linux-foundation.org> Sender: owner-linux-mm@kvack.org To: Andrew Morton , Linus Torvalds Cc: Andrea Arcangeli , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Pekka Enberg List-ID: * Andrew Morton wrote: > Problem. It appears that these patches have only been sent to linux-mm. > Linus doesn't read linux-mm and has never seen them. I do think we should > get things squared away with him regarding the overall intent and > implementation approach before trying to go further. > > I forwarded "[PATCH 27 of 41] transparent hugepage core" and his summary was > "So I don't hate the patch, but it sure as hell doesn't make me happy > either. And if the only advantage is about TLB miss costs, I really don't > see the point personally.". So if there's more benefit to the patches than > this, that will need some expounding upon. > > So I'd suggest that you a) address some minor Linus comments which I'll > forward separately, b) rework [patch 0/n] to provide a complete description > of the benefits and the downsides (if that isn't there already) and c) > resend everything, cc'ing Linus and linux-kernel and we'll get it thrashed > out. > > Sorry. Normally I use my own judgement on MM patches, but in this case if I > was asked "why did you send all this stuff", I don't believe I personally > have strong enough arguments to justify the changes - you're in a better > position than I to make that case. Plus this is a *large* patchset, and it > plays in an area where Linus is known to have, err, opinions. Not sure whether it got mentioned but one area where huge pages are rather useful are apps/middleware that does some sort of GC with tons of RAM. There the 512x reduction in remapping and TLB flush costs (not just TLB miss costs) obviously makes for a big difference not just in straight performance/latency but also in cache footprint. AFAIK most GC concepts today (that cover many gigabytes of memory) are limited by remap and TLB flush performance. So if we accept that shuffling lots of virtual memory is worth doing then the next natural step would be to make it transparent. Just my 2c, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 094BD6B01E3 for ; Mon, 5 Apr 2010 16:26:55 -0400 (EDT) Received: by fg-out-1718.google.com with SMTP id l26so630011fgb.8 for ; Mon, 05 Apr 2010 13:26:53 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20100405193616.GA5125@elte.hu> References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> Date: Mon, 5 Apr 2010 23:26:52 +0300 Message-ID: Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 From: Pekka Enberg Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Andrew Morton , Linus Torvalds , Andrea Arcangeli , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: Hi Ingo, On Mon, Apr 5, 2010 at 10:36 PM, Ingo Molnar wrote: >> Problem. =A0It appears that these patches have only been sent to linux-m= m. >> Linus doesn't read linux-mm and has never seen them. =A0I do think we sh= ould >> get things squared away with him regarding the overall intent and >> implementation approach before trying to go further. >> >> I forwarded "[PATCH 27 of 41] transparent hugepage core" and his summary= was >> "So I don't hate the patch, but it sure as hell doesn't make me happy >> either. =A0And if the only advantage is about TLB miss costs, I really d= on't >> see the point personally.". =A0So if there's more benefit to the patches= than >> this, that will need some expounding upon. >> >> So I'd suggest that you a) address some minor Linus comments which I'll >> forward separately, b) rework [patch 0/n] to provide a complete descript= ion >> of the benefits and the downsides (if that isn't there already) and c) >> resend everything, cc'ing Linus and linux-kernel and we'll get it thrash= ed >> out. >> >> Sorry. =A0Normally I use my own judgement on MM patches, but in this cas= e if I >> was asked "why did you send all this stuff", I don't believe I personall= y >> have strong enough arguments to justify the changes - you're in a better >> position than I to make that case. =A0Plus this is a *large* patchset, a= nd it >> plays in an area where Linus is known to have, err, opinions. > > Not sure whether it got mentioned but one area where huge pages are rathe= r > useful are apps/middleware that does some sort of GC with tons of RAM. Dunno what your measure of "tons of RAM" is but yeah, IIRC when you go above 2 GB or so, huge pages are usually a big win. > There the 512x reduction in remapping and TLB flush costs (not just TLB m= iss > costs) obviously makes for a big difference not just in straight > performance/latency but also in cache footprint. AFAIK most GC concepts t= oday > (that cover many gigabytes of memory) are limited by remap and TLB flush > performance. Which remap are you referring to? AFAIK, most modern GCs split memory in young and old generation "zones" and _copy_ surviving objects from the former to the latter if their lifetime exceeds some threshold. The JVM keeps scanning the smaller young generation very aggressively which causes TLB pressure and scans the larger old generation less often. Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id AAE746B01E3 for ; Mon, 5 Apr 2010 16:38:04 -0400 (EDT) Date: Mon, 5 Apr 2010 13:32:21 -0700 (PDT) From: Linus Torvalds Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 In-Reply-To: Message-ID: References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Pekka Enberg Cc: Ingo Molnar , Andrew Morton , Andrea Arcangeli , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, 5 Apr 2010, Pekka Enberg wrote: > > AFAIK, most modern GCs split memory in young and old generation > "zones" and _copy_ surviving objects from the former to the latter if > their lifetime exceeds some threshold. The JVM keeps scanning the > smaller young generation very aggressively which causes TLB pressure > and scans the larger old generation less often. .. my only input to this is: numbers talk, bullsh*t walks. I'm not interested in micro-benchmarks, either. I can show infinite TLB walk improvement in a microbenchmark. In order for me to be interested in any complex hugetlb crap, I want real numbers from real applications. Not "it takes this many cycles to walk a page table", or "it could matter under these circumstances". I also want those real numbers _not_ directly after a clean reboot, but after running other real loads on the machine that have actually used up all the memory and filled it with things like dentry data etc. The "right after boot" case is totally pointless, since a huge part of hugetlb entries is the ability to allocate those physically contiguous and well-aligned regions. Until then, it's just extra complexity for no actual gain. Oh, and while I'm at it, I want a pony too. Linus PS. I also think the current odd anonvma thing is _way_ more important. That was a feature that actually improved AIM throughput by 300%. Now, admittedly that's not a real load either, but at least it's not a total microbenchmark. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 2624D6B01E3 for ; Mon, 5 Apr 2010 16:46:30 -0400 (EDT) Received: by fxm2 with SMTP id 2so1568250fxm.10 for ; Mon, 05 Apr 2010 13:46:28 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> Date: Mon, 5 Apr 2010 23:46:27 +0300 Message-ID: Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 From: Pekka Enberg Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Ingo Molnar , Andrew Morton , Andrea Arcangeli , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: Hi Linus, On Mon, Apr 5, 2010 at 11:32 PM, Linus Torvalds wrote: >> AFAIK, most modern GCs split memory in young and old generation >> "zones" and _copy_ surviving objects from the former to the latter if >> their lifetime exceeds some threshold. The JVM keeps scanning the >> smaller young generation very aggressively which causes TLB pressure >> and scans the larger old generation less often. > > .. my only input to this is: numbers talk, bullsh*t walks. > > I'm not interested in micro-benchmarks, either. I can show infinite TLB > walk improvement in a microbenchmark. > > In order for me to be interested in any complex hugetlb crap, I want real > numbers from real applications. Not "it takes this many cycles to walk a > page table", or "it could matter under these circumstances". > > I also want those real numbers _not_ directly after a clean reboot, but > after running other real loads on the machine that have actually used up > all the memory and filled it with things like dentry data etc. The "right > after boot" case is totally pointless, since a huge part of hugetlb > entries is the ability to allocate those physically contiguous and > well-aligned regions. > > Until then, it's just extra complexity for no actual gain. > > Oh, and while I'm at it, I want a pony too. Unfortunately I wasn't able to find a pony on Google but here are some huge page numbers if you're interested: http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html I'm actually bit surprised you find the issue controversial, Linus. I am not a real JVM hacker (although I could probably play one on TV) but the "hugepages are a big win" argument seems pretty logical for any GC heavy activity. Wouldn't be the first time I was wrong, though. Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id DD3FF6B01E3 for ; Mon, 5 Apr 2010 17:02:47 -0400 (EDT) Date: Mon, 5 Apr 2010 17:01:33 -0400 From: Chris Mason Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100405210133.GE21620@think> References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Pekka Enberg , Ingo Molnar , Andrew Morton , Andrea Arcangeli , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 05, 2010 at 01:32:21PM -0700, Linus Torvalds wrote: > > > On Mon, 5 Apr 2010, Pekka Enberg wrote: > > > > AFAIK, most modern GCs split memory in young and old generation > > "zones" and _copy_ surviving objects from the former to the latter if > > their lifetime exceeds some threshold. The JVM keeps scanning the > > smaller young generation very aggressively which causes TLB pressure > > and scans the larger old generation less often. > > .. my only input to this is: numbers talk, bullsh*t walks. > > I'm not interested in micro-benchmarks, either. I can show infinite TLB > walk improvement in a microbenchmark. Ok, I'll bite. I should be able to get some database workloads with hugepages, transparent hugepages, and without any hugepages at all. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 9125E6B01EE for ; Mon, 5 Apr 2010 17:04:05 -0400 (EDT) Date: Mon, 5 Apr 2010 13:58:57 -0700 (PDT) From: Linus Torvalds Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 In-Reply-To: Message-ID: References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Pekka Enberg Cc: Ingo Molnar , Andrew Morton , Andrea Arcangeli , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, 5 Apr 2010, Pekka Enberg wrote: > > Unfortunately I wasn't able to find a pony on Google but here are some > huge page numbers if you're interested: You missed the point. Those numbers weren't done with the patches in question. They weren't done with the magic new code that can handle fragmentation and swapping. They are simply not relevant to any of the complex code under discussion. The thing you posted is already doable (and done) using the existing hacky (but at least unsurprising) preallocation crud. We know that works. That's never been the issue. What I'm asking for is this thing called "Does it actually work in REALITY". That's my point about "not just after a clean boot". Just to really hit the issue home, here's my current machine: [root@i5 ~]# free total used free shared buffers cached Mem: 8073864 1808488 6265376 0 75480 1018412 -/+ buffers/cache: 714596 7359268 Swap: 10207228 12848 10194380 Look, I have absolutely _sh*tloads_ of memory, and I'm not using it. Really. I've got 8GB in that machine, it's just not been doing much more than a few "git pull"s and "make allyesconfig" runs to check the current kernel and so it's got over 6GB free. So I'm bound to have _tons_ of 2M pages, no? No. Lookie here: [344492.280001] DMA: 1*4kB 1*8kB 1*16kB 2*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15836kB [344492.280020] DMA32: 17516*4kB 19497*8kB 18318*16kB 15195*32kB 10332*64kB 5163*128kB 1371*256kB 123*512kB 2*1024kB 1*2048kB 0*4096kB = 2745528kB [344492.280027] Normal: 57295*4kB 66959*8kB 39639*16kB 29486*32kB 10483*64kB 2366*128kB 398*256kB 100*512kB 27*1024kB 3*2048kB 0*4096kB = 3503268kB just to help you parse that: this is a _lightly_ loaded machine. It's been up for about four days. And look at it. In case you can't read it, the relevant part is this part: DMA: .. 1*2048kB 3*4096kB DMA32: .. 1*2048kB 0*4096kB Normal: .. 3*2048kB 0*4096kB there is just a _small handful_ of 2MB pages. Seriously. On a machine with 8 GB of RAM, and three quarters of it free, and there is just a couple of contiguous 2MB regions. Note, that's _MB_, not GB. And don't tell me that these things are easy to fix. Don't tell me that the current VM is quite clean and can be harmlessly extended to deal with this all. Just don't. Not when we currently have a totally unexplained regression in the VM from the last scalability thing we did. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id D7BE46B01E3 for ; Mon, 5 Apr 2010 17:19:19 -0400 (EDT) Message-ID: <4BBA53A0.8050608@redhat.com> Date: Tue, 06 Apr 2010 00:18:24 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405210133.GE21620@think> In-Reply-To: <20100405210133.GE21620@think> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Chris Mason Cc: Linus Torvalds , Pekka Enberg , Ingo Molnar , Andrew Morton , Andrea Arcangeli , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/06/2010 12:01 AM, Chris Mason wrote: > On Mon, Apr 05, 2010 at 01:32:21PM -0700, Linus Torvalds wrote: > >> >> On Mon, 5 Apr 2010, Pekka Enberg wrote: >> >>> AFAIK, most modern GCs split memory in young and old generation >>> "zones" and _copy_ surviving objects from the former to the latter if >>> their lifetime exceeds some threshold. The JVM keeps scanning the >>> smaller young generation very aggressively which causes TLB pressure >>> and scans the larger old generation less often. >>> >> .. my only input to this is: numbers talk, bullsh*t walks. >> >> I'm not interested in micro-benchmarks, either. I can show infinite TLB >> walk improvement in a microbenchmark. >> > Ok, I'll bite. I should be able to get some database workloads with > hugepages, transparent hugepages, and without any hugepages at all. > Please run them in conjunction with Mel Gorman's memory compaction, otherwise fragmentation may prevent huge pages from being instantiated. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 646DB6B01E3 for ; Mon, 5 Apr 2010 17:39:01 -0400 (EDT) Date: Mon, 5 Apr 2010 14:33:29 -0700 (PDT) From: Linus Torvalds Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 In-Reply-To: <4BBA53A0.8050608@redhat.com> Message-ID: References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405210133.GE21620@think> <4BBA53A0.8050608@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Chris Mason , Pekka Enberg , Ingo Molnar , Andrew Morton , Andrea Arcangeli , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Tue, 6 Apr 2010, Avi Kivity wrote: > > Please run them in conjunction with Mel Gorman's memory compaction, otherwise > fragmentation may prevent huge pages from being instantiated. .. and then please run them in conjunction with somebody doing "make -j16" on the kernel at the same time, or just generally doing real work for a few days before hand. The point is, there are benchmarks, and then there is real life. If we _know_ some feature only works for benchmarks, it should be discounted as such. It's like a compiler that is tuned for specint - at some point the numbers lose a lot of their meaning. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 57AC76B01E3 for ; Mon, 5 Apr 2010 17:54:28 -0400 (EDT) Date: Mon, 5 Apr 2010 23:54:06 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100405215406.GA32527@elte.hu> References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Pekka Enberg , Andrew Morton , Andrea Arcangeli , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: * Linus Torvalds wrote: > there is just a _small handful_ of 2MB pages. Seriously. On a machine with 8 > GB of RAM, and three quarters of it free, and there is just a couple of > contiguous 2MB regions. Note, that's _MB_, not GB. > > And don't tell me that these things are easy to fix. Don't tell me that the > current VM is quite clean and can be harmlessly extended to deal with this > all. Just don't. Not when we currently have a totally unexplained regression > in the VM from the last scalability thing we did. I think those are very real worries. The only point i wanted to make is that the numbers are real as well and go beyond what i saw characterised in the first email. (It might still not be enough to tip the scale in the direction of 'we really want to do this' though.) Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id B40B96B01EE for ; Mon, 5 Apr 2010 18:36:50 -0400 (EDT) Date: Mon, 5 Apr 2010 18:33:59 -0400 From: Chris Mason Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100405223359.GH21620@think> References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405210133.GE21620@think> <4BBA53A0.8050608@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Avi Kivity , Pekka Enberg , Ingo Molnar , Andrew Morton , Andrea Arcangeli , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 05, 2010 at 02:33:29PM -0700, Linus Torvalds wrote: > > > On Tue, 6 Apr 2010, Avi Kivity wrote: > > > > Please run them in conjunction with Mel Gorman's memory compaction, otherwise > > fragmentation may prevent huge pages from being instantiated. > > .. and then please run them in conjunction with somebody doing "make -j16" > on the kernel at the same time, or just generally doing real work for a > few days before hand. > > The point is, there are benchmarks, and then there is real life. If we > _know_ some feature only works for benchmarks, it should be discounted as > such. It's like a compiler that is tuned for specint - at some point the > numbers lose a lot of their meaning. Sure, I'll do my best to be brutal. Avi, Andrea please fire off to me a git tree or patch bomb for benchmarking. Please include all the patches you think it needs to go fast, including any config hints etc... If you'd like numbers with and without a given set of patches, just let me know. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 824426B01EE for ; Mon, 5 Apr 2010 19:22:19 -0400 (EDT) Date: Tue, 6 Apr 2010 01:21:15 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100405232115.GM5825@random.random> References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: Hi Linus, On Mon, Apr 05, 2010 at 01:58:57PM -0700, Linus Torvalds wrote: > What I'm asking for is this thing called "Does it actually work in > REALITY". That's my point about "not just after a clean boot". > > Just to really hit the issue home, here's my current machine: > > [root@i5 ~]# free > total used free shared buffers cached > Mem: 8073864 1808488 6265376 0 75480 1018412 > -/+ buffers/cache: 714596 7359268 > Swap: 10207228 12848 10194380 > > Look, I have absolutely _sh*tloads_ of memory, and I'm not using it. > Really. I've got 8GB in that machine, it's just not been doing much more > than a few "git pull"s and "make allyesconfig" runs to check the current > kernel and so it's got over 6GB free. > > So I'm bound to have _tons_ of 2M pages, no? > > No. Lookie here: > > [344492.280001] DMA: 1*4kB 1*8kB 1*16kB 2*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15836kB > [344492.280020] DMA32: 17516*4kB 19497*8kB 18318*16kB 15195*32kB 10332*64kB 5163*128kB 1371*256kB 123*512kB 2*1024kB 1*2048kB 0*4096kB = 2745528kB > [344492.280027] Normal: 57295*4kB 66959*8kB 39639*16kB 29486*32kB 10483*64kB 2366*128kB 398*256kB 100*512kB 27*1024kB 3*2048kB 0*4096kB = 3503268kB > > just to help you parse that: this is a _lightly_ loaded machine. It's been > up for about four days. And look at it. > > In case you can't read it, the relevant part is this part: > > DMA: .. 1*2048kB 3*4096kB > DMA32: .. 1*2048kB 0*4096kB > Normal: .. 3*2048kB 0*4096kB > > there is just a _small handful_ of 2MB pages. Seriously. On a machine with > 8 GB of RAM, and three quarters of it free, and there is just a couple of > contiguous 2MB regions. Note, that's _MB_, not GB. What I can provide is my current status so far on workstation: $ free total used free shared buffers cached Mem: 1923648 1410912 512736 0 332236 391000 -/+ buffers/cache: 687676 1235972 Swap: 4200960 14204 4186756 $ cat /proc/buddyinfo Node 0, zone DMA 46 34 30 12 16 11 10 5 0 1 0 Node 0, zone DMA32 33 355 352 129 46 1307 751 225 9 1 0 $ uptime 00:06:54 up 10 days, 5:10, 3 users, load average: 0.00, 0.00, 0.00 $ grep Anon /proc/meminfo AnonPages: 78036 kB AnonHugePages: 100352 kB And laptop: $ free total used free shared buffers cached Mem: 3076948 1964136 1112812 0 91920 297212 -/+ buffers/cache: 1575004 1501944 Swap: 2939888 17668 2922220 $ cat /proc/buddyinfo Node 0, zone DMA 26 9 8 3 3 2 2 1 1 3 1 Node 0, zone DMA32 840 2142 6455 5848 5156 2554 291 52 30 0 0 $ uptime 00:08:21 up 17 days, 20:17, 5 users, load average: 0.06, 0.01, 0.00 $ grep Anon /proc/meminfo AnonPages: 856332 kB AnonHugePages: 272384 kB this is with: $ cat /sys/kernel/mm/transparent_hugepage/defrag always madvise [never] $ cat /sys/kernel/mm/transparent_hugepage/khugepaged/defrag [yes] no Currently the "defrag" sysfs control only toggles __GFP_WAIT from on/off in huge_memory.c (details in the patch with subject "transparent hugepage core" in the alloc_hugepage() function). Toggling __GFP_WAIT is a joke right now. The real deal to address your worry is first to run "hugeadm --set-recommended-min_free_kbytes" and to apply Mel's patches called "memory compaction" which is a separate patchset. I'm the consumer, Mel's the producer ;). With virtual machines the host kernel doesn't need to live forever (it has to be stable but we can easily reboot it without guest noticing), we can migrate virtual machines to fresh booted new hosts voiding the whole producer issue. Furthermore VM the first time are usually started at host boot time, and we want as much memory as possible backed by hugepages in the host. This is not to say that the producer isn't important or can't work, Mel posted number that shows it works, and we definitely want it to work, but I'm just trying to make a point that a good consumer of plenty of hugepages available at boot is useful even assuming the producer won't ever work or won't ever get it (not the real life case we're dealing with!). Initially we're going to take advantage of only the consumer in production exactly because it's already useful, even if we want to take advantage of a smart runtime "producer" too later on as time goes on. Migrating guests to produce hugepages isn't the ideal way for sure and I'm very confident that Mel's work already filling the gap very nicely. The VM itself (regardless if the consumer is hugetlbfs or transparent hugepage support) is evolving towards being able to generated endless amount of hugepages (in 2M size, 1G still unthinkable because of the huge cost) as shown by the already mainline available "hugeadm --set-recommended-min_free_kbytes". BTW, I think having this 10 liner algorithm in userland hugeadm binary is wrong and it should be a separate sysctl like "echo 1 >/sys/kernel/vm/set-recommended-min_free_kbytes", but that's offtopic and an implementation detail... This is just to show they are already addressing that stuff for hugetlbfs. So I just created a better consumer for the stuff they make an effort to produce anyway (i.e. 2M pages). The better consumer we have of it in the kernel, the more effort will be put into the producer. > And don't tell me that these things are easy to fix. Don't tell me that > the current VM is quite clean and can be harmlessly extended to deal with > this all. Just don't. Not when we currently have a totally unexplained > regression in the VM from the last scalability thing we did. Well the risk of regression with the consumer is little if disabled with sysfs so it'd be trivial to localize if it caused any problem. About memory compaction I think we should limit the invocation of those new VM algorithms to hugetlbfs and transparent hugepage support (and I already created the sysfs controls to enable/disable those so you can run transparent hugepage support with or without defrag feature). So all of this can be turned off at runtime. You can run only the consumer, both consumer or producer, or none (and if none, risk of regression should be zero). There's no point to ever defrag if there is no consumer of 2M pages. khugepaged should be able to invoke memory compaction comfortably in the defrag job in the background if khugepaged/defrag is set to "yes". I think worrying about the producer too much generates a chicken egg problem, without an heavy consumer in mainline, there's little point for people to work on the producer. Note that creating a good producer wasn't easy task, I did all I could to keep it self contained and I think I succeeded at that. My work as result created interest into improving the producer on Mel's side. I am sure if the consumer goes in, producing the stuff will also happen without much problems. My preferred merging patch is to merge the consumer first. But then I'm not entirely against the other order too. Merging both at the same time to me looks unnecessary complexity merged in the kernel at the same time and it'd make things less bisectable. But it wouldn't be impossible either. About the performance benefits I posted some numbers in linux-mm, but I'll collect it here (and this is after boot with plenty of hugepages). As a side note in this first part please note also the boost in the page fault rate (but this really only for curiosity, as this will only happen with hugepages are immediately available in the buddy). ------------ hugepages in the virtualization hypervisor (and also in the guest!) are much more important than in a regular host not using virtualization, becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in case only the hypervisor uses transparent hugepages, and they decrease the tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and the linux guest both uses this patch (though the guest will limit the addition speedup to anonymous regions only for now...). Even more important is that the tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow paging or no-virtualization scenario. So maximizing the amount of virtual memory cached by the TLB pays off significantly more with NPT/EPT than without (even if there would be no significant speedup in the tlb-miss runtime). [..] Some performance result: vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep ages3 memset page fault 1566023 memset tlb miss 453854 memset second tlb miss 453321 random access tlb miss 41635 random access second tlb miss 41658 vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3 memset page fault 1566471 memset tlb miss 453375 memset second tlb miss 453320 random access tlb miss 41636 random access second tlb miss 41637 vmx andrea # ./largepages3 memset page fault 1566642 memset tlb miss 453417 memset second tlb miss 453313 random access tlb miss 41630 random access second tlb miss 41647 vmx andrea # ./largepages3 memset page fault 1566872 memset tlb miss 453418 memset second tlb miss 453315 random access tlb miss 41618 random access second tlb miss 41659 vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage vmx andrea # ./largepages3 memset page fault 2182476 memset tlb miss 460305 memset second tlb miss 460179 random access tlb miss 44483 random access second tlb miss 44186 vmx andrea # ./largepages3 memset page fault 2182791 memset tlb miss 460742 memset second tlb miss 459962 random access tlb miss 43981 random access second tlb miss 43988 ============ #include #include #include #include #define SIZE (3UL*1024*1024*1024) int main() { char *p = malloc(SIZE), *p2; struct timeval before, after; gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset page fault %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) *p2 = 0; gettimeofday(&after, NULL); printf("random access tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) *p2 = 0; gettimeofday(&after, NULL); printf("random access second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); return 0; } ============ ------------- This is a more interesting benchmark of kernel compile and some random cpu bound dd command (not a microbenchmark like above): ----------- This is a kernel build in a 2.6.31 guest, on a 2.6.34-rc1 host. KVM run with "-drive cache=on,if=virtio,boot=on and -smp 4 -m 2g -vnc :0" (host has 4G of ram). CPU is Phenom (not II) with NPT (4 cores, 1 die). All reads are provided from host cache and cpu overhead of the I/O is reduced thanks to virtio. Workload is just a "make clean >/dev/null; time make -j20 >/dev/null". Results copied by hand because I logged through vnc. real 4m12.498s 14m28.106s 1m26.721s real 4m12.000s 14m27.850s 1m25.729s After the benchmark: grep Anon /proc/meminfo AnonPages: 121300 kB AnonHugePages: 1007616 kB cat /debugfs/kvm/largepages 2296 1.6G free in guest and 1.5free in host. Then on host: # echo never > /sys//kernel/mm/transparent_hugepage/enabled # echo never > /sys/kernel/mm/transparent_hugepage/khugepaged/enabled then I restart the VM and re-run the same workload: real 4m25.040s user 15m4.665s sys 1m50.519s real 4m29.653s user 15m8.637s sys 1m49.631s (guest kernel was not so recent and it had no transparent hugepage support because gcc normally won't take advantage of hugepages according to /proc/meminfo, so I made the comparison with a distro guest kernel with my usual .config I use in kvm guests) So guest compile the kernel 6% faster with hugepages and the results are trivially reproducible and stable enough (especially with hugepage enabled, without it varies from 4m24 sto 4m30s as I tried a few times more without hugepages in NTP when userland wasn't patched yet...). Below another test that takes advantage of hugepage in guest too, so running the same 2.6.34-rc1 with transparent hugepage support in both host and guest. (this really shows the power of KVM design, we boost the hypervisor and we get double boost for guest applications) Workload: time dd if=/dev/zero of=/dev/null bs=128M count=100 Host hugepage no guest: 3.898 Host hugepage guest hugepage: 3.966 (-1.17%) Host no hugepage no guest: 4.088 (-4.87%) Host hugepage guest no hugepage: 4.312 (-10.1%) Host no hugepage guest hugepage: 4.388 (-12.5%) Host no hugepage guest no hugepage: 4.425 (-13.5%) Workload: time dd if=/dev/zero of=/dev/null bs=4M count=1000 Host hugepage no guest: 1.207 Host hugepage guest hugepage: 1.245 (-3.14%) Host no hugepage no guest: 1.261 (-4.47%) Host no hugepage guest no hugepage: 1.323 (-9.61%) Host no hugepage guest hugepage: 1.371 (-13.5%) Host no hugepage guest no hugepage: 1.398 (-15.8%) I've no local EPT system to test so I may run them over vpn later on some large EPT system (and surely there are better benchs than a silly dd... but this is a start and shows even basic stuff gets the boost). The above is basically an "home-workstation/laptop" coverage. I (partly) intentionally run these on a system that has a ~$100 CPU and ~$50 motherboard, to show the absolute worst case, to be sure that 100% of home end users (running KVM) will take a measurable advantage from this effort. On huge systems the percentage boost is expected much bigger than on the home-workstation above test of course. -------------- Again gcc is a kind of worst case for it but it also shows a definitive significant and reproducible boost. Also note for a non-virtualization usage (so outside of MADV_HUGEPAGE), invoking memory compaction synchronously is likely a risk of losing CPU speed. khugepaged takes care of long lived allocations of random tasks and the only thing to use memory compaction synchronously could be the page faults of regions marked MADV_HUGEPAGE. But we may only decide to invoke memory compaction asynchronously and never as result of direct reclaim in process context to avoid any latency to guest operations. All it matters after boot is that khugepaged can do its job, it's not urgent. When things are urgent migrating guests to a new cloud node is always possible. I'd like to clarify this whole work has been done without ever making assumptions about virtual machines, I tried to make this as universally useful as possible (and not just because we want the exact same VM algorithms to trim one level of guest pagetables too to get a comulative boost so fully exploiting the KVM design ;). I'm thrilled Chris is going to test a host-only test for database and I'm sure willing to help with that. Compacting everything that is "movable" is surely solvable from a theoretical standpoint and that includes all anonymous memory (huge or not) and all cache. That alone accounts for an huge bulk of the total memory of a system, so being able to mix it all will result in the best behavior which isn't possible to achieve with hugetlbfs (so if the memory isn't allocated as anonymous memory can still be used as cache for I/O). So in the very worst case, if everything else fails on the producer front (again: not the case as far as I can tell!) what should be reserved at boot is an amount of memory to limit the unmovable parts there. And to leave the movable parts free to be allocated dynamically without limitations depending on the workloads. I'm quite sure Mel will be able to provide more details on his work that has been reviewed in detail already on linux-mm with lots of positive feedback which is why I expect zero problems on that side too in real life (besides my theoretical standpoint in previous chapter ;). Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 78FEC6B01EE for ; Mon, 5 Apr 2010 20:30:52 -0400 (EDT) Date: Mon, 5 Apr 2010 17:26:15 -0700 (PDT) From: Linus Torvalds Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 In-Reply-To: <20100405232115.GM5825@random.random> Message-ID: References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405232115.GM5825@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Tue, 6 Apr 2010, Andrea Arcangeli wrote: > > Some performance result: Quite frankly, these "performance results" seem to be basically dishonest. Judging by your numbers, the big win is apparently pre-populating the page tables, the "tlb miss" you quote seem to be almost in the noise. IOW, we have memset page fault 1566023 vs memset page fault 2182476 looking like a major performance advantage, but then the actual usage is much less noticeable. IOW, how much of the performance advantage would we get from a _much_ simpler patch to just much more aggressively pre-populate the page tables (especially for just anonymous pages, I assume) or even just fault pages in several at a time when you have lots of memory? In particular, when you quote 6% improvement for a kernel compile, your own numbers make seriously wonder how many percentage points you'd get from just faulting in 8 pages at a time when you have lots of memory free, and use a single 3-order allocation to get those eight pages? Would that already shrink the difference between those "memset page faults" by a factor of eight? See what I'm saying? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id CBFEE6B01EE for ; Mon, 5 Apr 2010 21:14:03 -0400 (EDT) Date: Mon, 5 Apr 2010 18:08:51 -0700 (PDT) From: Linus Torvalds Subject: [RFD] Re: [PATCH 00 of 41] Transparent Hugepage Support #17 In-Reply-To: Message-ID: References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405232115.GM5825@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, 5 Apr 2010, Linus Torvalds wrote: > > In particular, when you quote 6% improvement for a kernel compile, your > own numbers make [me] seriously wonder how many percentage points you'd get > from just faulting in 8 pages at a time when you have lots of memory free, > and use a single 3-order allocation to get those eight pages? THIS PATCH IS TOTALLY UNTESTED! It's very very unlikely to work, but it compiles for me at least in one particular configuration. So it must be perfect. Ship it. It basically tries to just fill in anonymous memory PTE entries roughly one cacheline at a time, avoiding extra page-faults and extra memory allocations. It's probably buggy as hell, I don't dare try to actually boot the crap I write. It literally started out as a pseudo-code patch that I then ended up expanding until it compiled and then fixed up some corner cases in. IOW, it's not really a serious patch, although when I look at it, it doesn't really look all that horrible. Now, I'm pretty sure that allocating the page with a single order-3 allocation, and then treating it as 8 individual order-0 pages is broken and probably makes various things unhappy. That "make_single_page()" monstrosity may or may not be sufficient. In other words, what I'm trying to say is: treat this patch as a request for discussion, rather than something that necessarily _works_. Linus --- include/linux/gfp.h | 3 ++ mm/memory.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++ mm/mempolicy.c | 9 ++++++ 3 files changed, 81 insertions(+), 0 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 4c6d413..2b8f42b 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -84,6 +84,7 @@ struct vm_area_struct; #define GFP_HIGHUSER_MOVABLE (__GFP_WAIT | __GFP_IO | __GFP_FS | \ __GFP_HARDWALL | __GFP_HIGHMEM | \ __GFP_MOVABLE) +#define GFP_USER_ORDER (GFP_HIGHUSER_MOVABLE | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY) #define GFP_IOFS (__GFP_IO | __GFP_FS) #ifdef CONFIG_NUMA @@ -306,10 +307,12 @@ alloc_pages(gfp_t gfp_mask, unsigned int order) } extern struct page *alloc_page_vma(gfp_t gfp_mask, struct vm_area_struct *vma, unsigned long addr); +extern struct page *alloc_page_user_order(struct vm_area_struct *, unsigned long, int); #else #define alloc_pages(gfp_mask, order) \ alloc_pages_node(numa_node_id(), gfp_mask, order) #define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0) +#define alloc_page_user_order(vma, addr, order) alloc_pages(GFP_USER_ORDER, order) #endif #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0) diff --git a/mm/memory.c b/mm/memory.c index 1d2ea39..7ad97cb 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2741,6 +2741,71 @@ out_release: return ret; } +static inline void make_single_page(struct page *page) +{ + set_page_count(page, 1); + set_page_private(page, 0); +} + +/* + * See if we can optimistically fill eight pages at a time + */ +static int optimistic_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *page_table, pmd_t *pmd) +{ + int i; + spinlock_t *ptl; + struct page *bigpage; + + /* Don't even bother if it's not writable */ + if (!(vma->vm_flags & VM_WRITE)) + return 0; + + /* Are we ok wrt the vma boundaries? */ + if ((address & (PAGE_MASK << 3)) < vma->vm_start) + return 0; + if ((address | ~(PAGE_MASK << 3)) > vma->vm_end) + return 0; + + /* + * Round to a nice even 8-byte page boundary, and + * optimistically (with no locking), check whether + * it's all empty. Skip if we have it partly filled + * in. + * + * 8 page table entries tends to be about a cacheline. + */ + page_table -= (address >> PAGE_SHIFT) & 7; + for (i = 0; i < 8; i++) + if (!pte_none(page_table[i])) + return 0; + + /* Allocate the eight pages in one go, no warning or retrying */ + bigpage = alloc_page_user_order(vma, addr, 3); + if (!bigpage) + return 0; + + ptl = pte_lockptr(mm, pmd); + spin_lock(ptl); + + for (i = 0; i < 8; i++) { + struct page *page = bigpage + i; + + make_single_page(page); + if (pte_none(page_table[i])) { + pte_t pte = mk_pte(page, vma->vm_page_prot); + pte = pte_mkwrite(pte_mkdirty(pte)); + set_pte_at(mm, address, page_table+i, pte); + } else { + __free_page(page); + } + } + + /* The caller will unlock */ + return 1; +} + + /* * We enter with non-exclusive mmap_sem (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -2754,6 +2819,9 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, spinlock_t *ptl; pte_t entry; + if (optimistic_fault(mm, vma, address, page_table, pmd)) + goto update; + if (!(flags & FAULT_FLAG_WRITE)) { entry = pte_mkspecial(pfn_pte(my_zero_pfn(address), vma->vm_page_prot)); @@ -2790,6 +2858,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, setpte: set_pte_at(mm, address, page_table, entry); +update: /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, address, page_table); unlock: diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 08f40a2..55a92bd 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1707,6 +1707,15 @@ alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr) return __alloc_pages_nodemask(gfp, 0, zl, policy_nodemask(gfp, pol)); } +struct page * +alloc_page_user_order(struct vm_area_struct *vma, unsigned long addr, int order) +{ + struct zonelist *zl = policy_zonelist(gfp, pol); + struct mempolicy *pol = get_vma_policy(current, vma, addr); + + return __alloc_pages_nodemask(GFP_USER_ORDER, order, zl, pol); +} + /** * alloc_pages_current - Allocate pages. * -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 97AD56B01EF for ; Mon, 5 Apr 2010 21:14:48 -0400 (EDT) Date: Tue, 6 Apr 2010 03:13:45 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100406011345.GT5825@random.random> References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405232115.GM5825@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 05, 2010 at 05:26:15PM -0700, Linus Torvalds wrote: > > > On Tue, 6 Apr 2010, Andrea Arcangeli wrote: > > > > Some performance result: > > Quite frankly, these "performance results" seem to be basically dishonest. > > Judging by your numbers, the big win is apparently pre-populating the page > tables, the "tlb miss" you quote seem to be almost in the noise. IOW, we > have > > memset page fault 1566023 > > vs > > memset page fault 2182476 > > looking like a major performance advantage, but then the actual usage is > much less noticeable. > > IOW, how much of the performance advantage would we get from a _much_ > simpler patch to just much more aggressively pre-populate the page tables > (especially for just anonymous pages, I assume) or even just fault pages > in several at a time when you have lots of memory? I had a prefaulting patch that also allocated an hugepage but only mapped it with 2 ptes, 4 ptes, 8 ptes, up to 256ptes using a sysctl, until the memset faulted in the rest and that triggered another chunk of prefault on the reamining hugepage. In the end these weren't worth it so I went stright with huge pmd immediately (even if initially I worried about the more intensive clear-page in cow), which is hugely simpler too and doesn't only provide a page fault advantage. > In particular, when you quote 6% improvement for a kernel compile, your The memset test you mention above was run on host. The kernel compile is run on guest with an unmodified guest kernel. The kernel compile isn't mangling pagetables differently. The kernel compile is run on two different host kernels: one running with transparent hugepages one without, the guest kernel has no modifications at all. No page fault ever happens in the host, only gcc runs in the guest in an unmodified kernel that isn't using hugepages at all. > own numbers make seriously wonder how many percentage points you'd get > from just faulting in 8 pages at a time when you have lots of memory free, > and use a single 3-order allocation to get those eight pages? > > Would that already shrink the difference between those "memset page > faults" by a factor of eight? > > See what I'm saying? I see what you're saying but that has nothing to do with the 6% boost. In short I first measured the page fault improvement in host (~+50% faster, sure that has nothing to do with pmd_huge or the tlb miss, I said I mentioned it just for curiosity in fact), then measured the tlb miss improvement in host (a few percent faster as usual with hugetlbfs) then measured the boost in guest if host uses hugepages (with no guest kernel change at all, just the tlb miss going faster in guest and that boosts the guest kernel compile 6%) and then some other test with dd with all combinations of host/guest using hugepages or not, and also with dd run on bare metal with or without hugepages. As said gcc is a sort of worst case, so you can assume any guest math will run 6% faster or more in guest if the host runs with transparent hugepages enabled (and there's memory compaction etc). The page fault speedup is a "nice addon" that has nothing to do with the kernel compile improvement because it was repeated many times and the guest kernel memory was already faulted in before. I only wanted to point it out "for curiosity" as I wrote in the prev email. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 406686B01F1 for ; Mon, 5 Apr 2010 21:27:35 -0400 (EDT) Date: Tue, 6 Apr 2010 03:26:47 +0200 From: Andrea Arcangeli Subject: Re: [RFD] Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100406012647.GU5825@random.random> References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405232115.GM5825@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 05, 2010 at 06:08:51PM -0700, Linus Torvalds wrote: > > > On Mon, 5 Apr 2010, Linus Torvalds wrote: > > > > In particular, when you quote 6% improvement for a kernel compile, your > > own numbers make [me] seriously wonder how many percentage points you'd get > > from just faulting in 8 pages at a time when you have lots of memory free, > > and use a single 3-order allocation to get those eight pages? > > THIS PATCH IS TOTALLY UNTESTED! > > It's very very unlikely to work, but it compiles for me at least in one > particular configuration. So it must be perfect. Ship it. > > It basically tries to just fill in anonymous memory PTE entries roughly > one cacheline at a time, avoiding extra page-faults and extra memory > allocations. > > It's probably buggy as hell, I don't dare try to actually boot the crap I > write. It literally started out as a pseudo-code patch that I then ended > up expanding until it compiled and then fixed up some corner cases in. > > IOW, it's not really a serious patch, although when I look at it, it > doesn't really look all that horrible. > > Now, I'm pretty sure that allocating the page with a single order-3 > allocation, and then treating it as 8 individual order-0 pages is broken > and probably makes various things unhappy. That "make_single_page()" > monstrosity may or may not be sufficient. > > In other words, what I'm trying to say is: treat this patch as a request > for discussion, rather than something that necessarily _works_. This will provide 0% speedup to a kernel compile in guest where transparent hugepage support (or hugetlbfs too) would provide a 6% speedup. I evaluated the prefault approach before I finalized my design and then generated an huge pmd when the whole hugepage was mapped. It's all worthless complexity in my view. In fact except at boot time we'll likely won't be interested to take advantage of this, as it is not a free optimization and it magnifies the time it takes to clear-page copy-page (which is why I tried to try to only prefault an hugepages, and then after benchmarking I figured out it wasn't worth it and it'd be hugely more complicated too). The only case it is worth mapping more than one 4k page, is when we can take advantage of the tlb miss speedup and of the 2M tlb, otherwise it's better to stick to 4k page faults and do a 4k clear-page copy-page and not risk to take more than 4k of memory. And let khugepaged do the rest. I think I already mentioned it in the previous email but seeing your patch I feel obliged to re-post: --------------- hugepages in the virtualization hypervisor (and also in the guest!) are much more important than in a regular host not using virtualization, becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in case only the hypervisor uses transparent hugepages, and they decrease the tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and the linux guest both uses this patch (though the guest will limit the addition speedup to anonymous regions only for now...). Even more important is that the tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow paging or no-virtualization scenario. So maximizing the amount of virtual memory cached by the TLB pays off significantly more with NPT/EPT than without (even if there would be no significant speedup in the tlb-miss runtime). ---------------- This is in the changelog of the "transparent hugepage core" patch too and here as well: http://linux-mm.org/TransparentHugepage?action=AttachFile&do=get&target=transparent-hugepage.pdf -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id D26396B01EF for ; Mon, 5 Apr 2010 21:40:39 -0400 (EDT) Date: Mon, 5 Apr 2010 18:35:43 -0700 (PDT) From: Linus Torvalds Subject: Re: [RFD] Re: [PATCH 00 of 41] Transparent Hugepage Support #17 In-Reply-To: Message-ID: References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405232115.GM5825@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, 5 Apr 2010, Linus Torvalds wrote: > > THIS PATCH IS TOTALLY UNTESTED! Ok, it was also crap. I tried to warn you. We actually have that "split_page()" function that does the right thing, I don't know why I didn't realize that. And the lock was uninitialized for the optimistic case, because I had made that "clever optimization" to let the caller do the unlocking in the common path, but when I did that I didn't actually make sure that the caller had the right lock. Whee. I'm a moron. This is _still_ untested and probably horribly buggy, but at least it isn't *quite* as rough as the previous patch was. Linus --- include/linux/gfp.h | 3 ++ mm/memory.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++ mm/mempolicy.c | 9 +++++++ 3 files changed, 77 insertions(+), 0 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 4c6d413..2b8f42b 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -84,6 +84,7 @@ struct vm_area_struct; #define GFP_HIGHUSER_MOVABLE (__GFP_WAIT | __GFP_IO | __GFP_FS | \ __GFP_HARDWALL | __GFP_HIGHMEM | \ __GFP_MOVABLE) +#define GFP_USER_ORDER (GFP_HIGHUSER_MOVABLE | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY) #define GFP_IOFS (__GFP_IO | __GFP_FS) #ifdef CONFIG_NUMA @@ -306,10 +307,12 @@ alloc_pages(gfp_t gfp_mask, unsigned int order) } extern struct page *alloc_page_vma(gfp_t gfp_mask, struct vm_area_struct *vma, unsigned long addr); +extern struct page *alloc_page_user_order(struct vm_area_struct *, unsigned long, int); #else #define alloc_pages(gfp_mask, order) \ alloc_pages_node(numa_node_id(), gfp_mask, order) #define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0) +#define alloc_page_user_order(vma, addr, order) alloc_pages(GFP_USER_ORDER, order) #endif #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0) diff --git a/mm/memory.c b/mm/memory.c index 1d2ea39..4f1521e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2742,6 +2742,66 @@ out_release: } /* + * See if we can optimistically fill eight pages at a time + */ +static spinlock_t *optimistic_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *page_table, pmd_t *pmd) +{ + int i; + spinlock_t *ptl; + struct page *bigpage; + + /* Don't even bother if it's not writable */ + if (!(vma->vm_flags & VM_WRITE)) + return NULL; + + /* Are we ok wrt the vma boundaries? */ + if ((address & (PAGE_MASK << 3)) < vma->vm_start) + return NULL; + if ((address | ~(PAGE_MASK << 3)) > vma->vm_end) + return NULL; + + /* + * Round to a nice even 8-byte page boundary, and + * optimistically (with no locking), check whether + * it's all empty. Skip if we have it partly filled + * in. + * + * 8 page table entries tends to be about a cacheline. + */ + page_table -= (address >> PAGE_SHIFT) & 7; + for (i = 0; i < 8; i++) + if (!pte_none(page_table[i])) + return NULL; + + /* Allocate the eight pages in one go, no warning or retrying */ + bigpage = alloc_page_user_order(vma, addr, 3); + if (!bigpage) + return NULL; + + split_page(bigpage, 3); + + ptl = pte_lockptr(mm, pmd); + spin_lock(ptl); + + for (i = 0; i < 8; i++) { + struct page *page = bigpage + i; + + if (pte_none(page_table[i])) { + pte_t pte = mk_pte(page, vma->vm_page_prot); + pte = pte_mkwrite(pte_mkdirty(pte)); + set_pte_at(mm, address, page_table+i, pte); + } else { + __free_page(page); + } + } + + /* The caller will unlock */ + return ptl; +} + + +/* * We enter with non-exclusive mmap_sem (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. * We return with mmap_sem still held, but pte unmapped and unlocked. @@ -2754,6 +2814,10 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, spinlock_t *ptl; pte_t entry; + ptl = optimistic_fault(mm, vma, address, page_table, pmd); + if (ptl) + goto update; + if (!(flags & FAULT_FLAG_WRITE)) { entry = pte_mkspecial(pfn_pte(my_zero_pfn(address), vma->vm_page_prot)); @@ -2790,6 +2854,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, setpte: set_pte_at(mm, address, page_table, entry); +update: /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, address, page_table); unlock: diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 08f40a2..55a92bd 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1707,6 +1707,15 @@ alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr) return __alloc_pages_nodemask(gfp, 0, zl, policy_nodemask(gfp, pol)); } +struct page * +alloc_page_user_order(struct vm_area_struct *vma, unsigned long addr, int order) +{ + struct zonelist *zl = policy_zonelist(gfp, pol); + struct mempolicy *pol = get_vma_policy(current, vma, addr); + + return __alloc_pages_nodemask(GFP_USER_ORDER, order, zl, pol); +} + /** * alloc_pages_current - Allocate pages. * -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id AEFBB6B01EF for ; Mon, 5 Apr 2010 21:43:35 -0400 (EDT) Date: Mon, 5 Apr 2010 18:38:35 -0700 (PDT) From: Linus Torvalds Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 In-Reply-To: <20100406011345.GT5825@random.random> Message-ID: References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Tue, 6 Apr 2010, Andrea Arcangeli wrote: > > In short I first measured the page fault improvement in host (~+50% > faster, sure that has nothing to do with pmd_huge or the tlb miss, I > said I mentioned it just for curiosity in fact), then measured the tlb > miss improvement in host (a few percent faster as usual with > hugetlbfs) then measured the boost in guest if host uses hugepages > (with no guest kernel change at all, just the tlb miss going faster in > guest and that boosts the guest kernel compile 6%) and then some other > test with dd with all combinations of host/guest using hugepages or > not, and also with dd run on bare metal with or without hugepages. Yeah, sorry. I misread your email - I noticed that 6% improvement for something that looked like a workload I might actually _care_ about, and didn't track the context enough to notice that it was just for the "host is using hugepages" case. So I thought it was a more interesting load than it was. The virtualization "TLB miss is expensive" load I can't find it in myself to care about. "Get a better CPU" is my answer to that one, Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 8E32E6B01EF for ; Mon, 5 Apr 2010 22:29:47 -0400 (EDT) Date: Mon, 5 Apr 2010 19:23:44 -0700 (PDT) From: Linus Torvalds Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 In-Reply-To: Message-ID: References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, 5 Apr 2010, Linus Torvalds wrote: > > So I thought it was a more interesting load than it was. The > virtualization "TLB miss is expensive" load I can't find it in myself to > care about. "Get a better CPU" is my answer to that one, [ Btw, I do realize that "better CPU" in this case may be "future CPU". I just think that this is where better TLB's and using ASID's etc is likely to be a much bigger deal than adding VM complexity. Kind of the same way I think HIGHMEM was ultimately a failure, and the 4G:4G split was an atrocity that should have been killed ] Anyway. Since the prefaulting wasn't the point, I'm killing the patch. But since I actually tested it, and then I made it work, here's something that I will hereby throw away, but maybe somebody else would like to play with. It still gets the memcg accounting wrong, but it actually does seem to boot for me. And it just might make page faults cheaper. We avoid the whole "drop the ptl and re-take it" for the optimistic case, for example. So maybe it is worth looking at, even though the 6% thing wasn't here. Linus --- include/linux/gfp.h | 4 ++ mm/memory.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++ mm/mempolicy.c | 9 +++++ 3 files changed, 95 insertions(+), 0 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 4c6d413..1b94d09 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -84,6 +84,8 @@ struct vm_area_struct; #define GFP_HIGHUSER_MOVABLE (__GFP_WAIT | __GFP_IO | __GFP_FS | \ __GFP_HARDWALL | __GFP_HIGHMEM | \ __GFP_MOVABLE) +#define GFP_USER_ORDER (GFP_NOWAIT | __GFP_HARDWALL | __GFP_NOWARN | __GFP_NORETRY | \ + __GFP_HIGHMEM | __GFP_MOVABLE | __GFP_ZERO) #define GFP_IOFS (__GFP_IO | __GFP_FS) #ifdef CONFIG_NUMA @@ -306,10 +308,12 @@ alloc_pages(gfp_t gfp_mask, unsigned int order) } extern struct page *alloc_page_vma(gfp_t gfp_mask, struct vm_area_struct *vma, unsigned long addr); +extern struct page *alloc_page_user_order(struct vm_area_struct *, unsigned long, int); #else #define alloc_pages(gfp_mask, order) \ alloc_pages_node(numa_node_id(), gfp_mask, order) #define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0) +#define alloc_page_user_order(vma, addr, order) alloc_pages(GFP_USER_ORDER, order) #endif #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0) diff --git a/mm/memory.c b/mm/memory.c index 1d2ea39..b2d5025 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2742,6 +2742,83 @@ out_release: } /* + * See if we can optimistically fill eight pages at a time + */ +static spinlock_t *optimistic_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *page_table, pmd_t *pmd) +{ + int i; + spinlock_t *ptl; + struct page *bigpage; + + /* Don't even bother if it's not writable */ + if (!(vma->vm_flags & VM_WRITE)) + return NULL; + + /* + * The optimistic path doesn't want to drop the + * page table map, so it can't allocate anon_vma's + * etc. + */ + if (!vma->anon_vma) + return NULL; + + /* Are we ok wrt the vma boundaries? */ + if ((address & (PAGE_MASK << 3)) < vma->vm_start) + return NULL; + if ((address | ~(PAGE_MASK << 3)) > vma->vm_end) + return NULL; + + /* + * Round to a nice even 8-byte page boundary, and + * optimistically (with no locking), check whether + * it's all empty. Skip if we have it partly filled + * in. + * + * 8 page table entries tends to be about a cacheline. + */ + page_table -= (address >> PAGE_SHIFT) & 7; + for (i = 0; i < 8; i++) + if (!pte_none(page_table[i])) + return NULL; + + /* Allocate the eight pages in one go, no warning or retrying */ + bigpage = alloc_page_user_order(vma, addr, 3); + if (!bigpage) + return NULL; + + split_page(bigpage, 3); + + ptl = pte_lockptr(mm, pmd); + spin_lock(ptl); + + address &= PAGE_MASK << 3; + for (i = 0; i < 8; i++) { + struct page *page = bigpage + i; + + if (pte_none(page_table[i])) { + pte_t pte; + + __SetPageUptodate(page); + + inc_mm_counter_fast(mm, MM_ANONPAGES); + page_add_new_anon_rmap(page, vma, address); + + pte = mk_pte(page, vma->vm_page_prot); + pte = pte_mkwrite(pte_mkdirty(pte)); + set_pte_at(mm, address, page_table+i, pte); + } else { + __free_page(page); + } + address += PAGE_SIZE; + } + + /* The caller will unlock */ + return ptl; +} + + +/* * We enter with non-exclusive mmap_sem (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. * We return with mmap_sem still held, but pte unmapped and unlocked. @@ -2754,6 +2831,10 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, spinlock_t *ptl; pte_t entry; + ptl = optimistic_fault(mm, vma, address, page_table, pmd); + if (ptl) + goto update; + if (!(flags & FAULT_FLAG_WRITE)) { entry = pte_mkspecial(pfn_pte(my_zero_pfn(address), vma->vm_page_prot)); @@ -2790,6 +2871,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, setpte: set_pte_at(mm, address, page_table, entry); +update: /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, address, page_table); unlock: diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 08f40a2..55a92bd 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1707,6 +1707,15 @@ alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr) return __alloc_pages_nodemask(gfp, 0, zl, policy_nodemask(gfp, pol)); } +struct page * +alloc_page_user_order(struct vm_area_struct *vma, unsigned long addr, int order) +{ + struct zonelist *zl = policy_zonelist(gfp, pol); + struct mempolicy *pol = get_vma_policy(current, vma, addr); + + return __alloc_pages_nodemask(GFP_USER_ORDER, order, zl, pol); +} + /** * alloc_pages_current - Allocate pages. * -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 713886B01EE for ; Tue, 6 Apr 2010 01:26:03 -0400 (EDT) Date: Tue, 6 Apr 2010 15:25:48 +1000 From: Nick Piggin Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100406052548.GC11191@laptop> References: <20100405193616.GA5125@elte.hu> <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Andrea Arcangeli , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 05, 2010 at 07:23:44PM -0700, Linus Torvalds wrote: > > > On Mon, 5 Apr 2010, Linus Torvalds wrote: > > > > So I thought it was a more interesting load than it was. The > > virtualization "TLB miss is expensive" load I can't find it in myself to > > care about. "Get a better CPU" is my answer to that one, > > [ Btw, I do realize that "better CPU" in this case may be "future CPU". I > just think that this is where better TLB's and using ASID's etc is > likely to be a much bigger deal than adding VM complexity. Kind of the > same way I think HIGHMEM was ultimately a failure, and the 4G:4G split > was an atrocity that should have been killed ] It's an interesting route to go down. With more and more virtualization, we start to think about HV platforms as more legitimate targets for large scale optimizations like this. On the other hand, hardware memory virtualization is still quite young on x86 CPUs and there are still hardware improvements down the line. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 4DF076B01EE for ; Tue, 6 Apr 2010 04:30:51 -0400 (EDT) Date: Tue, 6 Apr 2010 09:30:28 +0100 From: Mel Gorman Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100406083028.GA17882@csn.ul.ie> References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405210133.GE21620@think> <4BBA53A0.8050608@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <4BBA53A0.8050608@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Chris Mason , Linus Torvalds , Pekka Enberg , Ingo Molnar , Andrew Morton , Andrea Arcangeli , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Tue, Apr 06, 2010 at 12:18:24AM +0300, Avi Kivity wrote: > On 04/06/2010 12:01 AM, Chris Mason wrote: >> On Mon, Apr 05, 2010 at 01:32:21PM -0700, Linus Torvalds wrote: >> >>> >>> On Mon, 5 Apr 2010, Pekka Enberg wrote: >>> >>>> AFAIK, most modern GCs split memory in young and old generation >>>> "zones" and _copy_ surviving objects from the former to the latter if >>>> their lifetime exceeds some threshold. The JVM keeps scanning the >>>> smaller young generation very aggressively which causes TLB pressure >>>> and scans the larger old generation less often. >>>> >>> .. my only input to this is: numbers talk, bullsh*t walks. >>> >>> I'm not interested in micro-benchmarks, either. I can show infinite TLB >>> walk improvement in a microbenchmark. >>> >> Ok, I'll bite. I should be able to get some database workloads with >> hugepages, transparent hugepages, and without any hugepages at all. >> > > Please run them in conjunction with Mel Gorman's memory compaction, > otherwise fragmentation may prevent huge pages from being instantiated. > Strictly speaking, compaction is not necessary to allocate huge pages. What compaction gets you is o Lower latency and cost of huge page allocation o Works on swapless systems What is important is that you run hugeadm --set-recommended-min_free_kbytes from the libhugetlbfs 2.8 package early in boot so that anti-fragmentation is doing as good as job as possible. If one is very curious, use the mm_page_alloc_extfrag to trace how often severe fragmentation-related events occur under default settings and with min_free_kbytes set properly. Without the compaction patches, allocating huge pages will be occasionally *very* expensive as a large number of pages will need to be reclaimed. Most likely sympton is trashing while the database starts up. Allocation success rates will also be lower when under heavy load. Running make -j16 at the same time is unlikely to make much of a difference from a hugepage allocation point of view. The performance figures will vary significantly of course as make competes with the database for CPU time and other resources. Finally, benchmarking with databases is not new as such - http://lwn.net/Articles/378641/ . This was on fairly simple hardware though as I didn't have access to hardware more suitable for database workloads. If you are running with transparent huge pages though, be sure to double check that huge pages are actually being used transparently. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 461D26B01EE for ; Tue, 6 Apr 2010 05:08:45 -0400 (EDT) Date: Tue, 6 Apr 2010 11:08:13 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100406090813.GA14098@elte.hu> References: <20100405193616.GA5125@elte.hu> <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Andrea Arcangeli , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: * Linus Torvalds wrote: > On Mon, 5 Apr 2010, Linus Torvalds wrote: > > > > So I thought it was a more interesting load than it was. The > > virtualization "TLB miss is expensive" load I can't find it in myself to > > care about. "Get a better CPU" is my answer to that one, > > [ Btw, I do realize that "better CPU" in this case may be "future CPU". I > just think that this is where better TLB's and using ASID's etc is > likely to be a much bigger deal than adding VM complexity. Kind of the > same way I think HIGHMEM was ultimately a failure, and the 4G:4G split > was an atrocity that should have been killed ] Both highmem and 4g:4g were failures (albeit highly practical failures you have to admit) in the sense that their relevance faded over time. (because they extended the practical limits of the constantly fading, 32-bit world.) Both highmem and 4g:4g became less and less of an issue as hardware improved. OTOH are you saying the same thing about huge pages? On what basis? Do you think it would be possible for hardware to 'discover' physically-continuous 2M mappings and turn them into a huge TLB internally? [i'm not sure it's feasible even in future CPUs - and even if it is, the OS would still have to do the defrag and keep-them-2MB logic internally so there's not much difference.] The numbers seem rather clear: http://lwn.net/Articles/378641/ Yes, some of it is benchmarketing (most benchmarks are), but a significant portion of it isnt: HPC processing, DB workloads and Java workloads. Hugepages provide a 'final' performance boost in cases where there's no other software way left to speed up a given workload. The goal of Andrea's and Mel's patch-set, to make this 'final performance boost' more practical seems like a valid technical goal. We can still validly reject it all based on VM complexity (albeit the VM people wrote both the defrag part and the transparent usage part so all the patches are all real), but how can we legitimately reject the performance advantage? I think the hugetlb situation is more similar to the block IO transition to larger sector sizes in block IO or to the networking IO transition from host-side-everything to checksum-offload and then to TSO - than it is similar to highmem or 4g:4g. In fact the whole maintenance thought process seems somewhat similar to the TSO situation: the networking folks first rejected TSO based on complexity arguments, but then was embraced after some time. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 112B96B01EE for ; Tue, 6 Apr 2010 05:13:32 -0400 (EDT) Date: Tue, 6 Apr 2010 11:13:13 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100406091313.GA10262@elte.hu> References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <20100406090813.GA14098@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100406090813.GA14098@elte.hu> Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Andrea Arcangeli , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: * Ingo Molnar wrote: > The numbers seem rather clear: > > http://lwn.net/Articles/378641/ > > Yes, some of it is benchmarketing (most benchmarks are), but a significant > portion of it isnt: HPC processing, DB workloads and Java workloads. ( I forgot to mention virtualization - but i guess we can leave that out of the list as uninteresting-for-now. ) Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id ED2C46B01EE for ; Tue, 6 Apr 2010 05:30:44 -0400 (EDT) Date: Tue, 6 Apr 2010 10:30:21 +0100 From: Mel Gorman Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100406093021.GC17882@csn.ul.ie> References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405232115.GM5825@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20100405232115.GM5825@random.random> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Linus Torvalds , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Tue, Apr 06, 2010 at 01:21:15AM +0200, Andrea Arcangeli wrote: > Hi Linus, > > On Mon, Apr 05, 2010 at 01:58:57PM -0700, Linus Torvalds wrote: > > What I'm asking for is this thing called "Does it actually work in > > REALITY". That's my point about "not just after a clean boot". > > > > Just to really hit the issue home, here's my current machine: > > > > [root@i5 ~]# free > > total used free shared buffers cached > > Mem: 8073864 1808488 6265376 0 75480 1018412 > > -/+ buffers/cache: 714596 7359268 > > Swap: 10207228 12848 10194380 > > > > Look, I have absolutely _sh*tloads_ of memory, and I'm not using it. > > Really. I've got 8GB in that machine, it's just not been doing much more > > than a few "git pull"s and "make allyesconfig" runs to check the current > > kernel and so it's got over 6GB free. > > > > So I'm bound to have _tons_ of 2M pages, no? > > > > No. Lookie here: > > > > [344492.280001] DMA: 1*4kB 1*8kB 1*16kB 2*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15836kB > > [344492.280020] DMA32: 17516*4kB 19497*8kB 18318*16kB 15195*32kB 10332*64kB 5163*128kB 1371*256kB 123*512kB 2*1024kB 1*2048kB 0*4096kB = 2745528kB > > [344492.280027] Normal: 57295*4kB 66959*8kB 39639*16kB 29486*32kB 10483*64kB 2366*128kB 398*256kB 100*512kB 27*1024kB 3*2048kB 0*4096kB = 3503268kB > > > > just to help you parse that: this is a _lightly_ loaded machine. It's been > > up for about four days. And look at it. > > > > In case you can't read it, the relevant part is this part: > > > > DMA: .. 1*2048kB 3*4096kB > > DMA32: .. 1*2048kB 0*4096kB > > Normal: .. 3*2048kB 0*4096kB > > > > there is just a _small handful_ of 2MB pages. Seriously. On a machine with > > 8 GB of RAM, and three quarters of it free, and there is just a couple of > > contiguous 2MB regions. Note, that's _MB_, not GB. > The kernel you are using is presumably fairly recent so it has anti-fragmentation app[lied. The point of anti-frag is not to keep fragmentation low at all times but to have the system in a state where fragmentation can be dealt with. Hence, buddyinfo is rarely useful for figuring out "how many huge pages can I allocate?" In the past when I was measuring fragmentation at a given time, I used both buddyinfo and /proc/kpageflags to check the state of the system. There is a good chance you could allocate a decent percentage of memory as huge pages but as you are unlikely to have run hugeadm --set-recommended-min_free_kbytes early in boot, it is also likely to trash heavily and the success rates will not be very impressive. The min_free_kbytes is really important. In the past I've used the mm_page_alloc_extfrag to measure its effect. With default settings, under heavy loads, the event would trigger hundreds of thousands of times. With set-recommended-min_free_kbytes, it would trigger tens or maybe hundreds of times under the same situations and the bulk of those events were not severe. > What I can provide is my current status so far on workstation: > > $ free > total used free shared buffers > cached > Mem: 1923648 1410912 512736 0 332236 > 391000 > -/+ buffers/cache: 687676 1235972 > Swap: 4200960 14204 4186756 > $ cat /proc/buddyinfo > Node 0, zone DMA 46 34 30 12 16 11 10 5 0 1 0 > Node 0, zone DMA32 33 355 352 129 46 1307 751 225 9 1 0 > $ uptime > 00:06:54 up 10 days, 5:10, 3 users, load average: 0.00, 0.00, 0.00 > $ grep Anon /proc/meminfo > AnonPages: 78036 kB > AnonHugePages: 100352 kB > > And laptop: > > $ free > total used free shared buffers > cached > Mem: 3076948 1964136 1112812 0 91920 > 297212 > -/+ buffers/cache: 1575004 1501944 > Swap: 2939888 17668 2922220 > $ cat /proc/buddyinfo > Node 0, zone DMA 26 9 8 3 3 2 2 1 1 3 1 > Node 0, zone DMA32 840 2142 6455 5848 5156 2554 291 52 30 0 0 > $ uptime > 00:08:21 up 17 days, 20:17, 5 users, load average: 0.06, 0.01, 0.00 > $ grep Anon /proc/meminfo > AnonPages: 856332 kB > AnonHugePages: 272384 kB > > this is with: > > $ cat /sys/kernel/mm/transparent_hugepage/defrag > always madvise [never] > $ cat /sys/kernel/mm/transparent_hugepage/khugepaged/defrag > [yes] no > > Currently the "defrag" sysfs control only toggles __GFP_WAIT from > on/off in huge_memory.c (details in the patch with subject > "transparent hugepage core" in the alloc_hugepage() > function). Toggling __GFP_WAIT is a joke right now. > > The real deal to address your worry is first to run "hugeadm > --set-recommended-min_free_kbytes" and to apply Mel's patches called > "memory compaction" which is a separate patchset. > The former is critical, the latter is not strictly necessary but it will reduce the cost of hugepage allocation significantly, increases the success rates slightly when under load and will work on swapless systems. It's worth applying both but transparent hugepage support also stands on its own. > I'm the consumer, Mel's the producer ;). > > With virtual machines the host kernel doesn't need to live forever (it > has to be stable but we can easily reboot it without guest noticing), > we can migrate virtual machines to fresh booted new hosts voiding the > whole producer issue. Furthermore VM the first time are usually > started at host boot time, and we want as much memory as possible > backed by hugepages in the host. > > This is not to say that the producer isn't important or can't work, > Mel posted number that shows it works, and we definitely want it to > work, but I'm just trying to make a point that a good consumer of > plenty of hugepages available at boot is useful even assuming the > producer won't ever work or won't ever get it (not the real life case > we're dealing with!). > Most recent figures on huge page allocation under load are at http://lkml.org/lkml/2010/4/2/146. It includes data on the hugepage allocation latency on vanilla kernels and without compaction. > Initially we're going to take advantage of only the consumer in > production exactly because it's already useful, even if we want to > take advantage of a smart runtime "producer" too later on as time goes > on. Migrating guests to produce hugepages isn't the ideal way for sure > and I'm very confident that Mel's work already filling the gap very > nicely. > > The VM itself (regardless if the consumer is hugetlbfs or transparent > hugepage support) is evolving towards being able to generated endless > amount of hugepages (in 2M size, 1G still unthinkable because of the > huge cost) as shown by the already mainline available "hugeadm > --set-recommended-min_free_kbytes". BTW, I think having this 10 liner > algorithm in userland hugeadm binary is wrong and it should be a > separate sysctl like "echo 1 > >/sys/kernel/vm/set-recommended-min_free_kbytes", but that's offtopic > and an implementation detail... This is just to show they are already > addressing that stuff for hugetlbfs. So I just created a better > consumer for the stuff they make an effort to produce anyway (i.e. 2M > pages). The better consumer we have of it in the kernel, the more > effort will be put into the producer. > > > And don't tell me that these things are easy to fix. Don't tell me that > > the current VM is quite clean and can be harmlessly extended to deal with > > this all. Just don't. Not when we currently have a totally unexplained > > regression in the VM from the last scalability thing we did. > > Well the risk of regression with the consumer is little if disabled > with sysfs so it'd be trivial to localize if it caused any > problem. About memory compaction I think we should limit the > invocation of those new VM algorithms to hugetlbfs and transparent > hugepage support (and I already created the sysfs controls to > enable/disable those so you can run transparent hugepage support with > or without defrag feature). This effectively happens with the compaction patches as of V7. It only triggers for orders > PAGE_ALLOC_COSTLY_ORDER which in practice is mostly hugetlbfs with an occasional bit of madness from a very small number of devices. > So all of this can be turned off at > runtime. You can run only the consumer, both consumer or producer, or > none (and if none, risk of regression should be zero). There's no > point to ever defrag if there is no consumer of 2M pages. khugepaged > should be able to invoke memory compaction comfortably in the defrag > job in the background if khugepaged/defrag is set to "yes". > > I think worrying about the producer too much generates a chicken egg > problem, without an heavy consumer in mainline, there's little point > for people to work on the producer. The other producer I have in mind for compaction in particular is huge page allocation at runtime on swapless systems. hugeadm has the feature of temporarily adding swap while it resizes the pool and while it works, it's less than ideal because it still requires a local disk. KVM using it for virtual guests would be a heavier user. > Note that creating a good producer > wasn't easy task, I did all I could to keep it self contained and I > think I succeeded at that. My work as result created interest into > improving the producer on Mel's side. I am sure if the consumer goes > in, producing the stuff will also happen without much problems. > > My preferred merging patch is to merge the consumer first. But then > I'm not entirely against the other order too. Merging both at the same > time to me looks unnecessary complexity merged in the kernel at the > same time and it'd make things less bisectable. But it wouldn't be > impossible either. > > About the performance benefits I posted some numbers in linux-mm, but > I'll collect it here (and this is after boot with plenty of > hugepages). As a side note in this first part please note also the > boost in the page fault rate (but this really only for curiosity, as > this will only happen with hugepages are immediately available in the > buddy). > > ------------ > hugepages in the virtualization hypervisor (and also in the guest!) are > much more important than in a regular host not using virtualization, becasue > with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in > case only the hypervisor uses transparent hugepages, and they decrease the > tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and > the linux guest both uses this patch (though the guest will limit the addition > speedup to anonymous regions only for now...). Even more important is that the > tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow > paging or no-virtualization scenario. So maximizing the amount of virtual > memory cached by the TLB pays off significantly more with NPT/EPT than without > (even if there would be no significant speedup in the tlb-miss runtime). > > [..] > Some performance result: > > vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep > ages3 > memset page fault 1566023 > memset tlb miss 453854 > memset second tlb miss 453321 > random access tlb miss 41635 > random access second tlb miss 41658 > vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3 > memset page fault 1566471 > memset tlb miss 453375 > memset second tlb miss 453320 > random access tlb miss 41636 > random access second tlb miss 41637 > vmx andrea # ./largepages3 > memset page fault 1566642 > memset tlb miss 453417 > memset second tlb miss 453313 > random access tlb miss 41630 > random access second tlb miss 41647 > vmx andrea # ./largepages3 > memset page fault 1566872 > memset tlb miss 453418 > memset second tlb miss 453315 > random access tlb miss 41618 > random access second tlb miss 41659 > vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage > vmx andrea # ./largepages3 > memset page fault 2182476 > memset tlb miss 460305 > memset second tlb miss 460179 > random access tlb miss 44483 > random access second tlb miss 44186 > vmx andrea # ./largepages3 > memset page fault 2182791 > memset tlb miss 460742 > memset second tlb miss 459962 > random access tlb miss 43981 > random access second tlb miss 43988 > > ============ > #include > #include > #include > #include > > #define SIZE (3UL*1024*1024*1024) > > int main() > { > char *p = malloc(SIZE), *p2; > struct timeval before, after; > > gettimeofday(&before, NULL); > memset(p, 0, SIZE); > gettimeofday(&after, NULL); > printf("memset page fault %Lu\n", > (after.tv_sec-before.tv_sec)*1000000UL + > after.tv_usec-before.tv_usec); > > gettimeofday(&before, NULL); > memset(p, 0, SIZE); > gettimeofday(&after, NULL); > printf("memset tlb miss %Lu\n", > (after.tv_sec-before.tv_sec)*1000000UL + > after.tv_usec-before.tv_usec); > > gettimeofday(&before, NULL); > memset(p, 0, SIZE); > gettimeofday(&after, NULL); > printf("memset second tlb miss %Lu\n", > (after.tv_sec-before.tv_sec)*1000000UL + > after.tv_usec-before.tv_usec); > > gettimeofday(&before, NULL); > for (p2 = p; p2 < p+SIZE; p2 += 4096) > *p2 = 0; > gettimeofday(&after, NULL); > printf("random access tlb miss %Lu\n", > (after.tv_sec-before.tv_sec)*1000000UL + > after.tv_usec-before.tv_usec); > > gettimeofday(&before, NULL); > for (p2 = p; p2 < p+SIZE; p2 += 4096) > *p2 = 0; > gettimeofday(&after, NULL); > printf("random access second tlb miss %Lu\n", > (after.tv_sec-before.tv_sec)*1000000UL + > after.tv_usec-before.tv_usec); > > return 0; > } > ============ > ------------- > > This is a more interesting benchmark of kernel compile and some random > cpu bound dd command (not a microbenchmark like above): > > ----------- > This is a kernel build in a 2.6.31 guest, on a 2.6.34-rc1 host. KVM > run with "-drive cache=on,if=virtio,boot=on and -smp 4 -m 2g -vnc :0" > (host has 4G of ram). CPU is Phenom (not II) with NPT (4 cores, 1 > die). All reads are provided from host cache and cpu overhead of the > I/O is reduced thanks to virtio. Workload is just a "make clean > >/dev/null; time make -j20 >/dev/null". Results copied by hand because > I logged through vnc. > > real 4m12.498s > 14m28.106s > 1m26.721s > > real 4m12.000s > 14m27.850s > 1m25.729s > > After the benchmark: > > grep Anon /proc/meminfo > AnonPages: 121300 kB > AnonHugePages: 1007616 kB > cat /debugfs/kvm/largepages > 2296 > > 1.6G free in guest and 1.5free in host. > > Then on host: > > # echo never > /sys//kernel/mm/transparent_hugepage/enabled > # echo never > /sys/kernel/mm/transparent_hugepage/khugepaged/enabled > > then I restart the VM and re-run the same workload: > > real 4m25.040s > user 15m4.665s > sys 1m50.519s > > real 4m29.653s > user 15m8.637s > sys 1m49.631s > > (guest kernel was not so recent and it had no transparent hugepage > support because gcc normally won't take advantage of hugepages > according to /proc/meminfo, so I made the comparison with a distro > guest kernel with my usual .config I use in kvm guests) > > So guest compile the kernel 6% faster with hugepages and the results > are trivially reproducible and stable enough (especially with hugepage > enabled, without it varies from 4m24 sto 4m30s as I tried a few times > more without hugepages in NTP when userland wasn't patched yet...). > > Below another test that takes advantage of hugepage in guest too, so > running the same 2.6.34-rc1 with transparent hugepage support in both > host and guest. (this really shows the power of KVM design, we boost > the hypervisor and we get double boost for guest applications) > > Workload: time dd if=/dev/zero of=/dev/null bs=128M count=100 > > Host hugepage no guest: 3.898 > Host hugepage guest hugepage: 3.966 (-1.17%) > Host no hugepage no guest: 4.088 (-4.87%) > Host hugepage guest no hugepage: 4.312 (-10.1%) > Host no hugepage guest hugepage: 4.388 (-12.5%) > Host no hugepage guest no hugepage: 4.425 (-13.5%) > > Workload: time dd if=/dev/zero of=/dev/null bs=4M count=1000 > > Host hugepage no guest: 1.207 > Host hugepage guest hugepage: 1.245 (-3.14%) > Host no hugepage no guest: 1.261 (-4.47%) > Host no hugepage guest no hugepage: 1.323 (-9.61%) > Host no hugepage guest hugepage: 1.371 (-13.5%) > Host no hugepage guest no hugepage: 1.398 (-15.8%) > > I've no local EPT system to test so I may run them over vpn later on > some large EPT system (and surely there are better benchs than a silly > dd... but this is a start and shows even basic stuff gets the boost). > > The above is basically an "home-workstation/laptop" coverage. I > (partly) intentionally run these on a system that has a ~$100 CPU and > ~$50 motherboard, to show the absolute worst case, to be sure that > 100% of home end users (running KVM) will take a measurable advantage > from this effort. > > On huge systems the percentage boost is expected much bigger than on > the home-workstation above test of course. > -------------- > > > Again gcc is a kind of worst case for it but it also shows a > definitive significant and reproducible boost. > > Also note for a non-virtualization usage (so outside of > MADV_HUGEPAGE), invoking memory compaction synchronously is likely a > risk of losing CPU speed. khugepaged takes care of long lived > allocations of random tasks and the only thing to use memory > compaction synchronously could be the page faults of regions marked > MADV_HUGEPAGE. But we may only decide to invoke memory compaction > asynchronously and never as result of direct reclaim in process > context to avoid any latency to guest operations. All it matters after > boot is that khugepaged can do its job, it's not urgent. When things > are urgent migrating guests to a new cloud node is always possible. > > I'd like to clarify this whole work has been done without ever making > assumptions about virtual machines, I tried to make this as > universally useful as possible (and not just because we want the exact > same VM algorithms to trim one level of guest pagetables too to get a > comulative boost so fully exploiting the KVM design ;). I'm thrilled > Chris is going to test a host-only test for database and I'm sure > willing to help with that. > > Compacting everything that is "movable" is surely solvable from a > theoretical standpoint and that includes all anonymous memory (huge or > not) and all cache. Page migration as it is handles these cases. It can't handle slab, page table pages or some kernel allocations but anti-fragmentation does a good job of grouping these allocations into the same 2M pages already - particularly when min_free_kbytes is configured correctly. > That alone accounts for an huge bulk of the total > memory of a system, so being able to mix it all will result in the > best behavior which isn't possible to achieve with hugetlbfs (so if > the memory isn't allocated as anonymous memory can still be used as > cache for I/O).> So in the very worst case, if everything else fails on > the producer front (again: not the case as far as I can tell!) what > should be reserved at boot is an amount of memory to limit the > unmovable parts there. This latter part is currently possible with the kernelcore=X boot parameter so that the unmovable parts are limited to X amount of memory. It shouldn't be necessary to do this, but it is possible. If it is found that it is required, I'd hope to receive a bug report on it. > And to leave the movable parts free to be > allocated dynamically without limitations depending on the workloads. > > I'm quite sure Mel will be able to provide more details on his work > that has been reviewed in detail already on linux-mm with lots of > positive feedback which is why I expect zero problems on that side too > in real life (besides my theoretical standpoint in previous chapter ;). > The details of of what I have to say on compaction is covered in the compaction leader http://lkml.org/lkml/2010/4/2/146 including allocation success rates under severe compile-based load and data on allocation latencies. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 866D96B01EE for ; Tue, 6 Apr 2010 05:56:49 -0400 (EDT) Message-ID: <4BBB052D.8040307@redhat.com> Date: Tue, 06 Apr 2010 12:55:57 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Andrea Arcangeli , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/06/2010 05:23 AM, Linus Torvalds wrote: > > On Mon, 5 Apr 2010, Linus Torvalds wrote: > >> So I thought it was a more interesting load than it was. The >> virtualization "TLB miss is expensive" load I can't find it in myself to >> care about. "Get a better CPU" is my answer to that one, >> > [ Btw, I do realize that "better CPU" in this case may be "future CPU". I > just think that this is where better TLB's and using ASID's etc is > likely to be a much bigger deal than adding VM complexity. Kind of the > For virtualization the tlb miss cost comes from two parts, first there are the 24 memory accesses needed for a tlb fill (instead of the usual 4); these can indeed be improved by various intermediate tlbs (and current processors already do have those caches). However something that cannot be solved by the tlb are the accesses to the last level of the page table hierarchy - as soon as the page tables exceed the cache size, you take two cache misses for each tlb miss. Note virtualization only increases the hit, it also shows with non-virtualized loads, but there your cache utilization is halved and you only need one memory access for your last level page table. Here is a microbenchmark demonstrating the hit (non-virtualized); it simulates a pointer-chasing application with a varying working set. It is easy to see when the working set overflows the various caches, and later when the page tables overflow the caches. For virtualization the hit will be a factor of 3 instead of 2, and will come earlier since the page tables are bigger. size 4k (ns) 2M (ns) 4k 4.9 4.9 16k 4.9 4.9 64k 7.6 7.6 256k 15.1 8.1 1M 28.5 23.9 4M 31.8 25.3 16M 94.8 79.0 64M 260.9 224.2 256M 269.8 248.8 1G 278.1 246.3 4G 330.9 252.6 16G 436.3 243.8 64G 486.0 253.3 -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 3E3AD6B01EF for ; Tue, 6 Apr 2010 05:57:27 -0400 (EDT) Message-ID: <4BBB056D.1020300@redhat.com> Date: Tue, 06 Apr 2010 12:57:01 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <4BBB052D.8040307@redhat.com> In-Reply-To: <4BBB052D.8040307@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Andrea Arcangeli , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/06/2010 12:55 PM, Avi Kivity wrote: > > Here is a microbenchmark demonstrating the hit (non-virtualized); it > simulates a pointer-chasing application with a varying working set. > It is easy to see when the working set overflows the various caches, > and later when the page tables overflow the caches. For > virtualization the hit will be a factor of 3 instead of 2, and will > come earlier since the page tables are bigger. > > size 4k (ns) 2M (ns) > 4k 4.9 4.9 > 16k 4.9 4.9 > 64k 7.6 7.6 > 256k 15.1 8.1 > 1M 28.5 23.9 > 4M 31.8 25.3 > 16M 94.8 79.0 > 64M 260.9 224.2 > 256M 269.8 248.8 > 1G 278.1 246.3 > 4G 330.9 252.6 > 16G 436.3 243.8 > 64G 486.0 253.3 > (latencies are for a single read access) -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 5F0E96B01EE for ; Tue, 6 Apr 2010 06:32:50 -0400 (EDT) Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Mime-Version: 1.0 (Apple Message framework v1077) Content-Type: text/plain; charset=us-ascii From: Theodore Tso In-Reply-To: <20100406093021.GC17882@csn.ul.ie> Date: Tue, 6 Apr 2010 06:32:28 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405232115.GM5825@random.random> <20100406093021.GC17882@csn.ul.ie> Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Apr 6, 2010, at 5:30 AM, Mel Gorman wrote: >=20 > There is a good chance you could allocate a decent percentage of > memory as huge pages but as you are unlikely to have run hugeadm > --set-recommended-min_free_kbytes early in boot, it is also likely to = trash > heavily and the success rates will not be very impressive. Can you explain how hugeadm --set-recommended-min_free_kbytes works and = how it achieves this magic? Or can you send me a pointer to how this = works? I've tried doing some Google searches, and I found the LWN = article "Huge pages part 3: administration", but it doesn't go into a = lot of detail how increasing vm.min_free_kbytes helps the anti = fragmentation code. Thanks, -- Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id EDA266B01EE for ; Tue, 6 Apr 2010 07:16:43 -0400 (EDT) Date: Tue, 6 Apr 2010 12:16:19 +0100 From: Mel Gorman Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100406111619.GD17882@csn.ul.ie> References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405232115.GM5825@random.random> <20100406093021.GC17882@csn.ul.ie> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Theodore Tso Cc: Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Tue, Apr 06, 2010 at 06:32:28AM -0400, Theodore Tso wrote: > > On Apr 6, 2010, at 5:30 AM, Mel Gorman wrote: > > > > There is a good chance you could allocate a decent percentage of > > memory as huge pages but as you are unlikely to have run hugeadm > > --set-recommended-min_free_kbytes early in boot, it is also likely to trash > > heavily and the success rates will not be very impressive. > > Can you explain how hugeadm --set-recommended-min_free_kbytes works and > how it achieves this magic? Or can you send me a pointer to how this works? > I've tried doing some Google searches, and I found the LWN article "Huge > pages part 3: administration", but it doesn't go into a lot of detail how > increasing vm.min_free_kbytes helps the anti fragmentation code. Sure, the details of how and why it works are spread all over the place. It's fairly simple really and related to how anti-fragmentation does its work. Anti-frag divides up a zone into "arenas" where an arena is usually the default huge page size - 2M on x86-64, 16M on ppc64 etc. Its objective is to keep UNMOVABLE, RECLAIMABLE and MOVABLE pages within the same arenas using multiple free lists. If a page within the desired arena is not available, it falls back to using one of the other arenas. A fallback is a "fragmentation event" as traced by the mm_page_alloc_extfrag event. A severe event is if a small page is used and a benign event is if a large page (e.g. 2M) is moved to the desired list. It's benign because pages of the same "migrate type" continue to be allocated within the same arena. How often these "fragmentation events" occur depends on pages of the desired type being always available. This in turn depends on free pages being available which is easiest to control by min_free_kbytes and is where --set-recommended-min_free_kbytes comes in. By keeping a number of pages free, the probability of a page of the desired type being available increases. As there are three migrate-types we currently care about from an anti-frag perspective, the recommended min_free_kbytes value depends on the number of zones in the system and having 3 arenas worth of pages are kept free per zone. Once set, there will, in most cases, be a page free of the required type at allocation time. It can be observed in practice by tracing mm_page_alloc_extfrag. The next part of min_free_kbytes is related to the "reserve" blocks which are only important to high-order atomic allocations. There is a maximum of two reserve blocks per zone. For example, on a flat-memory system with one grouping of memory, there would be a maximum of two reserve arenas. On a NUMA system with two nodes, there would be a maximum of four. With multiple groupings of memory such as 32-bit X86 with DMA, Normal and Highmem groups of free-lists, there might be five reserve pageblocks, two each for the Normal and HighMem groupings and just one for DMA as it is only 16MB worth of pages. The final part of the recommended min_free_kbytes value is a sum of the reserve arenas and the migrate-type arenas to ensure that pages of the required type are free. The function that works this out in libhugetlbfs is long recommended_minfreekbytes(void) { FILE *f; char buf[ZONEINFO_LINEBUF]; int nr_zones = 0; long recommended_min; long pageblock_kbytes = kernel_default_hugepage_size() / 1024; /* Detect the number of zones in the system */ f = fopen(PROCZONEINFO, "r"); if (f == NULL) { WARNING("Unable to open " PROCZONEINFO); return 0; } while (fgets(buf, ZONEINFO_LINEBUF, f) != NULL) { if (strncmp(buf, "Node ", 5) == 0) nr_zones++; } fclose(f); /* Make sure at least 2 pageblocks are free for MIGRATE_RESERVE */ recommended_min = pageblock_kbytes * nr_zones * 2; /* * Make sure that on average at least two pageblocks are almost free * of another type, one for a migratetype to fall back to and a * second to avoid subsequent fallbacks of other types There are 3 * MIGRATE_TYPES we care about. */ recommended_min += pageblock_kbytes * nr_zones * 3 * 3; return recommended_min; } Does this clarify why min_free_kbytes helps and why the "recommended" value is what it is? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 353CD6B01EE for ; Tue, 6 Apr 2010 07:38:35 -0400 (EDT) Date: Tue, 6 Apr 2010 07:35:42 -0400 From: Chris Mason Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100406113542.GC5218@think> References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405210133.GE21620@think> <4BBA53A0.8050608@redhat.com> <20100406083028.GA17882@csn.ul.ie> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100406083028.GA17882@csn.ul.ie> Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: Avi Kivity , Linus Torvalds , Pekka Enberg , Ingo Molnar , Andrew Morton , Andrea Arcangeli , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Tue, Apr 06, 2010 at 09:30:28AM +0100, Mel Gorman wrote: > On Tue, Apr 06, 2010 at 12:18:24AM +0300, Avi Kivity wrote: > > On 04/06/2010 12:01 AM, Chris Mason wrote: > >> On Mon, Apr 05, 2010 at 01:32:21PM -0700, Linus Torvalds wrote: > >> > >>> > >>> On Mon, 5 Apr 2010, Pekka Enberg wrote: > >>> > >>>> AFAIK, most modern GCs split memory in young and old generation > >>>> "zones" and _copy_ surviving objects from the former to the latter if > >>>> their lifetime exceeds some threshold. The JVM keeps scanning the > >>>> smaller young generation very aggressively which causes TLB pressure > >>>> and scans the larger old generation less often. > >>>> > >>> .. my only input to this is: numbers talk, bullsh*t walks. > >>> > >>> I'm not interested in micro-benchmarks, either. I can show infinite TLB > >>> walk improvement in a microbenchmark. > >>> > >> Ok, I'll bite. I should be able to get some database workloads with > >> hugepages, transparent hugepages, and without any hugepages at all. > >> > > > > Please run them in conjunction with Mel Gorman's memory compaction, > > otherwise fragmentation may prevent huge pages from being instantiated. > > > > Strictly speaking, compaction is not necessary to allocate huge pages. > What compaction gets you is > > o Lower latency and cost of huge page allocation > o Works on swapless systems > > What is important is that you run > hugeadm --set-recommended-min_free_kbytes > from the libhugetlbfs 2.8 package early in boot so that > anti-fragmentation is doing as good as job as possible. Great, I'll make sure to do this. > If one is very > curious, use the mm_page_alloc_extfrag to trace how often severe > fragmentation-related events occur under default settings and with > min_free_kbytes set properly. > > Without the compaction patches, allocating huge pages will be occasionally > *very* expensive as a large number of pages will need to be reclaimed. > Most likely sympton is trashing while the database starts up. Allocation > success rates will also be lower when under heavy load. > > Running make -j16 at the same time is unlikely to make much of a > difference from a hugepage allocation point of view. The performance > figures will vary significantly of course as make competes with the > database for CPU time and other resources. Heh, Linus did actually say to run them concurrently with make -j16, but I read it as make -j16 before the database run. My goal will be to fragment the ram, then get a db in ram and see how fast it all goes. Fragmenting memory during the run is only interesting to test compaction, I'd throw out the resulting db benchmark numbers and only count the number of transparent hugepages we were able to allocate. > > Finally, benchmarking with databases is not new as such - > http://lwn.net/Articles/378641/ . This was on fairly simple hardware > though as I didn't have access to hardware more suitable for database > workloads. If you are running with transparent huge pages though, be > sure to double check that huge pages are actually being used > transparently. Will do. It'll take me a few days to get the machines setup and a baseline measurement. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 937F56B01F0 for ; Tue, 6 Apr 2010 07:57:42 -0400 (EDT) Message-ID: <4BBB2134.9090301@redhat.com> Date: Tue, 06 Apr 2010 14:55:32 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <4BBB052D.8040307@redhat.com> In-Reply-To: <4BBB052D.8040307@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Andrea Arcangeli , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/06/2010 12:55 PM, Avi Kivity wrote: > > Here is a microbenchmark demonstrating the hit (non-virtualized); it > simulates a pointer-chasing application with a varying working set. > It is easy to see when the working set overflows the various caches, > and later when the page tables overflow the caches. For > virtualization the hit will be a factor of 3 instead of 2, and will > come earlier since the page tables are bigger. > And here is the same thing with guest latencies as well: Random memory read latency, in nanoseconds, according to working set and page size. ------- host ------ ------------- guest ----------- --- hpage=4k --- -- hpage=2M - size 4k 2M 4k/4k 2M/4k 4k/2M 2M/2M 4k 4.9 4.9 5.0 4.9 4.9 4.9 16k 4.9 4.9 5.0 4.9 5.0 4.9 64k 7.6 7.6 7.9 7.8 7.8 7.8 256k 15.1 8.1 15.9 10.3 15.4 9.0 1M 28.5 23.9 29.3 37.9 29.3 24.6 4M 31.8 25.3 37.5 42.6 35.5 26.0 16M 94.8 79.0 110.7 107.3 92.0 77.3 64M 260.9 224.2 294.2 247.8 251.5 207.2 256M 269.8 248.8 313.9 253.1 260.1 230.3 1G 278.1 246.3 331.8 273.0 269.9 236.7 4G 330.9 252.6 545.6 346.0 341.6 256.5 16G 436.3 243.8 705.2 458.3 463.9 268.8 64G 486.0 253.3 767.3 532.5 516.9 274.7 It's easy to see how cache effects dominate the tlb walk. The only way hardware can reduce this is by increasing cache sizes dramatically. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 30C836B01FA for ; Tue, 6 Apr 2010 09:10:44 -0400 (EDT) Date: Tue, 6 Apr 2010 23:10:24 +1000 From: Nick Piggin Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100406131024.GA5288@laptop> References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <4BBB052D.8040307@redhat.com> <4BBB2134.9090301@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BBB2134.9090301@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Linus Torvalds , Andrea Arcangeli , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Tue, Apr 06, 2010 at 02:55:32PM +0300, Avi Kivity wrote: > On 04/06/2010 12:55 PM, Avi Kivity wrote: > > > >Here is a microbenchmark demonstrating the hit (non-virtualized); > >it simulates a pointer-chasing application with a varying working > >set. It is easy to see when the working set overflows the various > >caches, and later when the page tables overflow the caches. For > >virtualization the hit will be a factor of 3 instead of 2, and > >will come earlier since the page tables are bigger. > > > > And here is the same thing with guest latencies as well: > > Random memory read latency, in nanoseconds, according to working > set and page size. > > > ------- host ------ ------------- guest ----------- > --- hpage=4k --- -- hpage=2M - > > size 4k 2M 4k/4k 2M/4k 4k/2M 2M/2M > 4k 4.9 4.9 5.0 4.9 4.9 4.9 > 16k 4.9 4.9 5.0 4.9 5.0 4.9 > 64k 7.6 7.6 7.9 7.8 7.8 7.8 > 256k 15.1 8.1 15.9 10.3 15.4 9.0 > 1M 28.5 23.9 29.3 37.9 29.3 24.6 > 4M 31.8 25.3 37.5 42.6 35.5 26.0 > 16M 94.8 79.0 110.7 107.3 92.0 77.3 > 64M 260.9 224.2 294.2 247.8 251.5 207.2 > 256M 269.8 248.8 313.9 253.1 260.1 230.3 > 1G 278.1 246.3 331.8 273.0 269.9 236.7 > 4G 330.9 252.6 545.6 346.0 341.6 256.5 > 16G 436.3 243.8 705.2 458.3 463.9 268.8 > 64G 486.0 253.3 767.3 532.5 516.9 274.7 > > > It's easy to see how cache effects dominate the tlb walk. The only > way hardware can reduce this is by increasing cache sizes > dramatically. Well this is the best attainable speedup in a corner case where the whole memory hierarchy is being actively defeated. The numbers are not surprising. Actual workloads are infinitely more useful. And in most cases, quite possibly hardware improvements like asids will be more useful. I don't really agree with how virtualization problem is characterised. Xen's way of doing memory virtualization maps directly to normal hardware page tables so there doesn't seem like a fundamental requirement for more memory accesses. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 527276B01FC for ; Tue, 6 Apr 2010 09:13:38 -0400 (EDT) Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Mime-Version: 1.0 (Apple Message framework v1077) Content-Type: text/plain; charset=us-ascii From: Theodore Tso In-Reply-To: <20100406111619.GD17882@csn.ul.ie> Date: Tue, 6 Apr 2010 09:13:20 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <13812DAC-4B53-4B6B-8725-EBC9E735AF96@mit.edu> References: <20100405120906.0abe8e58.akpm@linux-foundation.org> <20100405193616.GA5125@elte.hu> <20100405232115.GM5825@random.random> <20100406093021.GC17882@csn.ul.ie> <20100406111619.GD17882@csn.ul.ie> Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Apr 6, 2010, at 7:16 AM, Mel Gorman wrote: >=20 > Does this clarify why min_free_kbytes helps and why the "recommended" > value is what it is? Thanks, this is really helpful. I wonder if it might be a good idea to = have a boot command-line option which automatically sets = vm.min_free_kbytes to the right value? Most administrators who are = used to using hugepages, are most familiar with needing to set boot = command-line options, and this way they won't need to try to find this = new userspace utility. I was looking for hugeadm on Ubuntu, for = example, and I couldn't find it. Regards, -- Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 2A3DA6B01FE for ; Tue, 6 Apr 2010 09:23:26 -0400 (EDT) Message-ID: <4BBB359D.1020603@redhat.com> Date: Tue, 06 Apr 2010 16:22:37 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <4BBB052D.8040307@redhat.com> <4BBB2134.9090301@redhat.com> <20100406131024.GA5288@laptop> In-Reply-To: <20100406131024.GA5288@laptop> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Linus Torvalds , Andrea Arcangeli , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/06/2010 04:10 PM, Nick Piggin wrote: > On Tue, Apr 06, 2010 at 02:55:32PM +0300, Avi Kivity wrote: > >> On 04/06/2010 12:55 PM, Avi Kivity wrote: >> >>> Here is a microbenchmark demonstrating the hit (non-virtualized); >>> it simulates a pointer-chasing application with a varying working >>> set. It is easy to see when the working set overflows the various >>> caches, and later when the page tables overflow the caches. For >>> virtualization the hit will be a factor of 3 instead of 2, and >>> will come earlier since the page tables are bigger. >>> >>> >> And here is the same thing with guest latencies as well: >> >> Random memory read latency, in nanoseconds, according to working >> set and page size. >> >> >> ------- host ------ ------------- guest ----------- >> --- hpage=4k --- -- hpage=2M - >> >> size 4k 2M 4k/4k 2M/4k 4k/2M 2M/2M >> 4k 4.9 4.9 5.0 4.9 4.9 4.9 >> 16k 4.9 4.9 5.0 4.9 5.0 4.9 >> 64k 7.6 7.6 7.9 7.8 7.8 7.8 >> 256k 15.1 8.1 15.9 10.3 15.4 9.0 >> 1M 28.5 23.9 29.3 37.9 29.3 24.6 >> 4M 31.8 25.3 37.5 42.6 35.5 26.0 >> 16M 94.8 79.0 110.7 107.3 92.0 77.3 >> 64M 260.9 224.2 294.2 247.8 251.5 207.2 >> 256M 269.8 248.8 313.9 253.1 260.1 230.3 >> 1G 278.1 246.3 331.8 273.0 269.9 236.7 >> 4G 330.9 252.6 545.6 346.0 341.6 256.5 >> 16G 436.3 243.8 705.2 458.3 463.9 268.8 >> 64G 486.0 253.3 767.3 532.5 516.9 274.7 >> >> >> It's easy to see how cache effects dominate the tlb walk. The only >> way hardware can reduce this is by increasing cache sizes >> dramatically. >> > Well this is the best attainable speedup in a corner case where the > whole memory hierarchy is being actively defeated. The numbers are > not surprising. Of course this shows the absolute worst case and will never show up directly in any real workload. The point wasn't that we expect a 3x speedup from large pages (far from it), but to show the problem is due to page tables overflowing the cache, not to any miss handler inefficiency. It also shows that virtualization only increases the impact, but isn't the direct cause. The real problem is large active working sets. > Actual workloads are infinitely more useful. And in > most cases, quite possibly hardware improvements like asids will > be more useful. > This already has ASIDs for the guest; and for the host they wouldn't help much since there's only one process running. I don't see how hardware improvements can drastically change the numbers above, it's clear that for the 4k case the host takes a cache miss for the pte, and twice for the 4k/4k guest case. > I don't really agree with how virtualization problem is characterised. > Xen's way of doing memory virtualization maps directly to normal > hardware page tables so there doesn't seem like a fundamental > requirement for more memory accesses. > The Xen pv case only works for modified guests (so no Windows), and doesn't support host memory management like swapping or ksm. Xen hvm (which runs unmodified guests) has the same problems as kvm. Note kvm can use a single layer of translation (and does on older hardware), so it would behave like the host, but that increases the cost of pte updates dramatically. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id AB0786B01F4 for ; Tue, 6 Apr 2010 09:45:47 -0400 (EDT) Date: Tue, 6 Apr 2010 23:45:39 +1000 From: Nick Piggin Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100406134539.GC5288@laptop> References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <4BBB052D.8040307@redhat.com> <4BBB2134.9090301@redhat.com> <20100406131024.GA5288@laptop> <4BBB359D.1020603@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BBB359D.1020603@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Linus Torvalds , Andrea Arcangeli , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Tue, Apr 06, 2010 at 04:22:37PM +0300, Avi Kivity wrote: > On 04/06/2010 04:10 PM, Nick Piggin wrote: > >Actual workloads are infinitely more useful. And in > >most cases, quite possibly hardware improvements like asids will > >be more useful. > > This already has ASIDs for the guest; and for the host they wouldn't > help much since there's only one process running. I didn't realize these improvements were directed completely at the virtualized case. > I don't see how > hardware improvements can drastically change the numbers above, it's > clear that for the 4k case the host takes a cache miss for the pte, > and twice for the 4k/4k guest case. It's because you're missing the point. You're taking the most unrealistic and pessimal cases and then showing that it has fundamental problems. Speedups like Linus is talking about would refer to ways to speed up actual workloads, not ways to avoid fundamental limitations. Prefetching, memory parallelism, caches. It's worked for 25 years :) > >I don't really agree with how virtualization problem is characterised. > >Xen's way of doing memory virtualization maps directly to normal > >hardware page tables so there doesn't seem like a fundamental > >requirement for more memory accesses. > > The Xen pv case only works for modified guests (so no Windows), and > doesn't support host memory management like swapping or ksm. Xen > hvm (which runs unmodified guests) has the same problems as kvm. > > Note kvm can use a single layer of translation (and does on older > hardware), so it would behave like the host, but that increases the > cost of pte updates dramatically. So it is fundamentally possible. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 580856B01EE for ; Tue, 6 Apr 2010 09:58:31 -0400 (EDT) Message-ID: <4BBB3DDB.7010101@redhat.com> Date: Tue, 06 Apr 2010 16:57:47 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <4BBB052D.8040307@redhat.com> <4BBB2134.9090301@redhat.com> <20100406131024.GA5288@laptop> <4BBB359D.1020603@redhat.com> <20100406134539.GC5288@laptop> In-Reply-To: <20100406134539.GC5288@laptop> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Linus Torvalds , Andrea Arcangeli , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/06/2010 04:45 PM, Nick Piggin wrote: > On Tue, Apr 06, 2010 at 04:22:37PM +0300, Avi Kivity wrote: > >> On 04/06/2010 04:10 PM, Nick Piggin wrote: >> >>> Actual workloads are infinitely more useful. And in >>> most cases, quite possibly hardware improvements like asids will >>> be more useful. >>> >> This already has ASIDs for the guest; and for the host they wouldn't >> help much since there's only one process running. >> > I didn't realize these improvements were directed completely at the > virtualized case. > I've read somewhere that future x86 will get non virtualization ASIDs, but currently that's the case. They've been present for the virtualized case for a few years now on AMD and introduced recently (with Nehalem) on Intel (known as VPIDs). >> I don't see how >> hardware improvements can drastically change the numbers above, it's >> clear that for the 4k case the host takes a cache miss for the pte, >> and twice for the 4k/4k guest case. >> > It's because you're missing the point. You're taking the most > unrealistic and pessimal cases and then showing that it has fundamental > problems. That's just a demonstration. Again, I don't expect 3x speedups from large pages. > Speedups like Linus is talking about would refer to ways to > speed up actual workloads, not ways to avoid fundamental limitations. > > Prefetching, memory parallelism, caches. It's worked for 25 years :) > Prefetching and memory parallelism are defeated by pointer chasing, which many workloads do. It's no accident that Java is a large beneficiary of large pages since Java programs are lots of small objects scattered around in memory. Caches don't scale as fast as memory, and are shared with data and other cores anyway. If you have 200ns of honest work per pointer dereference, then a 64GB working set will still see 300ns stalls with 4k pages vs 50 ns with large pages (both non-virtualized). 200ns is quite a bit of work per object. >>> I don't really agree with how virtualization problem is characterised. >>> Xen's way of doing memory virtualization maps directly to normal >>> hardware page tables so there doesn't seem like a fundamental >>> requirement for more memory accesses. >>> >> The Xen pv case only works for modified guests (so no Windows), and >> doesn't support host memory management like swapping or ksm. Xen >> hvm (which runs unmodified guests) has the same problems as kvm. >> >> Note kvm can use a single layer of translation (and does on older >> hardware), so it would behave like the host, but that increases the >> cost of pte updates dramatically. >> > So it is fundamentally possible. > The costs are much bigger than the gain, especially when scaling the number of vcpus. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id B6D416B01EE for ; Tue, 6 Apr 2010 10:45:35 -0400 (EDT) Message-ID: <4BBB48D7.6080303@redhat.com> Date: Tue, 06 Apr 2010 10:44:39 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <4BBB052D.8040307@redhat.com> <4BBB2134.9090301@redhat.com> <20100406131024.GA5288@laptop> In-Reply-To: <20100406131024.GA5288@laptop> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Avi Kivity , Linus Torvalds , Andrea Arcangeli , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/06/2010 09:10 AM, Nick Piggin wrote: > I don't really agree with how virtualization problem is characterised. > Xen's way of doing memory virtualization maps directly to normal > hardware page tables so there doesn't seem like a fundamental > requirement for more memory accesses. Xen also uses nested paging whereever possible, because shadow page tables are even slower than nested page tables. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 17CF06B01EE for ; Tue, 6 Apr 2010 10:55:34 -0400 (EDT) Date: Tue, 6 Apr 2010 15:55:13 +0100 From: Mel Gorman Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100406145512.GE17882@csn.ul.ie> References: <20100405193616.GA5125@elte.hu> <20100405232115.GM5825@random.random> <20100406093021.GC17882@csn.ul.ie> <20100406111619.GD17882@csn.ul.ie> <13812DAC-4B53-4B6B-8725-EBC9E735AF96@mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <13812DAC-4B53-4B6B-8725-EBC9E735AF96@mit.edu> Sender: owner-linux-mm@kvack.org To: Theodore Tso Cc: Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Tue, Apr 06, 2010 at 09:13:20AM -0400, Theodore Tso wrote: > > On Apr 6, 2010, at 7:16 AM, Mel Gorman wrote: > > > > > Does this clarify why min_free_kbytes helps and why the "recommended" > > value is what it is? > > Thanks, this is really helpful. I wonder if it might be a good idea to > have a boot command-line option which automatically sets vm.min_free_kbytes > to the right value? I considered automatically adjusting it the first time huge pages are used, as a command-line option or even a magic value writting to proc. It's trivial to implement each option, just haven't gotten around to doing it. There was less pressure once the tool existed. > Most administrators who are used to using hugepages, > are most familiar with needing to set boot command-line options, and this way > they won't need to try to find this new userspace utility. The utility covers a host of other use cases as well e.g. creates mount points, sets quota, sizes pools (both static and dynamic), reports on the current state of the system, can auto tune shmem settings etc. > I was looking > for hugeadm on Ubuntu, for example, and I couldn't find it. It's relatively recent and there isn't debian packaging for it (although an old one was sent to debian mentors once upon a time but never finished). It's on the TODO list of infinite woe to finish that packaging and go through Debian so it ends up in Ubuntu eventually. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 4AE846B01E3 for ; Tue, 6 Apr 2010 13:22:24 -0400 (EDT) Date: Tue, 6 Apr 2010 18:43:19 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100406164319.GY5825@random.random> References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <4BBB052D.8040307@redhat.com> <4BBB2134.9090301@redhat.com> <20100406131024.GA5288@laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100406131024.GA5288@laptop> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Avi Kivity , Linus Torvalds , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: Hi Nick, On Tue, Apr 06, 2010 at 11:10:24PM +1000, Nick Piggin wrote: > most cases, quite possibly hardware improvements like asids will > be more useful. ASID already exists, they're not about preventing a vmexit for every tlb flush or alternatively guest pagetable updates. In short NPT/EPT is to ASID are what x86-64 is to PAE, not the other way around. It simplifies things and speedup server workloads tremendously. ASID if you want it, then you've to put it in OS guest to manage or in regular linux on host regardless of virtualization on or off. Anyway hugetlbfs exists in linux way before virtualization ever exited, so I guess we should keep the virtualization talk aside for now to make everyone happy, I already once said in this thread this whole work has been done in a way not specific to virtualization, and let's focus on applications that have larger working set than gcc/vi/make/git and somebody should explain why exactly hugetlbfs is included in the 2.6.34 kernel if tlb miss cost doesn't matter, and why so much work keeps going in the hugetlbfs direction including the 1g page size and java runs on hugetlbfs, oracle runs on hugetlbfs, etc... tons of apps are using libhugetlbfs and hugetlbfs is growing like its own VM that eventually will be able to swap of its own. > I don't really agree with how virtualization problem is characterised. > Xen's way of doing memory virtualization maps directly to normal > hardware page tables so there doesn't seem like a fundamental > requirement for more memory accesses. Xen also takes advantage of NPT/EPT, when it does it sure has the same hardware runtime cost of KVM without hugepages, unless Xen or the guest or both are using hugepages somewhere and trimming the pte level from the shadow or guest pagetables. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id C43CF6B01EF for ; Tue, 6 Apr 2010 13:27:21 -0400 (EDT) Date: Tue, 6 Apr 2010 18:50:31 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100406165031.GA5825@random.random> References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <4BBB052D.8040307@redhat.com> <4BBB2134.9090301@redhat.com> <20100406131024.GA5288@laptop> <4BBB359D.1020603@redhat.com> <20100406134539.GC5288@laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100406134539.GC5288@laptop> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Avi Kivity , Linus Torvalds , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Tue, Apr 06, 2010 at 11:45:39PM +1000, Nick Piggin wrote: > problems. Speedups like Linus is talking about would refer to ways to > speed up actual workloads, not ways to avoid fundamental limitations. > > Prefetching, memory parallelism, caches. It's worked for 25 years :) This will always give you a worst case additional 6% on top (gcc is a definitive worst case) of all other speedup of the actual workloads, for server loads more likely >=15% boost. It's plain underclocking your CPU not to run this. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 655276B01F1 for ; Tue, 6 Apr 2010 13:32:12 -0400 (EDT) Message-ID: <4BBB6FEC.9050205@redhat.com> Date: Tue, 06 Apr 2010 20:31:24 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <4BBB052D.8040307@redhat.com> <4BBB2134.9090301@redhat.com> <20100406131024.GA5288@laptop> <4BBB359D.1020603@redhat.com> <20100406134539.GC5288@laptop> <20100406165031.GA5825@random.random> In-Reply-To: <20100406165031.GA5825@random.random> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Nick Piggin , Linus Torvalds , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/06/2010 07:50 PM, Andrea Arcangeli wrote: > On Tue, Apr 06, 2010 at 11:45:39PM +1000, Nick Piggin wrote: > >> problems. Speedups like Linus is talking about would refer to ways to >> speed up actual workloads, not ways to avoid fundamental limitations. >> >> Prefetching, memory parallelism, caches. It's worked for 25 years :) >> > This will always give you a worst case additional 6% on top (gcc is a > definitive worst case) of all other speedup of the actual workloads, > for server loads more likely>=15% boost. It's plain underclocking > your CPU not to run this. > I don't think gcc is worst case. Workloads that benefit from large pages are those with bloated working sets that do a lot of pointer chasing and do little computation in between. gcc fits two out of three (just a partial score on the first). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 965826B01E3 for ; Tue, 6 Apr 2010 13:42:33 -0400 (EDT) Date: Tue, 6 Apr 2010 18:46:25 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100406164625.GZ5825@random.random> References: <20100405193616.GA5125@elte.hu> <20100405232115.GM5825@random.random> <20100406093021.GC17882@csn.ul.ie> <20100406111619.GD17882@csn.ul.ie> <13812DAC-4B53-4B6B-8725-EBC9E735AF96@mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <13812DAC-4B53-4B6B-8725-EBC9E735AF96@mit.edu> Sender: owner-linux-mm@kvack.org To: Theodore Tso Cc: Mel Gorman , Linus Torvalds , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Tue, Apr 06, 2010 at 09:13:20AM -0400, Theodore Ts'o wrote: > > On Apr 6, 2010, at 7:16 AM, Mel Gorman wrote: > > > > > Does this clarify why min_free_kbytes helps and why the "recommended" > > value is what it is? > > Thanks, this is really helpful. I wonder if it might be a good idea to have a boot command-line option which automatically sets vm.min_free_kbytes to the right value? Most administrators who are used to using hugepages, are most familiar with needing to set boot command-line options, and this way they won't need to try to find this new userspace utility. I was looking for hugeadm on Ubuntu, for example, and I couldn't find it. It's part of libhugetlbfs. I also suggested in a earlier email this would better be "echo 1 >/sys/kernel/vm/set-recommended-min_free_kbytes" or set-recommended-min_free_kbytes=1 at boot considering it's a 10 liner piece of code that does the math to set it. But it's no big deal on my side, the important thing is that we have that feature. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 8965D6B01F6 for ; Tue, 6 Apr 2010 14:01:58 -0400 (EDT) Date: Tue, 6 Apr 2010 13:00:43 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 In-Reply-To: <4BBB6FEC.9050205@redhat.com> Message-ID: References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <4BBB052D.8040307@redhat.com> <4BBB2134.9090301@redhat.com> <20100406131024.GA5288@laptop> <4BBB359D.1020603@redhat.com> <20100406134539.GC5288@laptop> <20100406165031.GA5825@random.random> <4BBB6FEC.9050205@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Andrea Arcangeli , Nick Piggin , Linus Torvalds , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Tue, 6 Apr 2010, Avi Kivity wrote: > On 04/06/2010 07:50 PM, Andrea Arcangeli wrote: > > On Tue, Apr 06, 2010 at 11:45:39PM +1000, Nick Piggin wrote: > > > > > problems. Speedups like Linus is talking about would refer to ways to > > > speed up actual workloads, not ways to avoid fundamental limitations. > > > > > > Prefetching, memory parallelism, caches. It's worked for 25 years :) > > > > > This will always give you a worst case additional 6% on top (gcc is a > > definitive worst case) of all other speedup of the actual workloads, > > for server loads more likely>=15% boost. It's plain underclocking > > your CPU not to run this. > > > > I don't think gcc is worst case. Workloads that benefit from large pages are > those with bloated working sets that do a lot of pointer chasing and do little > computation in between. gcc fits two out of three (just a partial score on > the first). Once you have huge pages you will likely start to optimize for locality. Pointer chasing is bad even with huge pages if you go between multiple huge pages and you are beyond the number of huge tlb entries supported by the cpu. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 246FE6B01F8 for ; Tue, 6 Apr 2010 14:05:25 -0400 (EDT) Message-ID: <4BBB77BC.60409@redhat.com> Date: Tue, 06 Apr 2010 21:04:44 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <4BBB052D.8040307@redhat.com> <4BBB2134.9090301@redhat.com> <20100406131024.GA5288@laptop> <4BBB359D.1020603@redhat.com> <20100406134539.GC5288@laptop> <20100406165031.GA5825@random.random> <4BBB6FEC.9050205@redhat.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Christoph Lameter Cc: Andrea Arcangeli , Nick Piggin , Linus Torvalds , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/06/2010 09:00 PM, Christoph Lameter wrote: > >> I don't think gcc is worst case. Workloads that benefit from large pages are >> those with bloated working sets that do a lot of pointer chasing and do little >> computation in between. gcc fits two out of three (just a partial score on >> the first). >> > Once you have huge pages you will likely start to optimize for locality. > > Pointer chasing is bad even with huge pages if you go between multiple > huge pages and you are beyond the number of huge tlb entries supported by > the cpu. > A hugetlb miss is serviced from the L2 or L3 cache. A smalltlb miss is serviced from main memory. The miss rate is important, but not nearly as important as fill latency. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id E1C236B01F9 for ; Tue, 6 Apr 2010 14:48:32 -0400 (EDT) Message-ID: <4BBB81BB.9080206@redhat.com> Date: Tue, 06 Apr 2010 21:47:23 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <4BBB052D.8040307@redhat.com> <4BBB2134.9090301@redhat.com> <20100406131024.GA5288@laptop> <4BBB359D.1020603@redhat.com> <20100406134539.GC5288@laptop> In-Reply-To: <20100406134539.GC5288@laptop> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Linus Torvalds , Andrea Arcangeli , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/06/2010 04:45 PM, Nick Piggin wrote: > >> I don't see how >> hardware improvements can drastically change the numbers above, it's >> clear that for the 4k case the host takes a cache miss for the pte, >> and twice for the 4k/4k guest case. >> > It's because you're missing the point. You're taking the most > unrealistic and pessimal cases and then showing that it has fundamental > problems. Speedups like Linus is talking about would refer to ways to > speed up actual workloads, not ways to avoid fundamental limitations. > > Prefetching, memory parallelism, caches. It's worked for 25 years :) > btw, a workload that's known to benefit greatly from large pages is the kernel itself. It's very pointer-chasey and has a large working set (the whole of memory, in fact). But once you run it in a guest you've turned it into the 2M/4k case in the table which is basically a slightly slower version of host 4k pages. So, if we want good support for kernel intensive workloads in guests, or kernel-like workloads in the host (or kernel-like workloads in guest userspace), then we need good large page support. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 234946B01F4 for ; Sat, 10 Apr 2010 14:49:18 -0400 (EDT) Date: Sat, 10 Apr 2010 20:47:50 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100410184750.GJ5708@random.random> References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <20100406090813.GA14098@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100406090813.GA14098@elte.hu> Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: Hi Ingo, On Tue, Apr 06, 2010 at 11:08:13AM +0200, Ingo Molnar wrote: > The goal of Andrea's and Mel's patch-set, to make this 'final performance > boost' more practical seems like a valid technical goal. The integration in my current git tree (#19+): git clone git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git git clone --reference linux-2.6 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git later -> git fetch; git checkout -f origin/master is working great and runs rock solid after the last integration bugfix in migrate.c, enjoy! ;) This is on my workstation, after building a ton of packages (including javac binaries and all sort of other random stuff), lots of kernels, mutt on large maildir folders, and running lots of ebuild that is super heavy in vfs terms. # free total used free shared buffers cached Mem: 3923408 2536380 1387028 0 482656 1194228 -/+ buffers/cache: 859496 3063912 Swap: 4200960 788 4200172 # uptime 20:09:50 up 1 day, 13:19, 11 users, load average: 0.00, 0.00, 0.00 # cat /proc/buddyinfo /proc/extfrag_index /proc/unusable_index Node 0, zone DMA 4 2 3 2 2 0 1 0 1 1 3 Node 0, zone DMA32 10402 32864 10477 3729 2154 1156 471 136 22 50 41 Node 0, zone Normal 196 155 40 21 16 7 4 1 0 2 0 Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 Node 0, zone DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.992 Node 0, zone DMA 0.000 0.001 0.002 0.005 0.009 0.017 0.017 0.033 0.033 0.097 0.226 Node 0, zone DMA32 0.000 0.030 0.223 0.347 0.434 0.536 0.644 0.733 0.784 0.801 0.876 Node 0, zone Normal 0.000 0.072 0.185 0.244 0.306 0.400 0.482 0.576 0.623 0.623 1.000 # time echo 3 > /proc/sys/vm/drop_caches real 0m0.989s user 0m0.000s sys 0m0.984s # time echo > /proc/sys/vm/compact_memory real 0m0.195s user 0m0.000s sys 0m0.124s # cat /proc/buddyinfo /proc/extfrag_index /proc/unusable_index Node 0, zone DMA 4 2 3 2 2 0 1 0 1 1 3 Node 0, zone DMA32 1632 1444 1336 1065 748 449 229 128 59 50 685 Node 0, zone Normal 1046 783 552 367 261 176 116 82 50 43 15 Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 Node 0, zone DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 Node 0, zone DMA 0.000 0.001 0.002 0.005 0.009 0.017 0.017 0.033 0.033 0.097 0.226 Node 0, zone DMA32 0.000 0.001 0.005 0.012 0.022 0.037 0.054 0.072 0.092 0.111 0.142 Node 0, zone Normal 0.000 0.012 0.030 0.056 0.090 0.139 0.205 0.291 0.414 0.563 0.820 # free total used free shared buffers cached Mem: 3923408 295240 3628168 0 4636 23192 -/+ buffers/cache: 267412 3655996 Swap: 4200960 788 4200172 # grep Anon /proc/meminfo AnonPages: 210472 kB AnonHugePages: 102400 kB (now AnonPages includes AnonHugePages, for backwards compatibility, sorry about not having done it earlier, so ~50% of anon ram is in hugepages) MB of hugepages before drop_caches+compact_memory: >>> (41)*4+(52)*2 268 MB of hugepages after drop_caches+compact_memory: >>> (685+15)*4+(50+43)*2 2986 Total ram free: 3543 MB. 84% of the RAM not affected by unmovable stuff after huge vfs slab load for about 2 days. On laptop I got an huge swap storm that killed kdeinit4 with the oom killer while I was away (found the login back in kdm4 when I got back). that supposedly splitted all hugepages and now I after a while I got all hugepages back: # grep Anon /proc/meminfo AnonPages: 767680 kB AnonHugePages: 395264 kB # uptime 20:33:33 up 1 day, 13:45, 9 users, load average: 0.00, 0.00, 0.00 # dmesg|grep kill Out of memory: kill process 8869 (kdeinit4) score 320362 or a child # (50% of ram in hugepages and 400M more of hugepages immediately available after invoking drop_caches/compact_memory manually with the two sysctl) And if this isn't enough kernelcore= can also provide an even stronger guarantee to prevent unmovable stuff to spill over and start shrinking freeable slab before it's too late. The drop caches would be run by try_to_free_pages internally which is interlevated with the try_to_compact_pages calls of course, so this is to show the full potential of set_recommended_min_free_kbytes (in-kernel automatically run at late_initcall unless you boot with transparent_hugepage=0) and memory compaction, on top of the already compound-aware try_to_free_pages (in addition of the order fallback with movable/unmovable of set_recommended_min_free_kbytes). And without using kernelcore= but allowing ebuild and other heavy slab unmovable users to grow as much as they want and with only 3G of ram. The sluggishness of invoking alloc_pages with __GFP_WAIT from hugepage page faults (synchronously in direct reclaim) also completely gone away after I tracked it down to lumpy reclaim that I simply nuked. This is already fully usable and works great, and as Avi showed it boosts even a sort on host by 6%, think about HPC applications, and soon I hope to boost gcc on host by 6% (and of >15% in guest with NPT/EPT) by extending vm_end in 2M chunks in glibc, at least for those huge gcc builds taking >200M like translate.o of qemu-kvm... (so I hope soon gcc running on KVM guest, thanks to EPT/NPT, will run faster than on mainline kernel without transparent hugepages on bare metal). Now I'll add numa awareness by adding alloc_pages_vma and make a #20 release which is one last relevant bit... Then we may want to address smaps to show hugepages per process instead of only global in /proc/meminfo. The only tuning I might recommend to people benchmarking on top of current aa.git, is to compare the workloads with: echo always >/sys/kernel/mm/transparent_hugepage/defrag # default setting at boot echo never >/sys/kernel/mm/transparent_hugepage/defrag And also to speedup khugepaged by decreasing /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs (that will workaround the vm_end not being extended in 2M chunk). There's also one sysctl called /proc/sys/vm/extfrag_threshold that allows to tune memory compaction aggressiveness but I wouldn't twiddle with it, supposedly it'll go away and be replaced by a future exponential backoff based logic to interleave the try_to_compact_pages/try_to_free_pages optimally and more dynamically than the sysctl (discussion on linux-mm). But it's not an huge priority at the moment, it already works great like this and it absolutely never becomes sluggish and it's always responsive since I nuked lumpy-reclaim. The half jiffy average wait time definitely not necessary and it would be lost in the noise compared to addressing the major problem we had in calling try_to_free_pages with order = 9 and __GFP_WAIT. > In fact the whole maintenance thought process seems somewhat similar to the > TSO situation: the networking folks first rejected TSO based on complexity > arguments, but then was embraced after some time. Full agreement! I think everyone wants transparent hugepage, the only compliant I ever heard so far is from Christoph that has some slight preference on not introducing split_huge_page and going full hugepage everywhere, with native in gup immediately where GUP only returns head pages and every caller has to check PageTransHuge on them to see if it's huge or not. Changing several hundred of drivers in one go and with native swapping with hugepage backed swapcache immediately, which means also pagecache has to deal with hugepages immediately, is possible too, but I think this more gradual approach is easier to keep under control, Rome wasn't built in a day. Surely in a second time I want tmpfs backed by hugepages too at least. And maybe pagecache, but it doesn't need to happen immediately. Also we've to keep in mind for huge systems the PAGE_SIZE should eventually become 2M and those will be able to take advantage of transparent hugepages for the 1G pud_trans_huge, that will make HPC even faster. Anyway nothing prevents to take Christoph's long term direction also by starting self contained. To me what is relevant is that everyone in the VM camp seems to want transparent hugepages in some shape or form, because of the about linear speedup they provide to everything running on them on bare metal (and an more than linear cumulative speedup in case of nested pagetables for obvious reasons), no matter what design that it is. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id B73296B01F5 for ; Sat, 10 Apr 2010 15:03:34 -0400 (EDT) Date: Sat, 10 Apr 2010 21:02:33 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100410190233.GA30882@elte.hu> References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100410184750.GJ5708@random.random> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: * Andrea Arcangeli wrote: > [...] > > This is already fully usable and works great, and as Avi showed it boosts > even a sort on host by 6%, think about HPC applications, and soon I hope to > boost gcc on host by 6% (and of >15% in guest with NPT/EPT) by extending > vm_end in 2M chunks in glibc, at least for those huge gcc builds taking > >200M like translate.o of qemu-kvm... (so I hope soon gcc running on KVM > guest, thanks to EPT/NPT, will run faster than on mainline kernel without > transparent hugepages on bare metal). I think what would be needed is some non-virtualization speedup example of a 'non-special' workload, running on the native/host kernel. 'sort' is an interesting usecase - could it be patched to use hugepages if it has to sort through lots of data? Is it practical to run something like a plain make -jN kernel compile all in hugepages, and see a small but measurable speedup? Although it's not an ideal workload for computational speedups at all because a lot of the time we spend in a kernel build is really buildup/teardown of process state/context and similar 'administrative' overhead, while the true 'compilation work' is just a burst of a few dozen milliseconds and then we tear down all the state again. (It's very inefficient really.) Something like GIMP calculations would be a lot more representative of the speedup potential. Is it possible to run the GIMP with transparent hugepages enabled for it? Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 89E4C6B01F3 for ; Sat, 10 Apr 2010 15:23:39 -0400 (EDT) Message-ID: <4BC0CFF4.5000207@redhat.com> Date: Sat, 10 Apr 2010 22:22:28 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> In-Reply-To: <20100410190233.GA30882@elte.hu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/10/2010 10:02 PM, Ingo Molnar wrote: > * Andrea Arcangeli wrote: > > >> [...] >> >> This is already fully usable and works great, and as Avi showed it boosts >> even a sort on host by 6%, think about HPC applications, and soon I hope to >> boost gcc on host by 6% (and of>15% in guest with NPT/EPT) by extending >> vm_end in 2M chunks in glibc, at least for those huge gcc builds taking >> >>> 200M like translate.o of qemu-kvm... (so I hope soon gcc running on KVM >>> >> guest, thanks to EPT/NPT, will run faster than on mainline kernel without >> transparent hugepages on bare metal). >> > I think what would be needed is some non-virtualization speedup example of a > 'non-special' workload, running on the native/host kernel. 'sort' is an > interesting usecase - could it be patched to use hugepages if it has to sort > through lots of data? > In fact it works well unpatched, the 6% I measured was with the system sort. Currently in order to use hugepages (with the 'always' option) the only requirement is that the application uses a few large vmas. > Is it practical to run something like a plain make -jN kernel compile all in > hugepages, and see a small but measurable speedup? > I doubt it - kernel builds run in relatively little memory. The link stage uses a lot of memory but is fairly fast (I guess due to the partial links before). Building a template-heavy C++ application might show some gains. > Although it's not an ideal workload for computational speedups at all because > a lot of the time we spend in a kernel build is really buildup/teardown of > process state/context and similar 'administrative' overhead, while the true > 'compilation work' is just a burst of a few dozen milliseconds and then we > tear down all the state again. (It's very inefficient really.) > > Something like GIMP calculations would be a lot more representative of the > speedup potential. Is it possible to run the GIMP with transparent hugepages > enabled for it? > I thought of it, but raster work is too regular so speculative execution should hide the tlb fill latency. It's also easy to code in a way which hides cache effects (no idea if it is actually coded that way). Sort showed a speedup since it defeats branch prediction and thus the processor cannot pipeline the loop. I thought ray tracers with large scenes should show a nice speedup, but setting this up is beyond my capabilities. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 1FCE16B01E3 for ; Sat, 10 Apr 2010 15:48:15 -0400 (EDT) Date: Sat, 10 Apr 2010 21:47:51 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100410194751.GA23751@elte.hu> References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC0CFF4.5000207@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity , Mike Galbraith , Jason Garrett-Glaser Cc: Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: * Avi Kivity wrote: > > I think what would be needed is some non-virtualization speedup example of > > a 'non-special' workload, running on the native/host kernel. 'sort' is an > > interesting usecase - could it be patched to use hugepages if it has to > > sort through lots of data? > > In fact it works well unpatched, the 6% I measured was with the system sort. Yes - but you intentionally sorted something large - the question is, how big is the slowdown with small sizes (if there's a slowdown), where is the break-even point (if any)? > > [...] > > > > Something like GIMP calculations would be a lot more representative of the > > speedup potential. Is it possible to run the GIMP with transparent > > hugepages enabled for it? > > I thought of it, but raster work is too regular so speculative execution > should hide the tlb fill latency. It's also easy to code in a way which > hides cache effects (no idea if it is actually coded that way). Sort showed > a speedup since it defeats branch prediction and thus the processor cannot > pipeline the loop. Would be nice to try because there's a lot of transformations within Gimp - and Gimp can be scripted. It's also a test for negatives: if there is an across-the-board _lack_ of speedups, it shows that it's not really general purpose but more specialistic. If the optimization is specialistic, then that's somewhat of an argument against automatic/transparent handling. (even though even if the beneficiaries turn out to be only special workloads then transparency still has advantages.) > I thought ray tracers with large scenes should show a nice speedup, but > setting this up is beyond my capabilities. Oh, this tickled some memories: x264 compressed encoding can be very cache and TLB intense. Something like the encoding of a 350 MB video file: wget http://media.xiph.org/video/derf/y4m/soccer_4cif.y4m # NOTE: 350 MB! x264 --crf 20 --quiet soccer_4cif.y4m -o /dev/null --threads 4 would be another thing worth trying with transparent-hugetlb enabled. (i've Cc:-ed x264 benchmarking experts - in case i missed something) Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 2515B6B01E3 for ; Sat, 10 Apr 2010 16:01:37 -0400 (EDT) Date: Sat, 10 Apr 2010 22:00:37 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100410200037.GO5708@random.random> References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100410194751.GA23751@elte.hu> Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Sat, Apr 10, 2010 at 09:47:51PM +0200, Ingo Molnar wrote: > > * Avi Kivity wrote: > > > > I think what would be needed is some non-virtualization speedup example of > > > a 'non-special' workload, running on the native/host kernel. 'sort' is an > > > interesting usecase - could it be patched to use hugepages if it has to > > > sort through lots of data? > > > > In fact it works well unpatched, the 6% I measured was with the system sort. > > Yes - but you intentionally sorted something large - the question is, how big > is the slowdown with small sizes (if there's a slowdown), where is the > break-even point (if any)? The only chance there is a slowdown is if try_to_compact_pages or try_to_free_pages takes longer and runs more frequently with order 9 allocations than try_to_free_pages would on a 0 order allocation. That is only a problem for short-lived frequent allocations in case memory compaction fails to provide some hugepage (as it'll run multiple times even if not needed, which is what the future exponential backoff logic is about). This is why I recommended to run any "real life DB" benchmark with both transparent_hugepage/defrag set to both "always" and "never". "never" will practically make any slowdown impossible to measure. The only other case where there's a potential for minor slowdown compared to 4k pages is COW, the 2M copy will trash the cache and we need it to use non temporal stores, but even that will be offseted by having a boost in TLB terms saving memory accesses in the ptes. Which is my reason for avoiding any optimistic prefault and to only go huge when we get the TLB benefit in return (not just the pagefault speedup, the pagefault speedup is a double edge sword, it trashes more caches so you need more than that for it to be worth it). > Would be nice to try because there's a lot of transformations within Gimp - > and Gimp can be scripted. It's also a test for negatives: if there is an > across-the-board _lack_ of speedups, it shows that it's not really general > purpose but more specialistic. > > If the optimization is specialistic, then that's somewhat of an argument > against automatic/transparent handling. (even though even if the beneficiaries > turn out to be only special workloads then transparency still has advantages.) > > > I thought ray tracers with large scenes should show a nice speedup, but > > setting this up is beyond my capabilities. > > Oh, this tickled some memories: x264 compressed encoding can be very cache and > TLB intense. Something like the encoding of a 350 MB video file: > > wget http://media.xiph.org/video/derf/y4m/soccer_4cif.y4m # NOTE: 350 MB! > x264 --crf 20 --quiet soccer_4cif.y4m -o /dev/null --threads 4 > > would be another thing worth trying with transparent-hugetlb enabled. > > (i've Cc:-ed x264 benchmarking experts - in case i missed something) It definitely worth trying... nice idea. But we need glibc to increase vm_end in 2M aligned chunk, otherwise we've to workaround it in the kernel, for short lived allocations like gcc to take advantage of this. I managed to get 200M of gcc (of ~500M total) of translate.o into hugepages with two glibc params, but I want it all in transhuge before I measure it. I'm running it on the workstation that had 1 day and half of uptime and it's still building more packages as I write this and running large vfs loads in /usr and maildir. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id E626D6B01E3 for ; Sat, 10 Apr 2010 16:12:05 -0400 (EDT) Date: Sat, 10 Apr 2010 22:10:57 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100410201057.GP5708@random.random> References: <20100406011345.GT5825@random.random> <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <20100410200037.GO5708@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100410200037.GO5708@random.random> Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Sat, Apr 10, 2010 at 10:00:37PM +0200, Andrea Arcangeli wrote: > and we need it to use non temporal stores, but even that will be To clarify, I mean using temporal stores only on the CPUs with <8M L2 caches, with some of the Xeon preloading the cache may provide an even further boost to the child with hugepages in addition to the further longstanding benefits of hugetlb for long lived allocations. Furthermore there is also an option (only available when DEBUG_VM is on, called transparent_hugepage/debug_cow) to COW with 4k copies (exactly like we have to do if cow fails to allocate an hugepage, it's the cow fallback) that already eliminates any chance for slowdown in practice, but I don't recommend it at all, because it may provide a minor speedup immediately after the cow with l2 cache <4M, but then it slowdown the child forever and eliminates the more important longstanding benefits. And this in general is very nitpick at this point, but I just wanted to cover all the details I'm aware about of the subtopic you mentioned for completeness. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 80A616B01E3 for ; Sat, 10 Apr 2010 16:22:13 -0400 (EDT) Received: by pwi2 with SMTP id 2so3637772pwi.14 for ; Sat, 10 Apr 2010 13:22:12 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20100410200037.GO5708@random.random> References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <20100410200037.GO5708@random.random> From: Jason Garrett-Glaser Date: Sat, 10 Apr 2010 13:21:52 -0700 Message-ID: Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Ingo Molnar , Avi Kivity , Mike Galbraith , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: >> (i've Cc:-ed x264 benchmarking experts - in case i missed something) > > It definitely worth trying... nice idea. But we need glibc to increase > vm_end in 2M aligned chunk, otherwise we've to workaround it in the > kernel, for short lived allocations like gcc to take advantage of > this. I managed to get 200M of gcc (of ~500M total) of translate.o > into hugepages with two glibc params, but I want it all in transhuge > before I measure it. I'm running it on the workstation that had 1 day > and half of uptime and it's still building more packages as I write > this and running large vfs loads in /usr and maildir. > Just an FYI on this--if you're testing x264, it performs _all_ memory allocation on init and never mallocs again, so it's a good testbed for something that uses a lot of memory but doesn't malloc/free a lot. Jason -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 413356B01E3 for ; Sat, 10 Apr 2010 16:25:34 -0400 (EDT) Message-ID: <4BC0DE84.3090305@redhat.com> Date: Sat, 10 Apr 2010 23:24:36 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> In-Reply-To: <20100410194751.GA23751@elte.hu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/10/2010 10:47 PM, Ingo Molnar wrote: > * Avi Kivity wrote: > > >>> I think what would be needed is some non-virtualization speedup example of >>> a 'non-special' workload, running on the native/host kernel. 'sort' is an >>> interesting usecase - could it be patched to use hugepages if it has to >>> sort through lots of data? >>> >> In fact it works well unpatched, the 6% I measured was with the system sort. >> > Yes - but you intentionally sorted something large - the question is, how big > is the slowdown with small sizes (if there's a slowdown), where is the > break-even point (if any)? > There shouldn't be a slowdown as far as I can tell. The danger IMO is to pin down unused pages in a huge page and so increase memory pressure artificially. The point where this starts to win would be more or less when the page tables mapping the working set hit the size of the last-level cache, multiplied by some loading factor (guess: 0.5). So if you have a 4MB cache, the win should start at around 1GB working set. >>> Something like GIMP calculations would be a lot more representative of the >>> speedup potential. Is it possible to run the GIMP with transparent >>> hugepages enabled for it? >>> >> I thought of it, but raster work is too regular so speculative execution >> should hide the tlb fill latency. It's also easy to code in a way which >> hides cache effects (no idea if it is actually coded that way). Sort showed >> a speedup since it defeats branch prediction and thus the processor cannot >> pipeline the loop. >> > Would be nice to try because there's a lot of transformations within Gimp - > and Gimp can be scripted. It's also a test for negatives: if there is an > across-the-board _lack_ of speedups, it shows that it's not really general > purpose but more specialistic. > Right, but I don't think I can tell which transforms are likely to be sped up. Also, do people manipulate 500MB images regularly? A 20MB image won't see a significant improvement (40KB page tables, that's chickenfeed). > If the optimization is specialistic, then that's somewhat of an argument > against automatic/transparent handling. (even though even if the beneficiaries > turn out to be only special workloads then transparency still has advantages.) > Well, we know that databases, virtualization, and server-side java win from this. (Oracle won't benefit from this implementation since it wants shared, not anonymous, memory, but other databases may). I'm guessing large C++ compiles, and perhaps the new link-time optimization feature, will also see a nice speedup. Desktops will only benefit when they bloat to ~8GB RAM and 1-2GB firefox RSS, probably not so far in the future. >> I thought ray tracers with large scenes should show a nice speedup, but >> setting this up is beyond my capabilities. >> > Oh, this tickled some memories: x264 compressed encoding can be very cache and > TLB intense. Something like the encoding of a 350 MB video file: > > wget http://media.xiph.org/video/derf/y4m/soccer_4cif.y4m # NOTE: 350 MB! > x264 --crf 20 --quiet soccer_4cif.y4m -o /dev/null --threads 4 > > would be another thing worth trying with transparent-hugetlb enabled. > > I'll try it out. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id D2CC46B01E3 for ; Sat, 10 Apr 2010 16:44:01 -0400 (EDT) Message-ID: <4BC0E2C4.8090101@redhat.com> Date: Sat, 10 Apr 2010 23:42:44 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> In-Reply-To: <4BC0DE84.3090305@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/10/2010 11:24 PM, Avi Kivity wrote: >> Oh, this tickled some memories: x264 compressed encoding can be very >> cache and >> TLB intense. Something like the encoding of a 350 MB video file: >> >> wget http://media.xiph.org/video/derf/y4m/soccer_4cif.y4m # >> NOTE: 350 MB! >> x264 --crf 20 --quiet soccer_4cif.y4m -o /dev/null --threads 4 >> >> would be another thing worth trying with transparent-hugetlb enabled. >> > > I'll try it out. > 3-5% improvement. I had to tune khugepaged to scan more aggressively since the run is so short. The working set is only ~100MB here though. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 426006B01E3 for ; Sat, 10 Apr 2010 16:48:33 -0400 (EDT) Date: Sat, 10 Apr 2010 22:47:56 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100410204756.GR5708@random.random> References: <20100406011345.GT5825@random.random> <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC0E2C4.8090101@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Sat, Apr 10, 2010 at 11:42:44PM +0300, Avi Kivity wrote: > 3-5% improvement. I had to tune khugepaged to scan more aggressively > since the run is so short. The working set is only ~100MB here though. We need to either solve it with a kernel workaround or have an environment var for glibc to do the right thing... The best I got so far with gcc is with, about half goes in hugepages with this but it's not enough as likely lib invoked mallocs goes into heap and extended 1M at time. export MALLOC_MMAP_THRESHOLD_=$[1024*1024*1024] export MALLOC_TOP_PAD_=$[1024*1024*1024] Whatever we do, it has to be possible to disable it of course with malloc debug options, or with electric fence of course, but it's not like the default 1M provides any benefit compared to growing it 2M aligned ;) so it's quite an obvious thing to address in glibc in my view. Then if it takes too much RAM on small systems echo madvise >/sys/kernel/mm/transparent_hugepage/enabled will retain the optimizations in qemu guest physical address space range or other bits that are guaranteed not to waste memory and that also are a must-have on embedded that have even smaller l2 caches and slower cpus where every optimization matters. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 715E66B01F1 for ; Sat, 10 Apr 2010 16:49:33 -0400 (EDT) Received: by pvg11 with SMTP id 11so2627022pvg.14 for ; Sat, 10 Apr 2010 13:49:32 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <4BC0E2C4.8090101@redhat.com> References: <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> From: Jason Garrett-Glaser Date: Sat, 10 Apr 2010 13:49:12 -0700 Message-ID: Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Sat, Apr 10, 2010 at 1:42 PM, Avi Kivity wrote: > On 04/10/2010 11:24 PM, Avi Kivity wrote: >>> >>> Oh, this tickled some memories: x264 compressed encoding can be very >>> cache and >>> TLB intense. Something like the encoding of a 350 MB video file: >>> >>> =A0 wget http://media.xiph.org/video/derf/y4m/soccer_4cif.y4m =A0 =A0 = =A0 # NOTE: >>> 350 MB! >>> =A0 x264 --crf 20 --quiet soccer_4cif.y4m -o /dev/null --threads 4 >>> >>> would be another thing worth trying with transparent-hugetlb enabled. >>> >> >> I'll try it out. >> > > 3-5% improvement. =A0I had to tune khugepaged to scan more aggressively s= ince > the run is so short. =A0The working set is only ~100MB here though. I'd try some longer runs with larger datasets to do more testing. Some things to try: 1) Pick a 1080p or even 2160p sequence from http://media.xiph.org/video/der= f/ 2) Use --preset ultrafast or similar to do a ridiculously memory-bandwidth-limited runthrough. 3) Use --preset veryslow or similar to do a very not-memory-limited runthro= ugh. Jason -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id DCBD56B01E3 for ; Sat, 10 Apr 2010 16:55:05 -0400 (EDT) Message-ID: <4BC0E556.30304@redhat.com> Date: Sat, 10 Apr 2010 23:53:42 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Jason Garrett-Glaser Cc: Ingo Molnar , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/10/2010 11:49 PM, Jason Garrett-Glaser wrote: > >> 3-5% improvement. I had to tune khugepaged to scan more aggressively since >> the run is so short. The working set is only ~100MB here though. >> > I'd try some longer runs with larger datasets to do more testing. > > Some things to try: > > 1) Pick a 1080p or even 2160p sequence from http://media.xiph.org/video/derf/ > > Ok, I'm downloading crown_run 2160p, but it will take a while. > 2) Use --preset ultrafast or similar to do a ridiculously > memory-bandwidth-limited runthrough. > > Large pages improve random-access memory bandwidth but don't change sequential access. Which of these does --preset ultrafast change? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 260476B01E3 for ; Sat, 10 Apr 2010 16:58:44 -0400 (EDT) Received: by pzk30 with SMTP id 30so3906591pzk.12 for ; Sat, 10 Apr 2010 13:58:41 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <4BC0E556.30304@redhat.com> References: <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> From: Jason Garrett-Glaser Date: Sat, 10 Apr 2010 13:58:21 -0700 Message-ID: Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Sat, Apr 10, 2010 at 1:53 PM, Avi Kivity wrote: > On 04/10/2010 11:49 PM, Jason Garrett-Glaser wrote: >> >>> 3-5% improvement. =A0I had to tune khugepaged to scan more aggressively >>> since >>> the run is so short. =A0The working set is only ~100MB here though. >>> >> >> I'd try some longer runs with larger datasets to do more testing. >> >> Some things to try: >> >> 1) Pick a 1080p or even 2160p sequence from >> http://media.xiph.org/video/derf/ >> >> > > Ok, I'm downloading crown_run 2160p, but it will take a while. You can always cheat by synthesizing a fake sample like this: ffmpeg -i input.y4m -s 3840x2160 output.y4m Or something similar. Do be careful though; extremely fast presets combined with large input samples will be disk-bottlenecked, so make sure to keep it small enough to fit in disk cache and "prime" the cache before testing. >> 2) Use --preset ultrafast or similar to do a ridiculously >> memory-bandwidth-limited runthrough. >> >> > > Large pages improve random-access memory bandwidth but don't change > sequential access. =A0Which of these does --preset ultrafast change? Hmm, I'm not quite sure. The process is strictly sequential, but there is clearly enough random access mixed in to cause some sort of change given your previous test. The main thing faster presets do is decrease the amount of "work" done at each step, resulting in roughly the same amount of memory bandwidth being required for each step--but in a much shorter period of time. Most "work" done at each step stays well within the L2 cache. Jason -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id F41B26B01E3 for ; Sat, 10 Apr 2010 17:01:45 -0400 (EDT) Message-ID: <4BC0E6ED.7040100@redhat.com> Date: Sun, 11 Apr 2010 00:00:29 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100406011345.GT5825@random.random> <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <20100410204756.GR5708@random.random> In-Reply-To: <20100410204756.GR5708@random.random> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/10/2010 11:47 PM, Andrea Arcangeli wrote: > On Sat, Apr 10, 2010 at 11:42:44PM +0300, Avi Kivity wrote: > >> 3-5% improvement. I had to tune khugepaged to scan more aggressively >> since the run is so short. The working set is only ~100MB here though. >> > We need to either solve it with a kernel workaround or have an > environment var for glibc to do the right thing... > > IMO, both. The kernel should align vmas on 2MB boundaries (good for small pages as well). glibc should use 2MB increments. Even on <2MB sized vmas, the kernel should reserve the large page frame for a while in the hope that the application will use it in a short while. > The best I got so far with gcc is with, about half goes in hugepages > with this but it's not enough as likely lib invoked mallocs goes into > heap and extended 1M at time. > There are also guard pages around stacks IIRC, we could make them 2MB on x86-64. > export MALLOC_MMAP_THRESHOLD_=$[1024*1024*1024] > export MALLOC_TOP_PAD_=$[1024*1024*1024] > > Whatever we do, it has to be possible to disable it of course with > malloc debug options, or with electric fence of course, but it's not > like the default 1M provides any benefit compared to growing it 2M > aligned ;) so it's quite an obvious thing to address in glibc in my > view. Well, but mapping a 2MB vma with a large page could be a considerable waste if the application doesn't eventually use it. I'd like to map the pages with small pages (belonging to a large frame) and if the application actually uses the pages, switch to a large pte. Something that can also improve small pages is to prefault the vma with small pages, but with the accessed and dirty bit cleared. Later, we check those bits and reclaim the pages if they're unused, or coalesce them if they were used. The nice thing is that we save tons of page faults in the common case where the pages are used. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 0AC136B01E3 for ; Sat, 10 Apr 2010 17:48:31 -0400 (EDT) Date: Sat, 10 Apr 2010 23:47:26 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100410214726.GS5708@random.random> References: <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <20100410204756.GR5708@random.random> <4BC0E6ED.7040100@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC0E6ED.7040100@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Sun, Apr 11, 2010 at 12:00:29AM +0300, Avi Kivity wrote: > IMO, both. The kernel should align vmas on 2MB boundaries (good for > small pages as well). glibc should use 2MB increments. Even on <2MB Agreed. > sized vmas, the kernel should reserve the large page frame for a while > in the hope that the application will use it in a short while. I don't see the need of this per-process, and the buddy logic is already doing exactly that for us... (even without the movable/unmovable fallback logic) > There are also guard pages around stacks IIRC, we could make them 2MB on > x86-64. Agreed. That will provide little benefit though, the stack usage is quite local near the top and few apps stores bulks of data there (hard to reach even 512k in size). Firefox has a 300k stack. It'll waste >1M per process. If it grows and the application is long lived khugepaged takes care of this already. But personally I tend to like a black/white approach as much as possible, so I agree to make the vma large enough immediately if enabled = always. > Well, but mapping a 2MB vma with a large page could be a considerable > waste if the application doesn't eventually use it. I'd like to map the > pages with small pages (belonging to a large frame) and if the > application actually uses the pages, switch to a large pte. > > Something that can also improve small pages is to prefault the vma with > small pages, but with the accessed and dirty bit cleared. Later, we > check those bits and reclaim the pages if they're unused, or coalesce > them if they were used. The nice thing is that we save tons of page > faults in the common case where the pages are used. Yeah we could do that. I'm not against it but it's not my preference to do these things. Anything that introduces the risk of performance regressions in corner cases frightens me. I prefer to pay with RAM anytime. Again I like to keep the design as black/white as possible, if somebody is ram constrained he shouldn't leave enabled=always but keep enabled=madvise. That's the whole point of having added a enabled = madvise, for who is ram constrained but wants to run faster anyway with zero ram-waste risk. These days even desktop systems have more ram than needed so I don't see the big deal, we should squeeze out of the ram every possible CPU cycle (even in the user stack even if likely not significant and just a RAM waste) and not waste CPU in pre-fault or migration of 4k to 2M pages when the vm_end grows and then having to find which unmapped pages of a hugepage to reclaim after splitting it on the fly. I want to reduce to the minimum the risk of regressions anywhere when full transparency is enabled. This also has the benefit of keeping the kernel code simpler and with less special cases ;). It may not be ideal if you've a 1G desktop system and you want to run faster when encoding a movie, but for that there's exactly madvise(MADV_HUGEPAGE). qemu-kvm/transcode/ffmpeg all can use a little madvise on their big chunks of memory. khugepaged should also learn to prioritize on those VM_HUGEPAGE vmas before scanning the rest (which it doesn't right now to keep it a bit simpler, but obviously there's room for improvement). Anyway I think I we can start with aligning the vmas that don't pad themselfs with previous vma, to 2M size, and have the stack also aligned so the page faults will fill them automatically. Changing glibc to grow 2m instead of 1m is a one liner change to a #define and it'll also halve the number of mmap syscalls so it's quite strightforward next step. I also need to make it numa aware with an alloc_pages_vma. Both are simple enough that I can do them right now without worries. Then we can re-think at making the kernel more complex. I don't mean it's bad idea, just less obvious than paying with RAM and be simpler... I want to be sure this is rock solid before we go ahead doing more complex stuff. There have been zero problems so far (backing out the anon-vma changes solved the only bug that triggered without memory compaction (showing a skew between the pmd_huge mappings and page_mapcount because of anon-vma errors), and memory compaction also works great now with the last integration fix ;). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 249296B01E3 for ; Sat, 10 Apr 2010 21:06:43 -0400 (EDT) Date: Sun, 11 Apr 2010 03:05:40 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100411010540.GW5708@random.random> References: <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <20100410204756.GR5708@random.random> <4BC0E6ED.7040100@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC0E6ED.7040100@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: > > export MALLOC_MMAP_THRESHOLD_=$[1024*1024*1024] > > export MALLOC_TOP_PAD_=$[1024*1024*1024] With the above two params I get around 200M (around half) in hugepages with gcc building translate.o: $ rm translate.o ; time make translate.o CC translate.o real 0m22.900s user 0m22.601s sys 0m0.260s $ rm translate.o ; time make translate.o CC translate.o real 0m22.405s user 0m22.125s sys 0m0.240s # echo never > /sys/kernel/mm/transparent_hugepage/enabled # exit $ rm translate.o ; time make translate.o CC translate.o real 0m24.128s user 0m23.725s sys 0m0.376s $ rm translate.o ; time make translate.o CC translate.o real 0m24.126s user 0m23.725s sys 0m0.376s $ uptime 02:36:07 up 1 day, 19:45, 5 users, load average: 0.01, 0.12, 0.08 1 sec in 24 means around 4% faster, hopefully when glibc will fully cooperate we'll get better results than the above with gcc... I tried to emulate it with khugepaged running in a loop and I get almost the whole gcc anon memory in hugepages this way (as expected): # echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs # exit rm translate.o ; time make translate.o CC translate.o real 0m21.950s user 0m21.481s sys 0m0.292s $ rm translate.o ; time make translate.o CC translate.o real 0m21.992s user 0m21.529s sys 0m0.288s $ So this takes more than 2 seconds away from 24 seconds reproducibly, and it means gcc now runs 8% faster. This requires running khugepaged at 100% of one of the four cores but with a slight chance to glibc we'll be able reach the exact same 8% speedup (or more because this also involves copying ~200M and sending IPIs to unmap pages and stop userland during the memory copy that won't be necessary anymore). BTW, the current default for khugepaged is to scan 8 pmd every 10 seconds, that means collapsing at most 16M every 10 seconds. Checking 8 pmd pointers every 10 seconds and 6 wakeup per minute for a kernel thread is absolutely unmeasurable but despite the unmeasurable overhead, it provides for a very nice behavior for long lived allocations that may have been swapped in fragmented. This is on phenom X4, I'd be interested if somebody can try on other cpus. To get the environment of the test just: git clone git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git cd qemu-kvm make cd x86_64-softmmu export MALLOC_MMAP_THRESHOLD_=$[1024*1024*1024] export MALLOC_TOP_PAD_=$[1024*1024*1024] rm translate.o; time make translate.o Then you need to flip the above sysfs controls as I did. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id C16EF6B01E3 for ; Sun, 11 Apr 2010 05:30:22 -0400 (EDT) Message-ID: <4BC19663.8080001@redhat.com> Date: Sun, 11 Apr 2010 12:29:07 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> In-Reply-To: <4BC0E556.30304@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Jason Garrett-Glaser Cc: Ingo Molnar , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/10/2010 11:53 PM, Avi Kivity wrote: > On 04/10/2010 11:49 PM, Jason Garrett-Glaser wrote: >> >>> 3-5% improvement. I had to tune khugepaged to scan more >>> aggressively since >>> the run is so short. The working set is only ~100MB here though. >> I'd try some longer runs with larger datasets to do more testing. >> >> Some things to try: >> >> 1) Pick a 1080p or even 2160p sequence from >> http://media.xiph.org/video/derf/ >> > > Ok, I'm downloading crown_run 2160p, but it will take a while. > # time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads 2 yuv4mpeg: 3840x2160@50/1fps, 1:1 encoded 500 frames, 0.68 fps, 251812.80 kb/s real 12m17.154s user 20m39.151s sys 0m11.727s # echo never > /sys/kernel/mm/transparent_hugepage/enabled # echo never > /sys/kernel/mm/transparent_hugepage/khugepaged/enabled # time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads 2 yuv4mpeg: 3840x2160@50/1fps, 1:1 encoded 500 frames, 0.66 fps, 251812.80 kb/s real 12m37.962s user 21m13.506s sys 0m11.696s Just 2.7%, even though the working set was much larger. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 5B7326B01E3 for ; Sun, 11 Apr 2010 05:37:49 -0400 (EDT) Received: by pwi2 with SMTP id 2so3788176pwi.14 for ; Sun, 11 Apr 2010 02:37:47 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <4BC19663.8080001@redhat.com> References: <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> From: Jason Garrett-Glaser Date: Sun, 11 Apr 2010 02:37:27 -0700 Message-ID: Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Sun, Apr 11, 2010 at 2:29 AM, Avi Kivity wrote: > On 04/10/2010 11:53 PM, Avi Kivity wrote: >> >> On 04/10/2010 11:49 PM, Jason Garrett-Glaser wrote: >>> >>>> 3-5% improvement. =A0I had to tune khugepaged to scan more aggressivel= y >>>> since >>>> the run is so short. =A0The working set is only ~100MB here though. >>> >>> I'd try some longer runs with larger datasets to do more testing. >>> >>> Some things to try: >>> >>> 1) Pick a 1080p or even 2160p sequence from >>> http://media.xiph.org/video/derf/ >>> >> >> Ok, I'm downloading crown_run 2160p, but it will take a while. >> > > # time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads 2 > yuv4mpeg: 3840x2160@50/1fps, 1:1 > > encoded 500 frames, 0.68 fps, 251812.80 kb/s > > real =A0 =A012m17.154s > user =A0 =A020m39.151s > sys =A0 =A00m11.727s > > # echo never > /sys/kernel/mm/transparent_hugepage/enabled > # echo never > /sys/kernel/mm/transparent_hugepage/khugepaged/enabled > # time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads 2 > yuv4mpeg: 3840x2160@50/1fps, 1:1 > > encoded 500 frames, 0.66 fps, 251812.80 kb/s > > real =A0 =A012m37.962s > user =A0 =A021m13.506s > sys =A0 =A00m11.696s > > Just 2.7%, even though the working set was much larger. Did you make sure to check your stddev on those? I'm also curious how it compares for --preset ultrafast and so forth. Jason -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id A77606B01EF for ; Sun, 11 Apr 2010 05:41:43 -0400 (EDT) Message-ID: <4BC19916.20100@redhat.com> Date: Sun, 11 Apr 2010 12:40:38 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Jason Garrett-Glaser Cc: Ingo Molnar , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/11/2010 12:37 PM, Jason Garrett-Glaser wrote: > >> # time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads 2 >> yuv4mpeg: 3840x2160@50/1fps, 1:1 >> >> encoded 500 frames, 0.68 fps, 251812.80 kb/s >> >> real 12m17.154s >> user 20m39.151s >> sys 0m11.727s >> >> # echo never> /sys/kernel/mm/transparent_hugepage/enabled >> # echo never> /sys/kernel/mm/transparent_hugepage/khugepaged/enabled >> # time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads 2 >> yuv4mpeg: 3840x2160@50/1fps, 1:1 >> >> encoded 500 frames, 0.66 fps, 251812.80 kb/s >> >> real 12m37.962s >> user 21m13.506s >> sys 0m11.696s >> >> Just 2.7%, even though the working set was much larger. >> > Did you make sure to check your stddev on those? > I'm doing another run to look at variability. > I'm also curious how it compares for --preset ultrafast and so forth. > Is this something realistic or just a benchmark thing? -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 630656B01E3 for ; Sun, 11 Apr 2010 06:22:46 -0400 (EDT) Received: by pwi2 with SMTP id 2so3796289pwi.14 for ; Sun, 11 Apr 2010 03:22:44 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <4BC19916.20100@redhat.com> References: <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> From: Jason Garrett-Glaser Date: Sun, 11 Apr 2010 03:22:24 -0700 Message-ID: Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Sun, Apr 11, 2010 at 2:40 AM, Avi Kivity wrote: > On 04/11/2010 12:37 PM, Jason Garrett-Glaser wrote: >> >>> # time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads= 2 >>> yuv4mpeg: 3840x2160@50/1fps, 1:1 >>> >>> encoded 500 frames, 0.68 fps, 251812.80 kb/s >>> >>> real =A0 =A012m17.154s >>> user =A0 =A020m39.151s >>> sys =A0 =A00m11.727s >>> >>> # echo never> =A0/sys/kernel/mm/transparent_hugepage/enabled >>> # echo never> =A0/sys/kernel/mm/transparent_hugepage/khugepaged/enabled >>> # time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads= 2 >>> yuv4mpeg: 3840x2160@50/1fps, 1:1 >>> >>> encoded 500 frames, 0.66 fps, 251812.80 kb/s >>> >>> real =A0 =A012m37.962s >>> user =A0 =A021m13.506s >>> sys =A0 =A00m11.696s >>> >>> Just 2.7%, even though the working set was much larger. >>> >> >> Did you make sure to check your stddev on those? >> > > I'm doing another run to look at variability. > >> I'm also curious how it compares for --preset ultrafast and so forth. >> > > Is this something realistic or just a benchmark thing? Well, at 2160p, we're already a bit beyond the bounds of ordinary applications. Ultrafast is generally an "unrealistically fast" setting, getting stupid performance levels like 200fps 1080p encoding (at the cost of incredibly bad compression). "veryfast" is probably a more realistic test case (I know many companies using similar levels of performance). Jason -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id B01EE6B01E3 for ; Sun, 11 Apr 2010 06:47:01 -0400 (EDT) Date: Sun, 11 Apr 2010 12:46:08 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100411104608.GA12828@elte.hu> References: <20100406011345.GT5825@random.random> <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC0DE84.3090305@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: * Avi Kivity wrote: > On 04/10/2010 10:47 PM, Ingo Molnar wrote: > >* Avi Kivity wrote: > > > >>>I think what would be needed is some non-virtualization speedup example of > >>>a 'non-special' workload, running on the native/host kernel. 'sort' is an > >>>interesting usecase - could it be patched to use hugepages if it has to > >>>sort through lots of data? > >>In fact it works well unpatched, the 6% I measured was with the system sort. > >Yes - but you intentionally sorted something large - the question is, how big > >is the slowdown with small sizes (if there's a slowdown), where is the > >break-even point (if any)? > > There shouldn't be a slowdown as far as I can tell. [...] It does not hurt to double check the before/after micro-cost precisely - it would be nice to see a result of: perf stat -e instructions --repeat 100 sort /etc/passwd > /dev/null with and without hugetlb. Linus is right in that the patches are intrusive, and the answer to that isnt to insist that it isnt so (it evidently is so), the correct reply is to broaden the utility of the patches and to demonstrate that the feature is useful on a much wider spectrum of workloads. > > Would be nice to try because there's a lot of transformations within Gimp > > - and Gimp can be scripted. It's also a test for negatives: if there is an > > across-the-board _lack_ of speedups, it shows that it's not really general > > purpose but more specialistic. > > Right, but I don't think I can tell which transforms are likely to be sped > up. Also, do people manipulate 500MB images regularly? > > A 20MB image won't see a significant improvement (40KB page tables, that's > chickenfeed). > > If the optimization is specialistic, then that's somewhat of an argument > > against automatic/transparent handling. (even though even if the > > beneficiaries turn out to be only special workloads then transparency > > still has advantages.) > > Well, we know that databases, virtualization, and server-side java win from > this. (Oracle won't benefit from this implementation since it wants shared, > not anonymous, memory, but other databases may). I'm guessing large C++ > compiles, and perhaps the new link-time optimization feature, will also see > a nice speedup. > > Desktops will only benefit when they bloat to ~8GB RAM and 1-2GB firefox > RSS, probably not so far in the future. 1-2GB firefox RSS is reality for me. Btw., there's another workload that could be cache sensitive, 'git grep': aldebaran:~/linux> perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-load-misses --repeat 5 git grep arca >/dev/null Performance counter stats for 'git grep arca' (5 runs): 1882712774 cycles ( +- 0.074% ) 1153649442 instructions # 0.613 IPC ( +- 0.005% ) 518815167 dTLB-loads ( +- 0.035% ) 3028951 dTLB-load-misses ( +- 1.223% ) 0.597161428 seconds time elapsed ( +- 0.065% ) At first sight, with 7 cycles per cold TLB there's about 1.12% of a speedup potential in that workload. With just 1 cycle it's 0.16%. The real speedup ought to be somewhere inbetween. Btw., instead of throwing random numbers like '3-4%' into this thread it would be nice if you could send 'perf stat --repeat' numbers like i did above - they have an error bar, they show the TLB details, they show the cycles and instructions proportion and they are also far more precise than 'time' based results. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 4A8EA6B01EF for ; Sun, 11 Apr 2010 06:49:48 -0400 (EDT) Date: Sun, 11 Apr 2010 12:49:00 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100411104900.GA5632@elte.hu> References: <20100406011345.GT5825@random.random> <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100411104608.GA12828@elte.hu> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: * Ingo Molnar wrote: > > Desktops will only benefit when they bloat to ~8GB RAM and 1-2GB firefox > > RSS, probably not so far in the future. > > 1-2GB firefox RSS is reality for me. > > Btw., there's another workload that could be cache sensitive, 'git grep': > > aldebaran:~/linux> perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-load-misses --repeat 5 git grep arca >/dev/null > > Performance counter stats for 'git grep arca' (5 runs): > > 1882712774 cycles ( +- 0.074% ) > 1153649442 instructions # 0.613 IPC ( +- 0.005% ) > 518815167 dTLB-loads ( +- 0.035% ) > 3028951 dTLB-load-misses ( +- 1.223% ) > > 0.597161428 seconds time elapsed ( +- 0.065% ) Sidenote: you might want to try the cool new threaded git grep from upstream Git project: git clone git://git.kernel.org/pub/scm/git/git.git cd git make -j Beyond being faster, it will also probably show a bigger hugetlb speedup, as the effective per core (and per hyperthread) cache set is smaller than for a single-threaded git grep. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 83DB36B01E3 for ; Sun, 11 Apr 2010 07:00:36 -0400 (EDT) Date: Sun, 11 Apr 2010 13:00:15 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100411110015.GA10149@elte.hu> References: <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC19916.20100@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: * Avi Kivity wrote: > On 04/11/2010 12:37 PM, Jason Garrett-Glaser wrote: > > > >># time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads 2 > >>yuv4mpeg: 3840x2160@50/1fps, 1:1 > >> > >>encoded 500 frames, 0.68 fps, 251812.80 kb/s > >> > >>real 12m17.154s > >>user 20m39.151s > >>sys 0m11.727s > >> > >># echo never> /sys/kernel/mm/transparent_hugepage/enabled > >># echo never> /sys/kernel/mm/transparent_hugepage/khugepaged/enabled > >># time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads 2 > >>yuv4mpeg: 3840x2160@50/1fps, 1:1 > >> > >>encoded 500 frames, 0.66 fps, 251812.80 kb/s > >> > >>real 12m37.962s > >>user 21m13.506s > >>sys 0m11.696s > >> > >>Just 2.7%, even though the working set was much larger. > >Did you make sure to check your stddev on those? > > I'm doing another run to look at variability. Sigh. Could you please stop using stone-age tools like /usr/bin/time and instead use: perf stat --repeat 3 x264 ... you can install it via: cd linux cd tools/perf/ make -j install That way you will see 'variability' (sttdev/error bars/fuzz), and a whole lot of other CPU details beyond much more precise measurements: $ perf stat --repeat 3 x264 --crf 20 --quiet soccer_4cif.y4m -o /dev/null --threads 2 yuv4mpeg: 704x576@60/1fps, 128:117 encoded 2 frames, 23.47 fps, 39824.64 kb/s yuv4mpeg: 704x576@60/1fps, 128:117 encoded 2 frames, 23.52 fps, 39824.64 kb/s yuv4mpeg: 704x576@60/1fps, 128:117 encoded 2 frames, 23.45 fps, 39824.64 kb/s Performance counter stats for 'x264 --crf 20 --quiet soccer_4cif.y4m -o /dev/null --threads 2' (3 runs): 130.624286 task-clock-msecs # 1.496 CPUs ( +- 0.081% ) 74 context-switches # 0.001 M/sec ( +- 7.151% ) 3 CPU-migrations # 0.000 M/sec ( +- 25.000% ) 2987 page-faults # 0.023 M/sec ( +- 0.162% ) 389234822 cycles # 2979.804 M/sec ( +- 0.081% ) 481360693 instructions # 1.237 IPC ( +- 0.036% ) 4206296 cache-references # 32.201 M/sec ( +- 0.387% ) 55732 cache-misses # 0.427 M/sec ( +- 0.529% ) 0.087336553 seconds time elapsed ( +- 0.100% ) Note that perf stat will run fine on older [pre-2.6.31] kernels too (it will measure elapsed time) and even there it will be much more precise than /usr/bin/time. For more dTLB details, use something like: perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-load-misses --repeat 3 x264 ... Yes, i know we had a big flamewar about perf kvm, but IMHO that is no reason for you to pretend that this tool doesnt exist ;-) > > I'm also curious how it compares for --preset ultrafast and so forth. > > Is this something realistic or just a benchmark thing? I'd suggest for you to use the default settings, to make it realistic. (Maybe also 'advanced/high-quality' settings that an advanced user would utilize.) It is no doubt that benchmark advantages can be shown - the point of this exercise is to show that there are real-life speedups to various categories of non-server apps that hugetlb gives us. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id A145C6B01E3 for ; Sun, 11 Apr 2010 07:20:04 -0400 (EDT) Message-ID: <4BC1B034.4050302@redhat.com> Date: Sun, 11 Apr 2010 14:19:16 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> In-Reply-To: <20100411110015.GA10149@elte.hu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/11/2010 02:00 PM, Ingo Molnar wrote: >>> >>> Did you make sure to check your stddev on those? >>> >> I'm doing another run to look at variability. >> > Sigh. Could you please stop using stone-age tools like /usr/bin/time and > instead use: > I did one more run for each setting and got the same results (within a second). > Yes, i know we had a big flamewar about perf kvm, but IMHO that is no reason > for you to pretend that this tool doesnt exist ;-) > I use it almost daily, not sure why you think I pretend it doesn't exist. >> Is this something realistic or just a benchmark thing? >> > I'd suggest for you to use the default settings, to make it realistic. (Maybe > also 'advanced/high-quality' settings that an advanced user would utilize.) > In fact I'm guessing --ultrafast would reduce the gain. The lower the quality, the less time you spend looking at other frames to find commonality. Like bzip2 -1/-9 memory footprint. > It is no doubt that benchmark advantages can be shown - the point of this > exercise is to show that there are real-life speedups to various categories of > non-server apps that hugetlb gives us. > I think hugetlb will mostly help server apps. Desktop apps simply don't have working sets big enough to matter. There will be exceptions, but as a rule, desktop apps won't benefit much from this. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 294AF6B01E3 for ; Sun, 11 Apr 2010 07:24:43 -0400 (EDT) Date: Sun, 11 Apr 2010 13:24:24 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100411112424.GA10952@elte.hu> References: <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <20100410204756.GR5708@random.random> <4BC0E6ED.7040100@redhat.com> <20100411010540.GW5708@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100411010540.GW5708@random.random> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: * Andrea Arcangeli wrote: > So this takes more than 2 seconds away from 24 seconds reproducibly, and it > means gcc now runs 8% faster. [...] That's fantastic if systematic ... i'd give a limb for faster kbuild times in the >2% range. Would be nice to see a precise before/after 'perf stat' comparison: perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-load-misses --repeat 3 ... that way we can see that the instruction count is roughly the same before/after, the cycle count goes down and we can also see the reduction in dTLB misses (and other advantages, if any). Plus, here's a hugetlb usability feature request if you dont mind me suggesting it. This current usage (as root): echo never > /sys/kernel/mm/transparent_hugepage/enabled is fine for testing but it would be also nice to give finegrained per workload tunability to such details. It would be _very_ nice to have app-inheritable hugetlb attributes plus have a 'hugetlb' tool in tools/hugetlb/, which would allow the per workload tuning of hugetlb uses. For example: hugetlb ctl --never ./my-workload.sh would disable hugetlb usage in my-workload.sh (and all sub-processes). Running: hugetlb ctl --always ./my-workload.sh would enable it. [or something like that - maybe there are better naming schemes] Other commands: hugetlb stat would show current allocation stats, etc. Currently you have the 'hugetlbctl' app but IMO it limits the useful command space to 'control' ops only - it would be _much_ better to use the Git model: to name the tool in a much more generic way ('hugetlb' - the project name), and then let sub-commands be added like Git (and perf ;-) does. Git has more than 70 subcommands currently, trend growing. That command model scales and works well for smaller projects like perf (or hugetlb) as well. Anyway, was just a suggestion. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 212716B01E3 for ; Sun, 11 Apr 2010 07:30:43 -0400 (EDT) Received: by pzk28 with SMTP id 28so4086810pzk.11 for ; Sun, 11 Apr 2010 04:30:40 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <4BC1B034.4050302@redhat.com> References: <20100410190233.GA30882@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> From: Jason Garrett-Glaser Date: Sun, 11 Apr 2010 04:30:20 -0700 Message-ID: Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Sun, Apr 11, 2010 at 4:19 AM, Avi Kivity wrote: > On 04/11/2010 02:00 PM, Ingo Molnar wrote: >>>> >>>> Did you make sure to check your stddev on those? >>>> >>> >>> I'm doing another run to look at variability. >>> >> >> Sigh. Could you please stop using stone-age tools like /usr/bin/time and >> instead use: >> > > I did one more run for each setting and got the same results (within a > second). > >> Yes, i know we had a big flamewar about perf kvm, but IMHO that is no >> reason >> for you to pretend that this tool doesnt exist ;-) >> > > I use it almost daily, not sure why you think I pretend it doesn't exist. > >>> Is this something realistic or just a benchmark thing? >>> >> >> I'd suggest for you to use the default settings, to make it realistic. >> (Maybe >> also 'advanced/high-quality' settings that an advanced user would >> utilize.) >> > > In fact I'm guessing --ultrafast would reduce the gain. =A0The lower the > quality, the less time you spend looking at other frames to find > commonality. =A0Like bzip2 -1/-9 memory footprint. The main thing that controls how much obnoxious fetching of past frames you're doing is --ref. This is 3 by default, 1 at all the faster settings, and goes as high as 16 on the very slow ones. Do also note that at very slow settings, the lookahead eats up a phenomenal amount of memory and bandwidth due to its O(--bframes^2 * --rc-lookahead) viterbi analysis. Just for reference, since you're looking at practical applications, here's approximate presets used by various companies I work with that care a lot about performance and run Linux: The Criterion Collection (encoding web versions of films, blu-ray authoring): Veryslow Zencoder (high-quality web transcoding service): Slow Facebook (fast-turnaround web video): Medium Avail Media (live, realtime HD television broadcast): Fast Gaikai (interactive, ultra-low-latency, web video): Veryfast Jason -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id C6EDD6B01EF for ; Sun, 11 Apr 2010 07:31:34 -0400 (EDT) Message-ID: <4BC1B2CA.8050208@redhat.com> Date: Sun, 11 Apr 2010 14:30:18 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100406011345.GT5825@random.random> <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> In-Reply-To: <20100411104608.GA12828@elte.hu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/11/2010 01:46 PM, Ingo Molnar wrote: > >> There shouldn't be a slowdown as far as I can tell. [...] >> > It does not hurt to double check the before/after micro-cost precisely - it > would be nice to see a result of: > > perf stat -e instructions --repeat 100 sort /etc/passwd> /dev/null > > with and without hugetlb. > With: 1036752 instructions # 0.000 IPC ( +- 0.092% ) Without: 1036844 instructions # 0.000 IPC ( +- 0.100% ) > Linus is right in that the patches are intrusive, and the answer to that isnt > to insist that it isnt so (it evidently is so), No one is insisting the patches aren't intrusive. We're insisting they bring a real benefit. I think Linus' main objection was that hugetlb wouldn't work due to fragmentation, and I think we've demonstrated that antifrag/compaction do allow hugetlb to work even during a fragmenting workload running in parallel. > the correct reply is to > broaden the utility of the patches and to demonstrate that the feature is > useful on a much wider spectrum of workloads. > That's probably not the case. I don't expect a significant improvement in desktop experience. The benefit will be for workloads with large working sets and random access to memory. >> Well, we know that databases, virtualization, and server-side java win from >> this. (Oracle won't benefit from this implementation since it wants shared, >> not anonymous, memory, but other databases may). I'm guessing large C++ >> compiles, and perhaps the new link-time optimization feature, will also see >> a nice speedup. >> >> Desktops will only benefit when they bloat to ~8GB RAM and 1-2GB firefox >> RSS, probably not so far in the future. >> > 1-2GB firefox RSS is reality for me. > Mine usually crashes sooner... interestingly, its vmas are heavily fragmented: 00007f97f1500000 2048K rw--- [ anon ] 00007f97f1800000 1024K rw--- [ anon ] 00007f97f1a00000 1024K rw--- [ anon ] 00007f97f1c00000 2048K rw--- [ anon ] 00007f97f1f00000 1024K rw--- [ anon ] 00007f97f2100000 1024K rw--- [ anon ] 00007f97f2300000 1024K rw--- [ anon ] 00007f97f2500000 1024K rw--- [ anon ] 00007f97f2700000 1024K rw--- [ anon ] 00007f97f2900000 1024K rw--- [ anon ] 00007f97f2b00000 2048K rw--- [ anon ] 00007f97f2e00000 2048K rw--- [ anon ] 00007f97f3100000 1024K rw--- [ anon ] 00007f97f3300000 1024K rw--- [ anon ] 00007f97f3500000 1024K rw--- [ anon ] 00007f97f3700000 1024K rw--- [ anon ] 00007f97f3900000 2048K rw--- [ anon ] 00007f97f3c00000 2048K rw--- [ anon ] 00007f97f3f00000 1024K rw--- [ anon ] So hugetlb won't work out-of-the-box on firefox. > Btw., there's another workload that could be cache sensitive, 'git grep': > > aldebaran:~/linux> perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-load-misses --repeat 5 git grep arca>/dev/null > > Performance counter stats for 'git grep arca' (5 runs): > > 1882712774 cycles ( +- 0.074% ) > 1153649442 instructions # 0.613 IPC ( +- 0.005% ) > 518815167 dTLB-loads ( +- 0.035% ) > 3028951 dTLB-load-misses ( +- 1.223% ) > > 0.597161428 seconds time elapsed ( +- 0.065% ) > > At first sight, with 7 cycles per cold TLB there's about 1.12% of a speedup > potential in that workload. With just 1 cycle it's 0.16%. The real speedup > ought to be somewhere inbetween. > 'git grep' is a pagecache workload, not anonymous memory, so it shouldn't see any improvement. I imagine git will see a nice speedup if we get hugetlb for pagecache, at least for read-only workloads that don't hash all the time. > Btw., instead of throwing random numbers like '3-4%' into this thread it would > be nice if you could send 'perf stat --repeat' numbers like i did above - they > have an error bar, they show the TLB details, they show the cycles and > instructions proportion and they are also far more precise than 'time' based > results. > Sure. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id C8E5A6B01E3 for ; Sun, 11 Apr 2010 07:34:19 -0400 (EDT) Message-ID: <4BC1B389.20803@redhat.com> Date: Sun, 11 Apr 2010 14:33:29 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <20100410204756.GR5708@random.random> <4BC0E6ED.7040100@redhat.com> <20100411010540.GW5708@random.random> <20100411112424.GA10952@elte.hu> In-Reply-To: <20100411112424.GA10952@elte.hu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Andrea Arcangeli , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/11/2010 02:24 PM, Ingo Molnar wrote: > * Andrea Arcangeli wrote: > > >> So this takes more than 2 seconds away from 24 seconds reproducibly, and it >> means gcc now runs 8% faster. [...] >> > That's fantastic if systematic ... i'd give a limb for faster kbuild times in > the>2% range. > > Would be nice to see a precise before/after 'perf stat' comparison: > > perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-load-misses --repeat 3 ... > > that way we can see that the instruction count is roughly the same > before/after, the cycle count goes down and we can also see the reduction in > dTLB misses (and other advantages, if any). > > Plus, here's a hugetlb usability feature request if you dont mind me > suggesting it. > > This current usage (as root): > > echo never> /sys/kernel/mm/transparent_hugepage/enabled > > is fine for testing but it would be also nice to give finegrained per workload > tunability to such details. It would be _very_ nice to have app-inheritable > hugetlb attributes plus have a 'hugetlb' tool in tools/hugetlb/, which would > allow the per workload tuning of hugetlb uses. For example: > > hugetlb ctl --never ./my-workload.sh > > would disable hugetlb usage in my-workload.sh (and all sub-processes). > Running: > > hugetlb ctl --always ./my-workload.sh > > would enable it. [or something like that - maybe there are better naming schemes] > I would like to see transparent hugetlb enabled by default for all workloads, and good enough so that users don't need to tweak it at all. May not happen for the initial merge, but certainly later. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 03D636B01E3 for ; Sun, 11 Apr 2010 07:53:53 -0400 (EDT) Date: Sun, 11 Apr 2010 13:52:29 +0200 From: Ingo Molnar Subject: hugepages will matter more in the future Message-ID: <20100411115229.GB10952@elte.hu> References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC1B034.4050302@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: * Avi Kivity wrote: > > It is no doubt that benchmark advantages can be shown - the point of this > > exercise is to show that there are real-life speedups to various > > categories of non-server apps that hugetlb gives us. > > I think hugetlb will mostly help server apps. Desktop apps simply don't > have working sets big enough to matter. There will be exceptions, but as a > rule, desktop apps won't benefit much from this. Both Xorg, xterms and firefox have rather huge RSS's on my boxes. (Even a phone these days easily has more than 512 MB RAM.) Andrea measured multi-percent improvement in gcc performance. I think it's real. Also note that IMO hugetlbs will matter _more_ in the future, even if CPU designers do a perfect job and CPU caches stay well-balanced to typical working sets: because RAM size is increasing somewhat faster than CPU cache size, due to the different physical constraints that CPUs face. A quick back-of-the-envelope estimation: 20 years ago the high-end desktop had 4MB of RAM and 64K of a cache [1:64 proportion], today it has 16 GB of RAM and 8 MB of L2 cache on the CPU [1:2048 proportion]. App working sets track typical RAM sizes [it is their primary limit], not typical CPU cache sizes. So while RAM size is exploding, CPU cache sizes cannot grow that fast and there's an increasing 'gap' between the pagetable size of higher-end RAM-filling workloads and CPU cache sizes - which gap the CPU itself cannot possibly close or mitigate in the future. Also, the proportion of 4K:2MB is a fixed constant, and CPUs dont grow their TLB caches as much as typical RAM size grows: they'll grow it according to the _mean_ working set size - while the 'max' working set gets larger and larger due to the increasing [proportional] gap to RAM size. Put in a different way: this slow, gradual phsyical process causes data-cache misses to become 'colder and colder': in essence a portion of the worst-case TLB miss cost gets added to the average data-cache miss cost on more and more workloads. (Even without any nested-pagetables or other virtualization considerations.) The CPU can do nothing about this - even if it stays in a golden balance with typical workloads. Hugetlbs were ridiculous 10 years ago, but are IMO real today. My prediction is that in 5-10 years we'll be thinking about 1GB pages for certain HPC apps and 2MB pages will be common on the desktop. This is why i think we should think about hugetlb support today and this is why i think we should consider elevating hugetlbs to the next level of built-in Linux VM support. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 728836B01E3 for ; Sun, 11 Apr 2010 08:02:59 -0400 (EDT) Message-ID: <4BC1BA0D.1050904@redhat.com> Date: Sun, 11 Apr 2010 15:01:17 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: hugepages will matter more in the future References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> In-Reply-To: <20100411115229.GB10952@elte.hu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: On 04/11/2010 02:52 PM, Ingo Molnar wrote: > > Put in a different way: this slow, gradual phsyical process causes data-cache > misses to become 'colder and colder': in essence a portion of the worst-case > TLB miss cost gets added to the average data-cache miss cost on more and more > workloads. (Even without any nested-pagetables or other virtualization > considerations.) The CPU can do nothing about this - even if it stays in a > golden balance with typical workloads. > This is the essence and which is why we really need transparent hugetlb. Both the tlb and the caches are way to small to handle the millions of pages that are common now. > This is why i think we should think about hugetlb support today and this is > why i think we should consider elevating hugetlbs to the next level of > built-in Linux VM support. > Agreed, with s/today/yesterday/. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id A0BBA6B01E3 for ; Sun, 11 Apr 2010 08:08:18 -0400 (EDT) Date: Sun, 11 Apr 2010 14:08:00 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100411120800.GC10952@elte.hu> References: <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC1B2CA.8050208@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: * Avi Kivity wrote: > On 04/11/2010 01:46 PM, Ingo Molnar wrote: > > > >>There shouldn't be a slowdown as far as I can tell. [...] > >It does not hurt to double check the before/after micro-cost precisely - it > >would be nice to see a result of: > > > > perf stat -e instructions --repeat 100 sort /etc/passwd> /dev/null > > > >with and without hugetlb. > > With: > > 1036752 instructions # 0.000 IPC ( +- > 0.092% ) > > Without: > > 1036844 instructions # 0.000 IPC ( +- > 0.100% ) > > > Linus is right in that the patches are intrusive, and the answer to that > > isnt to insist that it isnt so (it evidently is so), > > No one is insisting the patches aren't intrusive. We're insisting they > bring a real benefit. I think Linus' main objection was that hugetlb > wouldn't work due to fragmentation, and I think we've demonstrated that > antifrag/compaction do allow hugetlb to work even during a fragmenting > workload running in parallel. As i understood it i think Linus had three main objections: 1- the improvements were only shown in specialistic environments (virtualization, servers) 2- complexity 3- futility: defrag is hard and theoretically impossible 1) numbers were too specialistic I think if some more numbers are gathered and if hugetlb/nohugetlb is made a bit more configurable (on a per workload basis) then this concern is fairly addressed. 2) complexity There's probably not much to be done about this. It's a cost/benefit tradeoff decision, i.e. depends on the other two factors. 3) futility I think Andrea and Mel and you demonstrated that while defrag is futile in theory (we can always fill up all of RAM with dentries and there's no 2MB allocation possible), it seems rather usable in practice. > > the correct reply is to broaden the utility of the patches and to > > demonstrate that the feature is useful on a much wider spectrum of > > workloads. > > That's probably not the case. I don't expect a significant improvement in > desktop experience. The benefit will be for workloads with large working > sets and random access to memory. See my previous mail about the 'RAM gap' - i think it matters more than you think. The important thing to realize is that the working set of the 'desktop' is _not_ independent of RAM size: it just fills up RAM to the 'typical average RAM size'. That is around 2 GB today. In 5-10 years it will be at 16 GB. Applications will just bloat up to that natural size. They'll use finer default resolutions, larger internal caches, etc. etc. So IMO it all matters to the desktop too and is not just a server feature. We saw this again and again: today's server scalability limitation is tomorrow's desktop scalability limitation. > Mine usually crashes sooner... interestingly, its vmas are heavily > fragmented: > > 00007f97f1500000 2048K rw--- [ anon ] > 00007f97f1800000 1024K rw--- [ anon ] > 00007f97f1a00000 1024K rw--- [ anon ] > 00007f97f1c00000 2048K rw--- [ anon ] > 00007f97f1f00000 1024K rw--- [ anon ] > 00007f97f2100000 1024K rw--- [ anon ] > 00007f97f2300000 1024K rw--- [ anon ] > 00007f97f2500000 1024K rw--- [ anon ] > 00007f97f2700000 1024K rw--- [ anon ] > 00007f97f2900000 1024K rw--- [ anon ] > 00007f97f2b00000 2048K rw--- [ anon ] > 00007f97f2e00000 2048K rw--- [ anon ] > 00007f97f3100000 1024K rw--- [ anon ] > 00007f97f3300000 1024K rw--- [ anon ] > 00007f97f3500000 1024K rw--- [ anon ] > 00007f97f3700000 1024K rw--- [ anon ] > 00007f97f3900000 2048K rw--- [ anon ] > 00007f97f3c00000 2048K rw--- [ anon ] > 00007f97f3f00000 1024K rw--- [ anon ] > > So hugetlb won't work out-of-the-box on firefox. Hm, seems to have 1MB holes between them. Half of them are 2MB in size, but half of them are not properly aligned. So about 33% of firefox's anon memory is hugepage-able straight away - still nonzero. (Plus maybe if this comes from glibc then it could be handled by patching glibc.) > 'git grep' is a pagecache workload, not anonymous memory, so it shouldn't > see any improvement. [...] Indeed, git grep is read() based. > [...] I imagine git will see a nice speedup if we get hugetlb for > pagecache, at least for read-only workloads that don't hash all the time. Shouldnt that already be the case today? The pagecache is in the kernel where we have things 2MB mapped. Git read()s it into the same [small] buffer again and again, so the only 'wide' address space access it does is within the kernel, to the 2MB mapped pagecache pages. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id A2E906B01E3 for ; Sun, 11 Apr 2010 08:12:21 -0400 (EDT) Date: Sun, 11 Apr 2010 14:11:30 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100411121130.GD10952@elte.hu> References: <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <20100410204756.GR5708@random.random> <4BC0E6ED.7040100@redhat.com> <20100411010540.GW5708@random.random> <20100411112424.GA10952@elte.hu> <4BC1B389.20803@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC1B389.20803@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Andrea Arcangeli , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: * Avi Kivity wrote: > I would like to see transparent hugetlb enabled by default for all > workloads, and good enough so that users don't need to tweak it at all. May > not happen for the initial merge, but certainly later. Definitely agreed with that - the feature doesnt make sense without that kind of automatic default. Either it _can_ handle to be the default just fine and give us advantages on a broad basis, or if not then it's not worth merging. Nevertheless allowing an opt-out on a finegrained basis would still be nice in general. The default is powerful enough of a force in itself - a finegrained opt-out does not hurt that advantage, it only improves utility. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 05CA16B01E3 for ; Sun, 11 Apr 2010 08:25:34 -0400 (EDT) Message-ID: <4BC1BF93.60807@redhat.com> Date: Sun, 11 Apr 2010 15:24:51 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> In-Reply-To: <20100411120800.GC10952@elte.hu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/11/2010 03:08 PM, Ingo Molnar wrote: > >> No one is insisting the patches aren't intrusive. We're insisting they >> bring a real benefit. I think Linus' main objection was that hugetlb >> wouldn't work due to fragmentation, and I think we've demonstrated that >> antifrag/compaction do allow hugetlb to work even during a fragmenting >> workload running in parallel. >> > As i understood it i think Linus had three main objections: > > 1- the improvements were only shown in specialistic environments > (virtualization, servers) > Servers are not specialized workloads, and neither is virtualization. If we have to justify everything based on the desktop experience we'd have no 4096 core support, fibre channel and 10GbE drivers, a zillion architectures etc. > 2- complexity > No arguing with that. > The important thing to realize is that the working set of the 'desktop' is > _not_ independent of RAM size: it just fills up RAM to the 'typical average > RAM size'. That is around 2 GB today. In 5-10 years it will be at 16 GB. > > Applications will just bloat up to that natural size. They'll use finer > default resolutions, larger internal caches, etc. etc. > Well, if this happens we'll be ready. >> 'git grep' is a pagecache workload, not anonymous memory, so it shouldn't >> see any improvement. [...] >> > Indeed, git grep is read() based. > Right. >> [...] I imagine git will see a nice speedup if we get hugetlb for >> pagecache, at least for read-only workloads that don't hash all the time. >> > Shouldnt that already be the case today? The pagecache is in the kernel where > we have things 2MB mapped. Git read()s it into the same [small] buffer again > and again, so the only 'wide' address space access it does is within the > kernel, to the 2MB mapped pagecache pages. > If you 'git grep pattern $commit' instead, you'll be reading out of mmap()ed git packs. Much of git memory access goes through that. To get the benefit of hugetlb there, we'd need to run khugepaged on pagecache, and align file vmas on 2MB boundaries. We'll also get executables and shared objects mapped via large pages this way, the ELF ABI is already set up to align sections on 2MB boundaries. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id 1C1FF6B01E3 for ; Sun, 11 Apr 2010 08:36:37 -0400 (EDT) Date: Sun, 11 Apr 2010 14:35:14 +0200 From: Ingo Molnar Subject: Re: hugepages will matter more in the future Message-ID: <20100411123514.GA19676@elte.hu> References: <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <4BC1BA0D.1050904@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC1BA0D.1050904@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: * Avi Kivity wrote: > On 04/11/2010 02:52 PM, Ingo Molnar wrote: > > > > Put in a different way: this slow, gradual phsyical process causes > > data-cache misses to become 'colder and colder': in essence a portion of > > the worst-case TLB miss cost gets added to the average data-cache miss > > cost on more and more workloads. (Even without any nested-pagetables or > > other virtualization considerations.) The CPU can do nothing about this - > > even if it stays in a golden balance with typical workloads. > > This is the essence and which is why we really need transparent hugetlb. > Both the tlb and the caches are way to small to handle the millions of pages > that are common now. > > > This is why i think we should think about hugetlb support today and this > > is why i think we should consider elevating hugetlbs to the next level of > > built-in Linux VM support. > > Agreed, with s/today/yesterday/. Well, yes - with the caveat that i think yesterday's hugetlb patches were notwhere close to being mergable. (and were nowhere close to addressing the problems to begin with) Andrea's patches are IMHO a game changer because they are the first thing that has the chance to improve a large category of workloads. We saw it that the 10-years-old hugetlbfs and libhugetlb experiments alone helped very little: a Linux-only opt-in performance feature that takes effort [and admin space configuration ...] on the app side will almost never be taken advantage of to make a visible difference to the end result - it simply doesnt scale as a development and deployment model. The most important thing the past 10 years of kernel development have taught us are that transparent, always-available, zero-app-effort kernel features are king. The rest barely exists. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 36FD66B01E3 for ; Sun, 11 Apr 2010 08:46:40 -0400 (EDT) Date: Sun, 11 Apr 2010 14:46:24 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100411124624.GC19676@elte.hu> References: <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <4BC1BF93.60807@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC1BF93.60807@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: * Avi Kivity wrote: > On 04/11/2010 03:08 PM, Ingo Molnar wrote: > > > >> No one is insisting the patches aren't intrusive. We're insisting they > >> bring a real benefit. I think Linus' main objection was that hugetlb > >> wouldn't work due to fragmentation, and I think we've demonstrated that > >> antifrag/compaction do allow hugetlb to work even during a fragmenting > >> workload running in parallel. > > > > As i understood it i think Linus had three main objections: > > > > 1- the improvements were only shown in specialistic environments > > (virtualization, servers) > > Servers are not specialized workloads, and neither is virtualization. [...] As far as kernel development goes they are. ( In fact in the past few years virtualization has grown the nasty habbit of sometimes _hindering_ upstream kernel development ... I hope that will change. ) > > Applications will just bloat up to that natural size. They'll use finer > > default resolutions, larger internal caches, etc. etc. > > Well, if this happens we'll be ready. That's what happened in the past 20 years, and i can see no signs of that process stopping anytime soon. [ Note, 'apps bloat up to natural RAM size' is a heavy simplification with a somewhat derogatory undertone: in reality what happens is that apps just grow along what are basically random vectors, and if a vector hits across the RAM limit [and causing a visible slowdown due to bloat] there is a _pushback_ from developers/testers/users. The end result is that app working sets are clipped to somewhat below the typical desktop RAM size, but rarely are they debloated to much below that practical average threshold. So in essence 'apps fill up available RAM'. ] Just like car traffic 'fills up' available road capacity. If there's enough road capacity [and fuel prices are not too high] then families (and businesses) will have second and third cars and wont bother optimizing their driving patterns. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id EABF96B01E3 for ; Sun, 11 Apr 2010 11:28:24 -0400 (EDT) Date: Sun, 11 Apr 2010 08:22:04 -0700 (PDT) From: Linus Torvalds Subject: Re: hugepages will matter more in the future In-Reply-To: <20100411115229.GB10952@elte.hu> Message-ID: References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Avi Kivity , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: On Sun, 11 Apr 2010, Ingo Molnar wrote: > > Both Xorg, xterms and firefox have rather huge RSS's on my boxes. (Even a > phone these days easily has more than 512 MB RAM.) Andrea measured > multi-percent improvement in gcc performance. I think it's real. Reality check: he got multiple percent with - one huge badly written file being compiled that took 22s because it's such a horrible monster. - magic libc malloc flags tghat are totally and utterly unrealistic in anything but a benchmark - by basically keeping one CPU totally busy doing defragmentation. Quite frankly, that kind of "performance analysis" makes me _less_ interested rather than more. Because all it shows is that you're willing to do anything at all to get better numbers, regardless of whether it is _realistic_ or not. Seriously, guys. Get a grip. If you start talking about special malloc algorithms, you have ALREADY LOST. Google for memory fragmentation with various malloc implementations in multi-threaded applications. Thinking that you can just allocate in 2MB chunks is so _fundamnetally_ broken that this whole thread should have been laughed out of the room. Instead, you guys egg each other on. Stop the f*cking circle-jerk already. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 544276B01EF for ; Sun, 11 Apr 2010 11:44:59 -0400 (EDT) Message-ID: <4BC1EE13.7080702@redhat.com> Date: Sun, 11 Apr 2010 18:43:15 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: hugepages will matter more in the future References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: On 04/11/2010 06:22 PM, Linus Torvalds wrote: > > On Sun, 11 Apr 2010, Ingo Molnar wrote: > >> Both Xorg, xterms and firefox have rather huge RSS's on my boxes. (Even a >> phone these days easily has more than 512 MB RAM.) Andrea measured >> multi-percent improvement in gcc performance. I think it's real. >> > Reality check: he got multiple percent with > > - one huge badly written file being compiled that took 22s because it's > such a horrible monster. > Not everything is a kernel build. Template heavy C++ code will also allocate tons of memory. gcc -flto will also want lots of memory. > - magic libc malloc flags tghat are totally and utterly unrealistic in > anything but a benchmark > Having glibc allocate in chunks of 2MB instead of 1MB is not unrealistic. I agree about MMAP_THRESHOLD. > - by basically keeping one CPU totally busy doing defragmentation. > I never saw khugepaged take any significant amount of cpu. > Quite frankly, that kind of "performance analysis" makes me _less_ > interested rather than more. Because all it shows is that you're willing > to do anything at all to get better numbers, regardless of whether it is > _realistic_ or not. > > Seriously, guys. Get a grip. If you start talking about special malloc > algorithms, you have ALREADY LOST. Google for memory fragmentation with > various malloc implementations in multi-threaded applications. Thinking > that you can just allocate in 2MB chunks is so _fundamnetally_ broken that > this whole thread should have been laughed out of the room. > And yet Oracle and java have options to use large pages, and we know google and HPC like 'em. Maybe they just haven't noticed the fundamental brokenness yet. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 9E7FF6B01E3 for ; Sun, 11 Apr 2010 11:58:14 -0400 (EDT) Date: Sun, 11 Apr 2010 08:52:10 -0700 (PDT) From: Linus Torvalds Subject: Re: hugepages will matter more in the future In-Reply-To: <4BC1EE13.7080702@redhat.com> Message-ID: References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <4BC1EE13.7080702@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: On Sun, 11 Apr 2010, Avi Kivity wrote: > > And yet Oracle and java have options to use large pages, and we know google > and HPC like 'em. Maybe they just haven't noticed the fundamental brokenness > yet. The thing is, what you are advocating is what traditional UNIX did. Prioritizing the special cases rather than the generic workloads. And I'm telling you, it's wrong. Traditional Unix is dead, and it's dead exactly _because_ it prioritized those kinds of loads. I'm perfectly happy to take specialized workloads into account, but it needs to help the _normal_ case too. Somebody mentioned 4k CPU support as an example, and that's a good example. The only reason we support 4k CPU's is that the code was made clean enough to work with them and actually help clean up the SMP code in general. I've also seen Andrea talk about how it's all rock solid. We _know_ that is wrong, because the anon_vma bug is not solved. That bug apparently happens under low-memory situations, so clearly nobody has really stressed the low-memory case. So here's the deal: make the code cleaner, and it's fine. And stop trying to sell it with _crap_. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id C1C2C6B01E3 for ; Sun, 11 Apr 2010 12:06:12 -0400 (EDT) Message-ID: <4BC1F31E.2050009@redhat.com> Date: Sun, 11 Apr 2010 19:04:46 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: hugepages will matter more in the future References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <4BC1EE13.7080702@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: On 04/11/2010 06:52 PM, Linus Torvalds wrote: > > On Sun, 11 Apr 2010, Avi Kivity wrote: > >> And yet Oracle and java have options to use large pages, and we know google >> and HPC like 'em. Maybe they just haven't noticed the fundamental brokenness >> yet. >> > The thing is, what you are advocating is what traditional UNIX did. > Prioritizing the special cases rather than the generic workloads. > > And I'm telling you, it's wrong. Traditional Unix is dead, and it's dead > exactly _because_ it prioritized those kinds of loads. > This is not a specialized workload. Plenty of sites are running java, plenty of sites are running Oracle (though that won't benefit from anonymous hugepages), and plenty of sites are running virtualization. Not everyone does two kernel builds before breakfast. > I'm perfectly happy to take specialized workloads into account, but it > needs to help the _normal_ case too. Somebody mentioned 4k CPU support as > an example, and that's a good example. The only reason we support 4k CPU's > is that the code was made clean enough to work with them and actually > help clean up the SMP code in general. > > I've also seen Andrea talk about how it's all rock solid. We _know_ that > is wrong, because the anon_vma bug is not solved. That bug apparently > happens under low-memory situations, so clearly nobody has really stressed > the low-memory case. > Well, nothing is rock solid until it's had a few months in the hands of users. > So here's the deal: make the code cleaner, and it's fine. And stop trying > to sell it with _crap_. > That's perfectly reasonable. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 65EFF6B01E3 for ; Sun, 11 Apr 2010 15:37:39 -0400 (EDT) Date: Sun, 11 Apr 2010 21:35:31 +0200 From: Andrea Arcangeli Subject: Re: hugepages will matter more in the future Message-ID: <20100411193531.GB5656@random.random> References: <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <4BC1EE13.7080702@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Avi Kivity , Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: On Sun, Apr 11, 2010 at 08:52:10AM -0700, Linus Torvalds wrote: > is wrong, because the anon_vma bug is not solved. That bug apparently http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=2acc64e8da017045039f30b926efac1f5c4bd82a -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 36BBB6B01E3 for ; Sun, 11 Apr 2010 15:41:54 -0400 (EDT) Date: Sun, 11 Apr 2010 21:40:10 +0200 From: Andrea Arcangeli Subject: Re: hugepages will matter more in the future Message-ID: <20100411194010.GC5656@random.random> References: <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Ingo Molnar , Avi Kivity , Jason Garrett-Glaser , Mike Galbraith , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: On Sun, Apr 11, 2010 at 08:22:04AM -0700, Linus Torvalds wrote: > - magic libc malloc flags tghat are totally and utterly unrealistic in > anything but a benchmark > > - by basically keeping one CPU totally busy doing defragmentation. This is a red herring. This is the last thing we want, and we'll run even faster if we could make current glibc binaries to cooperate. But this is a new feature and it'll require changing glibc slightly. Future glibc will be optimal and it won't require khugepaged don't worry. I got crashes in page_mapcount != number of huge_pmd mapping the page in split_huge_page because of the anon-vma bug, so I had to back it out, this is why it's stable now. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id A31496B01EE for ; Mon, 12 Apr 2010 02:10:00 -0400 (EDT) Date: Mon, 12 Apr 2010 16:09:31 +1000 From: Nick Piggin Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412060931.GP5683@laptop> References: <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100411120800.GC10952@elte.hu> Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Sun, Apr 11, 2010 at 02:08:00PM +0200, Ingo Molnar wrote: > > * Avi Kivity wrote: > > 3) futility > > I think Andrea and Mel and you demonstrated that while defrag is futile in > theory (we can always fill up all of RAM with dentries and there's no 2MB > allocation possible), it seems rather usable in practice. One problem is that you need to keep a lot more memory free in order for it to be reasonably effective. Another thing is that the problem of fragmentation breakdown is not just a one-shot event that fills memory with pinned objects. It is a slow degredation. Especially when you use something like SLUB as the memory allocator which requires higher order allocations for objects which are pinned in kernel memory. Just running a few minutes of testing with a kernel compile in the background does not show the full picture. You really need a box that has been up for days running a proper workload before you are likely to see any breakdown. I'm sure it's horrible for planning if the RDBMS or VM boxes gradually get slower after X days of uptime. It's better to have consistent performance really, for anything except pure benchmark setups. Defrag is not futile in theory, you just have to either have a reserve of movable pages (and never allow pinned kernel pages in there), or you need to allocate pinned kernel memory in units of the chunk size goal (which just gives you different types of fragmentation problems) or you need to do non-linear kernel mappings so you can defrag pinned kernel memory (with *lots* of other problems of course). So you just have a lot of downsides. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 066E36B01EE for ; Mon, 12 Apr 2010 02:18:58 -0400 (EDT) Received: by fg-out-1718.google.com with SMTP id l26so123813fgb.8 for ; Sun, 11 Apr 2010 23:18:57 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20100412060931.GP5683@laptop> References: <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> Date: Mon, 12 Apr 2010 09:18:56 +0300 Message-ID: Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 From: Pekka Enberg Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Ingo Molnar , Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 9:09 AM, Nick Piggin wrote: >> I think Andrea and Mel and you demonstrated that while defrag is futile in >> theory (we can always fill up all of RAM with dentries and there's no 2MB >> allocation possible), it seems rather usable in practice. > > One problem is that you need to keep a lot more memory free in order > for it to be reasonably effective. Another thing is that the problem > of fragmentation breakdown is not just a one-shot event that fills > memory with pinned objects. It is a slow degredation. > > Especially when you use something like SLUB as the memory allocator > which requires higher order allocations for objects which are pinned > in kernel memory. I guess we'd need to merge the SLUB defragmentation patches to fix that? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 180F06B01EE for ; Mon, 12 Apr 2010 02:37:35 -0400 (EDT) Message-ID: <4BC2BF67.80903@redhat.com> Date: Mon, 12 Apr 2010 09:36:23 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> In-Reply-To: <20100412060931.GP5683@laptop> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/12/2010 09:09 AM, Nick Piggin wrote: > On Sun, Apr 11, 2010 at 02:08:00PM +0200, Ingo Molnar wrote: > >> * Avi Kivity wrote: >> >> 3) futility >> >> I think Andrea and Mel and you demonstrated that while defrag is futile in >> theory (we can always fill up all of RAM with dentries and there's no 2MB >> allocation possible), it seems rather usable in practice. >> > One problem is that you need to keep a lot more memory free in order > for it to be reasonably effective. It's the usual space-time tradeoff. You don't want to do it on a netbook, but it's worth it on a 16GB server, which is already not very high end. > Another thing is that the problem > of fragmentation breakdown is not just a one-shot event that fills > memory with pinned objects. It is a slow degredation. > > Especially when you use something like SLUB as the memory allocator > which requires higher order allocations for objects which are pinned > in kernel memory. > Won't the usual antifrag tactics apply? Try to allocate those objects from the same block. > Just running a few minutes of testing with a kernel compile in the > background does not show the full picture. You really need a box that > has been up for days running a proper workload before you are likely > to see any breakdown. > I'm sure we'll be able to generate worst-case scenarios. I'm also reasonably sure we'll be able to deal with them. I hope we won't need to, but it's even possible to move dentries around. > I'm sure it's horrible for planning if the RDBMS or VM boxes gradually > get slower after X days of uptime. It's better to have consistent > performance really, for anything except pure benchmark setups. > If that were the case we'd disable caches everywhere. General purpose computing is a best effort thing, we try to be fast on the common case but we'll be slow on the uncommon case. Access to a bit of memory can take 3 ns if it's in cache, 100 ns if not, and 3 ms if it's on disk. Here, the uncommon case will be really uncommon, most applications (that can benefit from large pages) I'm aware of don't switch from large anonymous working sets to a dcache load of many tiny files. They tend to keep doing the same thing over and over again. I'm not saying we don't need to adapt to changing conditions (we do, especially for kvm, that's what khugepaged is for), but as long as we have a graceful fallback, we don't need to worry too much about failure in extreme conditions. > Defrag is not futile in theory, you just have to either have a reserve > of movable pages (and never allow pinned kernel pages in there), or > you need to allocate pinned kernel memory in units of the chunk size > goal (which just gives you different types of fragmentation problems) > or you need to do non-linear kernel mappings so you can defrag pinned > kernel memory (with *lots* of other problems of course). So you just > have a lot of downsides. > Non-linear kernel mapping moves the small page problem from userspace back to the kernel, a really unhappy solution. Very large (object count, not object size) kernel caches can be addressed by compacting them, but I hope we won't need to do that. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 134A56B01EE for ; Mon, 12 Apr 2010 02:48:58 -0400 (EDT) Date: Mon, 12 Apr 2010 16:48:51 +1000 From: Nick Piggin Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412064850.GQ5683@laptop> References: <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Pekka Enberg Cc: Ingo Molnar , Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 09:18:56AM +0300, Pekka Enberg wrote: > On Mon, Apr 12, 2010 at 9:09 AM, Nick Piggin wrote: > >> I think Andrea and Mel and you demonstrated that while defrag is futile in > >> theory (we can always fill up all of RAM with dentries and there's no 2MB > >> allocation possible), it seems rather usable in practice. > > > > One problem is that you need to keep a lot more memory free in order > > for it to be reasonably effective. Another thing is that the problem > > of fragmentation breakdown is not just a one-shot event that fills > > memory with pinned objects. It is a slow degredation. > > > > Especially when you use something like SLUB as the memory allocator > > which requires higher order allocations for objects which are pinned > > in kernel memory. > > I guess we'd need to merge the SLUB defragmentation patches to fix that? No that's a different problem. And SLUB 'defragmentation' isn't really defragmentation, it is just selective reclaim. Reclaimable slab memory allocations are not the problem. The problem are the ones that you can't reclaim. The problem is this: - Memory gets fragmented by allocation of pinned pages within larger ranges so that we cannot allocate that large range. - Anti-frag improves this by putting pinned pages in different ranges and unpinned pages in different ranges. So the ranges of unpinned pages can get reclaimed to use a larger range. - However there is still an underlying problem of pinned pages causing fragmentation within their ranges. - If you require higher order allocations for pinned pages especially, then you will end up with your pinned ranges becoming fragmented and unable to satisfy the higher order allocation. So you must expand your pinned ranges into unpinned. If you only do 4K slab allocations, then things get better, however it can of course still break down if the pinned allocation requirement grows large. It's really hard to control this because it includes anything from open files to radix tree nodes to page tables and anything that any driver or subsystem allocates with kmalloc. Basically, if you were going to add another level of indirection to solve that, you may as well just go ahead and do nonlinear mappings of the kernel memory with page tables, so you'd only have to fix up places that require translated addresses rather than everything that touches KVA. This would still be a big headache. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id B9E526B01EE for ; Mon, 12 Apr 2010 02:50:42 -0400 (EDT) Date: Mon, 12 Apr 2010 08:49:40 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412064940.GA7745@elte.hu> References: <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100412060931.GP5683@laptop> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: * Nick Piggin wrote: > [...] > > Just running a few minutes of testing with a kernel compile in the > background does not show the full picture. You really need a box that has > been up for days running a proper workload before you are likely to see any > breakdown. AFAIK that's what Andrea has done as a test - but yes, i agree that fragmentation is the main design worry. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 013656B01EE for ; Mon, 12 Apr 2010 02:55:51 -0400 (EDT) Date: Mon, 12 Apr 2010 08:55:05 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412065505.GB7745@elte.hu> References: <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC2BF67.80903@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Nick Piggin , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: * Avi Kivity wrote: > > Defrag is not futile in theory, you just have to either have a reserve of > > movable pages (and never allow pinned kernel pages in there), or you need > > to allocate pinned kernel memory in units of the chunk size goal (which > > just gives you different types of fragmentation problems) or you need to > > do non-linear kernel mappings so you can defrag pinned kernel memory (with > > *lots* of other problems of course). So you just have a lot of downsides. > > Non-linear kernel mapping moves the small page problem from userspace back > to the kernel, a really unhappy solution. Note that in a theoretical sense a specific variant of non-linear kernel mappings is already implemented here and toda and is productized: it's called virtualization. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 079E86B01EE for ; Mon, 12 Apr 2010 03:09:10 -0400 (EDT) Date: Mon, 12 Apr 2010 09:08:11 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412070811.GD5656@random.random> References: <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100412060931.GP5683@laptop> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Ingo Molnar , Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 04:09:31PM +1000, Nick Piggin wrote: > One problem is that you need to keep a lot more memory free in order > for it to be reasonably effective. Another thing is that the problem > of fragmentation breakdown is not just a one-shot event that fills > memory with pinned objects. It is a slow degredation. set_recommended_min_free_kbytes seems to not be in function of ram size, 60MB aren't such a big deal. > Especially when you use something like SLUB as the memory allocator > which requires higher order allocations for objects which are pinned > in kernel memory. > > Just running a few minutes of testing with a kernel compile in the > background does not show the full picture. You really need a box that > has been up for days running a proper workload before you are likely > to see any breakdown. > > I'm sure it's horrible for planning if the RDBMS or VM boxes gradually > get slower after X days of uptime. It's better to have consistent > performance really, for anything except pure benchmark setups. All data I provided is very real, in addition to building a ton of packages and running emerge on /usr/portage I've been running all my real loads. Only problem I only run it for 1 day and half, but the load I kept it under was significant (surely a lot bigger inode/dentry load that any hypervisor usage would ever generate). > Defrag is not futile in theory, you just have to either have a reserve > of movable pages (and never allow pinned kernel pages in there), or > you need to allocate pinned kernel memory in units of the chunk size > goal (which just gives you different types of fragmentation problems) > or you need to do non-linear kernel mappings so you can defrag pinned > kernel memory (with *lots* of other problems of course). So you just > have a lot of downsides. That's what the kernelcore= option does no? Isn't that a good enough math guarantee? Probably we should use it in hypervisor products just in case, to be math-guaranted to never have to use VM migration as fallback (but definitive) defrag algorithm. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 1BE126B01EE for ; Mon, 12 Apr 2010 03:15:39 -0400 (EDT) Date: Mon, 12 Apr 2010 17:15:25 +1000 From: Nick Piggin Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412071525.GR5683@laptop> References: <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC2BF67.80903@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 09:36:23AM +0300, Avi Kivity wrote: > On 04/12/2010 09:09 AM, Nick Piggin wrote: > >On Sun, Apr 11, 2010 at 02:08:00PM +0200, Ingo Molnar wrote: > >>* Avi Kivity wrote: > >> > >>3) futility > >> > >>I think Andrea and Mel and you demonstrated that while defrag is futile in > >>theory (we can always fill up all of RAM with dentries and there's no 2MB > >>allocation possible), it seems rather usable in practice. > >One problem is that you need to keep a lot more memory free in order > >for it to be reasonably effective. > > It's the usual space-time tradeoff. You don't want to do it on a > netbook, but it's worth it on a 16GB server, which is already not > very high end. Possibly. > >Another thing is that the problem > >of fragmentation breakdown is not just a one-shot event that fills > >memory with pinned objects. It is a slow degredation. > > > >Especially when you use something like SLUB as the memory allocator > >which requires higher order allocations for objects which are pinned > >in kernel memory. > > Won't the usual antifrag tactics apply? Try to allocate those > objects from the same block. "try" is the key point. > >Just running a few minutes of testing with a kernel compile in the > >background does not show the full picture. You really need a box that > >has been up for days running a proper workload before you are likely > >to see any breakdown. > > I'm sure we'll be able to generate worst-case scenarios. I'm also > reasonably sure we'll be able to deal with them. I hope we won't > need to, but it's even possible to move dentries around. Pinned dentries? (which are the problem) That would be insane. > >I'm sure it's horrible for planning if the RDBMS or VM boxes gradually > >get slower after X days of uptime. It's better to have consistent > >performance really, for anything except pure benchmark setups. > > If that were the case we'd disable caches everywhere. General No we wouldn't. You can have consistent, predictable performance with caches. > purpose computing is a best effort thing, we try to be fast on the > common case but we'll be slow on the uncommon case. Access to a bit Sure. And the common case for production systems like VM or databse servers that are up for hundreds of days is when they are running with a lot of uptime. Common case is not a fresh reboot into a 3 hour benchmark setup. > of memory can take 3 ns if it's in cache, 100 ns if not, and 3 ms if > it's on disk. > > Here, the uncommon case will be really uncommon, most applications > (that can benefit from large pages) I'm aware of don't switch from > large anonymous working sets to a dcache load of many tiny files. > They tend to keep doing the same thing over and over again. > > I'm not saying we don't need to adapt to changing conditions (we do, > especially for kvm, that's what khugepaged is for), but as long as > we have a graceful fallback, we don't need to worry too much about > failure in extreme conditions. > > >Defrag is not futile in theory, you just have to either have a reserve > >of movable pages (and never allow pinned kernel pages in there), or > >you need to allocate pinned kernel memory in units of the chunk size > >goal (which just gives you different types of fragmentation problems) > >or you need to do non-linear kernel mappings so you can defrag pinned > >kernel memory (with *lots* of other problems of course). So you just > >have a lot of downsides. > > Non-linear kernel mapping moves the small page problem from > userspace back to the kernel, a really unhappy solution. Not unhappy for userspace intensive workloads. And user working sets I'm sure are growing faster than kernel working set. Also there would be nothing against compacting and merging kernel memory into larger pages. > Very large (object count, not object size) kernel caches can be > addressed by compacting them, but I hope we won't need to do that. You can't say that fragmentation is not a fundamental problem. And adding things like indirect pointers or weird crap adding complexity to code that deals with KVA IMO is not acceptable. So you can't just assert that you can "address" the problem. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 8911B6B01EE for ; Mon, 12 Apr 2010 03:19:51 -0400 (EDT) Date: Mon, 12 Apr 2010 09:18:56 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412071856.GE5656@random.random> References: <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC2BF67.80903@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Nick Piggin , Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 09:36:23AM +0300, Avi Kivity wrote: > On 04/12/2010 09:09 AM, Nick Piggin wrote: > > On Sun, Apr 11, 2010 at 02:08:00PM +0200, Ingo Molnar wrote: > > > >> * Avi Kivity wrote: > >> > >> 3) futility > >> > >> I think Andrea and Mel and you demonstrated that while defrag is futile in > >> theory (we can always fill up all of RAM with dentries and there's no 2MB > >> allocation possible), it seems rather usable in practice. > >> > > One problem is that you need to keep a lot more memory free in order > > for it to be reasonably effective. > > It's the usual space-time tradeoff. You don't want to do it on a > netbook, but it's worth it on a 16GB server, which is already not very > high end. Agreed. BTW, if booting with transparent_hugepage=0, set_recommended_min_free_kbyte in-kernel logic won't run automatically during the late_initcall invocation. > Non-linear kernel mapping moves the small page problem from userspace > back to the kernel, a really unhappy solution. Yeah, so we have hugepages in userland but we lose them in kernel ;) and we run kmalloc as slow as vmalloc ;). I think kernelcore= here is the answer when somebody asks the math guarantee. We should just focus on providing a math guarantee with kernelcore= and be done with it. Limiting the unmovable caches to a certain amount of RAM is orders of magnitude magnitude more flexible and transparent (and absolutely unnoticeable) than having to limit only hugepages (so unusable as regular anon memory, or regular pagecache, or any other movable entitiy) to a certain amount at boot (plus not being able to swap them, having to mount filesystems, using LD_PRELOAD tricks etc...). Furthermore with hypervisor usage the unmovable stuff really isn't a big deal (1G is more than enough for that even on monster servers) and we'll never care or risk to hit on the limit. All we need is the movable memory to grow freely and dynamically and being able to spread all over the RAM of the system automatically as needed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 1BB6C6B01EF for ; Mon, 12 Apr 2010 03:21:59 -0400 (EDT) Date: Mon, 12 Apr 2010 17:21:44 +1000 From: Nick Piggin Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412072144.GS5683@laptop> References: <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <20100412070811.GD5656@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100412070811.GD5656@random.random> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Ingo Molnar , Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 09:08:11AM +0200, Andrea Arcangeli wrote: > On Mon, Apr 12, 2010 at 04:09:31PM +1000, Nick Piggin wrote: > > One problem is that you need to keep a lot more memory free in order > > for it to be reasonably effective. Another thing is that the problem > > of fragmentation breakdown is not just a one-shot event that fills > > memory with pinned objects. It is a slow degredation. > > set_recommended_min_free_kbytes seems to not be in function of ram > size, 60MB aren't such a big deal. > > > Especially when you use something like SLUB as the memory allocator > > which requires higher order allocations for objects which are pinned > > in kernel memory. > > > > Just running a few minutes of testing with a kernel compile in the > > background does not show the full picture. You really need a box that > > has been up for days running a proper workload before you are likely > > to see any breakdown. > > > > I'm sure it's horrible for planning if the RDBMS or VM boxes gradually > > get slower after X days of uptime. It's better to have consistent > > performance really, for anything except pure benchmark setups. > > All data I provided is very real, in addition to building a ton of > packages and running emerge on /usr/portage I've been running all my > real loads. Only problem I only run it for 1 day and half, but the > load I kept it under was significant (surely a lot bigger inode/dentry > load that any hypervisor usage would ever generate). OK, but as a solution for some kind of very specific and highly optimized application already like RDBMS, HPC, hypervisor or JVM, they could just be using hugepages themselves, couldn't they? It seems more interesting as a more general speedup for applications that can't afford such optimizations? (eg. the common case for most people) > > Defrag is not futile in theory, you just have to either have a reserve > > of movable pages (and never allow pinned kernel pages in there), or > > you need to allocate pinned kernel memory in units of the chunk size > > goal (which just gives you different types of fragmentation problems) > > or you need to do non-linear kernel mappings so you can defrag pinned > > kernel memory (with *lots* of other problems of course). So you just > > have a lot of downsides. > > That's what the kernelcore= option does no? Isn't that a good enough > math guarantee? Probably we should use it in hypervisor products just > in case, to be math-guaranted to never have to use VM migration as > fallback (but definitive) defrag algorithm. Yes we do have the option to reserve pages and as far as I know it should work, although I can't remember whether it deals with mlock. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 3C1BE6B01EE for ; Mon, 12 Apr 2010 03:36:29 -0400 (EDT) Date: Mon, 12 Apr 2010 09:35:30 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412073530.GF5656@random.random> References: <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <20100412064940.GA7745@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100412064940.GA7745@elte.hu> Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Nick Piggin , Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 08:49:40AM +0200, Ingo Molnar wrote: > AFAIK that's what Andrea has done as a test - but yes, i agree that > fragmentation is the main design worry. Well, I didn't only run a kernel compile for a couple of minutes to show how memory compaction + in-kernel set_recommended_min_free_kbytes behaved on my system. I can't claim my numbers are conclusive as it only run for 1 day and half but there was some real unmovable load on it. Plus uptime isn't the only variable, if you use the kernel to create an hypervisor product, you can leave it running VM for a much longer time than 1 day, and it won't ever generate the amount of unmovable load that I generated in one day and half I guess. I built a ton of packages including gcc, bison (which in javac triggered the anon-vma bug before I backed it out) quite some other stuff that come as a regular update with a couple of emerge world like kvirc and stuff like that. There was mutt on lkml and linux-mm maildir with some hundred thousand inodes for the email, and a dozen kernel builds and git checkouts to verify my aa.git tree. That's what I can recall. After 1 day and half I still had ~80% of the not allocated ram in order 9 and maybe ~75% (by memory, could have been more or less I don't remember exactly but I posted the exact buddyinfo so you can calculate yourself if curious) in order 10 == MAX_ORDER. The vast majority of the free ram was in order 10 after echo 3 >drop_caches and echo >compact_memory, which simulates the maximum ability of the VM to generate hugepages dynamically (of course it won't ever create such a totally compacted buddyinfo at runtime as we don't want to shrink or compact stuff unless it's really needed). Likely if I killed mutt and other running apps and I would have run drop_caches and memory compaction again I would have gotten an even higher ratio as result of more memory being freeable. One day and half isn't enough, but it was initial data, and then I had to reboot into a new #20 release to test a memleak fix I did in do_huge_pmd_wp_page_fallback... I'll try to run it for a longer time now. I guess I'll be rebuilding quite some glibc on my system as we optimize it for the kernel. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id A52956B01EE for ; Mon, 12 Apr 2010 03:46:22 -0400 (EDT) Message-ID: <4BC2CF8C.5090108@redhat.com> Date: Mon, 12 Apr 2010 10:45:16 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> In-Reply-To: <20100412071525.GR5683@laptop> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/12/2010 10:15 AM, Nick Piggin wrote: > >>> Another thing is that the problem >>> of fragmentation breakdown is not just a one-shot event that fills >>> memory with pinned objects. It is a slow degredation. >>> >>> Especially when you use something like SLUB as the memory allocator >>> which requires higher order allocations for objects which are pinned >>> in kernel memory. >>> >> Won't the usual antifrag tactics apply? Try to allocate those >> objects from the same block. >> > "try" is the key point. > We use the "try" tactic extensively. So long as there's a reasonable chance of success, and a reasonable fallback on failure, it's fine. Do you think we won't have reasonable success rates? Why? > > >>> Just running a few minutes of testing with a kernel compile in the >>> background does not show the full picture. You really need a box that >>> has been up for days running a proper workload before you are likely >>> to see any breakdown. >>> >> I'm sure we'll be able to generate worst-case scenarios. I'm also >> reasonably sure we'll be able to deal with them. I hope we won't >> need to, but it's even possible to move dentries around. >> > Pinned dentries? (which are the problem) That would be insane. > Why? If you can isolate all the pointers into the dentry, allocate the new dentry, make the old one point into the new one, hash it, move the pointers, drop the old dentry. Difficult, yes, but insane? >>> I'm sure it's horrible for planning if the RDBMS or VM boxes gradually >>> get slower after X days of uptime. It's better to have consistent >>> performance really, for anything except pure benchmark setups. >>> >> If that were the case we'd disable caches everywhere. General >> > No we wouldn't. You can have consistent, predictable performance with > caches. > Caches have statistical performance. In the long run they average out. In the short run they can behave badly. Same thing with large pages, except the runs are longer and the wins are smaller. >> purpose computing is a best effort thing, we try to be fast on the >> common case but we'll be slow on the uncommon case. Access to a bit >> > Sure. And the common case for production systems like VM or databse > servers that are up for hundreds of days is when they are running with > a lot of uptime. Common case is not a fresh reboot into a 3 hour > benchmark setup. > Database are the easiest case, they allocate memory up front and don't give it up. We'll coalesce their memory immediately and they'll run happily ever after. Virtualization will fragment on overcommit, but the load is all anonymous memory, so it's easy to defragment. Very little dcache on the host. >> Non-linear kernel mapping moves the small page problem from >> userspace back to the kernel, a really unhappy solution. >> > Not unhappy for userspace intensive workloads. And user working sets > I'm sure are growing faster than kernel working set. Also there would > be nothing against compacting and merging kernel memory into larger > pages. > Well, I'm not against it, but that would be a much more intrusive change than what this thread is about. Also, you'd need 4K dentries etc, no? >> Very large (object count, not object size) kernel caches can be >> addressed by compacting them, but I hope we won't need to do that. >> > You can't say that fragmentation is not a fundamental problem. And > adding things like indirect pointers or weird crap adding complexity > to code that deals with KVA IMO is not acceptable. So you can't > just assert that you can "address" the problem. > Mostly we need a way of identifying pointers into a data structure, like rmap (after all that's what makes transparent hugepages work). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 78A7B6B01EF for ; Mon, 12 Apr 2010 03:47:19 -0400 (EDT) Date: Mon, 12 Apr 2010 09:45:57 +0200 From: Ingo Molnar Subject: Re: hugepages will matter more in the future Message-ID: <20100412074557.GA18485@elte.hu> References: <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <4BC1EE13.7080702@redhat.com> <4BC1F31E.2050009@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC1F31E.2050009@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Linus Torvalds , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: * Avi Kivity wrote: > On 04/11/2010 06:52 PM, Linus Torvalds wrote: > > > >On Sun, 11 Apr 2010, Avi Kivity wrote: > >> > >> And yet Oracle and java have options to use large pages, and we know > >> google and HPC like 'em. Maybe they just haven't noticed the fundamental > >> brokenness yet. ( Add Firefox to the mix too - it too allocates in 1MB/2MB chunks. Perhaps Xorg as well. ) > > The thing is, what you are advocating is what traditional UNIX did. > > Prioritizing the special cases rather than the generic workloads. > > > > And I'm telling you, it's wrong. Traditional Unix is dead, and it's dead > > exactly _because_ it prioritized those kinds of loads. > > This is not a specialized workload. Plenty of sites are running java, > plenty of sites are running Oracle (though that won't benefit from anonymous > hugepages), and plenty of sites are running virtualization. Not everyone > does two kernel builds before breakfast. Java/virtualization/DBs, and, to a certain sense Firefox have basically become meta-kernels: they offer their own intermediate APIs to their own style of apps - and those apps generally have no direct access to the native Linux kernel. And just like the native kernel has been enjoying the benefits of 2MB pages for more than a decade, do these other entities want to enjoy similar benefits as well. Fair is fair. Like it or not, combined end-user attention/work spent in these meta-kernels is rising steadily, while apps written in raw C are becoming the exception. So IMHO we really have roughly three logical choices: 1) either we accept that the situation is the fault of our technology and subsequently we reform and modernize the Linux syscall ABIs to be more friendly to apps (offer built-in GC and perhaps JIT concepts, perhaps offer a compiler, offer a wider range of libraries with better integration, etc.) 2) or we accept the fact that the application space is shifting to the meta-kernels - and then we should agressively optimize Linux for those meta-kernels and not pretend that they are 'specialized'. They literally represent tens of thousands of applications apiece. 3) or we should continue to muddle through somewhere in the middle, hoping that the 'pure C apps' win in the end (despite 10 years of a decline) and pretend that the meta-kernels are just 'specialized' workloads. Right now we are doing 3) and i think it's delusive and a mistake. I think we should be doing 1) - but failing that we have to be honest and do 2). Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 6E1286B01EE for ; Mon, 12 Apr 2010 03:51:41 -0400 (EDT) Message-ID: <4BC2D0C9.3060201@redhat.com> Date: Mon, 12 Apr 2010 10:50:33 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <20100412070811.GD5656@random.random> <20100412072144.GS5683@laptop> In-Reply-To: <20100412072144.GS5683@laptop> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Andrea Arcangeli , Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/12/2010 10:21 AM, Nick Piggin wrote: >> >> All data I provided is very real, in addition to building a ton of >> packages and running emerge on /usr/portage I've been running all my >> real loads. Only problem I only run it for 1 day and half, but the >> load I kept it under was significant (surely a lot bigger inode/dentry >> load that any hypervisor usage would ever generate). >> > OK, but as a solution for some kind of very specific and highly > optimized application already like RDBMS, HPC, hypervisor or JVM, > they could just be using hugepages themselves, couldn't they? > > It seems more interesting as a more general speedup for applications > that can't afford such optimizations? (eg. the common case for > most people) > The problem with hugetlbfs is that you need to commit upfront to using it, and that you need to be the admin. For virtualization, you want to use hugepages when there is no memory pressure, but you want to use ksm, ballooning, and swapping when there is (and then go back to large pages when pressure is relieved, e.g. by live migration). HPC and databases can probably live with hugetlbfs. JVM is somewhere in the middle, they do allocate memory dynamically. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 996B06B01F1 for ; Mon, 12 Apr 2010 03:52:12 -0400 (EDT) Date: Mon, 12 Apr 2010 09:51:49 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412075149.GB18485@elte.hu> References: <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100412071525.GR5683@laptop> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: * Nick Piggin wrote: > [...] Common case is not a fresh reboot into a 3 hour benchmark setup. Again - that's not what Andrea has done as a test: he has tested an atypically intense workload for more than a day. Which, if it's true, is good enough as far as i'm concerned - even if we assume that it deteriorates after 2 days of uptime. If after a day of intense uptime it's still usable then a few seconds of a dcache compaction run (spread out over a day) doesnt look unrealistic. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 3ABF76B01EE for ; Mon, 12 Apr 2010 04:07:10 -0400 (EDT) Date: Mon, 12 Apr 2010 10:06:26 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412080626.GG5656@random.random> References: <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <20100412070811.GD5656@random.random> <20100412072144.GS5683@laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100412072144.GS5683@laptop> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Ingo Molnar , Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 05:21:44PM +1000, Nick Piggin wrote: > On Mon, Apr 12, 2010 at 09:08:11AM +0200, Andrea Arcangeli wrote: > > On Mon, Apr 12, 2010 at 04:09:31PM +1000, Nick Piggin wrote: > > > One problem is that you need to keep a lot more memory free in order > > > for it to be reasonably effective. Another thing is that the problem > > > of fragmentation breakdown is not just a one-shot event that fills > > > memory with pinned objects. It is a slow degredation. > > > > set_recommended_min_free_kbytes seems to not be in function of ram > > size, 60MB aren't such a big deal. > > > > > Especially when you use something like SLUB as the memory allocator > > > which requires higher order allocations for objects which are pinned > > > in kernel memory. > > > > > > Just running a few minutes of testing with a kernel compile in the > > > background does not show the full picture. You really need a box that > > > has been up for days running a proper workload before you are likely > > > to see any breakdown. > > > > > > I'm sure it's horrible for planning if the RDBMS or VM boxes gradually > > > get slower after X days of uptime. It's better to have consistent > > > performance really, for anything except pure benchmark setups. > > > > All data I provided is very real, in addition to building a ton of > > packages and running emerge on /usr/portage I've been running all my > > real loads. Only problem I only run it for 1 day and half, but the > > load I kept it under was significant (surely a lot bigger inode/dentry > > load that any hypervisor usage would ever generate). > > OK, but as a solution for some kind of very specific and highly > optimized application already like RDBMS, HPC, hypervisor or JVM, > they could just be using hugepages themselves, couldn't they? > > It seems more interesting as a more general speedup for applications > that can't afford such optimizations? (eg. the common case for > most people) The reality is that very few are using hugetlbfs. I guess maybe 0.1% of KVM instances on phenom/nahlem chips are running on hugetlbfs for example (hugetlbfs boot reservation doesn't fit the cloud where you need all ram available in hugetlbfs and you still need 100% of unused ram as host pagecache for VDI), despite it would provide a >=6% boosts to all VM no matter what's running on the guest. Same goes for the JVM, maybe 0.1% of those runs on hugetlbfs. The commercial DBMS are the exception and they're probably closer to 99% running on hugetlbfs (and they've to keep using hugetlbfs until we move transparent hugepages in tmpfs). But as So there's a ton of wasted energy in my view. Like Ingo said, the faster they make the chips and the cheaper the RAM becomes, the more wasted energy as result of not using hugetlbfs. There's always more difference between cache sizes and ram sizes and also more difference between cache speeds and ram speeds. I don't see this trend ending and I'm not sure what is the better CPU that will make hugetlbfs worthless and unselectable at kernel configure time on x86 arch (if you build without generic). And I don't think it's feasible to ship a distro where 99% of apps that can benefit from hugepages are running with LD_PRELOAD=libhugetlbfs.so. It has to be transparent if we want to stop the waste. The main reason I've always been skeptical about transparent hugepages before I started working on this is the mess they generate on the whole kernel. So my priority of course has been to keep it self contained as much as possible. It kept spilling over and over until I managed to confine it to anonymous pages and fix whole mm/.c files with just a one liner (even the hugepage aware implementation that Johannes did still takes advantage of split_huge_page_pmd if the mprotect start/end isn't 2M naturally aligned, just to show how complex it would be to do it all at once). This will allow us to reach a solid base, and then later move to tmpfs and maybe later to pagecache and swapcache too. Pretending the whole kernel to become hugepage aware at once is a total mess, gup would need to return only head pages for example and breaking hundred of drivers in just that change. The compound_lock can be removed after you fix all those hundred of drivers and subsystems using gup... No big deal to remove it later, kind of you're removing the big kernel lock these days after 14 years of when it has been introduced. Plus I did all I could to try to keep it as black and white as possible. I think other OS are more gray in their approaches, my priority has been to pay for RAM anywhere I could if you set enabled=always, and to decrease as much as I could any risk of performance regressions in any workload. These days we can afford to lose 1G without much worry if it speedup the workload 8%, so I think the other designs are better for old hardware RAM constrainted and not very actual. On embedded with my patchset one should set enabled=madvise. Ingo suggested a per-process tweak to enable it selectively on certain apps, that is feasible too in the future (so people won't be forced to modify binaries to add madvise if they can't leave enabled=always). > Yes we do have the option to reserve pages and as far as I know it > should work, although I can't remember whether it deals with mlock. I think that is the right route to take for who needs the math-guarantees, and for many products it won't even be noticeable to enforce the math guarantee. It's kind of overcommit, somebody prefers the = 2 version and maybe they don't even notice it allows them to allocate less memory. Others prefers to be able to allocate ram without accounting for the unused virtual regions despite the bigger chance to run into the oom killer (and I'm in the latter camp for both overcommit sysctl and kernelcore= ;). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id EDDBC6B01EE for ; Mon, 12 Apr 2010 04:08:35 -0400 (EDT) Date: Mon, 12 Apr 2010 10:07:48 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412080748.GC18485@elte.hu> References: <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <20100412070811.GD5656@random.random> <20100412072144.GS5683@laptop> <4BC2D0C9.3060201@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC2D0C9.3060201@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Nick Piggin , Andrea Arcangeli , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: * Avi Kivity wrote: > On 04/12/2010 10:21 AM, Nick Piggin wrote: > >> > >>All data I provided is very real, in addition to building a ton of > >>packages and running emerge on /usr/portage I've been running all my > >>real loads. Only problem I only run it for 1 day and half, but the > >>load I kept it under was significant (surely a lot bigger inode/dentry > >>load that any hypervisor usage would ever generate). > >OK, but as a solution for some kind of very specific and highly > >optimized application already like RDBMS, HPC, hypervisor or JVM, > >they could just be using hugepages themselves, couldn't they? > > > > It seems more interesting as a more general speedup for applications that > > can't afford such optimizations? (eg. the common case for most people) > > The problem with hugetlbfs is that you need to commit upfront to using it, > and that you need to be the admin. For virtualization, you want to use > hugepages when there is no memory pressure, but you want to use ksm, > ballooning, and swapping when there is (and then go back to large pages when > pressure is relieved, e.g. by live migration). > > HPC and databases can probably live with hugetlbfs. JVM is somewhere in the > middle, they do allocate memory dynamically. Even for HPC hugetlbfs is often not good enough: if the data is being constantly acquired and put into a file and if it needs to be in persistent storage then you dont want to (and cannot) copy it to hugetlbfs (on a poweroff you would lose the file). Furthermore there's also the deployment barrier of marginal improvements: not many apps are willing to change for a +0.1% improvement - or even for a +0.9% improvement - _especially_ if that improvement also needs admin access and per distribution hackery. (each distribution tends to have their own slightly different way of handing filesystems and other permission/configuration matters) We've seen that with sendfile() and splice() an it's no different with hugetlbs either. hugetlbfs is basically a non-default poor-man's solution for something that the kernel should be providing transparently. It's a bad hack that is good enough to prototype that something works, but it has serious deployment, configuration and usage limitations. Only a kernel hacker detached from everyday application development and packaging constraints can believe that it's a high-quality technical solution. Transparent hugepages eliminates most of the app-visible disadvantages by shuffling the problems into the kernel [and no doubt causing follow-on headaches there] and by utilizing the 'power of the default' - and thus opening up hugetlbs to far more apps. [*] It's a really simple mechanism. Thanks, Ingo [*] Note, it would be even better if the kernel provided the C library [a'ka klibc] and if hugetlbs could be utilized via malloc() et al more transparently by us changing the user-space library in the kernel repo and deploying it to apps via a new kernel that provides an updated C library. We dont do that so we are stuck with crappier solutions and slower propagation of changes. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 76A8E6B01EE for ; Mon, 12 Apr 2010 04:14:38 -0400 (EDT) Date: Mon, 12 Apr 2010 18:14:31 +1000 From: Nick Piggin Subject: Re: hugepages will matter more in the future Message-ID: <20100412081431.GT5683@laptop> References: <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <4BC1EE13.7080702@redhat.com> <4BC1F31E.2050009@redhat.com> <20100412074557.GA18485@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100412074557.GA18485@elte.hu> Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Avi Kivity , Linus Torvalds , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: On Mon, Apr 12, 2010 at 09:45:57AM +0200, Ingo Molnar wrote: > > * Avi Kivity wrote: > > > On 04/11/2010 06:52 PM, Linus Torvalds wrote: > > > > > >On Sun, 11 Apr 2010, Avi Kivity wrote: > > >> > > >> And yet Oracle and java have options to use large pages, and we know > > >> google and HPC like 'em. Maybe they just haven't noticed the fundamental > > >> brokenness yet. > > ( Add Firefox to the mix too - it too allocates in 1MB/2MB chunks. Perhaps > Xorg as well. ) > > > > The thing is, what you are advocating is what traditional UNIX did. > > > Prioritizing the special cases rather than the generic workloads. > > > > > > And I'm telling you, it's wrong. Traditional Unix is dead, and it's dead > > > exactly _because_ it prioritized those kinds of loads. > > > > This is not a specialized workload. Plenty of sites are running java, > > plenty of sites are running Oracle (though that won't benefit from anonymous > > hugepages), and plenty of sites are running virtualization. Not everyone > > does two kernel builds before breakfast. > > Java/virtualization/DBs, and, to a certain sense Firefox have basically become > meta-kernels: they offer their own intermediate APIs to their own style of > apps - and those apps generally have no direct access to the native Linux > kernel. > > And just like the native kernel has been enjoying the benefits of 2MB pages > for more than a decade, do these other entities want to enjoy similar benefits > as well. Fair is fair. > > Like it or not, combined end-user attention/work spent in these meta-kernels > is rising steadily, while apps written in raw C are becoming the exception. > > So IMHO we really have roughly three logical choices: I don't see how these are the logical choices. I don't really see how they are even logical in some ways. Let's say that Andrea's patches offer 5% improvement in best-cases (that are not stupid microbenchmarks) and 0% in worst cases, and X% "on average" (whatever that means). Then it is simply a set of things to weigh against the added complexity (both in terms of code and performance characteristics of the system) that it is introduced. I don't really see how it is fundamentally different to any other patch that speeds things up. > 1) either we accept that the situation is the fault of our technology and > subsequently we reform and modernize the Linux syscall ABIs to be more > friendly to apps (offer built-in GC and perhaps JIT concepts, perhaps > offer a compiler, offer a wider range of libraries with better > integration, etc.) I don't see how this would bring transparent hugepages to userspace. We may offload some services to the kernel, but the *memory mappings* that get used by userspace obviously still go through TLBs. > 2) or we accept the fact that the application space is shifting to the > meta-kernels - and then we should agressively optimize Linux for those > meta-kernels and not pretend that they are 'specialized'. They literally > represent tens of thousands of applications apiece. And if meta-kernels (or whatever you want to call a common or important workload) see some speedup that is deemed to be worth the cost of the patch, then it will probably get merged. Same as anything else. > 3) or we should continue to muddle through somewhere in the middle, hoping > that the 'pure C apps' win in the end (despite 10 years of a decline) and > pretend that the meta-kernels are just 'specialized' workloads. 'pure C apps' (I don't know what you mean by this, but just non-GC memory?) can still see benefits from using hugepages. And I wouldn't say we're muddling through. Linux has been one of the if not the most successful OS kernel of the last 10 years not because of muddling. IMO in large part it is because we haven't been forced to tick boxes for marketing idiots or be pressured by special interests to the detriment of the common cases. > Right now we are doing 3) and i think it's delusive and a mistake. I think we > should be doing 1) - but failing that we have to be honest and do 2). Nothing wrong with carefully evaluating a performance improvement, but there is nothing urgent or huge fundamental reason we need to lose our heads and be irrational about it. If the world was coming to an end without hugepages, then we'd see more than 5% improvement I would have thought. Fact is that computing is based on locality of reference, and performance has continued to scale long past the big bad "memory wall" because real working set sizes (on the scale of CPU instructions, not on the scale of page reclaim) have not grown linearly with RAM sizes. Probably logarithmically or something. Sure there are some pointer chasing apps that will always (and ~have always) suck. We are also irriversibly getting into explicit parallelism (like multi core and multi threading) to work around all sorts of fundamental limits to single thread performance, not just TLB filling. So let's not be melodramatic about this :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 3BCBA6B01EE for ; Mon, 12 Apr 2010 04:19:10 -0400 (EDT) Date: Mon, 12 Apr 2010 10:18:13 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412081813.GH5656@random.random> References: <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <20100412070811.GD5656@random.random> <20100412072144.GS5683@laptop> <4BC2D0C9.3060201@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC2D0C9.3060201@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Nick Piggin , Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 10:50:33AM +0300, Avi Kivity wrote: > The problem with hugetlbfs is that you need to commit upfront to using > it, and that you need to be the admin. For virtualization, you want to > use hugepages when there is no memory pressure, but you want to use ksm, > ballooning, and swapping when there is (and then go back to large pages > when pressure is relieved, e.g. by live migration). > > HPC and databases can probably live with hugetlbfs. JVM is somewhere in > the middle, they do allocate memory dynamically. I guess lots of the recent work on hugetlbfs has been exactly meant to try to make hugetlbfs more palatable by things like JVM, the end result is that it's growing in its own parallel VM but very still crippled down compared to the real kernel VM. I see very long term value in hugetlbfs, for example for CPUs that can't mix different page sizes in the same VMA, or for the 1G page reservation (no way we're going to slowdown everything increasing MAX_ORDER so much by default even if fragmentation issues wouldn't grow exponentially with the order) but I think hugetlbfs should remain simple and cover optimally these use cases, without trying to expand itself into the dynamic area of transparent usages where it wasn't designed to be used in the first place and where it's not a too good fit. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 45F566B01EE for ; Mon, 12 Apr 2010 04:22:19 -0400 (EDT) Date: Mon, 12 Apr 2010 10:21:43 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412082143.GI5656@random.random> References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <20100412070811.GD5656@random.random> <20100412072144.GS5683@laptop> <4BC2D0C9.3060201@redhat.com> <20100412080748.GC18485@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100412080748.GC18485@elte.hu> Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Avi Kivity , Nick Piggin , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 10:07:48AM +0200, Ingo Molnar wrote: > configuration and usage limitations. Only a kernel hacker detached from > everyday application development and packaging constraints can believe that > it's a high-quality technical solution. That made my day ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 60DA36B01EF for ; Mon, 12 Apr 2010 04:22:58 -0400 (EDT) Date: Mon, 12 Apr 2010 10:22:18 +0200 From: Ingo Molnar Subject: Re: hugepages will matter more in the future Message-ID: <20100412082218.GA7380@elte.hu> References: <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <4BC1EE13.7080702@redhat.com> <4BC1F31E.2050009@redhat.com> <20100412074557.GA18485@elte.hu> <20100412081431.GT5683@laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100412081431.GT5683@laptop> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Avi Kivity , Linus Torvalds , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: * Nick Piggin wrote: > > 2) or we accept the fact that the application space is shifting to the > > meta-kernels - and then we should agressively optimize Linux for those > > meta-kernels and not pretend that they are 'specialized'. They literally > > represent tens of thousands of applications apiece. > > And if meta-kernels (or whatever you want to call a common or important > workload) see some speedup that is deemed to be worth the cost of the patch, > then it will probably get merged. Same as anything else. I call a 'meta kernel' something that people code thousands of apps for, instead of coding on the native kernel. JVM/DBs/Firefox are such frameworks. (you can call it middleware i guess) By all means they are not a 'single special-purpose workload' but represent literally tens of thousands of apps. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 8823A6B01EE for ; Mon, 12 Apr 2010 04:28:57 -0400 (EDT) Date: Mon, 12 Apr 2010 18:28:44 +1000 From: Nick Piggin Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412082844.GU5683@laptop> References: <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC2CF8C.5090108@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 10:45:16AM +0300, Avi Kivity wrote: > On 04/12/2010 10:15 AM, Nick Piggin wrote: > > > >>>Another thing is that the problem > >>>of fragmentation breakdown is not just a one-shot event that fills > >>>memory with pinned objects. It is a slow degredation. > >>> > >>>Especially when you use something like SLUB as the memory allocator > >>>which requires higher order allocations for objects which are pinned > >>>in kernel memory. > >>Won't the usual antifrag tactics apply? Try to allocate those > >>objects from the same block. > >"try" is the key point. > > We use the "try" tactic extensively. So long as there's a > reasonable chance of success, and a reasonable fallback on failure, > it's fine. > > Do you think we won't have reasonable success rates? Why? After the memory is fragmented? It's more or less irriversable. So success rates (to fill a specific number of huges pages) will be fine up to a point. Then it will be a continual failure. Sure, some workloads simply won't trigger fragmentation problems. Others will. > >>>Just running a few minutes of testing with a kernel compile in the > >>>background does not show the full picture. You really need a box that > >>>has been up for days running a proper workload before you are likely > >>>to see any breakdown. > >>I'm sure we'll be able to generate worst-case scenarios. I'm also > >>reasonably sure we'll be able to deal with them. I hope we won't > >>need to, but it's even possible to move dentries around. > >Pinned dentries? (which are the problem) That would be insane. > > Why? If you can isolate all the pointers into the dentry, allocate > the new dentry, make the old one point into the new one, hash it, > move the pointers, drop the old dentry. > > Difficult, yes, but insane? Yes. > >>>I'm sure it's horrible for planning if the RDBMS or VM boxes gradually > >>>get slower after X days of uptime. It's better to have consistent > >>>performance really, for anything except pure benchmark setups. > >>If that were the case we'd disable caches everywhere. General > >No we wouldn't. You can have consistent, predictable performance with > >caches. > > Caches have statistical performance. In the long run they average > out. In the short run they can behave badly. Same thing with large > pages, except the runs are longer and the wins are smaller. You don't understand. Caches don't suddenly or slowly stop working. For a particular pattern of workload, they statistically pretty much work the same all the time. > >>purpose computing is a best effort thing, we try to be fast on the > >>common case but we'll be slow on the uncommon case. Access to a bit > >Sure. And the common case for production systems like VM or databse > >servers that are up for hundreds of days is when they are running with > >a lot of uptime. Common case is not a fresh reboot into a 3 hour > >benchmark setup. > > Database are the easiest case, they allocate memory up front and > don't give it up. We'll coalesce their memory immediately and > they'll run happily ever after. Again, you're thinking about a benchmark setup. If you've got various admin things, backups, scripts running, probably web servers, application servers etc. Then it's not all that simple. And yes, Linux works pretty well for a multi-workload platform. You might be thinking too much about virtualization where you put things in sterile little boxes and take the performance hit. > Virtualization will fragment on overcommit, but the load is all > anonymous memory, so it's easy to defragment. Very little dcache on > the host. If virtualization is the main worry (which it seems that it is seeing as your TLB misses cost like 6 times more cachelines), then complexity should be pushed into the hypervisor, not the core kernel. > >>Non-linear kernel mapping moves the small page problem from > >>userspace back to the kernel, a really unhappy solution. > >Not unhappy for userspace intensive workloads. And user working sets > >I'm sure are growing faster than kernel working set. Also there would > >be nothing against compacting and merging kernel memory into larger > >pages. > > Well, I'm not against it, but that would be a much more intrusive > change than what this thread is about. Also, you'd need 4K dentries > etc, no? No. You'd just be defragmenting 4K worth of dentries at a time. Dentries (and anything that doesn't care about untranslated KVA) are trivial. Zero change for users of the code. This is going off-topic though, I don't want to hijack the thread with talk of nonlinear kernel. > >>Very large (object count, not object size) kernel caches can be > >>addressed by compacting them, but I hope we won't need to do that. > >You can't say that fragmentation is not a fundamental problem. And > >adding things like indirect pointers or weird crap adding complexity > >to code that deals with KVA IMO is not acceptable. So you can't > >just assert that you can "address" the problem. > > Mostly we need a way of identifying pointers into a data structure, > like rmap (after all that's what makes transparent hugepages work). And that involves auditing and rewriting anything that allocates and pins kernel memory. It's not only dentries. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 635D56B01EE for ; Mon, 12 Apr 2010 04:34:28 -0400 (EDT) Date: Mon, 12 Apr 2010 18:34:20 +1000 From: Nick Piggin Subject: Re: hugepages will matter more in the future Message-ID: <20100412083420.GV5683@laptop> References: <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <4BC1EE13.7080702@redhat.com> <4BC1F31E.2050009@redhat.com> <20100412074557.GA18485@elte.hu> <20100412081431.GT5683@laptop> <20100412082218.GA7380@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100412082218.GA7380@elte.hu> Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Avi Kivity , Linus Torvalds , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: On Mon, Apr 12, 2010 at 10:22:18AM +0200, Ingo Molnar wrote: > > * Nick Piggin wrote: > > > > 2) or we accept the fact that the application space is shifting to the > > > meta-kernels - and then we should agressively optimize Linux for those > > > meta-kernels and not pretend that they are 'specialized'. They literally > > > represent tens of thousands of applications apiece. > > > > And if meta-kernels (or whatever you want to call a common or important > > workload) see some speedup that is deemed to be worth the cost of the patch, > > then it will probably get merged. Same as anything else. > > I call a 'meta kernel' something that people code thousands of apps for, > instead of coding on the native kernel. JVM/DBs/Firefox are such frameworks. > (you can call it middleware i guess) > > By all means they are not a 'single special-purpose workload' but represent > literally tens of thousands of apps. I don't think I said anything like 'single special-purpose workload'. I said 'common or important workload'. And they are not fundamentally different (in context of evaluating and accepting a performance improvement) than any other workload. I'm not saying they don't matter. The interesting fact is also that such type of thing is also much more suitable for doing optimisation tricks. JVMs and RDBMS typically can make use of hugepages already, for example. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 9D7876B01EE for ; Mon, 12 Apr 2010 04:47:32 -0400 (EDT) Date: Mon, 12 Apr 2010 10:45:39 +0200 From: Andrea Arcangeli Subject: Re: hugepages will matter more in the future Message-ID: <20100412084539.GJ5656@random.random> References: <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <4BC1EE13.7080702@redhat.com> <4BC1F31E.2050009@redhat.com> <20100412074557.GA18485@elte.hu> <20100412081431.GT5683@laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100412081431.GT5683@laptop> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Ingo Molnar , Avi Kivity , Linus Torvalds , Jason Garrett-Glaser , Mike Galbraith , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: On Mon, Apr 12, 2010 at 06:14:31PM +1000, Nick Piggin wrote: > I don't see how these are the logical choices. I don't really see how > they are even logical in some ways. Let's say that Andrea's patches > offer 5% improvement in best-cases (that are not stupid microbenchmarks) > and 0% in worst cases, and X% "on average" (whatever that means). Then > it is simply a set of things to weigh against the added complexity (both > in terms of code and performance characteristics of the system) that it > is introduced. gcc 8% boost with translate.o has been ruled out as useless benchmark, but note that definitely it isn't. Yeah maybe one can write that .c file so it won't take 22 seconds to build, but that's not the point. I wanted to demonstrates there will be lots of other apps taking advantage of this. Linux isn't used only to run gcc, people runs simulations that grows up to unknown amounts of memory on a daily basis on Linux, it's just I used gcc as example as to show even a gcc file that we build maybe 2 times a day gets a 8% boost and because gcc is the most commonly run compute intensive program purely CPU bound that we're familiar with. If I was building chips instead of writing kernel, I would have run one of those simulations instead of gcc to build qemu-kvm translate. And once I won't have to run khugepaged to move all gcc memory into hugepages maybe even the kernel build will get a boost ("maybe" because I'm not convinced, it sounds too good to be true, but it will try it out later for curiosity ;). So I think so far what we can be very relaxed to claim is that in real life "_best_ case" on host without virt the improvement is really ~8% (and much bigger boost already measured with virt >15%, the best case of virt I don't know yet). > I don't really see how it is fundamentally different to any other patch > that speeds things up. This is exactly true, the speedup has to be balanced against the complexity introduced. I'll add a few more points that can help the evaluation. You can be 100% sure this can't destabilize *anything* if you echo never >enabled or boot with transparent_hugepage=0. Furthermore if you enable embedded and set CONFIG_TRANSPARENT_HUGEPAGE=n 99% of the new code won't even be built. The 8% best case speedup should be reproducible on all hardware from my $150 workstation (maybe even on UP x86 32bit) and even atom UP to the 4096 cpus numa system (there hopefully it'll be more than 8% because of the much bigger skew between l2 cache in core and remote numa memory). The 8% boost surely will be possible to reproduce with really optimal written apps and it's not only AIM. It's not like anon-vma design change that will microslowdown the fast paths, make head hurts, cannot be disabled at runtime, and it allows to see a boost only in badly designed apps (Avi once told me fork is useless, well I don't entirely agree but surely it's not something good apps should be heavy user of, it's more about going simpler for something not really enterprise or performance critical, the fact certain DB uses fork is I think caused by proprietary source designs and not technical issues). It's not like speculative pagecache that not only boosts only certain workloads, and only if you have so many CPUs on the large SMP and cannot be opted out or disabled if it's unstable. So it's more complex maybe, but it's zero risk if disabled at runtime or compile time and it provides at constant speedup to optimally written apps (huge speedup in case of EPT/NPT). And yeah it'd be cool if there was a better CPU than the ones with EPT/NPT, surely if somebody can invent something better than that, tons of people would be interested, considering little stuff (Google being one of the exceptions) runs on bare metal these days. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id C27A06B01EF for ; Mon, 12 Apr 2010 04:49:23 -0400 (EDT) Message-ID: <4BC2DE2C.8070707@redhat.com> Date: Mon, 12 Apr 2010 11:47:40 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: hugepages will matter more in the future References: <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <4BC1EE13.7080702@redhat.com> <4BC1F31E.2050009@redhat.com> <20100412074557.GA18485@elte.hu> <20100412081431.GT5683@laptop> <20100412082218.GA7380@elte.hu> <20100412083420.GV5683@laptop> In-Reply-To: <20100412083420.GV5683@laptop> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Ingo Molnar , Linus Torvalds , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: On 04/12/2010 11:34 AM, Nick Piggin wrote: > The interesting fact is also that such type of thing is also much more > suitable for doing optimisation tricks. JVMs and RDBMS typically can > make use of hugepages already, for example. > That just shows they're important enough for people to care. What transparent hugepages does is remove the tradeoff between performance and flexibility that people have to make now, and also allow opportunistic speedup on apps that don't have a userbase large enough to care. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 7ED896B01EE for ; Mon, 12 Apr 2010 05:02:16 -0400 (EDT) Date: Mon, 12 Apr 2010 11:01:21 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412090121.GK5656@random.random> References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100412082844.GU5683@laptop> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Avi Kivity , Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 06:28:44PM +1000, Nick Piggin wrote: > If virtualization is the main worry (which it seems that it is > seeing as your TLB misses cost like 6 times more cachelines), > then complexity should be pushed into the hypervisor, not the > core kernel. It's not just about virtualization on host, or I could have done a much smaller patch without bothering so much to make something as universal as possible with cows and stuff. Also about virtualization you forget that the CPU can establish 2M tlb entries in guest only if both the guest and the host shadow pagetables are both pmd_huge, if one of the two pmd isn't huge then the guest virtual to host physical translation won't be the same for all 512 4k pages (well it might be if you're extremely lucky but I strongly doubt the CPU bothers to check the host pfns are contiguous if both guest pmd and shadow pmd aren't huge). In other words we've to do something that is totally disconnected from virtualization, in order to advantage of it to the maximum extent with virt ;). This allows to leverage the KVM design compared to vmware or and the other inferior virtualization designs. We make gcc run 8% faster on a cheap single socket workstation without virt, and we get even bigger cumulative boost in virtualized gcc without changing anything at all in KVM. If this isn't the obvious best way to go, I don't know what it is! ;) > And that involves auditing and rewriting anything that allocates > and pins kernel memory. It's not only dentries. All not short lived gup pins have to use mmu notifier, no piece of the kernel is allowed to keep movable pages pinned for more than the time it takes to complete the DMA. It has to be fixed to provide all other benefits with GRU, XPMEM now that VM locks are switching to mutex (and as usual to KVM too). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id E6A7D6B01F1 for ; Mon, 12 Apr 2010 05:04:01 -0400 (EDT) Message-ID: <4BC2E1D6.9040702@redhat.com> Date: Mon, 12 Apr 2010 12:03:18 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> In-Reply-To: <20100412082844.GU5683@laptop> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/12/2010 11:28 AM, Nick Piggin wrote: > >> We use the "try" tactic extensively. So long as there's a >> reasonable chance of success, and a reasonable fallback on failure, >> it's fine. >> >> Do you think we won't have reasonable success rates? Why? >> > After the memory is fragmented? It's more or less irriversable. So > success rates (to fill a specific number of huges pages) will be fine > up to a point. Then it will be a continual failure. > So we get just a part of the win, not all of it. > Sure, some workloads simply won't trigger fragmentation problems. > Others will. > Some workloads benefit from readahead. Some don't. In fact, readahead has a higher potential to reduce performance. Same as with many other optimizations. >> Why? If you can isolate all the pointers into the dentry, allocate >> the new dentry, make the old one point into the new one, hash it, >> move the pointers, drop the old dentry. >> >> Difficult, yes, but insane? >> > Yes. > Well, I'll accept what you say since I'm nowhere near as familiar with the code. But maybe someone insane will come along and do it. >> Caches have statistical performance. In the long run they average >> out. In the short run they can behave badly. Same thing with large >> pages, except the runs are longer and the wins are smaller. >> > You don't understand. Caches don't suddenly or slowly stop working. > For a particular pattern of workload, they statistically pretty much > work the same all the time. > Yet your effective cache size can be reduced by unhappy aliasing of physical pages in your working set. It's unlikely but it can happen. For a statistical mix of workloads, huge pages will also work just fine. Perhaps not all of them, but most (those that don't fill _all_ of memory with dentries). >> Database are the easiest case, they allocate memory up front and >> don't give it up. We'll coalesce their memory immediately and >> they'll run happily ever after. >> > Again, you're thinking about a benchmark setup. If you've got various > admin things, backups, scripts running, probably web servers, > application servers etc. Then it's not all that simple. > These are all anonymous/pagecache loads, which we deal with well. > And yes, Linux works pretty well for a multi-workload platform. You > might be thinking too much about virtualization where you put things > in sterile little boxes and take the performance hit. > > People do it for a reason. >> Virtualization will fragment on overcommit, but the load is all >> anonymous memory, so it's easy to defragment. Very little dcache on >> the host. >> > If virtualization is the main worry (which it seems that it is > seeing as your TLB misses cost like 6 times more cachelines), > (just 2x) > then complexity should be pushed into the hypervisor, not the > core kernel. > The whole point behind kvm is to reuse the Linux core. If we have to reimplement Linux memory management and scheduling, then it's a failure. >> Well, I'm not against it, but that would be a much more intrusive >> change than what this thread is about. Also, you'd need 4K dentries >> etc, no? >> > No. You'd just be defragmenting 4K worth of dentries at a time. > Dentries (and anything that doesn't care about untranslated KVA) > are trivial. Zero change for users of the code. > I see. > This is going off-topic though, I don't want to hijack the thread > with talk of nonlinear kernel. > Too bad, it's interesting. >> Mostly we need a way of identifying pointers into a data structure, >> like rmap (after all that's what makes transparent hugepages work). >> > And that involves auditing and rewriting anything that allocates > and pins kernel memory. It's not only dentries. > Not everything, just the major users that can scale with the amount of memory in the machine. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id A4E726B01EF for ; Mon, 12 Apr 2010 05:26:25 -0400 (EDT) Date: Mon, 12 Apr 2010 19:26:15 +1000 From: Nick Piggin Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412092615.GY5683@laptop> References: <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> <4BC2E1D6.9040702@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC2E1D6.9040702@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 12:03:18PM +0300, Avi Kivity wrote: > On 04/12/2010 11:28 AM, Nick Piggin wrote: > > > >>We use the "try" tactic extensively. So long as there's a > >>reasonable chance of success, and a reasonable fallback on failure, > >>it's fine. > >> > >>Do you think we won't have reasonable success rates? Why? > >After the memory is fragmented? It's more or less irriversable. So > >success rates (to fill a specific number of huges pages) will be fine > >up to a point. Then it will be a continual failure. > > So we get just a part of the win, not all of it. It can degrade over time. This is the difference. Two idencial workloads may have performance X and Y depending on whether uptime is 1 day or 20 days. > >Sure, some workloads simply won't trigger fragmentation problems. > >Others will. > > Some workloads benefit from readahead. Some don't. In fact, > readahead has a higher potential to reduce performance. > > Same as with many other optimizations. Do you see any difference with your examples and this issue? > >>Why? If you can isolate all the pointers into the dentry, allocate > >>the new dentry, make the old one point into the new one, hash it, > >>move the pointers, drop the old dentry. > >> > >>Difficult, yes, but insane? > >Yes. > > Well, I'll accept what you say since I'm nowhere near as familiar > with the code. But maybe someone insane will come along and do it. And it'll get nacked :) And it's not only dcache that can cause a problem. This is part of the whole reason it is insane. It is insane to only fix the dcache, because if you accept the dcache is a problem that needs such complexity to fix, then you must accept the same for the inode caches, the buffer head caches, vmas, radix tree nodes, files etc. no? > >>Caches have statistical performance. In the long run they average > >>out. In the short run they can behave badly. Same thing with large > >>pages, except the runs are longer and the wins are smaller. > >You don't understand. Caches don't suddenly or slowly stop working. > >For a particular pattern of workload, they statistically pretty much > >work the same all the time. > > Yet your effective cache size can be reduced by unhappy aliasing of > physical pages in your working set. It's unlikely but it can > happen. > > For a statistical mix of workloads, huge pages will also work just > fine. Perhaps not all of them, but most (those that don't fill > _all_ of memory with dentries). Like I said, you don't need to fill all memory with dentries, you just need to be allocating higher order kernel memory and end up fragmenting your reclaimable pools. And it's not a statistical mix that is the problem. The problem is that the workloads that do cause fragmentation problems will run well for 1 day or 5 days and then degrade. And it is impossible to know what will degrade and what won't and by how much. I'm not saying this is a showstopper, but it does really suck. > >>Database are the easiest case, they allocate memory up front and > >>don't give it up. We'll coalesce their memory immediately and > >>they'll run happily ever after. > >Again, you're thinking about a benchmark setup. If you've got various > >admin things, backups, scripts running, probably web servers, > >application servers etc. Then it's not all that simple. > > These are all anonymous/pagecache loads, which we deal with well. Huh? They also involve sockets, files, and involve all of the above data structures I listed and many more. > >And yes, Linux works pretty well for a multi-workload platform. You > >might be thinking too much about virtualization where you put things > >in sterile little boxes and take the performance hit. > > > > People do it for a reason. The reasoning is not always sound though. And also people do other things. Including increasingly better containers and workload management in the single kernel. > >>Virtualization will fragment on overcommit, but the load is all > >>anonymous memory, so it's easy to defragment. Very little dcache on > >>the host. > >If virtualization is the main worry (which it seems that it is > >seeing as your TLB misses cost like 6 times more cachelines), > > (just 2x) > > >then complexity should be pushed into the hypervisor, not the > >core kernel. > > The whole point behind kvm is to reuse the Linux core. If we have > to reimplement Linux memory management and scheduling, then it's a > failure. And if you need to add complexity to the Linux core for it, it's also a failure. I'm not saying to reimplement things, but if you had a little bit more support perhaps. Anyway it's just ideas, I'm not saying that transparent hugepages is wrong simply because KVM is a big user and it could be implemented in another way. But if it is possible for KVM to use libhugetlb with just a bit of support from the kernel, then it goes some way to reducing the need for transparent hugepages. > >>Well, I'm not against it, but that would be a much more intrusive > >>change than what this thread is about. Also, you'd need 4K dentries > >>etc, no? > >No. You'd just be defragmenting 4K worth of dentries at a time. > >Dentries (and anything that doesn't care about untranslated KVA) > >are trivial. Zero change for users of the code. > > I see. > > >This is going off-topic though, I don't want to hijack the thread > >with talk of nonlinear kernel. > > Too bad, it's interesting. It sure is, we can start another thread. > >>Mostly we need a way of identifying pointers into a data structure, > >>like rmap (after all that's what makes transparent hugepages work). > >And that involves auditing and rewriting anything that allocates > >and pins kernel memory. It's not only dentries. > > Not everything, just the major users that can scale with the amount > of memory in the machine. Well you need to audit, to determine if it is going to be a problem or not, and it is more than only dentries. (but even dentries would be a nightmare considering how widely they're used and how much they're passed around the vfs and filesystems). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 8429D6B01EE for ; Mon, 12 Apr 2010 05:40:51 -0400 (EDT) Date: Mon, 12 Apr 2010 11:39:52 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412093952.GR5656@random.random> References: <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> <4BC2E1D6.9040702@redhat.com> <20100412092615.GY5683@laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100412092615.GY5683@laptop> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Avi Kivity , Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 07:26:15PM +1000, Nick Piggin wrote: > But if it is possible for KVM to use libhugetlb with just a bit of > support from the kernel, then it goes some way to reducing the > need for transparent hugepages. KVM has full hugetlbfs support for a long time. There's some people using it, and it remains a must-have for 1G pages, but it's not manageable that way in the cloud. It's ok for a special instance only. Right now all my VM by default are running on hugepages now without changing a single bit (with a few liner patch to qemu to add a alignment because the gfn bits in the number range HPAGE_PMD_SHIFT..PAGE_SHIFT have to be a match to the host pfn bits for NPT shadows to go pmd_huge). For qemu to run on hugepages not even the alignment is needed (but it's better to align there too, to be sure the guest kernel that lives hugepages as it's usually mapped in the first mbyte). This is the single change I had to apply to KVM for it to take advantage of transparent hugepages because it was already working fine with hugetlbfs: http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=d249c189870896b3f275987b70702d2b8c7705d4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 497536B01EE for ; Mon, 12 Apr 2010 06:03:44 -0400 (EDT) Message-ID: <4BC2EFBA.5080404@redhat.com> Date: Mon, 12 Apr 2010 13:02:34 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> <4BC2E1D6.9040702@redhat.com> <20100412092615.GY5683@laptop> In-Reply-To: <20100412092615.GY5683@laptop> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/12/2010 12:26 PM, Nick Piggin wrote: > On Mon, Apr 12, 2010 at 12:03:18PM +0300, Avi Kivity wrote: > >> On 04/12/2010 11:28 AM, Nick Piggin wrote: >> >>> >>>> We use the "try" tactic extensively. So long as there's a >>>> reasonable chance of success, and a reasonable fallback on failure, >>>> it's fine. >>>> >>>> Do you think we won't have reasonable success rates? Why? >>>> >>> After the memory is fragmented? It's more or less irriversable. So >>> success rates (to fill a specific number of huges pages) will be fine >>> up to a point. Then it will be a continual failure. >>> >> So we get just a part of the win, not all of it. >> > It can degrade over time. This is the difference. Two idencial workloads > may have performance X and Y depending on whether uptime is 1 day or 20 > days. > I don't see why it will degrade. Antifrag will prefer to allocate dcache near existing dcache. The only scenario I can see where it degrades is that you have a dcache load that spills over to all of memory, then falls back leaving a pinned page in every huge frame. It can happen, but I don't see it as a likely scenario. But maybe I'm missing something. >>> Sure, some workloads simply won't trigger fragmentation problems. >>> Others will. >>> >> Some workloads benefit from readahead. Some don't. In fact, >> readahead has a higher potential to reduce performance. >> >> Same as with many other optimizations. >> > Do you see any difference with your examples and this issue? > Memory layout is more persistent. Well, disk layout is even more persistent. Still we do extents, and if our disk is fragmented, we take the hit. >> Well, I'll accept what you say since I'm nowhere near as familiar >> with the code. But maybe someone insane will come along and do it. >> > And it'll get nacked :) And it's not only dcache that can cause a > problem. This is part of the whole reason it is insane. It is insane > to only fix the dcache, because if you accept the dcache is a problem > that needs such complexity to fix, then you must accept the same for > the inode caches, the buffer head caches, vmas, radix tree nodes, files > etc. no? > inodes come with dcache, yes. I thought buffer heads are now a much smaller load. vmas usually don't scale up with memory. If you have a lot of radix tree nodes, then you also have a lot of pagecache, so the radix tree nodes can be contained. Open files also don't scale with memory. >> Yet your effective cache size can be reduced by unhappy aliasing of >> physical pages in your working set. It's unlikely but it can >> happen. >> >> For a statistical mix of workloads, huge pages will also work just >> fine. Perhaps not all of them, but most (those that don't fill >> _all_ of memory with dentries). >> > Like I said, you don't need to fill all memory with dentries, you > just need to be allocating higher order kernel memory and end up > fragmenting your reclaimable pools. > Allocate those higher order pages from the same huge frame. > And it's not a statistical mix that is the problem. The problem is > that the workloads that do cause fragmentation problems will run well > for 1 day or 5 days and then degrade. And it is impossible to know > what will degrade and what won't and by how much. > > I'm not saying this is a showstopper, but it does really suck. > > Can you suggest a real life test workload so we can investigate it? >> These are all anonymous/pagecache loads, which we deal with well. >> > Huh? They also involve sockets, files, and involve all of the above > data structures I listed and many more. > A few thousand sockets and open files is chickenfeed for a server. They'll kill a few huge frames but won't significantly affect the rest of memory. > > >>> And yes, Linux works pretty well for a multi-workload platform. You >>> might be thinking too much about virtualization where you put things >>> in sterile little boxes and take the performance hit. >>> >>> >> People do it for a reason. >> > The reasoning is not always sound though. And also people do other > things. Including increasingly better containers and workload > management in the single kernel. > Containers are wonderful but still a future thing, and even when fully implemented they still don't offer the same isolation as virtualization. For example, the owner of workload A might want to upgrade the kernel to fix a bug he's hitting, while the owner of workload B needs three months to test it. >> The whole point behind kvm is to reuse the Linux core. If we have >> to reimplement Linux memory management and scheduling, then it's a >> failure. >> > And if you need to add complexity to the Linux core for it, it's > also a failure. > Well, we need to add complexity, and we already have. If the acceptance criteria for a feature would be 'no new complexity', then the kernel would be a lot smaller than it is now. Everything has to be evaluated on the basis of its generality, the benefit, the importance of the subsystem that needs it, and impact on the code. Huge pages are already used in server loads so they're not specific to kvm. The benefit, 5-15%, is significant. You and Linus might not be interested in virtualization, but a significant and growing fraction of hosts are virtualized, it's up to us if they run Linux or something else. And I trust Andrea and the reviewers here to keep the code impact sane. > I'm not saying to reimplement things, but if you had a little bit > more support perhaps. Anyway it's just ideas, I'm not saying that > transparent hugepages is wrong simply because KVM is a big user and it > could be implemented in another way. > What do you mean by 'more support'? > But if it is possible for KVM to use libhugetlb with just a bit of > support from the kernel, then it goes some way to reducing the > need for transparent hugepages. > kvm already works with hugetlbfs. But it's brittle, it means we have to choose between performance and overcommit. >> Not everything, just the major users that can scale with the amount >> of memory in the machine. >> > Well you need to audit, to determine if it is going to be a problem or > not, and it is more than only dentries. (but even dentries would be a > nightmare considering how widely they're used and how much they're > passed around the vfs and filesystems). > pages are passed around everywhere as well. When something is locked or its reference count doesn't match the reachable pointer count, you give up. Only a small number of objects are in active use at any one time. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 49AAC6B01EE for ; Mon, 12 Apr 2010 06:09:01 -0400 (EDT) Date: Mon, 12 Apr 2010 12:08:06 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412100806.GU5656@random.random> References: <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> <4BC2E1D6.9040702@redhat.com> <20100412092615.GY5683@laptop> <4BC2EFBA.5080404@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC2EFBA.5080404@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Nick Piggin , Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 01:02:34PM +0300, Avi Kivity wrote: > The only scenario I can see where it degrades is that you have a dcache > load that spills over to all of memory, then falls back leaving a pinned > page in every huge frame. It can happen, but I don't see it as a likely > scenario. But maybe I'm missing something. And in my understanding this is exactly the scenario that kernelcore= should prevent from ever materialize. Providing math guarantees without kernelcore= is probably futile. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id DCFB96B01EF for ; Mon, 12 Apr 2010 06:11:25 -0400 (EDT) Message-ID: <4BC2F1A6.3070202@redhat.com> Date: Mon, 12 Apr 2010 13:10:46 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> <4BC2E1D6.9040702@redhat.com> <20100412092615.GY5683@laptop> <4BC2EFBA.5080404@redhat.com> <20100412100806.GU5656@random.random> In-Reply-To: <20100412100806.GU5656@random.random> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Nick Piggin , Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/12/2010 01:08 PM, Andrea Arcangeli wrote: > On Mon, Apr 12, 2010 at 01:02:34PM +0300, Avi Kivity wrote: > >> The only scenario I can see where it degrades is that you have a dcache >> load that spills over to all of memory, then falls back leaving a pinned >> page in every huge frame. It can happen, but I don't see it as a likely >> scenario. But maybe I'm missing something. >> > And in my understanding this is exactly the scenario that kernelcore= > should prevent from ever materialize. Providing math guarantees > without kernelcore= is probably futile. > Well, that forces the user to make a different boot-time tradeoff. It's unsatisfying. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id EC9E06B01EF for ; Mon, 12 Apr 2010 06:25:01 -0400 (EDT) Date: Mon, 12 Apr 2010 12:23:53 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412102353.GV5656@random.random> References: <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> <4BC2E1D6.9040702@redhat.com> <20100412092615.GY5683@laptop> <4BC2EFBA.5080404@redhat.com> <20100412100806.GU5656@random.random> <4BC2F1A6.3070202@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC2F1A6.3070202@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Nick Piggin , Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 01:10:46PM +0300, Avi Kivity wrote: > On 04/12/2010 01:08 PM, Andrea Arcangeli wrote: > > On Mon, Apr 12, 2010 at 01:02:34PM +0300, Avi Kivity wrote: > > > >> The only scenario I can see where it degrades is that you have a dcache > >> load that spills over to all of memory, then falls back leaving a pinned > >> page in every huge frame. It can happen, but I don't see it as a likely > >> scenario. But maybe I'm missing something. > >> > > And in my understanding this is exactly the scenario that kernelcore= > > should prevent from ever materialize. Providing math guarantees > > without kernelcore= is probably futile. > > > > Well, that forces the user to make a different boot-time tradeoff. It's > unsatisfying. Well this is just about the math guarantee, like disabling memory overcommit to have better guarantee not to run into the oom killer... most people won't need this but it can address the math concerns. I think it's enough if people wants a guarantee and it won't require using nonlinear mapping for kernel. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 8380C6B01EF for ; Mon, 12 Apr 2010 06:27:58 -0400 (EDT) Date: Mon, 12 Apr 2010 11:27:35 +0100 From: Mel Gorman Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412102734.GN25756@csn.ul.ie> References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <20100412070811.GD5656@random.random> <20100412072144.GS5683@laptop> <4BC2D0C9.3060201@redhat.com> <20100412080748.GC18485@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20100412080748.GC18485@elte.hu> Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Avi Kivity , Nick Piggin , Andrea Arcangeli , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 10:07:48AM +0200, Ingo Molnar wrote: > > > > [*] Note, it would be even better if the kernel provided the C library [a'ka > klibc] and if hugetlbs could be utilized via malloc() et al more hugectl --heap does this. It uses the __morecore hook in glibc to back malloc with files on hugetlbfs. There is also a programming API with some basic usage at http://www.csn.ul.ie/~mel/docs/stream-api/ The difference in distributions will hopefully be ironed out by replacing custom scripts with calls to hugeadm to do the bulk of the configuration work - e.g. creating mount points and permissions. There is no need to be creating a new user-space library in the kernel repo. > transparently by us changing the user-space library in the kernel repo and > deploying it to apps via a new kernel that provides an updated C library. > We dont do that so we are stuck with crappier solutions and slower > propagation of changes. > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 9596C6B01EF for ; Mon, 12 Apr 2010 06:37:12 -0400 (EDT) Date: Mon, 12 Apr 2010 20:37:01 +1000 From: Nick Piggin Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412103701.GZ5683@laptop> References: <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> <4BC2E1D6.9040702@redhat.com> <20100412092615.GY5683@laptop> <4BC2EFBA.5080404@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC2EFBA.5080404@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 01:02:34PM +0300, Avi Kivity wrote: > On 04/12/2010 12:26 PM, Nick Piggin wrote: > >On Mon, Apr 12, 2010 at 12:03:18PM +0300, Avi Kivity wrote: > >>On 04/12/2010 11:28 AM, Nick Piggin wrote: > >>>>We use the "try" tactic extensively. So long as there's a > >>>>reasonable chance of success, and a reasonable fallback on failure, > >>>>it's fine. > >>>> > >>>>Do you think we won't have reasonable success rates? Why? > >>>After the memory is fragmented? It's more or less irriversable. So > >>>success rates (to fill a specific number of huges pages) will be fine > >>>up to a point. Then it will be a continual failure. > >>So we get just a part of the win, not all of it. > >It can degrade over time. This is the difference. Two idencial workloads > >may have performance X and Y depending on whether uptime is 1 day or 20 > >days. > > I don't see why it will degrade. Antifrag will prefer to allocate > dcache near existing dcache. > > The only scenario I can see where it degrades is that you have a > dcache load that spills over to all of memory, then falls back > leaving a pinned page in every huge frame. It can happen, but I > don't see it as a likely scenario. But maybe I'm missing something. No, it doesn't need to make all hugepages unavailable in order to start degrading. The moment that fewer huge pages are available than can be used, due to fragmentation, is when you could start seeing fragmentation. If you're using higher order allocations in the kernel, like SLUB will especially (and SLAB will for some things) then the requirement for fragmentation basically gets smaller by I think about the same factor as the page size. So order-2 slabs only need to fill 1/4 of memory in order to be able to fragment entire memory. But fragmenting entire memory is not the start of the degredation, it is the end. > >>>Sure, some workloads simply won't trigger fragmentation problems. > >>>Others will. > >>Some workloads benefit from readahead. Some don't. In fact, > >>readahead has a higher potential to reduce performance. > >> > >>Same as with many other optimizations. > >Do you see any difference with your examples and this issue? > > Memory layout is more persistent. Well, disk layout is even more > persistent. Still we do extents, and if our disk is fragmented, we > take the hit. Sure, and that's not a good thing either. > >>Well, I'll accept what you say since I'm nowhere near as familiar > >>with the code. But maybe someone insane will come along and do it. > >And it'll get nacked :) And it's not only dcache that can cause a > >problem. This is part of the whole reason it is insane. It is insane > >to only fix the dcache, because if you accept the dcache is a problem > >that needs such complexity to fix, then you must accept the same for > >the inode caches, the buffer head caches, vmas, radix tree nodes, files > >etc. no? > > inodes come with dcache, yes. I thought buffer heads are now a much > smaller load. vmas usually don't scale up with memory. If you have > a lot of radix tree nodes, then you also have a lot of pagecache, so > the radix tree nodes can be contained. Open files also don't scale > with memory. See above; we don't need to fill all memory, especially with higher order allocations. Definitely some workloads that never use much kernel memory will probably not see fragmentation problems. > >>Yet your effective cache size can be reduced by unhappy aliasing of > >>physical pages in your working set. It's unlikely but it can > >>happen. > >> > >>For a statistical mix of workloads, huge pages will also work just > >>fine. Perhaps not all of them, but most (those that don't fill > >>_all_ of memory with dentries). > >Like I said, you don't need to fill all memory with dentries, you > >just need to be allocating higher order kernel memory and end up > >fragmenting your reclaimable pools. > > Allocate those higher order pages from the same huge frame. We don't keep different pools of different frame sizes around to allocate different object sizes in. That would get even weirder than the existing anti-frag stuff with overflow and fallback rules. > >And it's not a statistical mix that is the problem. The problem is > >that the workloads that do cause fragmentation problems will run well > >for 1 day or 5 days and then degrade. And it is impossible to know > >what will degrade and what won't and by how much. > > > >I'm not saying this is a showstopper, but it does really suck. > > > > Can you suggest a real life test workload so we can investigate it? > > >>These are all anonymous/pagecache loads, which we deal with well. > >Huh? They also involve sockets, files, and involve all of the above > >data structures I listed and many more. > > A few thousand sockets and open files is chickenfeed for a server. > They'll kill a few huge frames but won't significantly affect the > rest of memory. Lots of small files is very common for a web server for example. > >>>And yes, Linux works pretty well for a multi-workload platform. You > >>>might be thinking too much about virtualization where you put things > >>>in sterile little boxes and take the performance hit. > >>> > >>People do it for a reason. > >The reasoning is not always sound though. And also people do other > >things. Including increasingly better containers and workload > >management in the single kernel. > > Containers are wonderful but still a future thing, and even when > fully implemented they still don't offer the same isolation as > virtualization. For example, the owner of workload A might want to > upgrade the kernel to fix a bug he's hitting, while the owner of > workload B needs three months to test it. But better for performance in general. > >>The whole point behind kvm is to reuse the Linux core. If we have > >>to reimplement Linux memory management and scheduling, then it's a > >>failure. > >And if you need to add complexity to the Linux core for it, it's > >also a failure. > > Well, we need to add complexity, and we already have. If the > acceptance criteria for a feature would be 'no new complexity', then > the kernel would be a lot smaller than it is now. > > Everything has to be evaluated on the basis of its generality, the > benefit, the importance of the subsystem that needs it, and impact > on the code. Huge pages are already used in server loads so they're > not specific to kvm. The benefit, 5-15%, is significant. You and > Linus might not be interested in virtualization, but a significant > and growing fraction of hosts are virtualized, it's up to us if they > run Linux or something else. And I trust Andrea and the reviewers > here to keep the code impact sane. I'm being realistic. I know sure it is just to be evaluated based on gains, complexity, alternatives, etc. When I hear arguments like we must do this because memory to cache ratio has got 100 times worse and ergo we're on the brink of catastrophe, that's when things get silly. > >I'm not saying to reimplement things, but if you had a little bit > >more support perhaps. Anyway it's just ideas, I'm not saying that > >transparent hugepages is wrong simply because KVM is a big user and it > >could be implemented in another way. > > What do you mean by 'more support'? > > >But if it is possible for KVM to use libhugetlb with just a bit of > >support from the kernel, then it goes some way to reducing the > >need for transparent hugepages. > > kvm already works with hugetlbfs. But it's brittle, it means we > have to choose between performance and overcommit. Overcommit because it doesn't work with swapping? Or something more? > >>Not everything, just the major users that can scale with the amount > >>of memory in the machine. > >Well you need to audit, to determine if it is going to be a problem or > >not, and it is more than only dentries. (but even dentries would be a > >nightmare considering how widely they're used and how much they're > >passed around the vfs and filesystems). > > pages are passed around everywhere as well. When something is > locked or its reference count doesn't match the reachable pointer > count, you give up. Only a small number of objects are in active > use at any one time. Easier said than done, I suspect. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id C63656B01EE for ; Mon, 12 Apr 2010 06:45:13 -0400 (EDT) Date: Mon, 12 Apr 2010 11:44:51 +0100 From: Mel Gorman Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412104451.GO25756@csn.ul.ie> References: <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <20100412070811.GD5656@random.random> <20100412072144.GS5683@laptop> <20100412080626.GG5656@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20100412080626.GG5656@random.random> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Nick Piggin , Ingo Molnar , Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 10:06:26AM +0200, Andrea Arcangeli wrote: > On Mon, Apr 12, 2010 at 05:21:44PM +1000, Nick Piggin wrote: > > On Mon, Apr 12, 2010 at 09:08:11AM +0200, Andrea Arcangeli wrote: > > > On Mon, Apr 12, 2010 at 04:09:31PM +1000, Nick Piggin wrote: > > > > One problem is that you need to keep a lot more memory free in order > > > > for it to be reasonably effective. Another thing is that the problem > > > > of fragmentation breakdown is not just a one-shot event that fills > > > > memory with pinned objects. It is a slow degredation. > > > > > > set_recommended_min_free_kbytes seems to not be in function of ram > > > size, 60MB aren't such a big deal. > > > > > > > Especially when you use something like SLUB as the memory allocator > > > > which requires higher order allocations for objects which are pinned > > > > in kernel memory. > > > > > > > > Just running a few minutes of testing with a kernel compile in the > > > > background does not show the full picture. You really need a box that > > > > has been up for days running a proper workload before you are likely > > > > to see any breakdown. > > > > > > > > I'm sure it's horrible for planning if the RDBMS or VM boxes gradually > > > > get slower after X days of uptime. It's better to have consistent > > > > performance really, for anything except pure benchmark setups. > > > > > > All data I provided is very real, in addition to building a ton of > > > packages and running emerge on /usr/portage I've been running all my > > > real loads. Only problem I only run it for 1 day and half, but the > > > load I kept it under was significant (surely a lot bigger inode/dentry > > > load that any hypervisor usage would ever generate). > > > > OK, but as a solution for some kind of very specific and highly > > optimized application already like RDBMS, HPC, hypervisor or JVM, > > they could just be using hugepages themselves, couldn't they? > > > > It seems more interesting as a more general speedup for applications > > that can't afford such optimizations? (eg. the common case for > > most people) > > The reality is that very few are using hugetlbfs. I guess maybe 0.1% > of KVM instances on phenom/nahlem chips are running on hugetlbfs for > example (hugetlbfs boot reservation doesn't fit the cloud where you > need all ram available in hugetlbfs and you still need 100% of unused > ram as host pagecache for VDI), As a side-note, this is what dynamic hugepage pool resizing was for. hugeadm --pool-pages-max :[+|-]> The hugepage pool grows and shrinks as required if the system is able to allocate the huge pages. If the huge pages are not available, mmap() returns NULL and userspace is expected to recover by retrying the allocation with small pages (something libhugetlbfs does automatically). In the virtualisation context, the greater problem with such an approach is no-overcommit is possible. I am given to understand that this is a major problem because hosts of virtual machines are often overcommitted on the assumption they don't all peak at the same time. > despite it would provide a >=6% boosts > to all VM no matter what's running on the guest. Same goes for the > JVM, maybe 0.1% of those runs on hugetlbfs. The commercial DBMS are > the exception and they're probably closer to 99% running on hugetlbfs > (and they've to keep using hugetlbfs until we move transparent > hugepages in tmpfs). But as > The DBMS documentation often appears to put a greater emphasis on huge page tuning than the applications that depend on the JVM. > So there's a ton of wasted energy in my view. Like Ingo said, the > faster they make the chips and the cheaper the RAM becomes, the more > wasted energy as result of not using hugetlbfs. There's always more > difference between cache sizes and ram sizes and also more difference > between cache speeds and ram speeds. I don't see this trend ending and > I'm not sure what is the better CPU that will make hugetlbfs worthless > and unselectable at kernel configure time on x86 arch (if you build > without generic). > > And I don't think it's feasible to ship a distro where 99% of apps > that can benefit from hugepages are running with > LD_PRELOAD=libhugetlbfs.so. It has to be transparent if we want to > stop the waste. > I don't see such a thing happening. Huge pages on hugetlbfs do not swap and would be like calling mlock aggressively. > The main reason I've always been skeptical about transparent hugepages > before I started working on this is the mess they generate on the > whole kernel. So my priority of course has been to keep it self > contained as much as possible. It kept spilling over and over until I > managed to confine it to anonymous pages and fix whole mm/.c files > with just a one liner (even the hugepage aware implementation that > Johannes did still takes advantage of split_huge_page_pmd if the > mprotect start/end isn't 2M naturally aligned, just to show how > complex it would be to do it all at once). This will allow us to reach > a solid base, and then later move to tmpfs and maybe later to > pagecache and swapcache too. Pretending the whole kernel to become > hugepage aware at once is a total mess, gup would need to return only > head pages for example and breaking hundred of drivers in just that > change. The compound_lock can be removed after you fix all those > hundred of drivers and subsystems using gup... No big deal to remove > it later, kind of you're removing the big kernel lock these days after > 14 years of when it has been introduced. > > Plus I did all I could to try to keep it as black and white as > possible. I think other OS are more gray in their approaches, my > priority has been to pay for RAM anywhere I could if you set > enabled=always, and to decrease as much as I could any risk of > performance regressions in any workload. These days we can afford to > lose 1G without much worry if it speedup the workload 8%, so I think > the other designs are better for old hardware RAM constrainted and not > very actual. On embedded with my patchset one should set > enabled=madvise. Ingo suggested a per-process tweak to enable it > selectively on certain apps, that is feasible too in the future (so > people won't be forced to modify binaries to add madvise if they can't > leave enabled=always). > > > Yes we do have the option to reserve pages and as far as I know it > > should work, although I can't remember whether it deals with mlock. > > I think that is the right route to take for who needs the > math-guarantees, and for many products it won't even be noticeable to > enforce the math guarantee. It's kind of overcommit, somebody prefers > the = 2 version and maybe they don't even notice it allows them to > allocate less memory. Others prefers to be able to allocate ram > without accounting for the unused virtual regions despite the bigger > chance to run into the oom killer (and I'm in the latter camp for both > overcommit sysctl and kernelcore= ;). > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 3995D6B01EE for ; Mon, 12 Apr 2010 07:00:35 -0400 (EDT) Message-ID: <4BC2FCFA.5080004@redhat.com> Date: Mon, 12 Apr 2010 13:59:06 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> <4BC2E1D6.9040702@redhat.com> <20100412092615.GY5683@laptop> <4BC2EFBA.5080404@redhat.com> <20100412103701.GZ5683@laptop> In-Reply-To: <20100412103701.GZ5683@laptop> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/12/2010 01:37 PM, Nick Piggin wrote: > >> I don't see why it will degrade. Antifrag will prefer to allocate >> dcache near existing dcache. >> >> The only scenario I can see where it degrades is that you have a >> dcache load that spills over to all of memory, then falls back >> leaving a pinned page in every huge frame. It can happen, but I >> don't see it as a likely scenario. But maybe I'm missing something. >> > No, it doesn't need to make all hugepages unavailable in order to > start degrading. The moment that fewer huge pages are available than > can be used, due to fragmentation, is when you could start seeing > fragmentation. > Graceful degradation is fine. We're degrading to the current situation here, not something worse. > If you're using higher order allocations in the kernel, like SLUB > will especially (and SLAB will for some things) then the requirement > for fragmentation basically gets smaller by I think about the same > factor as the page size. So order-2 slabs only need to fill 1/4 of > memory in order to be able to fragment entire memory. But fragmenting > entire memory is not the start of the degredation, it is the end. > Those order-2 slabs should be allocated in the same page frame. If they're allocated randomly, sure, you need 1 allocation per huge page frame. If you're filling up huge page frames, things look a lot better. > > >>>>> Sure, some workloads simply won't trigger fragmentation problems. >>>>> Others will. >>>>> >>>> Some workloads benefit from readahead. Some don't. In fact, >>>> readahead has a higher potential to reduce performance. >>>> >>>> Same as with many other optimizations. >>>> >>> Do you see any difference with your examples and this issue? >>> >> Memory layout is more persistent. Well, disk layout is even more >> persistent. Still we do extents, and if our disk is fragmented, we >> take the hit. >> > Sure, and that's not a good thing either. > And yet we live with it for decades; and we use more or less the same techniques to avoid it. >> inodes come with dcache, yes. I thought buffer heads are now a much >> smaller load. vmas usually don't scale up with memory. If you have >> a lot of radix tree nodes, then you also have a lot of pagecache, so >> the radix tree nodes can be contained. Open files also don't scale >> with memory. >> > See above; we don't need to fill all memory, especially with higher > order allocations. > Not if you allocate carefully. > Definitely some workloads that never use much kernel memory will > probably not see fragmentation problems. > > Right; and on a 16-64GB machine you'll have a hard time filling kernel memory with objects. >>> Like I said, you don't need to fill all memory with dentries, you >>> just need to be allocating higher order kernel memory and end up >>> fragmenting your reclaimable pools. >>> >> Allocate those higher order pages from the same huge frame. >> > We don't keep different pools of different frame sizes around > to allocate different object sizes in. That would get even weirder > than the existing anti-frag stuff with overflow and fallback rules. > Maybe we should, once we start to use a lot of such objects. Once you have 10MB worth of inodes, you don't lose anything by allocating their slabs from 2MB units. >> A few thousand sockets and open files is chickenfeed for a server. >> They'll kill a few huge frames but won't significantly affect the >> rest of memory. >> > Lots of small files is very common for a web server for example. > 10k files? 100k files? how many open at once? Even 1M files is ~1GB, not touching our 64GB server. Most content is dynamic these days anyway. >> Containers are wonderful but still a future thing, and even when >> fully implemented they still don't offer the same isolation as >> virtualization. For example, the owner of workload A might want to >> upgrade the kernel to fix a bug he's hitting, while the owner of >> workload B needs three months to test it. >> > But better for performance in general. > > True. But virtualization has the advantage of actually being there. Note that kvm is also benefiting from containers to improve resource isolation. >> Everything has to be evaluated on the basis of its generality, the >> benefit, the importance of the subsystem that needs it, and impact >> on the code. Huge pages are already used in server loads so they're >> not specific to kvm. The benefit, 5-15%, is significant. You and >> Linus might not be interested in virtualization, but a significant >> and growing fraction of hosts are virtualized, it's up to us if they >> run Linux or something else. And I trust Andrea and the reviewers >> here to keep the code impact sane. >> > I'm being realistic. I know sure it is just to be evaluated based > on gains, complexity, alternatives, etc. > > When I hear arguments like we must do this because memory to cache > ratio has got 100 times worse and ergo we're on the brink of > catastrophe, that's when things get silly. > That wasn't me. It's 5-15%, not earth shattering, but significant. Especially when we hear things like 1% performance regression per kernel release on average. And it's true that the gain will grow as machines grow. >>> But if it is possible for KVM to use libhugetlb with just a bit of >>> support from the kernel, then it goes some way to reducing the >>> need for transparent hugepages. >>> >> kvm already works with hugetlbfs. But it's brittle, it means we >> have to choose between performance and overcommit. >> > Overcommit because it doesn't work with swapping? Or something more? > kvm overcommit uses ballooning, page merging, and swapping. None of these work well with large pages (well, ballooning might). >> pages are passed around everywhere as well. When something is >> locked or its reference count doesn't match the reachable pointer >> count, you give up. Only a small number of objects are in active >> use at any one time. >> > Easier said than done, I suspect. > No doubt it's very tricky code. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 9116D6B01EE for ; Mon, 12 Apr 2010 07:12:53 -0400 (EDT) Message-ID: <4BC30001.2040205@redhat.com> Date: Mon, 12 Apr 2010 14:12:01 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <20100412070811.GD5656@random.random> <20100412072144.GS5683@laptop> <20100412080626.GG5656@random.random> <20100412104451.GO25756@csn.ul.ie> In-Reply-To: <20100412104451.GO25756@csn.ul.ie> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: Andrea Arcangeli , Nick Piggin , Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/12/2010 01:44 PM, Mel Gorman wrote: > I don't see such a thing happening. Huge pages on hugetlbfs do not swap and > would be like calling mlock aggressively. > Yes, we keep talking about defragmentation, but the nice thing about transparent huge pages is the ability to fragment when needed. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 5B7266B01EE for ; Mon, 12 Apr 2010 07:21:39 -0400 (EDT) Date: Mon, 12 Apr 2010 04:22:30 -0700 From: Arjan van de Ven Subject: Re: hugepages will matter more in the future Message-ID: <20100412042230.5d974e5d@infradead.org> In-Reply-To: <20100411115229.GB10952@elte.hu> References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Avi Kivity , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Sun, 11 Apr 2010 13:52:29 +0200 Ingo Molnar wrote: > > Also, the proportion of 4K:2MB is a fixed constant, and CPUs dont > grow their TLB caches as much as typical RAM size grows: they'll grow > it according to the _mean_ working set size - while the 'max' working > set gets larger and larger due to the increasing [proportional] gap > to RAM size. > This is why i think we should think about hugetlb support today and > this is why i think we should consider elevating hugetlbs to the next > level of built-in Linux VM support. I respectfully disagree with your analysis. While it is true that the number of "level 1" tlb entries has not kept up with ram or application size, the CPU designers have made it so that there effectively is a "level 2" (or technically, level 3) in the cache. A tlb miss from cache is so cheap that in almost all cases (you can cheat it by using only 1 byte per page, walking randomly through memory and having a strict ordering between those 1 byte accesses) it is hidden in the out of order engine. So in practice, for many apps, as long as the CPU cache scales with application size the TLB more or less scales too. Now hugepages have some interesting other advantages, namely they save pagetable memory..which for something like TPC-C on a fork based database can be a measureable win. -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id C1E1C6B01EE for ; Mon, 12 Apr 2010 07:31:31 -0400 (EDT) Message-ID: <4BC30436.8070001@redhat.com> Date: Mon, 12 Apr 2010 14:29:58 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: hugepages will matter more in the future References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <20100412042230.5d974e5d@infradead.org> In-Reply-To: <20100412042230.5d974e5d@infradead.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Arjan van de Ven Cc: Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/12/2010 02:22 PM, Arjan van de Ven wrote: > >> This is why i think we should think about hugetlb support today and >> this is why i think we should consider elevating hugetlbs to the next >> level of built-in Linux VM support. >> > > I respectfully disagree with your analysis. > While it is true that the number of "level 1" tlb entries has not kept > up with ram or application size, the CPU designers have made it so that > there effectively is a "level 2" (or technically, level 3) in the cache. > > A tlb miss from cache is so cheap that in almost all cases (you can > cheat it by using only 1 byte per page, walking randomly through memory > and having a strict ordering between those 1 byte accesses) it is > hidden in the out of order engine. > Pointer chasing defeats OoO. The cpu is limited in the amount of speculation it can do. Since you will likely miss on the data access, you have two memory accesses to hide (3 for virt). > So in practice, for many apps, as long as the CPU cache scales with > application size the TLB more or less scales too. > A 16MB cache maps 8GB of memory (4GB with virtualization), leaving nothing for data. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 3036A6B01EE for ; Mon, 12 Apr 2010 08:24:51 -0400 (EDT) Message-ID: <4BC310D2.6030703@redhat.com> Date: Mon, 12 Apr 2010 15:23:46 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> <4BC2E1D6.9040702@redhat.com> <20100412092615.GY5683@laptop> <4BC2EFBA.5080404@redhat.com> <20100412103701.GZ5683@laptop> <4BC2FCFA.5080004@redhat.com> In-Reply-To: <4BC2FCFA.5080004@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/12/2010 01:59 PM, Avi Kivity wrote: >>> Containers are wonderful but still a future thing, and even when >>> fully implemented they still don't offer the same isolation as >>> virtualization. For example, the owner of workload A might want to >>> upgrade the kernel to fix a bug he's hitting, while the owner of >>> workload B needs three months to test it. >> But better for performance in general. >> > > True. But virtualization has the advantage of actually being there. btw, containers are way more intrusive than all the kvm related changes put together, and still not done. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id E0C326B01EE for ; Mon, 12 Apr 2010 09:18:21 -0400 (EDT) Date: Mon, 12 Apr 2010 15:17:11 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412131711.GX5656@random.random> References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <20100412070811.GD5656@random.random> <20100412072144.GS5683@laptop> <20100412080626.GG5656@random.random> <20100412104451.GO25756@csn.ul.ie> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100412104451.GO25756@csn.ul.ie> Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: Nick Piggin , Ingo Molnar , Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 11:44:51AM +0100, Mel Gorman wrote: > As a side-note, this is what dynamic hugepage pool resizing was for. > > hugeadm --pool-pages-max :[+|-]> > > The hugepage pool grows and shrinks as required if the system is able to > allocate the huge pages. If the huge pages are not available, mmap() returns > NULL and userspace is expected to recover by retrying the allocation with > small pages (something libhugetlbfs does automatically). If 99% of the virtual space is backed by hugepages and just the last 2M have to be backed by regular pages that's fine with us, we want to use hugepages for the 99% of the memory. > In the virtualisation context, the greater problem with such an approach > is no-overcommit is possible. I am given to understand that this is a > major problem because hosts of virtual machines are often overcommitted > on the assumption they don't all peak at the same time. Yep, other things that come to mind is that we need KSM to split and merge hugepages when they're found equal, not yet working right now but it's more natural to do it in the core VM as KSM pages then have to be swapped too and mixed in the same vma with regular pages and hugepages. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 8C87B6B01E3 for ; Mon, 12 Apr 2010 09:26:03 -0400 (EDT) Date: Mon, 12 Apr 2010 15:25:02 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412132502.GY5656@random.random> References: <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> <4BC2E1D6.9040702@redhat.com> <20100412092615.GY5683@laptop> <4BC2EFBA.5080404@redhat.com> <20100412103701.GZ5683@laptop> <4BC2FCFA.5080004@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC2FCFA.5080004@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Nick Piggin , Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 01:59:06PM +0300, Avi Kivity wrote: > Right; and on a 16-64GB machine you'll have a hard time filling kernel > memory with objects. Yep, this is worth mentioning, the more RAM there is, the higher percentage of the freeable memory won't be fragmented, even without kernelcore=. Which is probably why we won't ever need to worry about kernelcore=. > kvm overcommit uses ballooning, page merging, and swapping. None of > these work well with large pages (well, ballooning might). KSM is the only one that will need some further modification to be able to merge the equal contents inside hugepages. It already can co-exist (I tested it) but right now it will skip over hugepages and it's only able to merge regular pages if there's any. We need to make it hugepage aware and to split the hugepages when it finds stuff to merge. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 741DC6B01E3 for ; Mon, 12 Apr 2010 09:32:43 -0400 (EDT) Date: Mon, 12 Apr 2010 15:30:19 +0200 From: Andrea Arcangeli Subject: Re: hugepages will matter more in the future Message-ID: <20100412133019.GZ5656@random.random> References: <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <20100412042230.5d974e5d@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100412042230.5d974e5d@infradead.org> Sender: owner-linux-mm@kvack.org To: Arjan van de Ven Cc: Ingo Molnar , Avi Kivity , Jason Garrett-Glaser , Mike Galbraith , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 04:22:30AM -0700, Arjan van de Ven wrote: > Now hugepages have some interesting other advantages, namely they save > pagetable memory..which for something like TPC-C on a fork based > database can be a measureable win. It doesn't save pagetable memory (as in `grep MemFree /proc/meminfo`). To achive that we'd need to return -ENOMEM from split_huge_page_pmd and split_huge_page, which would complicate things significantly. I'd prefer if we could get rid gradually of split_huge_page_pmd calls instead of having to handle a retval in several inner nested functions that don't contemplate returning error like all their callers. I think the saving in pagetables isn't really interesting... it's a couple of gigabytes but it doesn't move the needle as much as being able to boost CPU performance. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 2A9476B01E3 for ; Mon, 12 Apr 2010 09:34:55 -0400 (EDT) Message-ID: <4BC3213E.40409@redhat.com> Date: Mon, 12 Apr 2010 16:33:50 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: hugepages will matter more in the future References: <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <20100412042230.5d974e5d@infradead.org> <20100412133019.GZ5656@random.random> In-Reply-To: <20100412133019.GZ5656@random.random> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Arjan van de Ven , Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/12/2010 04:30 PM, Andrea Arcangeli wrote: > On Mon, Apr 12, 2010 at 04:22:30AM -0700, Arjan van de Ven wrote: > >> Now hugepages have some interesting other advantages, namely they save >> pagetable memory..which for something like TPC-C on a fork based >> database can be a measureable win. >> > It doesn't save pagetable memory (as in `grep MemFree > /proc/meminfo`). So where does the pagetable go? > To achive that we'd need to return -ENOMEM from > split_huge_page_pmd and split_huge_page, which would complicate things > significantly. I'd prefer if we could get rid gradually of > split_huge_page_pmd calls instead of having to handle a retval in > several inner nested functions that don't contemplate returning error > like all their callers. > > I think the saving in pagetables isn't really interesting... it's a > couple of gigabytes but it doesn't move the needle as much as being > able to boost CPU performance. > Fork-based (or process+shm based, like Oracle) replicate the page tables per process, so it's N * 0.2%, which would be quite large. We could share pmds for large shared memory areas, but it wouldn't be easy. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 4B8026B01E3 for ; Mon, 12 Apr 2010 09:40:55 -0400 (EDT) Date: Mon, 12 Apr 2010 15:39:52 +0200 From: Andrea Arcangeli Subject: Re: hugepages will matter more in the future Message-ID: <20100412133952.GA5656@random.random> References: <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <20100412042230.5d974e5d@infradead.org> <20100412133019.GZ5656@random.random> <4BC3213E.40409@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC3213E.40409@redhat.com> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Arjan van de Ven , Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 04:33:50PM +0300, Avi Kivity wrote: > So where does the pagetable go? They're preallocated together with the hugepage and queued into the mm to retain locality. This way a huge pmd can be converted to a regular pmd pointing to the preallocated pte on the fly without GFP_KERNEL allocations. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 3BD826B01E3 for ; Mon, 12 Apr 2010 09:55:20 -0400 (EDT) Message-ID: <4BC325D9.7030203@redhat.com> Date: Mon, 12 Apr 2010 16:53:29 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: hugepages will matter more in the future References: <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <20100412042230.5d974e5d@infradead.org> <20100412133019.GZ5656@random.random> <4BC3213E.40409@redhat.com> <20100412133952.GA5656@random.random> In-Reply-To: <20100412133952.GA5656@random.random> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Arjan van de Ven , Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/12/2010 04:39 PM, Andrea Arcangeli wrote: > On Mon, Apr 12, 2010 at 04:33:50PM +0300, Avi Kivity wrote: > >> So where does the pagetable go? >> > They're preallocated together with the hugepage and queued into the mm > to retain locality. This way a huge pmd can be converted to a regular > pmd pointing to the preallocated pte on the fly without GFP_KERNEL > allocations. > Oh. Well I hope this can be eliminated in the future somehow. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id E7F2A6B01E3 for ; Mon, 12 Apr 2010 10:29:25 -0400 (EDT) Date: Mon, 12 Apr 2010 09:24:52 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 In-Reply-To: <20100410184750.GJ5708@random.random> Message-ID: References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Ingo Molnar , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Sat, 10 Apr 2010, Andrea Arcangeli wrote: > Full agreement! I think everyone wants transparent hugepage, the only > compliant I ever heard so far is from Christoph that has some slight > preference on not introducing split_huge_page and going full hugepage > everywhere, with native in gup immediately where GUP only returns head > pages and every caller has to check PageTransHuge on them to see if > it's huge or not. Changing several hundred of drivers in one go and > with native swapping with hugepage backed swapcache immediately, which > means also pagecache has to deal with hugepages immediately, is > possible too, but I think this more gradual approach is easier to keep > under control, Rome wasn't built in a day. Surely in a second time I > want tmpfs backed by hugepages too at least. And maybe pagecache, but > it doesn't need to happen immediately. Also we've to keep in mind for > huge systems the PAGE_SIZE should eventually become 2M and those will > be able to take advantage of transparent hugepages for the 1G > pud_trans_huge, that will make HPC even faster. Anyway nothing > prevents to take Christoph's long term direction also by starting self > contained. I want hugepages but not the way you have done it here. Follow conventions and do not introduce on the fly conversion of page size and do not treat a huge page as a 2M page while also handling the 4k components as separate pages. Those create additional synchronization issues (like the compound lock and the refcounting of tail pages). There are existing ways to convert from 2M to 4k without these issues (see reclaim logic and page migration). This would be much cleaner. I am not sure where your imagination ran wild to make the claim that hundreds of drivers would have to be changed only because of the use of proper synchronization methods. I have never said that everything has to be converted in one go but that it would have to be an incremental process. Would you please stop building strawmem and telling wild stories? > To me what is relevant is that everyone in the VM camp seems to want > transparent hugepages in some shape or form, because of the about > linear speedup they provide to everything running on them on bare > metal (and an more than linear cumulative speedup in case of nested > pagetables for obvious reasons), no matter what design that it is. We want huge pages yes. But transparent? If you can define transparent then we may agree at some point. Certainly not transparent in the sense of volatile objects that suddenly convert from 2M to 4K sizes causing breakage. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 561116B01E3 for ; Mon, 12 Apr 2010 10:32:44 -0400 (EDT) Date: Mon, 12 Apr 2010 09:29:03 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 In-Reply-To: Message-ID: References: <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Pekka Enberg Cc: Nick Piggin , Ingo Molnar , Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, 12 Apr 2010, Pekka Enberg wrote: > > Especially when you use something like SLUB as the memory allocator > > which requires higher order allocations for objects which are pinned > > in kernel memory. > > I guess we'd need to merge the SLUB defragmentation patches to fix that? 1. SLUB does not require higher order allocations. 2. SLUB defrag patches would allow reclaim / moving of slab memory but would require callbacks to be provided by slab users to remove references to objects. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 527406B01E3 for ; Mon, 12 Apr 2010 10:51:08 -0400 (EDT) Message-ID: <4BC33314.4090809@redhat.com> Date: Mon, 12 Apr 2010 17:49:56 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 References: <20100405232115.GM5825@random.random> <20100406011345.GT5825@random.random> <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Christoph Lameter Cc: Andrea Arcangeli , Ingo Molnar , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/12/2010 05:24 PM, Christoph Lameter wrote: > >> To me what is relevant is that everyone in the VM camp seems to want >> transparent hugepages in some shape or form, because of the about >> linear speedup they provide to everything running on them on bare >> metal (and an more than linear cumulative speedup in case of nested >> pagetables for obvious reasons), no matter what design that it is. >> > We want huge pages yes. But transparent? If you can define transparent > then we may agree at some point. Certainly not transparent in the sense of > volatile objects that suddenly convert from 2M to 4K sizes causing > breakage. > Suddenly converting from 2M to 4k is a requirement, otherwise we could just use hugetlbfs. It's simple, we want huge pages when we have the memory and small pages when we don't. Only the kernel knows about memory pressure, so it's up to the kernel to break apart and put together those huge pages. If you have other requirements, they have to come on top, not replace our requirements. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id D8D486B01E3 for ; Mon, 12 Apr 2010 11:46:43 -0400 (EDT) Date: Mon, 12 Apr 2010 08:41:01 -0700 (PDT) From: Linus Torvalds Subject: Re: hugepages will matter more in the future In-Reply-To: <20100411194010.GC5656@random.random> Message-ID: References: <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <20100411194010.GC5656@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Ingo Molnar , Avi Kivity , Jason Garrett-Glaser , Mike Galbraith , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: On Sun, 11 Apr 2010, Andrea Arcangeli wrote: > On Sun, Apr 11, 2010 at 08:22:04AM -0700, Linus Torvalds wrote: > > - magic libc malloc flags tghat are totally and utterly unrealistic in > > anything but a benchmark > > > > - by basically keeping one CPU totally busy doing defragmentation. > > This is a red herring. This is the last thing we want, and we'll run > even faster if we could make current glibc binaries to cooperate. But > this is a new feature and it'll require changing glibc slightly. So if it is a red herring, why the hell did you do your numbers with it? Also, talking about "changing glibc slightly" is another sign of just denial of reality. You realize that a lot of apps (especially the ones with large VM footprints) do not use glibc malloc at all, exactly because it has some bad properties particularly with threading? I saw people quote firefox mappings in this thread. You realize that firefox is one such application? > Future glibc will be optimal and it won't require khugepaged don't > worry. Sure. "All problems are imaginary". > I got crashes in page_mapcount != number of huge_pmd mapping the page > in split_huge_page because of the anon-vma bug, so I had to back it > out, this is why it's stable now. Ok. My deeper point really was that all the VM people seem to be in this circlejerk to improve performance, and it looks like nobody is even trying to fix the _existing_ problem (caused by another try to improve performance). I'm totally unimpressed with the whole circus partly exactly due to that. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id C77786B01EE for ; Mon, 12 Apr 2010 12:07:08 -0400 (EDT) Date: Tue, 13 Apr 2010 02:06:50 +1000 From: Nick Piggin Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100412160650.GB5683@laptop> References: <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Christoph Lameter Cc: Pekka Enberg , Ingo Molnar , Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, Apr 12, 2010 at 09:29:03AM -0500, Christoph Lameter wrote: > On Mon, 12 Apr 2010, Pekka Enberg wrote: > > > > Especially when you use something like SLUB as the memory allocator > > > which requires higher order allocations for objects which are pinned > > > in kernel memory. > > > > I guess we'd need to merge the SLUB defragmentation patches to fix that? > > 1. SLUB does not require higher order allocations. The problem is not that it requires higher order allocations. The problem is that it uses them. It is not a failing higher order allocation attempt in SLUB that we're worried about here. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 4C2C66B01EE for ; Mon, 12 Apr 2010 12:21:51 -0400 (EDT) Message-ID: <4BC34837.7020108@redhat.com> Date: Mon, 12 Apr 2010 12:20:07 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: hugepages will matter more in the future References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <4BC1EE13.7080702@redhat.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Avi Kivity , Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: On 04/11/2010 11:52 AM, Linus Torvalds wrote: > So here's the deal: make the code cleaner, and it's fine. And stop trying > to sell it with _crap_. Since none of the hugepages proponents in this thread seem to have asked this question: What would you like the code to look like, in order for hugepages code to be acceptable to you? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id F33026B01E3 for ; Mon, 12 Apr 2010 12:46:34 -0400 (EDT) Date: Mon, 12 Apr 2010 09:40:54 -0700 (PDT) From: Linus Torvalds Subject: Re: hugepages will matter more in the future In-Reply-To: <4BC34837.7020108@redhat.com> Message-ID: References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <4BC1EE13.7080702@redhat.com> <4BC34837.7020108@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Rik van Riel Cc: Avi Kivity , Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: On Mon, 12 Apr 2010, Rik van Riel wrote: > On 04/11/2010 11:52 AM, Linus Torvalds wrote: > > > So here's the deal: make the code cleaner, and it's fine. And stop trying > > to sell it with _crap_. > > Since none of the hugepages proponents in this thread seem to have > asked this question: > > What would you like the code to look like, in order for hugepages > code to be acceptable to you? So as I already commented to Andrew, the code has no comments about the "big picture", and the largest comment I found was about a totally _trivial_ issue about replacing the hugepage by first clearing the entry, then flushing the tlb, and then filling it. That needs hardly any comment at all, since that's what we do for _normal_ page table entries too when we change anything non-trivial about them. That's the anti-thesis of rocket science. Yet that was apparently considered the most important thing in the whole core patch to talk about! And quite frankly, I've been irritated by the "timings" used to sell this thing from the start. The changelog for the entry makes a big deal out of the fact that there's just a single page fault per 2MB, and that the page timing for clearing a huge region is faster the first time because you don't take a lot of page faults. That's a "Duh!" moment too, but it never even talks about the issue of "oh, well, we did allocate all those 2M chunks, not knowing whether they were going to be used or not". Sure, it's going to help programs that actually use all of it. Nobody is surprised. What I still care about, what what makes _all_ the timings I've seen in this whole insane thread pretty much totally useless, is the fact that we used to know that what _really_ speeds up a machine is caching. Keeping _relevant_ data around so that you don't do IO. And the mantra from pretty much day one has been "free memory is wasted memory". Yet now, the possibility of _truly_ wasting memory isn't apparently even a blip on anybody's radar. People blithely talk about changing glibc default behavior as if there are absolutely no issues, and 2MB chunks are pocket change. I can pretty much guarantee that every single developer on this list has a machine with excessive amounts of memory compared to what the machine is actually required to do. And I just do not think that is true in general. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 593656B01E3 for ; Mon, 12 Apr 2010 13:01:06 -0400 (EDT) Date: Mon, 12 Apr 2010 09:56:20 -0700 (PDT) From: Linus Torvalds Subject: Re: hugepages will matter more in the future In-Reply-To: Message-ID: References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <4BC1EE13.7080702@redhat.com> <4BC34837.7020108@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Rik van Riel Cc: Avi Kivity , Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: On Mon, 12 Apr 2010, Linus Torvalds wrote: > > So as I already commented to Andrew, the code has no comments about the > "big picture", and the largest comment I found was about a totally > _trivial_ issue about replacing the hugepage by first clearing the entry, > then flushing the tlb, and then filling it. Btw, this is the same complaint I had about the anon_vma code. There was no overview comments, and some of my fixes to that came directly from writing a big-picture "what should happen" flow chart, and either noticing that the code didn't do what it should have done, or that even the big picture was not clear. And yes, I do realize that historically we (I) haven't been good at those things. It's just that the VM has gotten _so_ complicated that we damn well need them, at least when we add new features that the rest of the VM team doesn't know by rote. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 8FC696B01E3 for ; Mon, 12 Apr 2010 13:07:05 -0400 (EDT) Received: from chimera.site ([71.245.98.113]) by xenotime.net for ; Mon, 12 Apr 2010 10:07:00 -0700 Message-ID: <4BC35333.5030704@xenotime.net> Date: Mon, 12 Apr 2010 10:06:59 -0700 From: Randy Dunlap MIME-Version: 1.0 Subject: Re: hugepages will matter more in the future References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <4BC1EE13.7080702@redhat.com> <4BC34837.7020108@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Rik van Riel , Avi Kivity , Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: On 04/12/10 09:56, Linus Torvalds wrote: > > > On Mon, 12 Apr 2010, Linus Torvalds wrote: >> >> So as I already commented to Andrew, the code has no comments about the >> "big picture", and the largest comment I found was about a totally >> _trivial_ issue about replacing the hugepage by first clearing the entry, >> then flushing the tlb, and then filling it. > > Btw, this is the same complaint I had about the anon_vma code. There was > no overview comments, and some of my fixes to that came directly from > writing a big-picture "what should happen" flow chart, and either noticing > that the code didn't do what it should have done, or that even the big > picture was not clear. > > And yes, I do realize that historically we (I) haven't been good at those > things. It's just that the VM has gotten _so_ complicated that we damn > well need them, at least when we add new features that the rest of the VM > team doesn't know by rote. and we can't expect Mel (or anyone) to write MM/VM books continuously, which is what it would take since it's always changing, so useful comments are the way to go. -- ~Randy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 14ED96B01E3 for ; Mon, 12 Apr 2010 13:37:43 -0400 (EDT) Date: Mon, 12 Apr 2010 19:36:32 +0200 From: Andrea Arcangeli Subject: Re: hugepages will matter more in the future Message-ID: <20100412173632.GB5583@random.random> References: <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <4BC1EE13.7080702@redhat.com> <4BC34837.7020108@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Rik van Riel , Avi Kivity , Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: On Mon, Apr 12, 2010 at 09:40:54AM -0700, Linus Torvalds wrote: > Yet now, the possibility of _truly_ wasting memory isn't apparently even a > blip on anybody's radar. People blithely talk about changing glibc default > behavior as if there are absolutely no issues, and 2MB chunks are pocket > change. This is about enabled=always, in some cases we'll waste memory in the hope to run faster, correct. > I can pretty much guarantee that every single developer on this list has a > machine with excessive amounts of memory compared to what the machine is > actually required to do. And I just do not think that is true in general. If this is the concern about general use, it's enough to make the default: echo madvise >/sys/kernel/mm/transparent_hugepage/enabled and then only madvise(MADV_HUGEPAGE) (like qemu guest physical memory) will use it, and khugepaged will _only_ scan madvise regions. That guarantees zero RAM waste, and even a 128M embedded definitely should enable and take advantage of it to squeeze a few cycles away from a slow CPU. It's a one liner change. I should make the default selectable at kernel config time, so developers can keep it =always and distro can set it =madvise (trivial to switch to "always" during boot or with kernel command line). Right now it's =always also to give it more testing btw. Also note about glibc, our target is to replace libhugetlbfs and pratically make libhugetlbfs the default. Applications calling mmap and not passing through malloc, or using libs not possible to override, will also not be able to take advantage of libhugetlbfs so that's ok. If somebody scatters 4k mappings all over the virtual address space of a task, I don't like to allocate 2M pages for those 4k virtual mappings (even if it'd be possible to reclaim them pretty fast without I/O), though even that is theoretically possible. I just prefer to have a glibc that cooperates, just like libhugetlbfs cooperates with hugetlbfs. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id A736E6B01EF for ; Mon, 12 Apr 2010 13:47:22 -0400 (EDT) Message-ID: <4BC35C61.7010201@redhat.com> Date: Mon, 12 Apr 2010 13:46:09 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: hugepages will matter more in the future References: <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <4BC1EE13.7080702@redhat.com> <4BC34837.7020108@redhat.com> <20100412173632.GB5583@random.random> In-Reply-To: <20100412173632.GB5583@random.random> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Linus Torvalds , Avi Kivity , Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Arjan van de Ven List-ID: On 04/12/2010 01:36 PM, Andrea Arcangeli wrote: > I should make the default selectable at kernel config time, so > developers can keep it =always and distro can set it =madvise (trivial > to switch to "always" during boot or with kernel command line). Right > now it's =always also to give it more testing btw. That still means the code will not benefit most applications. Surely a more benign default behaviour is possible? For example, instantiating hugepages on pagefault only in VMAs that are significantly larger than a hugepage (say, 16MB or larger?) and not VM_GROWSDOWN (stack starts small). We can still collapse the small pages into a large page if the process starts actually using the memory in the VMA. Memory use is a serious concern for some people, even people who could really benefit from the hugepages. For example, my home desktop system has 12GB RAM, but also runs 3 production virtual machines (kernelnewbies, PSBL, etc) and often has a test virtual machine as well. Not wasting memory is important, since the system is constantly doing disk IO. Any memory that is taken away from the page cache could hurt things. On the other hand, speeding up the virtual machines by 6% could be a big help too... I'd like to think we can find a way to get the best of both worlds. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 8A7346B0202 for ; Mon, 12 Apr 2010 23:41:47 -0400 (EDT) Date: Mon, 12 Apr 2010 20:38:29 -0400 From: Andrew Morton Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-Id: <20100412203829.871f1dee.akpm@linux-foundation.org> In-Reply-To: <4BC2EFBA.5080404@redhat.com> References: <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> <4BC2E1D6.9040702@redhat.com> <20100412092615.GY5683@laptop> <4BC2EFBA.5080404@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Nick Piggin , Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, 12 Apr 2010 13:02:34 +0300 Avi Kivity wrote: > The only scenario I can see where it degrades is that you have a dcache > load that spills over to all of memory, then falls back leaving a pinned > page in every huge frame. It can happen, but I don't see it as a likely > scenario. But maybe I'm missing something. This used to happen fairly easily. You have a directory tree and some app which walks down and across it, stat()ing regular files therein. So you end up with dentries and inodes which are laid out in memory as dir-file-file-file-file-...-file-dir-file-... Then the file dentries/inodes get reclaimed and you're left with a sparse collection of directory dcache/icache entries - massively fragmented. I forget _why_ it happened. Perhaps because S_ISREG cache items aren't pinned by anything, but S_ISDIR cache items are pinned by their children so it takes many more expiry rounds to get rid of them. There was talk about fixing this, perhaps by using different slab caches for dirs vs files. Hard, because the type of the file/inode isn't known at allocation time. Nothing happened about it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 5FE106B0211 for ; Tue, 13 Apr 2010 02:18:38 -0400 (EDT) Date: Tue, 13 Apr 2010 16:18:02 +1000 From: Neil Brown Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100413161802.498336ca@notabene.brown> In-Reply-To: <20100412203829.871f1dee.akpm@linux-foundation.org> References: <4BC0DE84.3090305@redhat.com> <20100411104608.GA12828@elte.hu> <4BC1B2CA.8050208@redhat.com> <20100411120800.GC10952@elte.hu> <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> <4BC2E1D6.9040702@redhat.com> <20100412092615.GY5683@laptop> <4BC2EFBA.5080404@redhat.com> <20100412203829.871f1dee.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Avi Kivity , Nick Piggin , Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, 12 Apr 2010 20:38:29 -0400 Andrew Morton wrote: > On Mon, 12 Apr 2010 13:02:34 +0300 Avi Kivity wrote: > > > The only scenario I can see where it degrades is that you have a dcache > > load that spills over to all of memory, then falls back leaving a pinned > > page in every huge frame. It can happen, but I don't see it as a likely > > scenario. But maybe I'm missing something. > > > > This used to happen fairly easily. You have a directory tree and some > app which walks down and across it, stat()ing regular files therein. > So you end up with dentries and inodes which are laid out in memory as > dir-file-file-file-file-...-file-dir-file-... Then the file > dentries/inodes get reclaimed and you're left with a sparse collection > of directory dcache/icache entries - massively fragmented. > > I forget _why_ it happened. Perhaps because S_ISREG cache items aren't > pinned by anything, but S_ISDIR cache items are pinned by their children > so it takes many more expiry rounds to get rid of them. > > There was talk about fixing this, perhaps by using different slab > caches for dirs vs files. Hard, because the type of the file/inode > isn't known at allocation time. Nothing happened about it. Actually I don't think that would be hard at all. ->lookup can return a different dentry than the one passed in, usually using d_splice_alias to find it. So when you create an inode for a directory, create an anonymous dentry, attach it via i_dentry, and it should "just work". That is assuming this is still a "problem" that needs to be "fixed". NeilBrown -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id B53DE6B01EF for ; Tue, 13 Apr 2010 07:39:39 -0400 (EDT) Date: Tue, 13 Apr 2010 13:38:25 +0200 From: Ingo Molnar Subject: Re: hugepages will matter more in the future Message-ID: <20100413113825.GD19757@elte.hu> References: <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <20100412042230.5d974e5d@infradead.org> <20100412133019.GZ5656@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100412133019.GZ5656@random.random> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Arjan van de Ven , Avi Kivity , Jason Garrett-Glaser , Mike Galbraith , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: * Andrea Arcangeli wrote: > On Mon, Apr 12, 2010 at 04:22:30AM -0700, Arjan van de Ven wrote: > > > > Now hugepages have some interesting other advantages, namely they save > > pagetable memory..which for something like TPC-C on a fork based database > > can be a measureable win. > > It doesn't save pagetable memory (as in `grep MemFree /proc/meminfo`). [...] It does save in terms of CPU cache footprint. (which the argument was about) The RAM is wasted, but are always cache cold. > [...] I think the saving in pagetables isn't really interesting... [...] i think it's very much interesting for 'pure' hugetlb mappings, as a next-step thing. It amounts to 8 bytes wasted per 4K page [0.2% of RAM wasted] - much more with the kind of aliasing that DBs frequently do - for hugetlb workloads it is basically roughly equivalent to a +8 bytes increase in struct page size - few MM hackers would accept that. So it will have to be fixed down the line. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 73A076B01E3 for ; Tue, 13 Apr 2010 09:18:21 -0400 (EDT) Date: Tue, 13 Apr 2010 15:17:05 +0200 From: Andrea Arcangeli Subject: Re: hugepages will matter more in the future Message-ID: <20100413131705.GK5583@random.random> References: <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <20100412042230.5d974e5d@infradead.org> <20100412133019.GZ5656@random.random> <20100413113825.GD19757@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100413113825.GD19757@elte.hu> Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Arjan van de Ven , Avi Kivity , Jason Garrett-Glaser , Mike Galbraith , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Tue, Apr 13, 2010 at 01:38:25PM +0200, Ingo Molnar wrote: > > * Andrea Arcangeli wrote: > > > On Mon, Apr 12, 2010 at 04:22:30AM -0700, Arjan van de Ven wrote: > > > > > > Now hugepages have some interesting other advantages, namely they save > > > pagetable memory..which for something like TPC-C on a fork based database > > > can be a measureable win. > > > > It doesn't save pagetable memory (as in `grep MemFree /proc/meminfo`). [...] > > It does save in terms of CPU cache footprint. (which the argument was about) > The RAM is wasted, but are always cache cold. Definitely, thanks for further clarifying this, and this is why I've been careful to specify "as in `grep MemFree..". > i think it's very much interesting for 'pure' hugetlb mappings, as a next-step > thing. It amounts to 8 bytes wasted per 4K page [0.2% of RAM wasted] - much > more with the kind of aliasing that DBs frequently do - for hugetlb workloads > it is basically roughly equivalent to a +8 bytes increase in struct page size > - few MM hackers would accept that. > > So it will have to be fixed down the line. It's exactly 4k wasted for each pmd set as pmd_trans_huge. Removing the pagetable preallocation will be absolutely trivial as far as huge_memory.c is concerned (takes like 1 minute of hacking) and in fact it simplifies a bit of the code, what will be not trivial will be to handle the -ENOMEM retval from every place that calls split_huge_page_pmd, which definitely we can address down the line (ideally by removing split_huge_page_pmd). The other benefit the current preallocation provides, is that it doesn't increase requirements from the PF_MEMALLOC pool, until we can swap hugepages natively with huge-swapcache, in order to swap we need to allocate the pte. Who tried this before (Dave IIRC) answered some email ago that he also had to preallocate the pte to avoid running into the above issue. When he said that, it further confirmed me that it's worth to go this way initially. Also note: we're not wasting memory compared to when pmd is not huge, we just don't take advantage of the full potential of hugepages to keep things more manageable initially. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id E1F026B01E3 for ; Tue, 13 Apr 2010 09:32:39 -0400 (EDT) Date: Tue, 13 Apr 2010 15:31:54 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100413133153.GO5583@random.random> References: <20100412060931.GP5683@laptop> <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> <4BC2E1D6.9040702@redhat.com> <20100412092615.GY5683@laptop> <4BC2EFBA.5080404@redhat.com> <20100412203829.871f1dee.akpm@linux-foundation.org> <20100413161802.498336ca@notabene.brown> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100413161802.498336ca@notabene.brown> Sender: owner-linux-mm@kvack.org To: Neil Brown Cc: Andrew Morton , Avi Kivity , Nick Piggin , Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: Hi Neil! On Tue, Apr 13, 2010 at 04:18:02PM +1000, Neil Brown wrote: > Actually I don't think that would be hard at all. > ->lookup can return a different dentry than the one passed in, usually using > d_splice_alias to find it. > So when you create an inode for a directory, create an anonymous dentry, > attach it via i_dentry, and it should "just work". > That is assuming this is still a "problem" that needs to be "fixed". I'm not sure if changing the slab object will make a whole lot of difference, because antifrag will threat all unmovable stuff the same. To make a difference directories should go in a different 2M page of the inodes, and that would require changes to the slab code to achieve I guess. However while I doubt it helps with hugepage fragmentation because of the above, it still sounds a good idea to provide more "free memory" to the system with less effort and while preserving more cache. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id 6FB736B01E3 for ; Tue, 13 Apr 2010 09:40:59 -0400 (EDT) Date: Tue, 13 Apr 2010 14:40:35 +0100 From: Mel Gorman Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100413134035.GY25756@csn.ul.ie> References: <4BC2BF67.80903@redhat.com> <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> <4BC2E1D6.9040702@redhat.com> <20100412092615.GY5683@laptop> <4BC2EFBA.5080404@redhat.com> <20100412203829.871f1dee.akpm@linux-foundation.org> <20100413161802.498336ca@notabene.brown> <20100413133153.GO5583@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20100413133153.GO5583@random.random> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Neil Brown , Andrew Morton , Avi Kivity , Nick Piggin , Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Tue, Apr 13, 2010 at 03:31:54PM +0200, Andrea Arcangeli wrote: > Hi Neil! > > On Tue, Apr 13, 2010 at 04:18:02PM +1000, Neil Brown wrote: > > Actually I don't think that would be hard at all. > > ->lookup can return a different dentry than the one passed in, usually using > > d_splice_alias to find it. > > So when you create an inode for a directory, create an anonymous dentry, > > attach it via i_dentry, and it should "just work". > > That is assuming this is still a "problem" that needs to be "fixed". > > I'm not sure if changing the slab object will make a whole lot of > difference, because antifrag will threat all unmovable stuff the > same. Anti-frag considers reclaimable slab caches to be different to unmovable allocations. Slabs with the SLAB_RECLAIM_ACCOUNT use the __GFP_RECLAIMABLE flag. It was to keep truly unmovable allocations in the same 2M pages where possible. It also means that even with large bursts of kernel allocations due to big filesystem loads, the system will still get some of those 2M blocks back eventually when slab eventually ages and shrinks. You can use /proc/pagetypeinfo to get a count of the 2M blocks of each type for different types of workloads to see what the scenarios look like from an anti-frag and compaction perspective but very loosly speaking, with compaction applied, you'd expect to be able to covert all "Movable" blocks to huge pages by either compacting or paging. You'll get some of the "Reclaimable" blocks if slab is shrunk enough the unmovable blocks depends on how many of the allocations are due to pagetables. > To make a difference directories should go in a different 2M > page of the inodes, and that would require changes to the slab code to > achieve I guess. > > However while I doubt it helps with hugepage fragmentation because of > the above, it still sounds a good idea to provide more "free memory" > to the system with less effort and while preserving more cache. > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 528216B01E3 for ; Tue, 13 Apr 2010 09:46:16 -0400 (EDT) Date: Tue, 13 Apr 2010 15:44:56 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100413134456.GQ5583@random.random> References: <20100412071525.GR5683@laptop> <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> <4BC2E1D6.9040702@redhat.com> <20100412092615.GY5683@laptop> <4BC2EFBA.5080404@redhat.com> <20100412203829.871f1dee.akpm@linux-foundation.org> <20100413161802.498336ca@notabene.brown> <20100413133153.GO5583@random.random> <20100413134035.GY25756@csn.ul.ie> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100413134035.GY25756@csn.ul.ie> Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: Neil Brown , Andrew Morton , Avi Kivity , Nick Piggin , Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Tue, Apr 13, 2010 at 02:40:35PM +0100, Mel Gorman wrote: > On Tue, Apr 13, 2010 at 03:31:54PM +0200, Andrea Arcangeli wrote: > > Hi Neil! > > > > On Tue, Apr 13, 2010 at 04:18:02PM +1000, Neil Brown wrote: > > > Actually I don't think that would be hard at all. > > > ->lookup can return a different dentry than the one passed in, usually using > > > d_splice_alias to find it. > > > So when you create an inode for a directory, create an anonymous dentry, > > > attach it via i_dentry, and it should "just work". > > > That is assuming this is still a "problem" that needs to be "fixed". > > > > I'm not sure if changing the slab object will make a whole lot of > > difference, because antifrag will threat all unmovable stuff the > > same. > > Anti-frag considers reclaimable slab caches to be different to unmovable > allocations. Slabs with the SLAB_RECLAIM_ACCOUNT use the __GFP_RECLAIMABLE > flag. It was to keep truly unmovable allocations in the same 2M pages where > possible. As long as we keep the reclaimable separated from the "movable" that's fine. > It also means that even with large bursts of kernel allocations due to big > filesystem loads, the system will still get some of those 2M blocks back > eventually when slab eventually ages and shrinks. Only if the file isn't open... it's not really certain it's reclaimable. > You can use /proc/pagetypeinfo to get a count of the 2M blocks of each > type for different types of workloads to see what the scenarios look like > from an anti-frag and compaction perspective but very loosly speaking, > with compaction applied, you'd expect to be able to covert all "Movable" > blocks to huge pages by either compacting or paging. You'll get some of the > "Reclaimable" blocks if slab is shrunk enough the unmovable blocks depends > on how many of the allocations are due to pagetables. Awesome statistic! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 85AE86B01E3 for ; Tue, 13 Apr 2010 09:56:05 -0400 (EDT) Date: Tue, 13 Apr 2010 14:55:43 +0100 From: Mel Gorman Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100413135543.GA25756@csn.ul.ie> References: <4BC2CF8C.5090108@redhat.com> <20100412082844.GU5683@laptop> <4BC2E1D6.9040702@redhat.com> <20100412092615.GY5683@laptop> <4BC2EFBA.5080404@redhat.com> <20100412203829.871f1dee.akpm@linux-foundation.org> <20100413161802.498336ca@notabene.brown> <20100413133153.GO5583@random.random> <20100413134035.GY25756@csn.ul.ie> <20100413134456.GQ5583@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20100413134456.GQ5583@random.random> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Neil Brown , Andrew Morton , Avi Kivity , Nick Piggin , Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Tue, Apr 13, 2010 at 03:44:56PM +0200, Andrea Arcangeli wrote: > On Tue, Apr 13, 2010 at 02:40:35PM +0100, Mel Gorman wrote: > > On Tue, Apr 13, 2010 at 03:31:54PM +0200, Andrea Arcangeli wrote: > > > Hi Neil! > > > > > > On Tue, Apr 13, 2010 at 04:18:02PM +1000, Neil Brown wrote: > > > > Actually I don't think that would be hard at all. > > > > ->lookup can return a different dentry than the one passed in, usually using > > > > d_splice_alias to find it. > > > > So when you create an inode for a directory, create an anonymous dentry, > > > > attach it via i_dentry, and it should "just work". > > > > That is assuming this is still a "problem" that needs to be "fixed". > > > > > > I'm not sure if changing the slab object will make a whole lot of > > > difference, because antifrag will threat all unmovable stuff the > > > same. > > > > Anti-frag considers reclaimable slab caches to be different to unmovable > > allocations. Slabs with the SLAB_RECLAIM_ACCOUNT use the __GFP_RECLAIMABLE > > flag. It was to keep truly unmovable allocations in the same 2M pages where > > possible. > > As long as we keep the reclaimable separated from the "movable" that's > fine. > That already happens. > > It also means that even with large bursts of kernel allocations due to big > > filesystem loads, the system will still get some of those 2M blocks back > > eventually when slab eventually ages and shrinks. > > Only if the file isn't open... it's not really certain it's reclaimable. > True. Christoph made a few stabs at being able to slab targetted reclaim (called defragmentation, but it was about reclaim) but it was never completed and merged. Even if it was merged, the slab reclaimable objects would still be kept in their own 2M pageblocks though. > > You can use /proc/pagetypeinfo to get a count of the 2M blocks of each > > type for different types of workloads to see what the scenarios look like > > from an anti-frag and compaction perspective but very loosly speaking, > > with compaction applied, you'd expect to be able to covert all "Movable" > > blocks to huge pages by either compacting or paging. You'll get some of the > > "Reclaimable" blocks if slab is shrunk enough the unmovable blocks depends > > on how many of the allocations are due to pagetables. > > Awesome statistic! > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 9BF896B01E3 for ; Tue, 13 Apr 2010 10:05:21 -0400 (EDT) Date: Tue, 13 Apr 2010 16:03:30 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100413140330.GR5583@random.random> References: <20100412082844.GU5683@laptop> <4BC2E1D6.9040702@redhat.com> <20100412092615.GY5683@laptop> <4BC2EFBA.5080404@redhat.com> <20100412203829.871f1dee.akpm@linux-foundation.org> <20100413161802.498336ca@notabene.brown> <20100413133153.GO5583@random.random> <20100413134035.GY25756@csn.ul.ie> <20100413134456.GQ5583@random.random> <20100413135543.GA25756@csn.ul.ie> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100413135543.GA25756@csn.ul.ie> Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: Neil Brown , Andrew Morton , Avi Kivity , Nick Piggin , Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Rik van Riel , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Tue, Apr 13, 2010 at 02:55:43PM +0100, Mel Gorman wrote: > That already happens. Yep as shown by /proc/pagetypeinfo. > True. Christoph made a few stabs at being able to slab targetted reclaim > (called defragmentation, but it was about reclaim) but it was never completed > and merged. Even if it was merged, the slab reclaimable objects would > still be kept in their own 2M pageblocks though. I guess it's not easy and more expensive to reclaim in use object, I didn't see the targetted reclaim patches. So it sounds ok if they stay in their own pageblocks separated from the movable pageblocks, even if they become fully reclaimable. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 02E456B01EF for ; Sat, 17 Apr 2010 11:11:20 -0400 (EDT) Date: Sat, 17 Apr 2010 08:12:18 -0700 From: Arjan van de Ven Subject: Re: hugepages will matter more in the future Message-ID: <20100417081218.4160f36b@infradead.org> In-Reply-To: <4BC30436.8070001@redhat.com> References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <20100412042230.5d974e5d@infradead.org> <4BC30436.8070001@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Mon, 12 Apr 2010 14:29:58 +0300 Avi Kivity wrote: > Pointer chasing defeats OoO. The cpu is limited in the amount of > speculation it can do. Pointer chasing defeats the CPU cache as well. As long as the CPU cache contains, mostly, the page tables for all the data in the cache, applications that try to work good with a cache don't notice too much. Sure, once you start doing pointer chasing cache misses things suck. they do very much so. -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 570596B01EF for ; Sat, 17 Apr 2010 14:19:28 -0400 (EDT) Message-ID: <4BC9FB64.5040009@redhat.com> Date: Sat, 17 Apr 2010 21:18:12 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: hugepages will matter more in the future References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <20100412042230.5d974e5d@infradead.org> <4BC30436.8070001@redhat.com> <20100417081218.4160f36b@infradead.org> In-Reply-To: <20100417081218.4160f36b@infradead.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Arjan van de Ven Cc: Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/17/2010 06:12 PM, Arjan van de Ven wrote: > On Mon, 12 Apr 2010 14:29:58 +0300 > Avi Kivity wrote: > >> Pointer chasing defeats OoO. The cpu is limited in the amount of >> speculation it can do. >> > Pointer chasing defeats the CPU cache as well. > True. > As long as the CPU cache contains, mostly, the page tables for all the > data in the cache, applications that try to work good with a cache > don't notice too much. Sure, once you start doing pointer chasing cache > misses things suck. they do very much so. > Correct. We're trying to reduce suckage from 2 cache misses per access (3 for virt), to 1 cache miss per access. We're also freeing up space in the cache for data. Saying the application already sucks isn't helping anything. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id A7C396B01EF for ; Sat, 17 Apr 2010 15:04:30 -0400 (EDT) Date: Sat, 17 Apr 2010 12:05:31 -0700 From: Arjan van de Ven Subject: Re: hugepages will matter more in the future Message-ID: <20100417120531.0b86e959@infradead.org> In-Reply-To: <4BC9FB64.5040009@redhat.com> References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <20100412042230.5d974e5d@infradead.org> <4BC30436.8070001@redhat.com> <20100417081218.4160f36b@infradead.org> <4BC9FB64.5040009@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Sat, 17 Apr 2010 21:18:12 +0300 > > Correct. We're trying to reduce suckage from 2 cache misses per > access (3 for virt), to 1 cache miss per access. We're also freeing > up space in the cache for data. > > Saying the application already sucks isn't helping anything. but the guy who's writing the application will already optimize for this case... -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id C6ABA6B01EF for ; Sat, 17 Apr 2010 15:06:45 -0400 (EDT) Message-ID: <4BCA067B.1020008@redhat.com> Date: Sat, 17 Apr 2010 22:05:31 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: hugepages will matter more in the future References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <20100412042230.5d974e5d@infradead.org> <4BC30436.8070001@redhat.com> <20100417081218.4160f36b@infradead.org> <4BC9FB64.5040009@redhat.com> <20100417120531.0b86e959@infradead.org> In-Reply-To: <20100417120531.0b86e959@infradead.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Arjan van de Ven Cc: Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/17/2010 10:05 PM, Arjan van de Ven wrote: > On Sat, 17 Apr 2010 21:18:12 +0300 > >> Correct. We're trying to reduce suckage from 2 cache misses per >> access (3 for virt), to 1 cache miss per access. We're also freeing >> up space in the cache for data. >> >> Saying the application already sucks isn't helping anything. >> > but the guy who's writing the application will already optimize for > this case... > I lost you. What is he optimizing for? 4k pages? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id EB9AE6B01EF for ; Sat, 17 Apr 2010 15:16:44 -0400 (EDT) Date: Sat, 17 Apr 2010 12:18:03 -0700 From: Arjan van de Ven Subject: Re: hugepages will matter more in the future Message-ID: <20100417121803.654fb34e@infradead.org> In-Reply-To: <4BCA067B.1020008@redhat.com> References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <20100412042230.5d974e5d@infradead.org> <4BC30436.8070001@redhat.com> <20100417081218.4160f36b@infradead.org> <4BC9FB64.5040009@redhat.com> <20100417120531.0b86e959@infradead.org> <4BCA067B.1020008@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On Sat, 17 Apr 2010 22:05:31 +0300 Avi Kivity wrote: > On 04/17/2010 10:05 PM, Arjan van de Ven wrote: > > On Sat, 17 Apr 2010 21:18:12 +0300 > > > >> Correct. We're trying to reduce suckage from 2 cache misses per > >> access (3 for virt), to 1 cache miss per access. We're also > >> freeing up space in the cache for data. > >> > >> Saying the application already sucks isn't helping anything. > >> > > but the guy who's writing the application will already optimize for > > this case... > > > > I lost you. What is he optimizing for? 4k pages? not totally sucking on cache misses, eg trying to do data locality etc -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 5AC136B01EF for ; Sat, 17 Apr 2010 15:21:15 -0400 (EDT) Message-ID: <4BCA09E3.9010003@redhat.com> Date: Sat, 17 Apr 2010 22:20:03 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: hugepages will matter more in the future References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <4BC0E556.30304@redhat.com> <4BC19663.8080001@redhat.com> <4BC19916.20100@redhat.com> <20100411110015.GA10149@elte.hu> <4BC1B034.4050302@redhat.com> <20100411115229.GB10952@elte.hu> <20100412042230.5d974e5d@infradead.org> <4BC30436.8070001@redhat.com> <20100417081218.4160f36b@infradead.org> <4BC9FB64.5040009@redhat.com> <20100417120531.0b86e959@infradead.org> <4BCA067B.1020008@redhat.com> <20100417121803.654fb34e@infradead.org> In-Reply-To: <20100417121803.654fb34e@infradead.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Arjan van de Ven Cc: Ingo Molnar , Jason Garrett-Glaser , Mike Galbraith , Andrea Arcangeli , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura List-ID: On 04/17/2010 10:18 PM, Arjan van de Ven wrote: > On Sat, 17 Apr 2010 22:05:31 +0300 > Avi Kivity wrote: > > >> On 04/17/2010 10:05 PM, Arjan van de Ven wrote: >> >>> On Sat, 17 Apr 2010 21:18:12 +0300 >>> >>> >>>> Correct. We're trying to reduce suckage from 2 cache misses per >>>> access (3 for virt), to 1 cache miss per access. We're also >>>> freeing up space in the cache for data. >>>> >>>> Saying the application already sucks isn't helping anything. >>>> >>>> >>> but the guy who's writing the application will already optimize for >>> this case... >>> >>> >> I lost you. What is he optimizing for? 4k pages? >> > not totally sucking on cache misses, eg trying to do data locality etc > Of course, but it's not always possible. Hence Java and Oracle (and Linux itself) try to map their data with large pages. Things like a garbage collector, an LRU, or large object trees that are traversed by semi-random input are hard/impossible to localize. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 303796B01EE for ; Sun, 25 Apr 2010 15:29:47 -0400 (EDT) Date: Sun, 25 Apr 2010 21:27:39 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100425192739.GG5789@random.random> References: <20100406090813.GA14098@elte.hu> <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <20100410204756.GR5708@random.random> <4BC0E6ED.7040100@redhat.com> <20100411010540.GW5708@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: <20100411010540.GW5708@random.random> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Ulrich Drepper List-ID: On Sun, Apr 11, 2010 at 03:05:40AM +0200, Andrea Arcangeli wrote: > With the above two params I get around 200M (around half) in > hugepages with gcc building translate.o: >=20 > $ rm translate.o ; time make translate.o > CC translate.o >=20 > real 0m22.900s > user 0m22.601s > sys 0m0.260s > $ rm translate.o ; time make translate.o > CC translate.o >=20 > real 0m22.405s > user 0m22.125s > sys 0m0.240s > # echo never > /sys/kernel/mm/transparent_hugepage/enabled > # exit > $ rm translate.o ; time make translate.o > CC translate.o >=20 > real 0m24.128s > user 0m23.725s > sys 0m0.376s > $ rm translate.o ; time make translate.o > CC translate.o >=20 > real 0m24.126s > user 0m23.725s > sys 0m0.376s > $ uptime > 02:36:07 up 1 day, 19:45, 5 users, load average: 0.01, 0.12, 0.08 >=20 > 1 sec in 24 means around 4% faster, hopefully when glibc will fully > cooperate we'll get better results than the above with gcc... >=20 > I tried to emulate it with khugepaged running in a loop and I get > almost the whole gcc anon memory in hugepages this way (as expected): >=20 > # echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_mill= isecs > # exit > rm translate.o ; time make translate.o > CC translate.o >=20 > real 0m21.950s > user 0m21.481s > sys 0m0.292s > $ rm translate.o ; time make translate.o > CC translate.o >=20 > real 0m21.992s > user 0m21.529s > sys 0m0.288s > $=20 >=20 > So this takes more than 2 seconds away from 24 seconds reproducibly, > and it means gcc now runs 8% faster. This requires running khugepaged > at 100% of one of the four cores but with a slight chance to glibc > we'll be able reach the exact same 8% speedup (or more because this > also involves copying ~200M and sending IPIs to unmap pages and stop > userland during the memory copy that won't be necessary anymore). >=20 > BTW, the current default for khugepaged is to scan 8 pmd every 10 > seconds, that means collapsing at most 16M every 10 seconds. Checking > 8 pmd pointers every 10 seconds and 6 wakeup per minute for a kernel > thread is absolutely unmeasurable but despite the unmeasurable > overhead, it provides for a very nice behavior for long lived > allocations that may have been swapped in fragmented. >=20 > This is on phenom X4, I'd be interested if somebody can try on other cpus. >=20 > To get the environment of the test just: >=20 > git clone git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git > cd qemu-kvm > make > cd x86_64-softmmu >=20 > export MALLOC_MMAP_THRESHOLD_=3D$[1024*1024*1024] > export MALLOC_TOP_PAD_=3D$[1024*1024*1024] > rm translate.o; time make translate.o >=20 > Then you need to flip the above sysfs controls as I did. I patched gcc with the few liner change and without tweaking glibc and with khugepaged killed at all times. The system already had heavy load building glibc a couple of times and my usual kernel build load for about 12 hours. Shutting down khugepaged isn't really necessary considering how slow the scan is but I did it anyway. $ cat /sys/kernel/mm/transparent_hugepage/enabled=20 [always] madvise never $ cat /sys/kernel/mm/transparent_hugepage/khugepaged/enabled=20 always madvise [never] $ pgrep khugepaged $ ~/bin/x86_64/perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-lo= ad-misses -e l1-dcache-loads -e l1-dcache-load-misses --repeat 3 gcc -I/cry= pto/home/andrea/kernel/qemu-kvm/slirp -Werror -m64 -fstack-protector-all -W= old-style-definition -Wold-style-declaration -I. -I/crypto/home/andrea/kern= el/qemu-kvm -D_FORTIFY_SOURCE=3D2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=3D64 -D= _LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wend= if-labels -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -DHAS_= AUDIO -DHAS_AUDIO_CHOICE -I/crypto/home/andrea/kernel/qemu-kvm/fpu -I/crypt= o/home/andrea/kernel/qemu-kvm/tcg -I/crypto/home/andrea/kernel/qemu-kvm/tcg= /x86_64 -DTARGET_PHYS_ADDR_BITS=3D64 -I.. -I/crypto/home/andrea/kernel/qem= u-kvm/target-i386 -DNEED_CPU_H -MMD -MP -MT translate.o -O2 -g -I/crypto= /home/andrea/kernel/qemu-kvm/kvm/include -include /crypto/home/andrea/kerne= l/qemu-kvm/kvm/include/linux/config.h -I/crypto/home/andrea/kernel/qemu-kvm= /kvm/include/x86 -idirafter /crypto/home/andrea/kernel/qemu-kvm/compat -c -= o translate.o /crypto/home/andrea/kernel/qemu-kvm/target-i386/translate.c Performance counter stats for 'gcc -I/crypto/home/andrea/kernel/qemu-kvm/s= lirp -Werror -m64 -fstack-protector-all -Wold-style-definition -Wold-style-= declaration -I. -I/crypto/home/andrea/kernel/qemu-kvm -D_FORTIFY_SOURCE=3D2= -D_GNU_SOURCE -D_FILE_OFFSET_BITS=3D64 -D_LARGEFILE_SOURCE -Wstrict-protot= ypes -Wredundant-decls -Wall -Wundef -Wendif-labels -Wwrite-strings -Wmissi= ng-prototypes -fno-strict-aliasing -DHAS_AUDIO -DHAS_AUDIO_CHOICE -I/crypto= /home/andrea/kernel/qemu-kvm/fpu -I/crypto/home/andrea/kernel/qemu-kvm/tcg = -I/crypto/home/andrea/kernel/qemu-kvm/tcg/x86_64 -DTARGET_PHYS_ADDR_BITS=3D= 64 -I.. -I/crypto/home/andrea/kernel/qemu-kvm/target-i386 -DNEED_CPU_H -MMD= -MP -MT translate.o -O2 -g -I/crypto/home/andrea/kernel/qemu-kvm/kvm/inclu= de -include /crypto/home/andrea/kernel/qemu-kvm/kvm/include/linux/config.h = -I/crypto/home/andrea/kernel/qemu-kvm/kvm/include/x86 -idirafter /crypto/ho= me/andrea/kernel/qemu-kvm/compat -c -o translate.o /crypto/home/andrea/kern= el/qemu-kvm/target-i386/translate.c' (3 runs): 55365925618 cycles ( +- 0.038% ) (scaled from 6= 6.67%) 36558135065 instructions # 0.660 IPC ( +- 0.061= % ) (scaled from 66.66%) 16103841974 dTLB-loads ( +- 0.109% ) (scaled from 6= 6.68%) 823 dTLB-load-misses ( +- 0.081% ) (scaled from 6= 6.70%) 16080393958 L1-dcache-loads ( +- 0.030% ) (scaled from 6= 6.69%) 357523292 L1-dcache-load-misses ( +- 0.099% ) (scaled from 6= 6.68%) 23.129143516 seconds time elapsed ( +- 0.035% ) If I tweak glibc: $ export MALLOC_TOP_PAD_=3D100000000 $ export MALLOC_MMAP_THRESHOLD_=3D1000000000 $ ~/bin/x86_64/perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-lo= ad-misses -e l1-dcache-loads -e l1-dcache-load-misses --repeat 3 gcc -I/cry= pto/home/andrea/kernel/qemu-kvm/slirp -Werror -m64 -fstack-protector-all -W= old-style-definition -Wold-style-declaration -I. -I/crypto/home/andrea/kern= el/qemu-kvm -D_FORTIFY_SOURCE=3D2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=3D64 -D= _LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wend= if-labels -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -DHAS_= AUDIO -DHAS_AUDIO_CHOICE -I/crypto/home/andrea/kernel/qemu-kvm/fpu -I/crypt= o/home/andrea/kernel/qemu-kvm/tcg -I/crypto/home/andrea/kernel/qemu-kvm/tcg= /x86_64 -DTARGET_PHYS_ADDR_BITS=3D64 -I.. -I/crypto/home/andrea/kernel/qem= u-kvm/target-i386 -DNEED_CPU_H -MMD -MP -MT translate.o -O2 -g -I/crypto= /home/andrea/kernel/qemu-kvm/kvm/include -include /crypto/home/andrea/kerne= l/qemu-kvm/kvm/include/linux/config.h -I/crypto/home/andrea/kernel/qemu-kvm= /kvm/include/x86 -idirafter /crypto/home/andrea/kernel/qemu-kvm/compat -c -= o translate.o /crypto/home/andrea/kernel/qemu-kvm/target-i386/translate.c Performance counter stats for 'gcc -I/crypto/home/andrea/kernel/qemu-kvm/s= lirp -Werror -m64 -fstack-protector-all -Wold-style-definition -Wold-style-= declaration -I. -I/crypto/home/andrea/kernel/qemu-kvm -D_FORTIFY_SOURCE=3D2= -D_GNU_SOURCE -D_FILE_OFFSET_BITS=3D64 -D_LARGEFILE_SOURCE -Wstrict-protot= ypes -Wredundant-decls -Wall -Wundef -Wendif-labels -Wwrite-strings -Wmissi= ng-prototypes -fno-strict-aliasing -DHAS_AUDIO -DHAS_AUDIO_CHOICE -I/crypto= /home/andrea/kernel/qemu-kvm/fpu -I/crypto/home/andrea/kernel/qemu-kvm/tcg = -I/crypto/home/andrea/kernel/qemu-kvm/tcg/x86_64 -DTARGET_PHYS_ADDR_BITS=3D= 64 -I.. -I/crypto/home/andrea/kernel/qemu-kvm/target-i386 -DNEED_CPU_H -MMD= -MP -MT translate.o -O2 -g -I/crypto/home/andrea/kernel/qemu-kvm/kvm/inclu= de -include /crypto/home/andrea/kernel/qemu-kvm/kvm/include/linux/config.h = -I/crypto/home/andrea/kernel/qemu-kvm/kvm/include/x86 -idirafter /crypto/ho= me/andrea/kernel/qemu-kvm/compat -c -o translate.o /crypto/home/andrea/kern= el/qemu-kvm/target-i386/translate.c' (3 runs): 52684457919 cycles ( +- 0.059% ) (scaled from 6= 6.67%) 36392861901 instructions # 0.691 IPC ( +- 0.130= % ) (scaled from 66.68%) 16014094544 dTLB-loads ( +- 0.152% ) (scaled from 6= 6.67%) 784 dTLB-load-misses ( +- 0.450% ) (scaled from 6= 6.69%) 16030576638 L1-dcache-loads ( +- 0.161% ) (scaled from 6= 6.70%) 353904925 L1-dcache-load-misses ( +- 0.510% ) (scaled from 6= 6.68%) 22.048837226 seconds time elapsed ( +- 0.224% ) Then I disabled transparent hugepage (I left the glibc tweak just in case anyone wonders that with the environment var set, less brk syscalls run, but it doesn't make any difference without transparent hugepage regardless of those environment settings). $ cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] $ set|grep MALLOC MALLOC_MMAP_THRESHOLD_=3D1000000000 MALLOC_TOP_PAD_=3D100000000 _=3DMALLOC_TOP_PAD_ $ ~/bin/x86_64/perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-lo= ad-misses -e l1-dcache-loads -e l1-dcache-load-misses --repeat 3 gcc -I/cry= pto/home/andrea/kernel/qemu-kvm/slirp -Werror -m64 -fstack-protector-all -W= old-style-definition -Wold-style-declaration -I. -I/crypto/home/andrea/kern= el/qemu-kvm -D_FORTIFY_SOURCE=3D2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=3D64 -D= _LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wend= if-labels -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -DHAS_= AUDIO -DHAS_AUDIO_CHOICE -I/crypto/home/andrea/kernel/qemu-kvm/fpu -I/crypt= o/home/andrea/kernel/qemu-kvm/tcg -I/crypto/home/andrea/kernel/qemu-kvm/tcg= /x86_64 -DTARGET_PHYS_ADDR_BITS=3D64 -I.. -I/crypto/home/andrea/kernel/qem= u-kvm/target-i386 -DNEED_CPU_H -MMD -MP -MT translate.o -O2 -g -I/crypto= /home/andrea/kernel/qemu-kvm/kvm/include -include /crypto/home/andrea/kerne= l/qemu-kvm/kvm/include/linux/config.h -I/crypto/home/andrea/kernel/qemu-kvm= /kvm/include/x86 -idirafter /crypto/home/andrea/kernel/qemu-kvm/compat -c -= o translate.o /crypto/home/andrea/kernel/qemu-kvm/target-i386/translate.c Performance counter stats for 'gcc -I/crypto/home/andrea/kernel/qemu-kvm/s= lirp -Werror -m64 -fstack-protector-all -Wold-style-definition -Wold-style-= declaration -I. -I/crypto/home/andrea/kernel/qemu-kvm -D_FORTIFY_SOURCE=3D2= -D_GNU_SOURCE -D_FILE_OFFSET_BITS=3D64 -D_LARGEFILE_SOURCE -Wstrict-protot= ypes -Wredundant-decls -Wall -Wundef -Wendif-labels -Wwrite-strings -Wmissi= ng-prototypes -fno-strict-aliasing -DHAS_AUDIO -DHAS_AUDIO_CHOICE -I/crypto= /home/andrea/kernel/qemu-kvm/fpu -I/crypto/home/andrea/kernel/qemu-kvm/tcg = -I/crypto/home/andrea/kernel/qemu-kvm/tcg/x86_64 -DTARGET_PHYS_ADDR_BITS=3D= 64 -I.. -I/crypto/home/andrea/kernel/qemu-kvm/target-i386 -DNEED_CPU_H -MMD= -MP -MT translate.o -O2 -g -I/crypto/home/andrea/kernel/qemu-kvm/kvm/inclu= de -include /crypto/home/andrea/kernel/qemu-kvm/kvm/include/linux/config.h = -I/crypto/home/andrea/kernel/qemu-kvm/kvm/include/x86 -idirafter /crypto/ho= me/andrea/kernel/qemu-kvm/compat -c -o translate.o /crypto/home/andrea/kern= el/qemu-kvm/target-i386/translate.c' (3 runs): 58193692408 cycles ( +- 0.129% ) (scaled from 6= 6.66%) 36565168786 instructions # 0.628 IPC ( +- 0.052= % ) (scaled from 66.68%) 16098510972 dTLB-loads ( +- 0.223% ) (scaled from 6= 6.69%) 867 dTLB-load-misses ( +- 0.168% ) (scaled from 6= 6.69%) 16186049665 L1-dcache-loads ( +- 0.112% ) (scaled from 6= 6.69%) 364792323 L1-dcache-load-misses ( +- 0.145% ) (scaled from 6= 6.66%) 24.313032086 seconds time elapsed ( +- 0.154% ) (24.31-22.04)/22.04 =3D 10.2% boost (or 9.3% faster if you divide it by 24.31 ;). Ulrich also sent me a snippnet to align the region in glibc, I tried it but it doesn't get faster than with the environment vars above so the above is simpler than having to rebuild glibc for benchmarking (plus I was unsure if this snippnet really works as well as the two env variables, so I used an unmodified stock glibc for this test). diff --git a/malloc/malloc.c b/malloc/malloc.c index 722b1d4..b067b65 100644 --- a/malloc/malloc.c +++ b/malloc/malloc.c @@ -3168,6 +3168,10 @@ static Void_t* sYSMALLOc(nb, av) INTERNAL_SIZE_T nb;= mstate av; =20 size =3D nb + mp_.top_pad + MINSIZE; =20 +#define TWOM (2*1024*1024) + char *cur =3D (char*)MORECORE(0); + size =3D (char*)((size_t)(cur + size + TWOM - 1)&~(TWOM-1))-cur; + /* If contiguous, we can subtract out existing space that we hope to combine with new space. We add it back later only if Now that my gcc in my workstation hugepage-friendly I can test a kernel compile and see if I get any boost with that too, before it was just impossible. Also note: if you read ggc-page.c or glibc malloc.c you'll notice things like GGC_QUIRE_SIZE, and all sort of other alignment and multipage heuristics there. So it's absolutely guaranteed the moment the kernel gets transparent hugepages they will add the few liner change to get the guaranteed boost at least for the 2M size allocations, like they already do to rate-limit the number of syscalls and all other alignment tricks they do for the cache etc.. Talking about gcc and glibc changes in this context is very real IMHO and I think it's much superior solution than having mmap(4k) backed by 2M pages with all complexity and additional branches it'd introduce in all page faults (not just in a single large mmap which is a slow path). What we can add to the kernel, an idea that Ulrich proposed, is a mmap MMAP_ALIGN parameter to mmap, so that the first argument of mmap becomes the alignment. That creates more vmas but the below munmap does too. It's simply mandatory that 2M size alignment allocations starts 2M aligned from now on (the rest is handled by khugepaged already including the very user stack). To avoid fragmenting the virtual address space and in turn creating more vmas (and potentially micro-slowing-down the page faults) probably these allocations multiple of 2M in size and 2M aligned could go in their own address, something a MAP_ALIGN param can achieve inside the kernel transparently. Of course if userland munmap(4k) it'll fragment but that's up to userland to munmap also in aligned chunks multiple of 2m, if it wants to be optimal and avoid vma creation. The kernel used is aa.git fb6122f722c9e07da384c1309a5036a5f1c80a77 on single socket 4 cores phenom X4 4G of 800mhz ddr2 as before (and no virt). Signed-off-by: Andrea Arcangeli --- --- /var/tmp/portage/sys-devel/gcc-4.4.2/work/gcc-4.4.2/gcc/ggc-page.c 2008= -07-28 16:33:56.000000000 +0200 +++ /tmp/gcc-4.4.2/gcc/ggc-page.c 2010-04-25 06:01:32.829753566 +0200 @@ -450,6 +450,11 @@ #define BITMAP_SIZE(Num_objects) \ (CEIL ((Num_objects), HOST_BITS_PER_LONG) * sizeof(long)) =20 +#ifdef __x86_64__ +#define HPAGE_SIZE (2*1024*1024) +#define GGC_QUIRE_SIZE 512 +#endif + /* Allocate pages in chunks of this size, to throttle calls to memory allocation routines. The first page is used, the rest go onto the free list. This cannot be larger than HOST_BITS_PER_INT for the @@ -654,6 +659,23 @@ #ifdef HAVE_MMAP_ANON char *page =3D (char *) mmap (pref, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); +#ifdef HPAGE_SIZE + if (!(size & (HPAGE_SIZE-1)) && + page !=3D (char *) MAP_FAILED && (size_t) page & (HPAGE_SIZE-1)) { + char *old_page; + munmap(page, size); + page =3D (char *) mmap (pref, size + HPAGE_SIZE-1, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + old_page =3D page; + page =3D (char *) (((size_t)page + HPAGE_SIZE-1) + & ~(HPAGE_SIZE-1)); + if (old_page !=3D page) + munmap(old_page, page-old_page); + if (page !=3D old_page + HPAGE_SIZE-1) + munmap(page+size, old_page+HPAGE_SIZE-1-page); + } +#endif #endif #ifdef HAVE_MMAP_DEV_ZERO char *page =3D (char *) mmap (pref, size, PROT_READ | PROT_WRITE, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 40C5E6B01FA for ; Mon, 26 Apr 2010 14:01:30 -0400 (EDT) Date: Mon, 26 Apr 2010 20:01:10 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100426180110.GC8860@random.random> References: <20100410184750.GJ5708@random.random> <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <20100410204756.GR5708@random.random> <4BC0E6ED.7040100@redhat.com> <20100411010540.GW5708@random.random> <20100425192739.GG5789@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100425192739.GG5789@random.random> Sender: owner-linux-mm@kvack.org To: Avi Kivity Cc: Ingo Molnar , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Ulrich Drepper List-ID: Now tried with a kernel compile with gcc patched as in prev email (stock glibc and no glibc environment parameters). Without rebooting (still plenty of hugepages as usual). always: real 4m7.280s real 4m7.520s never: real 4m13.754s real 4m14.095s So the kernel now builds 2.3% faster. As expected nothing huge here because of gcc not using several hundred hundred mbytes of ram (unlike translate.o or other more pathological files), and there's lots of cpu time spent not just in gcc. Clearly this is not done for gcc (but for JVM and other workloads with larger working sets), but even a kernel build running more than 2% faster I think is worth mentioning as it confirms we're heading towards the right direction. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 4DF0D6B0237 for ; Fri, 30 Apr 2010 05:56:54 -0400 (EDT) Date: Fri, 30 Apr 2010 11:55:43 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100430095543.GB3423@elte.hu> References: <20100410190233.GA30882@elte.hu> <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <20100410204756.GR5708@random.random> <4BC0E6ED.7040100@redhat.com> <20100411010540.GW5708@random.random> <20100425192739.GG5789@random.random> <20100426180110.GC8860@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100426180110.GC8860@random.random> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Ulrich Drepper List-ID: * Andrea Arcangeli wrote: > Now tried with a kernel compile with gcc patched as in prev email > (stock glibc and no glibc environment parameters). Without rebooting > (still plenty of hugepages as usual). > > always: > > real 4m7.280s > real 4m7.520s > > never: > > real 4m13.754s > real 4m14.095s > > So the kernel now builds 2.3% faster. As expected nothing huge here > because of gcc not using several hundred hundred mbytes of ram (unlike > translate.o or other more pathological files), and there's lots of > cpu time spent not just in gcc. > > Clearly this is not done for gcc (but for JVM and other workloads with > larger working sets), but even a kernel build running more than 2% > faster I think is worth mentioning as it confirms we're heading > towards the right direction. Was this done on a native/host kernel? I.e. do everyday kernel hackers gain 2.3% of kbuild performance from this? I find that a very large speedup - it's much more than what i'd have expected. Are you absolutely 100% sure it's real? If yes, it would be nice to underline that by gathering some sort of 'perf stat --repeat 3 --all' kind of always/never comparison of those kernel builds, so that we can see where the +2.3% comes from. I'd expect to see roughly the same instruction count (within noise), but a ~3% reduced cycle count (due to fewer/faster TLB fills). Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 0082B6B0238 for ; Fri, 30 Apr 2010 11:21:21 -0400 (EDT) Date: Fri, 30 Apr 2010 17:19:40 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100430151940.GG22108@random.random> References: <4BC0CFF4.5000207@redhat.com> <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <20100410204756.GR5708@random.random> <4BC0E6ED.7040100@redhat.com> <20100411010540.GW5708@random.random> <20100425192739.GG5789@random.random> <20100426180110.GC8860@random.random> <20100430095543.GB3423@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100430095543.GB3423@elte.hu> Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Ulrich Drepper , Paolo Bonzini List-ID: On Fri, Apr 30, 2010 at 11:55:43AM +0200, Ingo Molnar wrote: > > * Andrea Arcangeli wrote: > > > Now tried with a kernel compile with gcc patched as in prev email > > (stock glibc and no glibc environment parameters). Without rebooting > > (still plenty of hugepages as usual). > > > > always: > > > > real 4m7.280s > > real 4m7.520s > > > > never: > > > > real 4m13.754s > > real 4m14.095s > > > > So the kernel now builds 2.3% faster. As expected nothing huge here > > because of gcc not using several hundred hundred mbytes of ram (unlike > > translate.o or other more pathological files), and there's lots of > > cpu time spent not just in gcc. > > > > Clearly this is not done for gcc (but for JVM and other workloads with > > larger working sets), but even a kernel build running more than 2% > > faster I think is worth mentioning as it confirms we're heading > > towards the right direction. > > Was this done on a native/host kernel? Correct, no virt, just bare metal. > I.e. do everyday kernel hackers gain 2.3% of kbuild performance from this? Yes I already get benefit from this in my work. > > I find that a very large speedup - it's much more than what i'd have expected. > > Are you absolutely 100% sure it's real? If yes, it would be nice to underline 200% sure, at least on the phenom X4 with 1 socket 4 cores and 800mhz ddr2 ram! Why don't you try yourself? You've just to use aa.git + the gcc patch I posted applied to gcc and nothing else. This is what I'm using in all my systems to actively benefit from it already. I've also seen numbers on JVM benchmarks even on host much bigger than 10% with zero userland modifications (as long as the allocation is done in big chunks everything works automatic and critical regions are usually allocated in big chunks, even gcc has the GGC_QUIRE_SIZE but it had to be tuned from 1m to 2m and aligned). The only crash I had was the one I fixed in the last release that was a race between migrate.c and exec.c that would trigger even without THP or memory compaction, I had zero problems so far. > that by gathering some sort of 'perf stat --repeat 3 --all' kind of > always/never comparison of those kernel builds, so that we can see where the > +2.3% comes from. I can do that. I wasn't sure if perf would deal well with such a macro benchmark, I didn't try yet. > I'd expect to see roughly the same instruction count (within noise), but a ~3% > reduced cycle count (due to fewer/faster TLB fills). Also note, before I did the few liner patch to gcc so it always use transparent hugepages in its garbage collector code, the kernel build was a little slower with transparent hugepage = always. The reason is likely that make or cpp or gcc itself, were trashing the cache in hugepage cows for data accesses that didn't benefit from the hugetlb, that's my best estimate. Faulting more than 4k at time is not always beneficial for cows, this is why it's pointless to try to implement any optimistic prefault logic, because it can backfired on you by just trashing the cache more. My design ensures every single time we optimistically fault 2m at once, we also get more than just that optimistic-fault initial speedup (and unwanted cache trashing and more latency in the fault because of larger clear-page copy-page) but we get _much_ more and longstanding: the hugetlb and faster tlb miss. I never pay the cost of optimistic fault, unless I get a _lot_ more in return than just entering/exiting the kernel fewer times. In fact the moment gcc uses hugepages it's not like such cow-cache-trashing cost goes away, but hugepages TLB effect likely leads to >2.3% gain but part of it is spent in offseting any minor slowdown in the cows. I also suspect that with enabled=madvise and madvise called by gcc ggc-page.c, things may be even faster than 2.3% in fact. But it entirely depends on the cpu cache sizes, on xeon it may be bigger than 2.3% gain as the cache trashing may not materialize there anywhere, so I'm sticking to the always option. Paolo has been very nice sending the gcc extreme tests too, those may achieve > 10% speedups (considering translate.o of qemu is at 10% speedup already). I just didn't run those yet because translate.o was much closer to real life scenario (in fact it is real life for the better or the worse), but in the future I'll try those gcc tests too as they're emulating what a real app will have to do in similar circumstances. They're pathological for gcc, but business as usual for everything else HPC. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 3EDDE6004C0 for ; Sun, 2 May 2010 08:18:35 -0400 (EDT) Date: Sun, 2 May 2010 14:17:15 +0200 From: Ingo Molnar Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #17 Message-ID: <20100502121715.GA16754@elte.hu> References: <20100410194751.GA23751@elte.hu> <4BC0DE84.3090305@redhat.com> <4BC0E2C4.8090101@redhat.com> <20100410204756.GR5708@random.random> <4BC0E6ED.7040100@redhat.com> <20100411010540.GW5708@random.random> <20100425192739.GG5789@random.random> <20100426180110.GC8860@random.random> <20100430095543.GB3423@elte.hu> <20100430151940.GG22108@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100430151940.GG22108@random.random> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Avi Kivity , Mike Galbraith , Jason Garrett-Glaser , Linus Torvalds , Pekka Enberg , Andrew Morton , linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Dave Hansen , Benjamin Herrenschmidt , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , bpicco@redhat.com, KOSAKI Motohiro , Balbir Singh , Arnd Bergmann , "Michael S. Tsirkin" , Peter Zijlstra , Johannes Weiner , Daisuke Nishimura , Ulrich Drepper , Paolo Bonzini List-ID: * Andrea Arcangeli wrote: > > I find that a very large speedup - it's much more than what i'd have > > expected. > > > > Are you absolutely 100% sure it's real? If yes, it would be nice to > > underline > > 200% sure, at least on the phenom X4 with 1 socket 4 cores and 800mhz ddr2 > ram! Why don't you try yourself? You've just to use aa.git + the gcc patch I > posted applied to gcc and nothing else. This is what I'm using in all my > systems to actively benefit from it already. Well, patching GCC (and then praying for GCC to actually build & work in a full toolchain) is not something that's done easily within a few minutes. I might try it, i just wanted to point out ways how you can make the numbers more convincing to people who dont try out your patches first hand. Was just a suggestion. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org