* [patch] thp, memcg: split hugepage for memcg oom on cow @ 2012-04-04 1:56 David Rientjes 2012-04-09 9:10 ` KAMEZAWA Hiroyuki 2012-04-10 0:49 ` KAMEZAWA Hiroyuki 0 siblings, 2 replies; 14+ messages in thread From: David Rientjes @ 2012-04-04 1:56 UTC (permalink / raw) To: Andrew Morton Cc: Andrea Arcangeli, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm On COW, a new hugepage is allocated and charged to the memcg. If the memcg is oom, however, this charge will fail and will return VM_FAULT_OOM to the page fault handler which results in an oom kill. Instead, it's possible to fallback to splitting the hugepage so that the COW results only in an order-0 page being charged to the memcg which has a higher liklihood to succeed. This is expensive because the hugepage must be split in the page fault handler, but it is much better than unnecessarily oom killing a process. Signed-off-by: David Rientjes <rientjes@google.com> --- mm/huge_memory.c | 1 + mm/memory.c | 18 +++++++++++++++--- 2 files changed, 16 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -959,6 +959,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { put_page(new_page); + split_huge_page(page); put_page(page); ret |= VM_FAULT_OOM; goto out; diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -3489,6 +3489,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); +retry: pgd = pgd_offset(mm, address); pud = pud_alloc(mm, pgd, address); if (!pud) @@ -3502,13 +3503,24 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, pmd, flags); } else { pmd_t orig_pmd = *pmd; + int ret; + barrier(); if (pmd_trans_huge(orig_pmd)) { if (flags & FAULT_FLAG_WRITE && !pmd_write(orig_pmd) && - !pmd_trans_splitting(orig_pmd)) - return do_huge_pmd_wp_page(mm, vma, address, - pmd, orig_pmd); + !pmd_trans_splitting(orig_pmd)) { + ret = do_huge_pmd_wp_page(mm, vma, address, pmd, + orig_pmd); + /* + * If COW results in an oom memcg, the huge pmd + * will already have been split, so retry the + * fault on the pte for a smaller charge. + */ + if (unlikely(ret & VM_FAULT_OOM)) + goto retry; + return ret; + } return 0; } } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [patch] thp, memcg: split hugepage for memcg oom on cow 2012-04-04 1:56 [patch] thp, memcg: split hugepage for memcg oom on cow David Rientjes @ 2012-04-09 9:10 ` KAMEZAWA Hiroyuki 2012-04-10 0:23 ` David Rientjes 2012-04-10 0:49 ` KAMEZAWA Hiroyuki 1 sibling, 1 reply; 14+ messages in thread From: KAMEZAWA Hiroyuki @ 2012-04-09 9:10 UTC (permalink / raw) To: David Rientjes; +Cc: Andrew Morton, Andrea Arcangeli, Johannes Weiner, linux-mm (2012/04/04 10:56), David Rientjes wrote: > On COW, a new hugepage is allocated and charged to the memcg. If the > memcg is oom, however, this charge will fail and will return VM_FAULT_OOM > to the page fault handler which results in an oom kill. > > Instead, it's possible to fallback to splitting the hugepage so that the > COW results only in an order-0 page being charged to the memcg which has > a higher liklihood to succeed. This is expensive because the hugepage > must be split in the page fault handler, but it is much better than > unnecessarily oom killing a process. > > Signed-off-by: David Rientjes <rientjes@google.com> > --- > mm/huge_memory.c | 1 + > mm/memory.c | 18 +++++++++++++++--- > 2 files changed, 16 insertions(+), 3 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -959,6 +959,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > > if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { > put_page(new_page); > + split_huge_page(page); > put_page(page); > ret |= VM_FAULT_OOM; > goto out; ?? how about == if (transparent_hugepage_enabled(vma) && !transparent_hugepage_debug_cow()) new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma), vma, haddr, numa_node_id(), 0); else new_page = NULL; if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { put_page(new_page); new_page = NULL; /* never OOM, just cause fallback */ } if (unlikely(!new_page)) { count_vm_event(THP_FAULT_FALLBACK); ret = do_huge_pmd_wp_page_fallback(mm, vma, address, pmd, orig_pmd, page, haddr); put_page(page); goto out; } == ? Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [patch] thp, memcg: split hugepage for memcg oom on cow 2012-04-09 9:10 ` KAMEZAWA Hiroyuki @ 2012-04-10 0:23 ` David Rientjes 2012-04-10 0:43 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 14+ messages in thread From: David Rientjes @ 2012-04-10 0:23 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Andrea Arcangeli, Johannes Weiner, linux-mm On Mon, 9 Apr 2012, KAMEZAWA Hiroyuki wrote: > if (transparent_hugepage_enabled(vma) && > !transparent_hugepage_debug_cow()) > new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma), > vma, haddr, numa_node_id(), 0); > else > new_page = NULL; > > if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { > put_page(new_page); > new_page = NULL; /* never OOM, just cause fallback */ > } > > if (unlikely(!new_page)) { > count_vm_event(THP_FAULT_FALLBACK); > ret = do_huge_pmd_wp_page_fallback(mm, vma, address, > pmd, orig_pmd, page, haddr); > put_page(page); > goto out; > } This would result in the same error since do_huge_pmd_wp_page_fallback() would fail to charge the necessary memory to the memcg. Are you still including my change to handle_mm_fault() to retry if this returns VM_FAULT_OOM? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [patch] thp, memcg: split hugepage for memcg oom on cow 2012-04-10 0:23 ` David Rientjes @ 2012-04-10 0:43 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 14+ messages in thread From: KAMEZAWA Hiroyuki @ 2012-04-10 0:43 UTC (permalink / raw) To: David Rientjes; +Cc: Andrew Morton, Andrea Arcangeli, Johannes Weiner, linux-mm (2012/04/10 9:23), David Rientjes wrote: > On Mon, 9 Apr 2012, KAMEZAWA Hiroyuki wrote: > >> if (transparent_hugepage_enabled(vma) && >> !transparent_hugepage_debug_cow()) >> new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma), >> vma, haddr, numa_node_id(), 0); >> else >> new_page = NULL; >> >> if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { >> put_page(new_page); >> new_page = NULL; /* never OOM, just cause fallback */ >> } >> >> if (unlikely(!new_page)) { >> count_vm_event(THP_FAULT_FALLBACK); >> ret = do_huge_pmd_wp_page_fallback(mm, vma, address, >> pmd, orig_pmd, page, haddr); >> put_page(page); >> goto out; >> } > > This would result in the same error since do_huge_pmd_wp_page_fallback() > would fail to charge the necessary memory to the memcg. > Ah, I see. this will charge 1024 pages anyway. But ...hm, memcg easily returns failure when many pages are requested. AND.... I misunderstood your patch. You split hugepage and allocate 1 page at fault. Ok, seems reasonable, I'm sorry. Thanks, -Kame > Are you still including my change to handle_mm_fault() to retry if this > returns VM_FAULT_OOM? > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [patch] thp, memcg: split hugepage for memcg oom on cow 2012-04-04 1:56 [patch] thp, memcg: split hugepage for memcg oom on cow David Rientjes 2012-04-09 9:10 ` KAMEZAWA Hiroyuki @ 2012-04-10 0:49 ` KAMEZAWA Hiroyuki 2012-04-10 5:41 ` David Rientjes 1 sibling, 1 reply; 14+ messages in thread From: KAMEZAWA Hiroyuki @ 2012-04-10 0:49 UTC (permalink / raw) To: David Rientjes; +Cc: Andrew Morton, Andrea Arcangeli, Johannes Weiner, linux-mm (2012/04/04 10:56), David Rientjes wrote: > On COW, a new hugepage is allocated and charged to the memcg. If the > memcg is oom, however, this charge will fail and will return VM_FAULT_OOM > to the page fault handler which results in an oom kill. > > Instead, it's possible to fallback to splitting the hugepage so that the > COW results only in an order-0 page being charged to the memcg which has > a higher liklihood to succeed. This is expensive because the hugepage > must be split in the page fault handler, but it is much better than > unnecessarily oom killing a process. > > Signed-off-by: David Rientjes <rientjes@google.com> > --- > mm/huge_memory.c | 1 + > mm/memory.c | 18 +++++++++++++++--- > 2 files changed, 16 insertions(+), 3 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -959,6 +959,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > > if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { > put_page(new_page); > + split_huge_page(page); > put_page(page); > ret |= VM_FAULT_OOM; > goto out; > diff --git a/mm/memory.c b/mm/memory.c > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3489,6 +3489,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, > if (unlikely(is_vm_hugetlb_page(vma))) > return hugetlb_fault(mm, vma, address, flags); > > +retry: > pgd = pgd_offset(mm, address); > pud = pud_alloc(mm, pgd, address); > if (!pud) > @@ -3502,13 +3503,24 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, > pmd, flags); > } else { > pmd_t orig_pmd = *pmd; > + int ret; > + > barrier(); > if (pmd_trans_huge(orig_pmd)) { > if (flags & FAULT_FLAG_WRITE && > !pmd_write(orig_pmd) && > - !pmd_trans_splitting(orig_pmd)) > - return do_huge_pmd_wp_page(mm, vma, address, > - pmd, orig_pmd); > + !pmd_trans_splitting(orig_pmd)) { > + ret = do_huge_pmd_wp_page(mm, vma, address, pmd, > + orig_pmd); > + /* > + * If COW results in an oom memcg, the huge pmd > + * will already have been split, so retry the > + * fault on the pte for a smaller charge. > + */ IIUC, do_huge_pmd_wp_page_fallback() can return VM_FAULT_OOM. So, this check is not related only to memcg. > + if (unlikely(ret & VM_FAULT_OOM)) > + goto retry; > + return ret; > + } > return 0; Anyway, seems reasonable to me. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [patch] thp, memcg: split hugepage for memcg oom on cow 2012-04-10 0:49 ` KAMEZAWA Hiroyuki @ 2012-04-10 5:41 ` David Rientjes 2012-04-10 5:42 ` [patch v2] " David Rientjes 0 siblings, 1 reply; 14+ messages in thread From: David Rientjes @ 2012-04-10 5:41 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Andrea Arcangeli, Johannes Weiner, linux-mm On Tue, 10 Apr 2012, KAMEZAWA Hiroyuki wrote: > > @@ -3502,13 +3503,24 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, > > pmd, flags); > > } else { > > pmd_t orig_pmd = *pmd; > > + int ret; > > + > > barrier(); > > if (pmd_trans_huge(orig_pmd)) { > > if (flags & FAULT_FLAG_WRITE && > > !pmd_write(orig_pmd) && > > - !pmd_trans_splitting(orig_pmd)) > > - return do_huge_pmd_wp_page(mm, vma, address, > > - pmd, orig_pmd); > > + !pmd_trans_splitting(orig_pmd)) { > > + ret = do_huge_pmd_wp_page(mm, vma, address, pmd, > > + orig_pmd); > > + /* > > + * If COW results in an oom memcg, the huge pmd > > + * will already have been split, so retry the > > + * fault on the pte for a smaller charge. > > + */ > > > IIUC, do_huge_pmd_wp_page_fallback() can return VM_FAULT_OOM. So, this check > is not related only to memcg. > You're right, and if we do that then we infinitely loop trying to handle the pagefault instead of returning. I'll post a v2 of the patch that fixes this, thanks for catching it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* [patch v2] thp, memcg: split hugepage for memcg oom on cow 2012-04-10 5:41 ` David Rientjes @ 2012-04-10 5:42 ` David Rientjes 2012-04-10 5:59 ` KAMEZAWA Hiroyuki 2012-04-11 14:20 ` Johannes Weiner 0 siblings, 2 replies; 14+ messages in thread From: David Rientjes @ 2012-04-10 5:42 UTC (permalink / raw) To: Andrew Morton Cc: KAMEZAWA Hiroyuki, Andrea Arcangeli, Johannes Weiner, linux-mm On COW, a new hugepage is allocated and charged to the memcg. If the system is oom or the charge to the memcg fails, however, the fault handler will return VM_FAULT_OOM which results in an oom kill. Instead, it's possible to fallback to splitting the hugepage so that the COW results only in an order-0 page being allocated and charged to the memcg which has a higher liklihood to succeed. This is expensive because the hugepage must be split in the page fault handler, but it is much better than unnecessarily oom killing a process. Signed-off-by: David Rientjes <rientjes@google.com> --- mm/huge_memory.c | 3 +++ mm/memory.c | 18 +++++++++++++++--- 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -950,6 +950,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, count_vm_event(THP_FAULT_FALLBACK); ret = do_huge_pmd_wp_page_fallback(mm, vma, address, pmd, orig_pmd, page, haddr); + if (ret & VM_FAULT_OOM) + split_huge_page(page); put_page(page); goto out; } @@ -957,6 +959,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { put_page(new_page); + split_huge_page(page); put_page(page); ret |= VM_FAULT_OOM; goto out; diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -3489,6 +3489,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); +retry: pgd = pgd_offset(mm, address); pud = pud_alloc(mm, pgd, address); if (!pud) @@ -3502,13 +3503,24 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, pmd, flags); } else { pmd_t orig_pmd = *pmd; + int ret; + barrier(); if (pmd_trans_huge(orig_pmd)) { if (flags & FAULT_FLAG_WRITE && !pmd_write(orig_pmd) && - !pmd_trans_splitting(orig_pmd)) - return do_huge_pmd_wp_page(mm, vma, address, - pmd, orig_pmd); + !pmd_trans_splitting(orig_pmd)) { + ret = do_huge_pmd_wp_page(mm, vma, address, pmd, + orig_pmd); + /* + * If COW results in an oom, the huge pmd will + * have been split, so retry the fault on the + * pte for a smaller charge. + */ + if (unlikely(ret & VM_FAULT_OOM)) + goto retry; + return ret; + } return 0; } } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [patch v2] thp, memcg: split hugepage for memcg oom on cow 2012-04-10 5:42 ` [patch v2] " David Rientjes @ 2012-04-10 5:59 ` KAMEZAWA Hiroyuki 2012-04-11 14:20 ` Johannes Weiner 1 sibling, 0 replies; 14+ messages in thread From: KAMEZAWA Hiroyuki @ 2012-04-10 5:59 UTC (permalink / raw) To: David Rientjes; +Cc: Andrew Morton, Andrea Arcangeli, Johannes Weiner, linux-mm (2012/04/10 14:42), David Rientjes wrote: > On COW, a new hugepage is allocated and charged to the memcg. If the > system is oom or the charge to the memcg fails, however, the fault > handler will return VM_FAULT_OOM which results in an oom kill. > > Instead, it's possible to fallback to splitting the hugepage so that the > COW results only in an order-0 page being allocated and charged to the > memcg which has a higher liklihood to succeed. This is expensive because > the hugepage must be split in the page fault handler, but it is much > better than unnecessarily oom killing a process. > > Signed-off-by: David Rientjes <rientjes@google.com> Seems nice to me. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [patch v2] thp, memcg: split hugepage for memcg oom on cow 2012-04-10 5:42 ` [patch v2] " David Rientjes 2012-04-10 5:59 ` KAMEZAWA Hiroyuki @ 2012-04-11 14:20 ` Johannes Weiner 2012-04-23 23:15 ` David Rientjes 1 sibling, 1 reply; 14+ messages in thread From: Johannes Weiner @ 2012-04-11 14:20 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Arcangeli, linux-mm On Mon, Apr 09, 2012 at 10:42:31PM -0700, David Rientjes wrote: > On COW, a new hugepage is allocated and charged to the memcg. If the > system is oom or the charge to the memcg fails, however, the fault > handler will return VM_FAULT_OOM which results in an oom kill. > > Instead, it's possible to fallback to splitting the hugepage so that the > COW results only in an order-0 page being allocated and charged to the > memcg which has a higher liklihood to succeed. This is expensive because > the hugepage must be split in the page fault handler, but it is much > better than unnecessarily oom killing a process. > > Signed-off-by: David Rientjes <rientjes@google.com> > --- > mm/huge_memory.c | 3 +++ > mm/memory.c | 18 +++++++++++++++--- > 2 files changed, 18 insertions(+), 3 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -950,6 +950,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > count_vm_event(THP_FAULT_FALLBACK); > ret = do_huge_pmd_wp_page_fallback(mm, vma, address, > pmd, orig_pmd, page, haddr); > + if (ret & VM_FAULT_OOM) > + split_huge_page(page); > put_page(page); > goto out; > } > @@ -957,6 +959,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > > if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { > put_page(new_page); > + split_huge_page(page); > put_page(page); > ret |= VM_FAULT_OOM; > goto out; > diff --git a/mm/memory.c b/mm/memory.c > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3489,6 +3489,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, > if (unlikely(is_vm_hugetlb_page(vma))) > return hugetlb_fault(mm, vma, address, flags); > > +retry: > pgd = pgd_offset(mm, address); > pud = pud_alloc(mm, pgd, address); > if (!pud) > @@ -3502,13 +3503,24 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, > pmd, flags); > } else { > pmd_t orig_pmd = *pmd; > + int ret; > + > barrier(); > if (pmd_trans_huge(orig_pmd)) { > if (flags & FAULT_FLAG_WRITE && > !pmd_write(orig_pmd) && > - !pmd_trans_splitting(orig_pmd)) > - return do_huge_pmd_wp_page(mm, vma, address, > - pmd, orig_pmd); > + !pmd_trans_splitting(orig_pmd)) { > + ret = do_huge_pmd_wp_page(mm, vma, address, pmd, > + orig_pmd); > + /* > + * If COW results in an oom, the huge pmd will > + * have been split, so retry the fault on the > + * pte for a smaller charge. > + */ > + if (unlikely(ret & VM_FAULT_OOM)) > + goto retry; Can you instead put a __split_huge_page_pmd(mm, pmd) here? It has to redo the get-page-ref-through-pagetable dance, but it's more robust and obvious than splitting the COW page before returning OOM in the thp wp handler. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [patch v2] thp, memcg: split hugepage for memcg oom on cow 2012-04-11 14:20 ` Johannes Weiner @ 2012-04-23 23:15 ` David Rientjes 2012-04-25 21:01 ` David Rientjes 2012-04-26 9:06 ` Johannes Weiner 0 siblings, 2 replies; 14+ messages in thread From: David Rientjes @ 2012-04-23 23:15 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Arcangeli, linux-mm On Wed, 11 Apr 2012, Johannes Weiner wrote: > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -950,6 +950,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > > count_vm_event(THP_FAULT_FALLBACK); > > ret = do_huge_pmd_wp_page_fallback(mm, vma, address, > > pmd, orig_pmd, page, haddr); > > + if (ret & VM_FAULT_OOM) > > + split_huge_page(page); > > put_page(page); > > goto out; > > } > > @@ -957,6 +959,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > > > > if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { > > put_page(new_page); > > + split_huge_page(page); > > put_page(page); > > ret |= VM_FAULT_OOM; > > goto out; > > diff --git a/mm/memory.c b/mm/memory.c > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -3489,6 +3489,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, > > if (unlikely(is_vm_hugetlb_page(vma))) > > return hugetlb_fault(mm, vma, address, flags); > > > > +retry: > > pgd = pgd_offset(mm, address); > > pud = pud_alloc(mm, pgd, address); > > if (!pud) > > @@ -3502,13 +3503,24 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, > > pmd, flags); > > } else { > > pmd_t orig_pmd = *pmd; > > + int ret; > > + > > barrier(); > > if (pmd_trans_huge(orig_pmd)) { > > if (flags & FAULT_FLAG_WRITE && > > !pmd_write(orig_pmd) && > > - !pmd_trans_splitting(orig_pmd)) > > - return do_huge_pmd_wp_page(mm, vma, address, > > - pmd, orig_pmd); > > + !pmd_trans_splitting(orig_pmd)) { > > + ret = do_huge_pmd_wp_page(mm, vma, address, pmd, > > + orig_pmd); > > + /* > > + * If COW results in an oom, the huge pmd will > > + * have been split, so retry the fault on the > > + * pte for a smaller charge. > > + */ > > + if (unlikely(ret & VM_FAULT_OOM)) > > + goto retry; > > Can you instead put a __split_huge_page_pmd(mm, pmd) here? It has to > redo the get-page-ref-through-pagetable dance, but it's more robust > and obvious than splitting the COW page before returning OOM in the > thp wp handler. > I agree it's more robust if do_huge_pmd_wp_page() were modified later and mistakenly returned VM_FAULT_OOM without the page being split, but __split_huge_page_pmd() has the drawback of also requiring to retake mm->page_table_lock to test whether orig_pmd is still legitimate so it will be slower. Do you feel strongly about the way it's currently written which will be faster at runtime? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [patch v2] thp, memcg: split hugepage for memcg oom on cow 2012-04-23 23:15 ` David Rientjes @ 2012-04-25 21:01 ` David Rientjes 2012-04-26 9:06 ` Johannes Weiner 1 sibling, 0 replies; 14+ messages in thread From: David Rientjes @ 2012-04-25 21:01 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Arcangeli, linux-mm On Mon, 23 Apr 2012, David Rientjes wrote: > > Can you instead put a __split_huge_page_pmd(mm, pmd) here? It has to > > redo the get-page-ref-through-pagetable dance, but it's more robust > > and obvious than splitting the COW page before returning OOM in the > > thp wp handler. > > > > I agree it's more robust if do_huge_pmd_wp_page() were modified later and > mistakenly returned VM_FAULT_OOM without the page being split, but > __split_huge_page_pmd() has the drawback of also requiring to retake > mm->page_table_lock to test whether orig_pmd is still legitimate so it > will be slower. Do you feel strongly about the way it's currently written > which will be faster at runtime? > Andrew, please merge this patch. I'd rather not unnecessarily take another reference on the cow page and unnecessarily take mm->page_table_lock in the page fault handler so the code is cleaner. It's faster this way. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [patch v2] thp, memcg: split hugepage for memcg oom on cow 2012-04-23 23:15 ` David Rientjes 2012-04-25 21:01 ` David Rientjes @ 2012-04-26 9:06 ` Johannes Weiner 2012-04-26 21:05 ` David Rientjes 1 sibling, 1 reply; 14+ messages in thread From: Johannes Weiner @ 2012-04-26 9:06 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Arcangeli, linux-mm [ Sorry, my responsiveness is horrible these days... ] On Mon, Apr 23, 2012 at 04:15:06PM -0700, David Rientjes wrote: > On Wed, 11 Apr 2012, Johannes Weiner wrote: > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > --- a/mm/huge_memory.c > > > +++ b/mm/huge_memory.c > > > @@ -950,6 +950,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > > > count_vm_event(THP_FAULT_FALLBACK); > > > ret = do_huge_pmd_wp_page_fallback(mm, vma, address, > > > pmd, orig_pmd, page, haddr); > > > + if (ret & VM_FAULT_OOM) > > > + split_huge_page(page); > > > put_page(page); > > > goto out; > > > } > > > @@ -957,6 +959,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > > > > > > if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { > > > put_page(new_page); > > > + split_huge_page(page); > > > put_page(page); > > > ret |= VM_FAULT_OOM; > > > goto out; > > > diff --git a/mm/memory.c b/mm/memory.c > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -3489,6 +3489,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, > > > if (unlikely(is_vm_hugetlb_page(vma))) > > > return hugetlb_fault(mm, vma, address, flags); > > > > > > +retry: > > > pgd = pgd_offset(mm, address); > > > pud = pud_alloc(mm, pgd, address); > > > if (!pud) > > > @@ -3502,13 +3503,24 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, > > > pmd, flags); > > > } else { > > > pmd_t orig_pmd = *pmd; > > > + int ret; > > > + > > > barrier(); > > > if (pmd_trans_huge(orig_pmd)) { > > > if (flags & FAULT_FLAG_WRITE && > > > !pmd_write(orig_pmd) && > > > - !pmd_trans_splitting(orig_pmd)) > > > - return do_huge_pmd_wp_page(mm, vma, address, > > > - pmd, orig_pmd); > > > + !pmd_trans_splitting(orig_pmd)) { > > > + ret = do_huge_pmd_wp_page(mm, vma, address, pmd, > > > + orig_pmd); > > > + /* > > > + * If COW results in an oom, the huge pmd will > > > + * have been split, so retry the fault on the > > > + * pte for a smaller charge. > > > + */ > > > + if (unlikely(ret & VM_FAULT_OOM)) > > > + goto retry; > > > > Can you instead put a __split_huge_page_pmd(mm, pmd) here? It has to > > redo the get-page-ref-through-pagetable dance, but it's more robust > > and obvious than splitting the COW page before returning OOM in the > > thp wp handler. > > I agree it's more robust if do_huge_pmd_wp_page() were modified later and > mistakenly returned VM_FAULT_OOM without the page being split, but > __split_huge_page_pmd() has the drawback of also requiring to retake > mm->page_table_lock to test whether orig_pmd is still legitimate so it > will be slower. Do you feel strongly about the way it's currently written > which will be faster at runtime? If you can't accomodate for a hugepage, this code runs 511 times in the worst case before you also can't fit a regular page anymore. And compare it to the cost of the splitting itself and the subsequent 4k COW break faults... I don't think it's a path worth optimizing for at all, especially if it includes sprinkling undocumented split_huge_pages around, and the fix could be as self-contained as something like this... diff --git a/mm/memory.c b/mm/memory.c index 706a274..dae0afc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3505,14 +3505,29 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, pmd, flags); } else { pmd_t orig_pmd = *pmd; + int ret; + barrier(); if (pmd_trans_huge(orig_pmd)) { if (flags & FAULT_FLAG_WRITE && !pmd_write(orig_pmd) && - !pmd_trans_splitting(orig_pmd)) - return do_huge_pmd_wp_page(mm, vma, address, + !pmd_trans_splitting(orig_pmd)) { + ret = do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd); - return 0; + if (unlikely(ret & VM_FAULT_OOM)) { + /* + * It's not worth going OOM + * over not being able to + * allocate or charge a full + * copy of the huge page. + * Split it up and handle as + * single page COW break below. + */ + __split_huge_page_pmd(mm, pmd); + } else + return ret; + } else + return 0; } } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [patch v2] thp, memcg: split hugepage for memcg oom on cow 2012-04-26 9:06 ` Johannes Weiner @ 2012-04-26 21:05 ` David Rientjes 2012-04-27 0:15 ` Johannes Weiner 0 siblings, 1 reply; 14+ messages in thread From: David Rientjes @ 2012-04-26 21:05 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Arcangeli, linux-mm On Thu, 26 Apr 2012, Johannes Weiner wrote: > > I agree it's more robust if do_huge_pmd_wp_page() were modified later and > > mistakenly returned VM_FAULT_OOM without the page being split, but > > __split_huge_page_pmd() has the drawback of also requiring to retake > > mm->page_table_lock to test whether orig_pmd is still legitimate so it > > will be slower. Do you feel strongly about the way it's currently written > > which will be faster at runtime? > > If you can't accomodate for a hugepage, this code runs 511 times in > the worst case before you also can't fit a regular page anymore. And > compare it to the cost of the splitting itself and the subsequent 4k > COW break faults... > > I don't think it's a path worth optimizing for at all, especially if > it includes sprinkling undocumented split_huge_pages around, and the > fix could be as self-contained as something like this... > I disagree that we should be unnecessarily taking mm->page_table_lock which is already strongly contended if all cpus are pagefaulting on the same process (and I'll be posting a patch to address specifically those slowdowns since thp is _much_ slower on page fault tests) when we can already do it in do_huge_pmd_wp_page(). If you'd like to add a comment for the split_huge_page() in that function if it's not clear enough from my VM_FAULT_OOM comment in handle_mm_fault(), then feel free to add it but I thought it was rather trivial to understand. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [patch v2] thp, memcg: split hugepage for memcg oom on cow 2012-04-26 21:05 ` David Rientjes @ 2012-04-27 0:15 ` Johannes Weiner 0 siblings, 0 replies; 14+ messages in thread From: Johannes Weiner @ 2012-04-27 0:15 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Arcangeli, linux-mm On Thu, Apr 26, 2012 at 02:05:11PM -0700, David Rientjes wrote: > On Thu, 26 Apr 2012, Johannes Weiner wrote: > > > > I agree it's more robust if do_huge_pmd_wp_page() were modified later and > > > mistakenly returned VM_FAULT_OOM without the page being split, but > > > __split_huge_page_pmd() has the drawback of also requiring to retake > > > mm->page_table_lock to test whether orig_pmd is still legitimate so it > > > will be slower. Do you feel strongly about the way it's currently written > > > which will be faster at runtime? > > > > If you can't accomodate for a hugepage, this code runs 511 times in > > the worst case before you also can't fit a regular page anymore. And > > compare it to the cost of the splitting itself and the subsequent 4k > > COW break faults... > > > > I don't think it's a path worth optimizing for at all, especially if > > it includes sprinkling undocumented split_huge_pages around, and the > > fix could be as self-contained as something like this... > > > > I disagree that we should be unnecessarily taking mm->page_table_lock > which is already strongly contended if all cpus are pagefaulting on the > same process (and I'll be posting a patch to address specifically those > slowdowns since thp is _much_ slower on page fault tests) when we can > already do it in do_huge_pmd_wp_page(). If you'd like to add a comment > for the split_huge_page() in that function if it's not clear enough from > my VM_FAULT_OOM comment in handle_mm_fault(), then feel free to add it but > I thought it was rather trivial to understand. Come on, it's not "trivial to understand" why the page in the parent is split because the child failed to allocate a replacement, shortly before returning "out of memory". You have to look at a different file to make sense of it. Such cross-dependencies between functions simply suck and made problems in the past. The least you could do is properly document them in _both_ places if you insist on adding them in the first place. Btw, is restarting the full page table walk even necessary? You already have the pmd, know/hope it's been split, and hold the mmap_sem for reading, so it can't go back to being huge or none. You should be able to fall through to the pte lookup and handle_pte_fault(), no? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2012-04-27 0:15 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-04-04 1:56 [patch] thp, memcg: split hugepage for memcg oom on cow David Rientjes 2012-04-09 9:10 ` KAMEZAWA Hiroyuki 2012-04-10 0:23 ` David Rientjes 2012-04-10 0:43 ` KAMEZAWA Hiroyuki 2012-04-10 0:49 ` KAMEZAWA Hiroyuki 2012-04-10 5:41 ` David Rientjes 2012-04-10 5:42 ` [patch v2] " David Rientjes 2012-04-10 5:59 ` KAMEZAWA Hiroyuki 2012-04-11 14:20 ` Johannes Weiner 2012-04-23 23:15 ` David Rientjes 2012-04-25 21:01 ` David Rientjes 2012-04-26 9:06 ` Johannes Weiner 2012-04-26 21:05 ` David Rientjes 2012-04-27 0:15 ` Johannes Weiner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.