* [PATCH 0/3] uprobes fixes for 3.5 @ 2012-06-07 16:59 Oleg Nesterov 2012-06-07 17:00 ` [PATCH 1/3] uprobes: valid_vma() should reject VM_HUGETLB Oleg Nesterov ` (3 more replies) 0 siblings, 4 replies; 19+ messages in thread From: Oleg Nesterov @ 2012-06-07 16:59 UTC (permalink / raw) To: Hugh Dickins, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju Cc: Ananth N Mavinakayanahalli, Anton Arapov, Linus Torvalds, Masami Hiramatsu, linux-kernel Hello, This doesn't depend on other uprobes patches I sent, and I think this is 3.5 material. However, please remember I do not understand vm, iow please review. write_opcode() becomes really ugly after 3/3, I'll try to cleanup it later. In fact I think it needs some cleanups even without this fix. Oleg. ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/3] uprobes: valid_vma() should reject VM_HUGETLB 2012-06-07 16:59 [PATCH 0/3] uprobes fixes for 3.5 Oleg Nesterov @ 2012-06-07 17:00 ` Oleg Nesterov 2012-06-15 6:22 ` Srikar Dronamraju 2012-06-07 17:00 ` [PATCH 2/3] uprobes: __copy_insn() should ensure a_ops->readpage != NULL Oleg Nesterov ` (2 subsequent siblings) 3 siblings, 1 reply; 19+ messages in thread From: Oleg Nesterov @ 2012-06-07 17:00 UTC (permalink / raw) To: Hugh Dickins, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju Cc: Ananth N Mavinakayanahalli, Anton Arapov, Linus Torvalds, Masami Hiramatsu, linux-kernel __replace_page() obviously can't work with the hugetlbfs mappings, uprobe_register() will likely crash the kernel. Change valid_vma() to check VM_HUGETLB as well. As for PageTransHuge() no need to worry, vma->vm_file != NULL. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- kernel/events/uprobes.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 6e3b181..48d53af 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -105,7 +105,8 @@ static bool valid_vma(struct vm_area_struct *vma, bool is_register) if (!is_register) return true; - if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) == (VM_READ|VM_EXEC)) + if ((vma->vm_flags & (VM_HUGETLB|VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) + == (VM_READ|VM_EXEC)) return true; return false; -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] uprobes: valid_vma() should reject VM_HUGETLB 2012-06-07 17:00 ` [PATCH 1/3] uprobes: valid_vma() should reject VM_HUGETLB Oleg Nesterov @ 2012-06-15 6:22 ` Srikar Dronamraju 0 siblings, 0 replies; 19+ messages in thread From: Srikar Dronamraju @ 2012-06-15 6:22 UTC (permalink / raw) To: Oleg Nesterov Cc: Hugh Dickins, Ingo Molnar, Peter Zijlstra, Ananth N Mavinakayanahalli, Anton Arapov, Linus Torvalds, Masami Hiramatsu, linux-kernel * Oleg Nesterov <oleg@redhat.com> [2012-06-07 19:00:02]: > __replace_page() obviously can't work with the hugetlbfs mappings, > uprobe_register() will likely crash the kernel. Change valid_vma() > to check VM_HUGETLB as well. > > As for PageTransHuge() no need to worry, vma->vm_file != NULL. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > kernel/events/uprobes.c | 3 ++- > 1 files changed, 2 insertions(+), 1 deletions(-) > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index 6e3b181..48d53af 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -105,7 +105,8 @@ static bool valid_vma(struct vm_area_struct *vma, bool is_register) > if (!is_register) > return true; > > - if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) == (VM_READ|VM_EXEC)) > + if ((vma->vm_flags & (VM_HUGETLB|VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) > + == (VM_READ|VM_EXEC)) > return true; > > return false; Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 2/3] uprobes: __copy_insn() should ensure a_ops->readpage != NULL 2012-06-07 16:59 [PATCH 0/3] uprobes fixes for 3.5 Oleg Nesterov 2012-06-07 17:00 ` [PATCH 1/3] uprobes: valid_vma() should reject VM_HUGETLB Oleg Nesterov @ 2012-06-07 17:00 ` Oleg Nesterov 2012-06-15 6:25 ` Srikar Dronamraju 2012-06-07 17:00 ` [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap() Oleg Nesterov 2012-06-08 14:03 ` oom-killer is crazy? (Was: [PATCH 0/3] uprobes fixes for 3.5) Oleg Nesterov 3 siblings, 1 reply; 19+ messages in thread From: Oleg Nesterov @ 2012-06-07 17:00 UTC (permalink / raw) To: Hugh Dickins, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju Cc: Ananth N Mavinakayanahalli, Anton Arapov, Linus Torvalds, Masami Hiramatsu, linux-kernel __copy_insn() blindly calls read_mapping_page(), this will crash the kernel if ->readpage == NULL, add the necessary check. For example, hugetlbfs_aops->readpage is NULL. Perhaps we should change read_mapping_page() instead. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- kernel/events/uprobes.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 48d53af..9c53bc2 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -616,6 +616,8 @@ __copy_insn(struct address_space *mapping, struct vm_area_struct *vma, char *ins if (!filp) return -EINVAL; + if (!mapping->a_ops->readpage) + return -EIO; idx = (unsigned long)(offset >> PAGE_CACHE_SHIFT); off1 = offset &= ~PAGE_MASK; -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] uprobes: __copy_insn() should ensure a_ops->readpage != NULL 2012-06-07 17:00 ` [PATCH 2/3] uprobes: __copy_insn() should ensure a_ops->readpage != NULL Oleg Nesterov @ 2012-06-15 6:25 ` Srikar Dronamraju 2012-06-15 12:10 ` Ingo Molnar 0 siblings, 1 reply; 19+ messages in thread From: Srikar Dronamraju @ 2012-06-15 6:25 UTC (permalink / raw) To: Oleg Nesterov Cc: Hugh Dickins, Ingo Molnar, Peter Zijlstra, Ananth N Mavinakayanahalli, Anton Arapov, Linus Torvalds, Masami Hiramatsu, linux-kernel * Oleg Nesterov <oleg@redhat.com> [2012-06-07 19:00:18]: > __copy_insn() blindly calls read_mapping_page(), this will crash > the kernel if ->readpage == NULL, add the necessary check. For > example, hugetlbfs_aops->readpage is NULL. Perhaps we should change > read_mapping_page() instead. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > kernel/events/uprobes.c | 2 ++ > 1 files changed, 2 insertions(+), 0 deletions(-) > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index 48d53af..9c53bc2 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -616,6 +616,8 @@ __copy_insn(struct address_space *mapping, struct vm_area_struct *vma, char *ins > > if (!filp) > return -EINVAL; > + if (!mapping->a_ops->readpage) > + return -EIO; Nit: Should there be a blank line before the if. Ingo had insisted on these coding style changes. > idx = (unsigned long)(offset >> PAGE_CACHE_SHIFT); > off1 = offset &= ~PAGE_MASK; Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> -- Thanks and Regards Srikar ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] uprobes: __copy_insn() should ensure a_ops->readpage != NULL 2012-06-15 6:25 ` Srikar Dronamraju @ 2012-06-15 12:10 ` Ingo Molnar 0 siblings, 0 replies; 19+ messages in thread From: Ingo Molnar @ 2012-06-15 12:10 UTC (permalink / raw) To: Srikar Dronamraju Cc: Oleg Nesterov, Hugh Dickins, Ingo Molnar, Peter Zijlstra, Ananth N Mavinakayanahalli, Anton Arapov, Linus Torvalds, Masami Hiramatsu, linux-kernel * Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: > * Oleg Nesterov <oleg@redhat.com> [2012-06-07 19:00:18]: > > > __copy_insn() blindly calls read_mapping_page(), this will crash > > the kernel if ->readpage == NULL, add the necessary check. For > > example, hugetlbfs_aops->readpage is NULL. Perhaps we should change > > read_mapping_page() instead. > > > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > > --- > > kernel/events/uprobes.c | 2 ++ > > 1 files changed, 2 insertions(+), 0 deletions(-) > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > index 48d53af..9c53bc2 100644 > > --- a/kernel/events/uprobes.c > > +++ b/kernel/events/uprobes.c > > @@ -616,6 +616,8 @@ __copy_insn(struct address_space *mapping, struct vm_area_struct *vma, char *ins > > > > if (!filp) > > return -EINVAL; > > + if (!mapping->a_ops->readpage) > > + return -EIO; > > Nit: Should there be a blank line before the if. Ingo had insisted on > these coding style changes. Well, within sensible limits: if()'s can certainly be in a block if they do related things (such as boundary checks, etc.), so the above isn't ugly IMO. Nor are separating newlines strictly necessary - it's also a function of the complexity of the code in question - if it's complex code then we want all visual and cleanliness help that we can think of. Thanks, Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap() 2012-06-07 16:59 [PATCH 0/3] uprobes fixes for 3.5 Oleg Nesterov 2012-06-07 17:00 ` [PATCH 1/3] uprobes: valid_vma() should reject VM_HUGETLB Oleg Nesterov 2012-06-07 17:00 ` [PATCH 2/3] uprobes: __copy_insn() should ensure a_ops->readpage != NULL Oleg Nesterov @ 2012-06-07 17:00 ` Oleg Nesterov 2012-06-08 8:47 ` Peter Zijlstra 2012-06-08 16:55 ` Oleg Nesterov 2012-06-08 14:03 ` oom-killer is crazy? (Was: [PATCH 0/3] uprobes fixes for 3.5) Oleg Nesterov 3 siblings, 2 replies; 19+ messages in thread From: Oleg Nesterov @ 2012-06-07 17:00 UTC (permalink / raw) To: Hugh Dickins, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju Cc: Ananth N Mavinakayanahalli, Anton Arapov, Linus Torvalds, Masami Hiramatsu, linux-kernel write_opcode() gets old_page via get_user_pages() and then calls __replace_page() which assumes that this old_page is still mapped after pte_offset_map_lock(). This is not true if this old_page was already try_to_unmap()'ed, and in this case everything __replace_page() does with old_page is wrong. Just for example, put_page() is not balanced. I think it is possible to teach __replace_page() to handle this unlikely case correctly, but this patch simply changes it to use page_check_address() and return -EAGAIN if it fails. The caller should notice this error code and retry. Note: write_opcode() asks for the cleanups, I'll try to do this in a separate patch. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- kernel/events/uprobes.c | 34 +++++++++------------------------- 1 files changed, 9 insertions(+), 25 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 9c53bc2..b3e8d5b 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -135,33 +135,17 @@ static loff_t vma_address(struct vm_area_struct *vma, loff_t offset) static int __replace_page(struct vm_area_struct *vma, struct page *page, struct page *kpage) { struct mm_struct *mm = vma->vm_mm; - pgd_t *pgd; - pud_t *pud; - pmd_t *pmd; - pte_t *ptep; - spinlock_t *ptl; unsigned long addr; - int err = -EFAULT; + spinlock_t *ptl; + pte_t *ptep; addr = page_address_in_vma(page, vma); if (addr == -EFAULT) - goto out; - - pgd = pgd_offset(mm, addr); - if (!pgd_present(*pgd)) - goto out; - - pud = pud_offset(pgd, addr); - if (!pud_present(*pud)) - goto out; - - pmd = pmd_offset(pud, addr); - if (!pmd_present(*pmd)) - goto out; + return -EFAULT; - ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); + ptep = page_check_address(page, mm, addr, &ptl, 0); if (!ptep) - goto out; + return -EAGAIN; get_page(kpage); page_add_new_anon_rmap(kpage, vma, addr); @@ -180,10 +164,8 @@ static int __replace_page(struct vm_area_struct *vma, struct page *page, struct try_to_free_swap(page); put_page(page); pte_unmap_unlock(ptep, ptl); - err = 0; -out: - return err; + return 0; } /** @@ -230,7 +212,7 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, struct uprobe *uprobe; loff_t addr; int ret; - +retry: /* Read the page with vaddr into memory */ ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &old_page, &vma); if (ret <= 0) @@ -297,6 +279,8 @@ unlock_out: put_out: put_page(old_page); + if (unlikely(ret == -EAGAIN)) + goto retry; return ret; } -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap() 2012-06-07 17:00 ` [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap() Oleg Nesterov @ 2012-06-08 8:47 ` Peter Zijlstra 2012-06-08 10:03 ` Oleg Nesterov 2012-06-08 16:55 ` Oleg Nesterov 1 sibling, 1 reply; 19+ messages in thread From: Peter Zijlstra @ 2012-06-08 8:47 UTC (permalink / raw) To: Oleg Nesterov Cc: Hugh Dickins, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, Linus Torvalds, Masami Hiramatsu, linux-kernel On Thu, 2012-06-07 at 19:00 +0200, Oleg Nesterov wrote: > write_opcode() gets old_page via get_user_pages() and then calls > __replace_page() which assumes that this old_page is still mapped > after pte_offset_map_lock(). > > This is not true if this old_page was already try_to_unmap()'ed, > and in this case everything __replace_page() does with old_page > is wrong. Just for example, put_page() is not balanced. > > I think it is possible to teach __replace_page() to handle this > unlikely case correctly, but this patch simply changes it to use > page_check_address() and return -EAGAIN if it fails. The caller > should notice this error code and retry. Note that replace_page() was nicked from ksm, does that suffer a similar problem? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap() 2012-06-08 8:47 ` Peter Zijlstra @ 2012-06-08 10:03 ` Oleg Nesterov 0 siblings, 0 replies; 19+ messages in thread From: Oleg Nesterov @ 2012-06-08 10:03 UTC (permalink / raw) To: Peter Zijlstra Cc: Hugh Dickins, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, Linus Torvalds, Masami Hiramatsu, linux-kernel On 06/08, Peter Zijlstra wrote: > > On Thu, 2012-06-07 at 19:00 +0200, Oleg Nesterov wrote: > > write_opcode() gets old_page via get_user_pages() and then calls > > __replace_page() which assumes that this old_page is still mapped > > after pte_offset_map_lock(). > > > > This is not true if this old_page was already try_to_unmap()'ed, > > and in this case everything __replace_page() does with old_page > > is wrong. Just for example, put_page() is not balanced. > > > > I think it is possible to teach __replace_page() to handle this > > unlikely case correctly, but this patch simply changes it to use > > page_check_address() and return -EAGAIN if it fails. The caller > > should notice this error code and retry. > > Note that replace_page() was nicked from ksm, does that suffer a similar > problem? Yes, I looked at replace_page() too. Afaics it looks fine, it does the additional pte_same(pte, orig_pte) check. Oleg. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap() 2012-06-07 17:00 ` [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap() Oleg Nesterov 2012-06-08 8:47 ` Peter Zijlstra @ 2012-06-08 16:55 ` Oleg Nesterov 2012-06-15 6:12 ` Srikar Dronamraju 1 sibling, 1 reply; 19+ messages in thread From: Oleg Nesterov @ 2012-06-08 16:55 UTC (permalink / raw) To: Hugh Dickins, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju Cc: Ananth N Mavinakayanahalli, Anton Arapov, Linus Torvalds, Masami Hiramatsu, linux-kernel On 06/07, Oleg Nesterov wrote: > > The caller > should notice this error code and retry. Damn. But I didn't notice that the caller changes "vaddr" in the middle. > Note: write_opcode() asks for the cleanups, I'll try to do this > in a separate patch. Yes. Please find v2 below. The only difference is that write_opcode() uses the temporary variable instead of "vaddr &= ~PAGE_MASK". --- Subject: [PATCH] uprobes: write_opcode()->__replace_page() can race with try_to_unmap() write_opcode() gets old_page via get_user_pages() and then calls __replace_page() which assumes that this old_page is still mapped after pte_offset_map_lock(). This is not true if this old_page was already try_to_unmap()'ed, and in this case everything __replace_page() does with old_page is wrong. Just for example, put_page() is not balanced. I think it is possible to teach __replace_page() to handle this unlikely case correctly, but this patch simply changes it to use page_check_address() and return -EAGAIN if it fails. The caller should notice this error code and retry. Note: write_opcode() asks for the cleanups, I'll try to do this in a separate patch. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- kernel/events/uprobes.c | 41 +++++++++++++---------------------------- 1 files changed, 13 insertions(+), 28 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 9c53bc2..54c8780 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -135,33 +135,17 @@ static loff_t vma_address(struct vm_area_struct *vma, loff_t offset) static int __replace_page(struct vm_area_struct *vma, struct page *page, struct page *kpage) { struct mm_struct *mm = vma->vm_mm; - pgd_t *pgd; - pud_t *pud; - pmd_t *pmd; - pte_t *ptep; - spinlock_t *ptl; unsigned long addr; - int err = -EFAULT; + spinlock_t *ptl; + pte_t *ptep; addr = page_address_in_vma(page, vma); if (addr == -EFAULT) - goto out; - - pgd = pgd_offset(mm, addr); - if (!pgd_present(*pgd)) - goto out; - - pud = pud_offset(pgd, addr); - if (!pud_present(*pud)) - goto out; + return -EFAULT; - pmd = pmd_offset(pud, addr); - if (!pmd_present(*pmd)) - goto out; - - ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); + ptep = page_check_address(page, mm, addr, &ptl, 0); if (!ptep) - goto out; + return -EAGAIN; get_page(kpage); page_add_new_anon_rmap(kpage, vma, addr); @@ -180,10 +164,8 @@ static int __replace_page(struct vm_area_struct *vma, struct page *page, struct try_to_free_swap(page); put_page(page); pte_unmap_unlock(ptep, ptl); - err = 0; -out: - return err; + return 0; } /** @@ -228,9 +210,10 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, void *vaddr_old, *vaddr_new; struct vm_area_struct *vma; struct uprobe *uprobe; + unsigned long pgoff; loff_t addr; int ret; - +retry: /* Read the page with vaddr into memory */ ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &old_page, &vma); if (ret <= 0) @@ -275,9 +258,9 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, memcpy(vaddr_new, vaddr_old, PAGE_SIZE); /* poke the new insn in, ASSUMES we don't cross page boundary */ - vaddr &= ~PAGE_MASK; - BUG_ON(vaddr + UPROBE_SWBP_INSN_SIZE > PAGE_SIZE); - memcpy(vaddr_new + vaddr, &opcode, UPROBE_SWBP_INSN_SIZE); + pgoff = (vaddr & ~PAGE_MASK); + BUG_ON(pgoff + UPROBE_SWBP_INSN_SIZE > PAGE_SIZE); + memcpy(vaddr_new + pgoff, &opcode, UPROBE_SWBP_INSN_SIZE); kunmap_atomic(vaddr_new); kunmap_atomic(vaddr_old); @@ -297,6 +280,8 @@ unlock_out: put_out: put_page(old_page); + if (unlikely(ret == -EAGAIN)) + goto retry; return ret; } -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap() 2012-06-08 16:55 ` Oleg Nesterov @ 2012-06-15 6:12 ` Srikar Dronamraju 2012-06-15 12:11 ` Ingo Molnar 0 siblings, 1 reply; 19+ messages in thread From: Srikar Dronamraju @ 2012-06-15 6:12 UTC (permalink / raw) To: Oleg Nesterov Cc: Hugh Dickins, Ingo Molnar, Peter Zijlstra, Ananth N Mavinakayanahalli, Anton Arapov, Linus Torvalds, Masami Hiramatsu, linux-kernel > --- > Subject: [PATCH] uprobes: write_opcode()->__replace_page() can race with try_to_unmap() > > write_opcode() gets old_page via get_user_pages() and then calls > __replace_page() which assumes that this old_page is still mapped > after pte_offset_map_lock(). > > This is not true if this old_page was already try_to_unmap()'ed, > and in this case everything __replace_page() does with old_page > is wrong. Just for example, put_page() is not balanced. > > I think it is possible to teach __replace_page() to handle this > unlikely case correctly, but this patch simply changes it to use > page_check_address() and return -EAGAIN if it fails. The caller > should notice this error code and retry. > > Note: write_opcode() asks for the cleanups, I'll try to do this > in a separate patch. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > kernel/events/uprobes.c | 41 +++++++++++++---------------------------- > 1 files changed, 13 insertions(+), 28 deletions(-) > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index 9c53bc2..54c8780 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -135,33 +135,17 @@ static loff_t vma_address(struct vm_area_struct *vma, loff_t offset) > static int __replace_page(struct vm_area_struct *vma, struct page *page, struct page *kpage) > { > struct mm_struct *mm = vma->vm_mm; > - pgd_t *pgd; > - pud_t *pud; > - pmd_t *pmd; > - pte_t *ptep; > - spinlock_t *ptl; > unsigned long addr; > - int err = -EFAULT; > + spinlock_t *ptl; > + pte_t *ptep; > > addr = page_address_in_vma(page, vma); > if (addr == -EFAULT) > - goto out; > - > - pgd = pgd_offset(mm, addr); > - if (!pgd_present(*pgd)) > - goto out; > - > - pud = pud_offset(pgd, addr); > - if (!pud_present(*pud)) > - goto out; > + return -EFAULT; > > - pmd = pmd_offset(pud, addr); > - if (!pmd_present(*pmd)) > - goto out; > - > - ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); > + ptep = page_check_address(page, mm, addr, &ptl, 0); > if (!ptep) > - goto out; > + return -EAGAIN; > > get_page(kpage); > page_add_new_anon_rmap(kpage, vma, addr); > @@ -180,10 +164,8 @@ static int __replace_page(struct vm_area_struct *vma, struct page *page, struct > try_to_free_swap(page); > put_page(page); > pte_unmap_unlock(ptep, ptl); > - err = 0; > > -out: > - return err; > + return 0; > } > > /** > @@ -228,9 +210,10 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, > void *vaddr_old, *vaddr_new; > struct vm_area_struct *vma; > struct uprobe *uprobe; > + unsigned long pgoff; > loff_t addr; > int ret; > - > +retry: Just a check on coding style: Shouldnt we have a preceeding blank line before the goto label. > /* Read the page with vaddr into memory */ > ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &old_page, &vma); > if (ret <= 0) > @@ -275,9 +258,9 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, > memcpy(vaddr_new, vaddr_old, PAGE_SIZE); > > /* poke the new insn in, ASSUMES we don't cross page boundary */ > - vaddr &= ~PAGE_MASK; > - BUG_ON(vaddr + UPROBE_SWBP_INSN_SIZE > PAGE_SIZE); > - memcpy(vaddr_new + vaddr, &opcode, UPROBE_SWBP_INSN_SIZE); > + pgoff = (vaddr & ~PAGE_MASK); > + BUG_ON(pgoff + UPROBE_SWBP_INSN_SIZE > PAGE_SIZE); > + memcpy(vaddr_new + pgoff, &opcode, UPROBE_SWBP_INSN_SIZE); > > kunmap_atomic(vaddr_new); > kunmap_atomic(vaddr_old); > @@ -297,6 +280,8 @@ unlock_out: > put_out: > put_page(old_page); > > + if (unlikely(ret == -EAGAIN)) > + goto retry; > return ret; > } > Looks good. Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> -- Thanks and Regards Srikar ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap() 2012-06-15 6:12 ` Srikar Dronamraju @ 2012-06-15 12:11 ` Ingo Molnar 2012-06-15 15:48 ` Oleg Nesterov 0 siblings, 1 reply; 19+ messages in thread From: Ingo Molnar @ 2012-06-15 12:11 UTC (permalink / raw) To: Srikar Dronamraju Cc: Oleg Nesterov, Hugh Dickins, Ingo Molnar, Peter Zijlstra, Ananth N Mavinakayanahalli, Anton Arapov, Linus Torvalds, Masami Hiramatsu, linux-kernel * Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: > > --- > > Subject: [PATCH] uprobes: write_opcode()->__replace_page() can race with try_to_unmap() > > > > write_opcode() gets old_page via get_user_pages() and then calls > > __replace_page() which assumes that this old_page is still mapped > > after pte_offset_map_lock(). > > > > This is not true if this old_page was already try_to_unmap()'ed, > > and in this case everything __replace_page() does with old_page > > is wrong. Just for example, put_page() is not balanced. > > > > I think it is possible to teach __replace_page() to handle this > > unlikely case correctly, but this patch simply changes it to use > > page_check_address() and return -EAGAIN if it fails. The caller > > should notice this error code and retry. > > > > Note: write_opcode() asks for the cleanups, I'll try to do this > > in a separate patch. > > > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > > --- > > kernel/events/uprobes.c | 41 +++++++++++++---------------------------- > > 1 files changed, 13 insertions(+), 28 deletions(-) > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > index 9c53bc2..54c8780 100644 > > --- a/kernel/events/uprobes.c > > +++ b/kernel/events/uprobes.c > > @@ -135,33 +135,17 @@ static loff_t vma_address(struct vm_area_struct *vma, loff_t offset) > > static int __replace_page(struct vm_area_struct *vma, struct page *page, struct page *kpage) > > { > > struct mm_struct *mm = vma->vm_mm; > > - pgd_t *pgd; > > - pud_t *pud; > > - pmd_t *pmd; > > - pte_t *ptep; > > - spinlock_t *ptl; > > unsigned long addr; > > - int err = -EFAULT; > > + spinlock_t *ptl; > > + pte_t *ptep; > > > > addr = page_address_in_vma(page, vma); > > if (addr == -EFAULT) > > - goto out; > > - > > - pgd = pgd_offset(mm, addr); > > - if (!pgd_present(*pgd)) > > - goto out; > > - > > - pud = pud_offset(pgd, addr); > > - if (!pud_present(*pud)) > > - goto out; > > + return -EFAULT; > > > > - pmd = pmd_offset(pud, addr); > > - if (!pmd_present(*pmd)) > > - goto out; > > - > > - ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); > > + ptep = page_check_address(page, mm, addr, &ptl, 0); > > if (!ptep) > > - goto out; > > + return -EAGAIN; > > > > get_page(kpage); > > page_add_new_anon_rmap(kpage, vma, addr); > > @@ -180,10 +164,8 @@ static int __replace_page(struct vm_area_struct *vma, struct page *page, struct > > try_to_free_swap(page); > > put_page(page); > > pte_unmap_unlock(ptep, ptl); > > - err = 0; > > > > -out: > > - return err; > > + return 0; > > } > > > > /** > > @@ -228,9 +210,10 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, > > void *vaddr_old, *vaddr_new; > > struct vm_area_struct *vma; > > struct uprobe *uprobe; > > + unsigned long pgoff; > > loff_t addr; > > int ret; > > - > > +retry: > > Just a check on coding style: Shouldnt we have a preceeding blank > line before the goto label. Yeah, that's most likely helpful to readability. Thanks, Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap() 2012-06-15 12:11 ` Ingo Molnar @ 2012-06-15 15:48 ` Oleg Nesterov 2012-06-16 7:11 ` Ingo Molnar 0 siblings, 1 reply; 19+ messages in thread From: Oleg Nesterov @ 2012-06-15 15:48 UTC (permalink / raw) To: Ingo Molnar Cc: Srikar Dronamraju, Hugh Dickins, Ingo Molnar, Peter Zijlstra, Ananth N Mavinakayanahalli, Anton Arapov, Linus Torvalds, Masami Hiramatsu, linux-kernel On 06/15, Ingo Molnar wrote: > > * Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: > > > > @@ -228,9 +210,10 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, > > > void *vaddr_old, *vaddr_new; > > > struct vm_area_struct *vma; > > > struct uprobe *uprobe; > > > + unsigned long pgoff; > > > loff_t addr; > > > int ret; > > > - > > > +retry: > > > > Just a check on coding style: Shouldnt we have a preceeding blank > > line before the goto label. > > Yeah, that's most likely helpful to readability. Aaah. Srikar, sorry, I didn't notice this comment and I already sent 1-15. But I added the blank line in 2/15 ;) Ingo, please let me know if I need to re-diff and resend. Otherwise I'll add the blank line later, write_opcode() needs more changes anyway. Oleg. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap() 2012-06-15 15:48 ` Oleg Nesterov @ 2012-06-16 7:11 ` Ingo Molnar 0 siblings, 0 replies; 19+ messages in thread From: Ingo Molnar @ 2012-06-16 7:11 UTC (permalink / raw) To: Oleg Nesterov Cc: Srikar Dronamraju, Hugh Dickins, Ingo Molnar, Peter Zijlstra, Ananth N Mavinakayanahalli, Anton Arapov, Linus Torvalds, Masami Hiramatsu, linux-kernel * Oleg Nesterov <oleg@redhat.com> wrote: > On 06/15, Ingo Molnar wrote: > > > > * Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: > > > > > > @@ -228,9 +210,10 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, > > > > void *vaddr_old, *vaddr_new; > > > > struct vm_area_struct *vma; > > > > struct uprobe *uprobe; > > > > + unsigned long pgoff; > > > > loff_t addr; > > > > int ret; > > > > - > > > > +retry: > > > > > > Just a check on coding style: Shouldnt we have a preceeding blank > > > line before the goto label. > > > > Yeah, that's most likely helpful to readability. > > Aaah. Srikar, sorry, I didn't notice this comment and I already > sent 1-15. But I added the blank line in 2/15 ;) > > Ingo, please let me know if I need to re-diff and resend. > Otherwise I'll add the blank line later, write_opcode() needs > more changes anyway. No need, I've added it before I committed the patch. Thanks, Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* oom-killer is crazy? (Was: [PATCH 0/3] uprobes fixes for 3.5) 2012-06-07 16:59 [PATCH 0/3] uprobes fixes for 3.5 Oleg Nesterov ` (2 preceding siblings ...) 2012-06-07 17:00 ` [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap() Oleg Nesterov @ 2012-06-08 14:03 ` Oleg Nesterov 2012-06-08 14:26 ` Dave Jones 3 siblings, 1 reply; 19+ messages in thread From: Oleg Nesterov @ 2012-06-08 14:03 UTC (permalink / raw) To: Hugh Dickins, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, KOSAKI Motohiro, Dave Jones, David Rientjes Cc: Ananth N Mavinakayanahalli, Anton Arapov, Linus Torvalds, Masami Hiramatsu, linux-kernel On 06/07, Oleg Nesterov wrote: > > This doesn't depend on other uprobes patches I sent, and I think > this is 3.5 material. And during the testing I found another thing which should be fixed in 3.5 imho. I noticed that oom-killer goes crazy. In the simplest case, when there is the single and "obvious" memory hog it kills sshd daemon. Hmm. oom_badness() does if (has_capability_noaudit(p, CAP_SYS_ADMIN)) points -= 30 * totalpages / 1000; very nice, but what if this underflows? points is unsigned long. points += p->signal->oom_score_adj... looks suspicious too. Looks like we should remove "unsigned" from oom_badness() and its callers? Probably not, it does "return points ? points : 1". Confused. Oleg. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: oom-killer is crazy? (Was: [PATCH 0/3] uprobes fixes for 3.5) 2012-06-08 14:03 ` oom-killer is crazy? (Was: [PATCH 0/3] uprobes fixes for 3.5) Oleg Nesterov @ 2012-06-08 14:26 ` Dave Jones 2012-06-08 15:04 ` Oleg Nesterov 0 siblings, 1 reply; 19+ messages in thread From: Dave Jones @ 2012-06-08 14:26 UTC (permalink / raw) To: Oleg Nesterov Cc: Hugh Dickins, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, KOSAKI Motohiro, David Rientjes, Ananth N Mavinakayanahalli, Anton Arapov, Linus Torvalds, Masami Hiramatsu, linux-kernel On Fri, Jun 08, 2012 at 04:03:28PM +0200, Oleg Nesterov wrote: > On 06/07, Oleg Nesterov wrote: > > > > This doesn't depend on other uprobes patches I sent, and I think > > this is 3.5 material. > > And during the testing I found another thing which should be fixed > in 3.5 imho. I noticed that oom-killer goes crazy. In the simplest > case, when there is the single and "obvious" memory hog it kills > sshd daemon. > > Hmm. oom_badness() does > > if (has_capability_noaudit(p, CAP_SYS_ADMIN)) > points -= 30 * totalpages / 1000; > > very nice, but what if this underflows? points is unsigned long. > points += p->signal->oom_score_adj... looks suspicious too. > > Looks like we should remove "unsigned" from oom_badness() and > its callers? Probably not, it does "return points ? points : 1". I've been running this from David for a week, but it still isn't right.. Dave diff --git a/mm/oom_kill.c b/mm/oom_kill.c index ed0e196..416637f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -183,7 +183,7 @@ static bool oom_unkillable_task(struct task_struct *p, unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages) { - unsigned long points; + long points; if (oom_unkillable_task(p, memcg, nodemask)) return 0; @@ -223,7 +223,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, * Never return 0 for an eligible task regardless of the root bonus and * oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here). */ - return points ? points : 1; + return points > 0 ? points : 1; } /* ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: oom-killer is crazy? (Was: [PATCH 0/3] uprobes fixes for 3.5) 2012-06-08 14:26 ` Dave Jones @ 2012-06-08 15:04 ` Oleg Nesterov 2012-06-08 20:21 ` [patch for-3.5-rc1] mm, oom: fix badness score underflow David Rientjes 0 siblings, 1 reply; 19+ messages in thread From: Oleg Nesterov @ 2012-06-08 15:04 UTC (permalink / raw) To: Dave Jones, Hugh Dickins, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, KOSAKI Motohiro, David Rientjes, Ananth N Mavinakayanahalli, Anton Arapov, Linus Torvalds, Masami Hiramatsu, linux-kernel On 06/08, Dave Jones wrote: > > On Fri, Jun 08, 2012 at 04:03:28PM +0200, Oleg Nesterov wrote: > > > > Hmm. oom_badness() does > > > > if (has_capability_noaudit(p, CAP_SYS_ADMIN)) > > points -= 30 * totalpages / 1000; > > > > very nice, but what if this underflows? points is unsigned long. > > points += p->signal->oom_score_adj... looks suspicious too. > > > > Looks like we should remove "unsigned" from oom_badness() and > > its callers? Probably not, it does "return points ? points : 1". > > I've been running this from David for a week, but it still isn't right.. > > Dave > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index ed0e196..416637f 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -183,7 +183,7 @@ static bool oom_unkillable_task(struct task_struct *p, > unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, > const nodemask_t *nodemask, unsigned long totalpages) > { > - unsigned long points; > + long points; > > if (oom_unkillable_task(p, memcg, nodemask)) > return 0; > @@ -223,7 +223,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, > * Never return 0 for an eligible task regardless of the root bonus and > * oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here). > */ > - return points ? points : 1; > + return points > 0 ? points : 1; > } I did the same to avoid the problem. Even if it still isn't right, I think it is much better ;) Currently oom_badness() is obviously and seriously broken. Oleg. ^ permalink raw reply [flat|nested] 19+ messages in thread
* [patch for-3.5-rc1] mm, oom: fix badness score underflow 2012-06-08 15:04 ` Oleg Nesterov @ 2012-06-08 20:21 ` David Rientjes 2012-06-09 22:25 ` KOSAKI Motohiro 0 siblings, 1 reply; 19+ messages in thread From: David Rientjes @ 2012-06-08 20:21 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton Cc: Oleg Nesterov, Dave Jones, Hugh Dickins, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, KOSAKI Motohiro, Ananth N Mavinakayanahalli, Anton Arapov, Masami Hiramatsu, linux-kernel If the privileges given to root threads (3% of allowable memory) or a negative value of /proc/pid/oom_score_adj happen to exceed the amount of rss of a thread, its badness score overflows as a result of a7f638f999ff ("mm, oom: normalize oom scores to oom_score_adj scale only for userspace"). Fix this by making the type signed and return 1, meaning the thread is still eligible for kill, if the value is negative. Reported-by: Dave Jones <davej@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> --- mm/oom_kill.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -183,7 +183,7 @@ static bool oom_unkillable_task(struct task_struct *p, unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages) { - unsigned long points; + long points; if (oom_unkillable_task(p, memcg, nodemask)) return 0; @@ -223,7 +223,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, * Never return 0 for an eligible task regardless of the root bonus and * oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here). */ - return points ? points : 1; + return points > 0 ? points : 1; } /* ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch for-3.5-rc1] mm, oom: fix badness score underflow 2012-06-08 20:21 ` [patch for-3.5-rc1] mm, oom: fix badness score underflow David Rientjes @ 2012-06-09 22:25 ` KOSAKI Motohiro 0 siblings, 0 replies; 19+ messages in thread From: KOSAKI Motohiro @ 2012-06-09 22:25 UTC (permalink / raw) To: David Rientjes Cc: Linus Torvalds, Andrew Morton, Oleg Nesterov, Dave Jones, Hugh Dickins, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, KOSAKI Motohiro, Ananth N Mavinakayanahalli, Anton Arapov, Masami Hiramatsu, linux-kernel, kosaki.motohiro (6/8/12 4:21 PM), David Rientjes wrote: > If the privileges given to root threads (3% of allowable memory) or a > negative value of /proc/pid/oom_score_adj happen to exceed the amount of > rss of a thread, its badness score overflows as a result of a7f638f999ff > ("mm, oom: normalize oom scores to oom_score_adj scale only for > userspace"). > > Fix this by making the type signed and return 1, meaning the thread is > still eligible for kill, if the value is negative. > > Reported-by: Dave Jones<davej@redhat.com> > Acked-by: Oleg Nesterov<oleg@redhat.com> > Signed-off-by: David Rientjes<rientjes@google.com> > --- > mm/oom_kill.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -183,7 +183,7 @@ static bool oom_unkillable_task(struct task_struct *p, > unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, > const nodemask_t *nodemask, unsigned long totalpages) > { > - unsigned long points; > + long points; > > if (oom_unkillable_task(p, memcg, nodemask)) > return 0; > @@ -223,7 +223,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, > * Never return 0 for an eligible task regardless of the root bonus and > * oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here). > */ > - return points ? points : 1; > + return points> 0 ? points : 1; > } Use long long. following line is dangerous. points += p->signal->oom_score_adj * totalpages / 1000; maximum oom_score_adj is 1000. then if system has >8G memory on 32bit (i.e. LONG_MAX [pages] * 4096 [pagesize] / 1000), it might get an overflow. Or, don't use normalized oom_score_adj. i.e, oom_score_adj_write() convert oom_score_adj into rss based modifier. This is oom-killer code. A micro optimization don't bring us a performance benefit. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2012-06-16 7:11 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-06-07 16:59 [PATCH 0/3] uprobes fixes for 3.5 Oleg Nesterov 2012-06-07 17:00 ` [PATCH 1/3] uprobes: valid_vma() should reject VM_HUGETLB Oleg Nesterov 2012-06-15 6:22 ` Srikar Dronamraju 2012-06-07 17:00 ` [PATCH 2/3] uprobes: __copy_insn() should ensure a_ops->readpage != NULL Oleg Nesterov 2012-06-15 6:25 ` Srikar Dronamraju 2012-06-15 12:10 ` Ingo Molnar 2012-06-07 17:00 ` [PATCH 3/3] uprobes: write_opcode()->__replace_page() can race with try_to_unmap() Oleg Nesterov 2012-06-08 8:47 ` Peter Zijlstra 2012-06-08 10:03 ` Oleg Nesterov 2012-06-08 16:55 ` Oleg Nesterov 2012-06-15 6:12 ` Srikar Dronamraju 2012-06-15 12:11 ` Ingo Molnar 2012-06-15 15:48 ` Oleg Nesterov 2012-06-16 7:11 ` Ingo Molnar 2012-06-08 14:03 ` oom-killer is crazy? (Was: [PATCH 0/3] uprobes fixes for 3.5) Oleg Nesterov 2012-06-08 14:26 ` Dave Jones 2012-06-08 15:04 ` Oleg Nesterov 2012-06-08 20:21 ` [patch for-3.5-rc1] mm, oom: fix badness score underflow David Rientjes 2012-06-09 22:25 ` KOSAKI Motohiro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).