All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>,
	zhongjiang <zhongjiang@huawei.com>,
	akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: fix account pmd page to the process
Date: Fri, 17 Jun 2016 15:25:06 +0300	[thread overview]
Message-ID: <20160617122506.GC6534@node.shutemov.name> (raw)
In-Reply-To: <bf76cc6c-a0da-98f9-4a89-0bb6161f5adf@oracle.com>

On Thu, Jun 16, 2016 at 09:47:46AM -0700, Mike Kravetz wrote:
> On 06/16/2016 09:31 AM, Michal Hocko wrote:
> > On Thu 16-06-16 09:05:23, Mike Kravetz wrote:
> >> On 06/16/2016 08:43 AM, Michal Hocko wrote:
> >>> [It seems that this patch has been sent several times and this
> >>> particular copy didn't add Kirill who has added this code CC him now]
> >>>
> >>> On Thu 16-06-16 17:42:14, Michal Hocko wrote:
> >>>> On Thu 16-06-16 19:36:11, zhongjiang wrote:
> >>>>> From: zhong jiang <zhongjiang@huawei.com>
> >>>>>
> >>>>> when a process acquire a pmd table shared by other process, we
> >>>>> increase the account to current process. otherwise, a race result
> >>>>> in other tasks have set the pud entry. so it no need to increase it.
> >>>>>
> >>>>> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
> >>>>> ---
> >>>>>  mm/hugetlb.c | 5 ++---
> >>>>>  1 file changed, 2 insertions(+), 3 deletions(-)
> >>>>>
> >>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >>>>> index 19d0d08..3b025c5 100644
> >>>>> --- a/mm/hugetlb.c
> >>>>> +++ b/mm/hugetlb.c
> >>>>> @@ -4189,10 +4189,9 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >>>>>  	if (pud_none(*pud)) {
> >>>>>  		pud_populate(mm, pud,
> >>>>>  				(pmd_t *)((unsigned long)spte & PAGE_MASK));
> >>>>> -	} else {
> >>>>> +	} else 
> >>>>>  		put_page(virt_to_page(spte));
> >>>>> -		mm_inc_nr_pmds(mm);
> >>>>> -	}
> >>>>
> >>>> The code is quite puzzling but is this correct? Shouldn't we rather do
> >>>> mm_dec_nr_pmds(mm) in that path to undo the previous inc?
> >>
> >> I agree that the code is quite puzzling. :(
> >>
> >> However, if this were an issue I would have expected to see some reports.
> >> Oracle DB makes use of this feature (shared page tables) and if the pmd
> >> count is wrong we would catch it in check_mm() at exit time.
> >>
> >> Upon closer examination, I believe the code in question is never executed.
> >> Note the callers of huge_pmd_share.  The calling code looks like:
> >>
> >>                         if (want_pmd_share() && pud_none(*pud))
> >>                                 pte = huge_pmd_share(mm, addr, pud);
> >>                         else
> >>                                 pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >>
> >> Therefore, we do not call huge_pmd_share unless pud_none(*pud).  The
> >> code in question is only executed when !pud_none(*pud).
> > 
> > My understanding is that the check is needed after we retake page lock
> > because we might have raced with other thread. But it's been quite some
> > time since I've looked at hugetlb locking and page table sharing code.
> 
> That is correct, we could have raced. Duh!
> 
> In the case of a race, the other thread would have incremented the
> PMD count already.  Your suggestion of decrementing pmd count in
> this case seems to be the correct approach.  But, I need to think
> about this some more.

Yes, I made mistake by increasing nr_pmds, not descreasing here.

Testcase:

#include <errno.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/time.h>

#define HPGSZ 2097152UL
int main(int argc, char **argv) {
	char *addr;

	system("echo 1024 > /proc/sys/vm/nr_hugepages");
	addr = mmap(NULL, 1024*HPGSZ, PROT_WRITE | PROT_READ,
			MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB | MAP_POPULATE, -1, 0);
	if (addr == MAP_FAILED) {
		fprintf(stderr, "Failed to alloc hugepage\n");
		return -1;
	}

	addr[0] = 1;
	fork();
	printf("addr[0]: %d\n", addr[0]);

	sleep(1);
	return 0;
}

You can simulate race by replacing 'if (pud_none(*pud))' with "if (0)". It
would produce "BUG: non-zero nr_pmds on freeing mm: 2" on the test-case.

Fix:

>From fd22922e7b4664e83653a84331f0a95b985bff0c Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Fri, 17 Jun 2016 15:07:03 +0300
Subject: [PATCH] hugetlb: fix nr_pmds accounting with shared page tables

We account HugeTLB's shared page table to all processes who share it.
The accounting happens during huge_pmd_share().

If somebody populates pud entry under us, we should decrease pagetable's
refcount and decrease nr_pmds of the process.

By mistake, I increase nr_pmds again in this case. :-/
It will lead to "BUG: non-zero nr_pmds on freeing mm: 2" on process'
exit.

Let's fix this by increasing nr_pmds only when we're sure that the page
table will be used.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: zhongjiang <zhongjiang@huawei.com>
Fixes: dc6c9a35b66b ("mm: account pmd page tables to the process")
Cc: <stable@vger.kernel.org>        [4.0+]
---
 mm/hugetlb.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e197cd7080e6..ed6a537f0878 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4216,7 +4216,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 		if (saddr) {
 			spte = huge_pte_offset(svma->vm_mm, saddr);
 			if (spte) {
-				mm_inc_nr_pmds(mm);
 				get_page(virt_to_page(spte));
 				break;
 			}
@@ -4231,9 +4230,9 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	if (pud_none(*pud)) {
 		pud_populate(mm, pud,
 				(pmd_t *)((unsigned long)spte & PAGE_MASK));
+		mm_inc_nr_pmds(mm);
 	} else {
 		put_page(virt_to_page(spte));
-		mm_inc_nr_pmds(mm);
 	}
 	spin_unlock(ptl);
 out:
-- 
 Kirill A. Shutemov

WARNING: multiple messages have this Message-ID (diff)
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>,
	zhongjiang <zhongjiang@huawei.com>,
	akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: fix account pmd page to the process
Date: Fri, 17 Jun 2016 15:25:06 +0300	[thread overview]
Message-ID: <20160617122506.GC6534@node.shutemov.name> (raw)
In-Reply-To: <bf76cc6c-a0da-98f9-4a89-0bb6161f5adf@oracle.com>

On Thu, Jun 16, 2016 at 09:47:46AM -0700, Mike Kravetz wrote:
> On 06/16/2016 09:31 AM, Michal Hocko wrote:
> > On Thu 16-06-16 09:05:23, Mike Kravetz wrote:
> >> On 06/16/2016 08:43 AM, Michal Hocko wrote:
> >>> [It seems that this patch has been sent several times and this
> >>> particular copy didn't add Kirill who has added this code CC him now]
> >>>
> >>> On Thu 16-06-16 17:42:14, Michal Hocko wrote:
> >>>> On Thu 16-06-16 19:36:11, zhongjiang wrote:
> >>>>> From: zhong jiang <zhongjiang@huawei.com>
> >>>>>
> >>>>> when a process acquire a pmd table shared by other process, we
> >>>>> increase the account to current process. otherwise, a race result
> >>>>> in other tasks have set the pud entry. so it no need to increase it.
> >>>>>
> >>>>> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
> >>>>> ---
> >>>>>  mm/hugetlb.c | 5 ++---
> >>>>>  1 file changed, 2 insertions(+), 3 deletions(-)
> >>>>>
> >>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >>>>> index 19d0d08..3b025c5 100644
> >>>>> --- a/mm/hugetlb.c
> >>>>> +++ b/mm/hugetlb.c
> >>>>> @@ -4189,10 +4189,9 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >>>>>  	if (pud_none(*pud)) {
> >>>>>  		pud_populate(mm, pud,
> >>>>>  				(pmd_t *)((unsigned long)spte & PAGE_MASK));
> >>>>> -	} else {
> >>>>> +	} else 
> >>>>>  		put_page(virt_to_page(spte));
> >>>>> -		mm_inc_nr_pmds(mm);
> >>>>> -	}
> >>>>
> >>>> The code is quite puzzling but is this correct? Shouldn't we rather do
> >>>> mm_dec_nr_pmds(mm) in that path to undo the previous inc?
> >>
> >> I agree that the code is quite puzzling. :(
> >>
> >> However, if this were an issue I would have expected to see some reports.
> >> Oracle DB makes use of this feature (shared page tables) and if the pmd
> >> count is wrong we would catch it in check_mm() at exit time.
> >>
> >> Upon closer examination, I believe the code in question is never executed.
> >> Note the callers of huge_pmd_share.  The calling code looks like:
> >>
> >>                         if (want_pmd_share() && pud_none(*pud))
> >>                                 pte = huge_pmd_share(mm, addr, pud);
> >>                         else
> >>                                 pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >>
> >> Therefore, we do not call huge_pmd_share unless pud_none(*pud).  The
> >> code in question is only executed when !pud_none(*pud).
> > 
> > My understanding is that the check is needed after we retake page lock
> > because we might have raced with other thread. But it's been quite some
> > time since I've looked at hugetlb locking and page table sharing code.
> 
> That is correct, we could have raced. Duh!
> 
> In the case of a race, the other thread would have incremented the
> PMD count already.  Your suggestion of decrementing pmd count in
> this case seems to be the correct approach.  But, I need to think
> about this some more.

Yes, I made mistake by increasing nr_pmds, not descreasing here.

Testcase:

#include <errno.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/time.h>

#define HPGSZ 2097152UL
int main(int argc, char **argv) {
	char *addr;

	system("echo 1024 > /proc/sys/vm/nr_hugepages");
	addr = mmap(NULL, 1024*HPGSZ, PROT_WRITE | PROT_READ,
			MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB | MAP_POPULATE, -1, 0);
	if (addr == MAP_FAILED) {
		fprintf(stderr, "Failed to alloc hugepage\n");
		return -1;
	}

	addr[0] = 1;
	fork();
	printf("addr[0]: %d\n", addr[0]);

	sleep(1);
	return 0;
}

You can simulate race by replacing 'if (pud_none(*pud))' with "if (0)". It
would produce "BUG: non-zero nr_pmds on freeing mm: 2" on the test-case.

Fix:

  reply	other threads:[~2016-06-17 12:25 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-16 11:36 [PATCH] mm: fix account pmd page to the process zhongjiang
2016-06-16 11:36 ` zhongjiang
2016-06-16 15:42 ` Michal Hocko
2016-06-16 15:42   ` Michal Hocko
2016-06-16 15:43   ` Michal Hocko
2016-06-16 15:43     ` Michal Hocko
2016-06-16 16:05     ` Mike Kravetz
2016-06-16 16:05       ` Mike Kravetz
2016-06-16 16:31       ` Michal Hocko
2016-06-16 16:31         ` Michal Hocko
2016-06-16 16:47         ` Mike Kravetz
2016-06-16 16:47           ` Mike Kravetz
2016-06-17 12:25           ` Kirill A. Shutemov [this message]
2016-06-17 12:25             ` Kirill A. Shutemov
2016-06-17 13:00             ` Michal Hocko
2016-06-17 13:00               ` Michal Hocko
2016-06-17 14:25               ` Kirill A. Shutemov
2016-06-17 14:25                 ` Kirill A. Shutemov
2016-06-17 15:39             ` Mike Kravetz
2016-06-17 15:39               ` Mike Kravetz
2016-06-18  5:07               ` zhong jiang
2016-06-18  5:07                 ` zhong jiang
2016-06-17 11:18   ` zhong jiang
2016-06-17 11:18     ` zhong jiang
     [not found] <1466163941-12952-1-git-send-email-zhongjiang@huawei.com>
2016-06-17 12:01 ` zhong jiang
2016-06-17 12:01   ` zhong jiang
  -- strict thread matches above, loose matches on Subject: below --
2016-06-17 11:56 zhongjiang
2016-06-17 11:56 ` zhongjiang
2016-06-17 12:21 ` Michal Hocko
2016-06-17 12:21   ` Michal Hocko
2016-06-17 13:04   ` zhong jiang
2016-06-17 13:04     ` zhong jiang
     [not found] <1466076175-23444-1-git-send-email-zhongjiang@huawei.com>
2016-06-16 11:30 ` zhong jiang
2016-06-16 11:30   ` zhong jiang
2016-06-16 11:28 zhongjiang
2016-06-16 11:28 ` zhongjiang
2016-06-16 11:17 zhongjiang
2016-06-16 11:17 ` zhongjiang
2016-06-16 11:16 zhongjiang
2016-06-16 11:16 ` zhongjiang
2016-06-16 11:13 zhongjiang
2016-06-16 11:13 ` zhongjiang
2016-06-16  7:47 zhongjiang
2016-06-16  7:47 ` zhongjiang
2016-06-15 14:48 zhongjiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160617122506.GC6534@node.shutemov.name \
    --to=kirill@shutemov.name \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=zhongjiang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.