All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] x86: fix Xen PV regression caused by NUMA page migration
@ 2014-03-21 18:18 David Vrabel
  2014-03-21 18:18 ` [PATCH 1/2] Revert "xen: properly account for _PAGE_NUMA during xen pte translations" David Vrabel
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: David Vrabel @ 2014-03-21 18:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Steven Noonan, Mel Gorman, Elena Ufimtseva, Boris Ostrovsky,
	David Vrabel

This series should properly fix the Xen PV guest regression introduced
by 1667918b6483 (mm: numa: clear numa hinting information on
mprotect).  The previous fix a9c8e4beeeb6 (xen: properly account for
_PAGE_NUMA during xen pte translations) breaks save/restore
(migration) and needs to be reverted.

I've only given this series a minimal amount of testing and would
appreciate testing by someone who experienced/reproduced the original
regression.

David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] Revert "xen: properly account for _PAGE_NUMA during xen pte translations"
  2014-03-21 18:18 [RFC PATCH 0/2] x86: fix Xen PV regression caused by NUMA page migration David Vrabel
@ 2014-03-21 18:18 ` David Vrabel
  2014-03-21 18:18 ` [PATCH 2/2] x86: use pv-ops in {pte, pmd}_{set, clear}_flags() David Vrabel
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: David Vrabel @ 2014-03-21 18:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Steven Noonan, Mel Gorman, Elena Ufimtseva, Boris Ostrovsky,
	David Vrabel

This reverts commit a9c8e4beeeb64c22b84c803747487857fe424b68.

PTEs in Xen PV guests must contain machine addresses if _PAGE_PRESENT
is set and pseudo-physical addresses is _PAGE_PRESENT is clear.

This is because during a domain save/restore (migration) the page
table entries are "canonicalised" and uncanonicalised". i.e., MFNs are
converted to PFNs during domain save so that on a restore the page
table entries may be rewritten with the new MFNs on the destination.

This change resulted in writing PTEs with MFNs if _PAGE_NUMA was set
but _PAGE_PRESENT was clear.  These PTEs would be migrated as-is which
would result in unexpected behaviour in the destination domain.
Either a) the MFN would be translated to the wrong PFN/page; b)
setting the _PAGE_PRESENT bit would clear the PTE because the MFN is
no longer owned by the domain; or c) the present bit would not get
set.

Symptoms include "Bad page" reports when unmapping after migrating a
domain.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Steven Noonan <steven@uplinklabs.net>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>        [3.12+]
---
 arch/x86/include/asm/pgtable.h |   14 ++------------
 arch/x86/xen/mmu.c             |    4 ++--
 2 files changed, 4 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5ad38ad..bbc8b12 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -445,20 +445,10 @@ static inline int pte_same(pte_t a, pte_t b)
 	return a.pte == b.pte;
 }
 
-static inline int pteval_present(pteval_t pteval)
-{
-	/*
-	 * Yes Linus, _PAGE_PROTNONE == _PAGE_NUMA. Expressing it this
-	 * way clearly states that the intent is that protnone and numa
-	 * hinting ptes are considered present for the purposes of
-	 * pagetable operations like zapping, protection changes, gup etc.
-	 */
-	return pteval & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_NUMA);
-}
-
 static inline int pte_present(pte_t a)
 {
-	return pteval_present(pte_flags(a));
+	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+			       _PAGE_NUMA);
 }
 
 #define pte_accessible pte_accessible
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 256282e..2423ef0 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct *mm, unsigned long addr,
 /* Assume pteval_t is equivalent to all the other *val_t types. */
 static pteval_t pte_mfn_to_pfn(pteval_t val)
 {
-	if (pteval_present(val)) {
+	if (val & _PAGE_PRESENT) {
 		unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
 		unsigned long pfn = mfn_to_pfn(mfn);
 
@@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)
 
 static pteval_t pte_pfn_to_mfn(pteval_t val)
 {
-	if (pteval_present(val)) {
+	if (val & _PAGE_PRESENT) {
 		unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
 		pteval_t flags = val & PTE_FLAGS_MASK;
 		unsigned long mfn;
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] x86: use pv-ops in {pte, pmd}_{set, clear}_flags()
  2014-03-21 18:18 [RFC PATCH 0/2] x86: fix Xen PV regression caused by NUMA page migration David Vrabel
  2014-03-21 18:18 ` [PATCH 1/2] Revert "xen: properly account for _PAGE_NUMA during xen pte translations" David Vrabel
@ 2014-03-21 18:18 ` David Vrabel
  2014-03-24 11:28   ` David Vrabel
  2014-03-26 19:10 ` [RFC PATCH 0/2] x86: fix Xen PV regression caused by NUMA page migration David Vrabel
  2014-04-15  8:24 ` David Sutton
  3 siblings, 1 reply; 16+ messages in thread
From: David Vrabel @ 2014-03-21 18:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Steven Noonan, Mel Gorman, Elena Ufimtseva, Boris Ostrovsky,
	David Vrabel

Instead of using native functions to operate on the PTEs in
pte_set_flags(), pte_clear_flags(), pmd_set_flags(), pmd_clear_flags()
use the PV aware ones.

This fixes a regression in Xen PV guests introduced by 1667918b6483
(mm: numa: clear numa hinting information on mprotect).

Xen PV guest page tables require that their entries use machine
addresses if the preset bit (_PAGE_PRESENT) is set, and (for
successful migration) non-present PTEs must use pseudo-physical
addresses.  This is because on migration MFNs in present PTEs are
translated to PFNs (canonicalised) so they may be translated back to
the new MFN in the destination domain (uncanonicalised).

pte_mknonnuma(), pmd_mknonnuma(), pte_mknuma() and pmd_mknuma() set
and clear the _PAGE_PRESENT bit using pte_set_flags(),
pte_clear_flags(), etc.

In a Xen PV guest, these functions must translate MFNs to PFNs when
clearing _PAGE_PRESENT and translate PFNs to MFNs when setting
_PAGE_PRESENT.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Steven Noonan <steven@uplinklabs.net>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>        [3.12+]
---
 arch/x86/include/asm/pgtable.h |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index bbc8b12..323e5e2 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -174,16 +174,16 @@ static inline int has_transparent_hugepage(void)
 
 static inline pte_t pte_set_flags(pte_t pte, pteval_t set)
 {
-	pteval_t v = native_pte_val(pte);
+	pteval_t v = pte_val(pte);
 
-	return native_make_pte(v | set);
+	return __pte(v | set);
 }
 
 static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
 {
-	pteval_t v = native_pte_val(pte);
+	pteval_t v = pte_val(pte);
 
-	return native_make_pte(v & ~clear);
+	return __pte(v & ~clear);
 }
 
 static inline pte_t pte_mkclean(pte_t pte)
@@ -248,14 +248,14 @@ static inline pte_t pte_mkspecial(pte_t pte)
 
 static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
 {
-	pmdval_t v = native_pmd_val(pmd);
+	pmdval_t v = pmd_val(pmd);
 
 	return __pmd(v | set);
 }
 
 static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
 {
-	pmdval_t v = native_pmd_val(pmd);
+	pmdval_t v = pmd_val(pmd);
 
 	return __pmd(v & ~clear);
 }
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] x86: use pv-ops in {pte, pmd}_{set, clear}_flags()
  2014-03-21 18:18 ` [PATCH 2/2] x86: use pv-ops in {pte, pmd}_{set, clear}_flags() David Vrabel
@ 2014-03-24 11:28   ` David Vrabel
       [not found]     ` <CAKbGBLiVqaHEOZx6y4MW4xDTUdKRhVLZXTTGiqYT7vuH2Wgeww@mail.gmail.com>
  0 siblings, 1 reply; 16+ messages in thread
From: David Vrabel @ 2014-03-24 11:28 UTC (permalink / raw)
  To: David Vrabel
  Cc: Steven Noonan, Mel Gorman, Elena Ufimtseva, xen-devel, Boris Ostrovsky

On 21/03/14 18:18, David Vrabel wrote:
> Instead of using native functions to operate on the PTEs in
> pte_set_flags(), pte_clear_flags(), pmd_set_flags(), pmd_clear_flags()
> use the PV aware ones.

Looking at the history of this changes, these were introduced to avoid
PV ops for performance reasons.

I believe this can be fixed by making the numa-related PTE modifier
functions (pte_mknuma(), pte_mknonnuma()) use pte_modify() which is the
correct function to use when adjusting page protection.

I really do not understand how you're supposed to distinguish between a
PTE for a PROT_NONE page and one with _PAGE_NUMA -- they're identical.
i.e., pte_numa() will return true for a PROT_NONE protected page which
just seems wrong to me.

David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] x86: use pv-ops in {pte,pmd}_{set,clear}_flags()
       [not found]     ` <CAKbGBLiVqaHEOZx6y4MW4xDTUdKRhVLZXTTGiqYT7vuH2Wgeww@mail.gmail.com>
@ 2014-03-25 20:16       ` Linus Torvalds
  2014-03-31 12:26         ` Mel Gorman
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2014-03-25 20:16 UTC (permalink / raw)
  To: Steven Noonan, Mel Gorman, Rik van Riel
  Cc: David Vrabel, Andrew Morton, linux-mm

On Mon, Mar 24, 2014 at 8:31 AM, Steven Noonan <steven@uplinklabs.net> wrote:
> Vrabel's comments make me think that revisiting the elimination of the
> _PAGE_NUMA bit implementation would be a good idea... should I CC you
> on this discussion (not sure if you're subscribed to xen-devel, or if
> LKML is a better place for that discussion)?

I detest the PAGE_NUMA games myself, but:

From: David Vrabel <david.vrabel@citrix.com>:
>
> I really do not understand how you're supposed to distinguish between a
> PTE for a PROT_NONE page and one with _PAGE_NUMA -- they're identical.
> i.e., pte_numa() will return true for a PROT_NONE protected page which
> just seems wrong to me.

The way to distinguish PAGE_NUMA from PROTNONE is *supposed* to be by
looking at the vma, and PROTNONE goes together with a vma with
PROT_NONE. That's what the comments in pgtable_types.h say.

However, as far as I can tell, that is pure and utter bullshit.  It's
true that generally handle_mm_fault() shouldn't be called for
PROT_NONE pages, since it will fail the protection checks. However, we
have FOLL_FORCE that overrides those protection checks for things like
ptrace etc. So people have tried to convince me that _PAGE_NUMA works,
but I'm really not at all sure they are right.

I fundamentally think that it was a horrible horrible disaster to make
_PAGE_NUMA alias onto _PAGE_PROTNONE.

But I'm cc'ing the people who tried to convince me otherwise last time
around, to see if they can articulate this mess better now.

The argument *seems* to be that if things are truly PROT_NONE, then
the page will never be touched by page faulting code (and as
mentioned, I think that argument is fundamentally broken), and if it's
PROT_NUMA then the page faulting code will magically do the right
thing.

FURTHERMORE, the argument was that we can't just call things PROT_NONE
and just say that "those are the semantics", because on other
architectures PROT_NONE might mean/do something else.  Which really
makes no sense either, because if the argument was that PROT_NONE
causes faults that can either be handled as faults (for PROT_NONE) or
as NUMA issues (for NUMA), then dammit, that argument should be
completely architecture-independent.

But I gave up arguing with people. I will state (again) that I think
this is a f*cking mess, and saying that PROTNONE and NUMA are somehow
the exact same thing on x86 but not in general is bogus crap. And
saying that you can determine which it is from the vma is very
debatable too.

Let the people responsible for the crap try to explain why it works
and has to be that mess. Again. Rik, Mel?

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/2] x86: fix Xen PV regression caused by NUMA page migration
  2014-03-21 18:18 [RFC PATCH 0/2] x86: fix Xen PV regression caused by NUMA page migration David Vrabel
  2014-03-21 18:18 ` [PATCH 1/2] Revert "xen: properly account for _PAGE_NUMA during xen pte translations" David Vrabel
  2014-03-21 18:18 ` [PATCH 2/2] x86: use pv-ops in {pte, pmd}_{set, clear}_flags() David Vrabel
@ 2014-03-26 19:10 ` David Vrabel
  2014-04-15  8:24 ` David Sutton
  3 siblings, 0 replies; 16+ messages in thread
From: David Vrabel @ 2014-03-26 19:10 UTC (permalink / raw)
  To: David Vrabel
  Cc: Boris Ostrovsky, xen-devel, Steven Noonan, Mel Gorman, Elena Ufimtseva

[-- Attachment #1: Type: text/plain, Size: 815 bytes --]

On 21/03/14 18:18, David Vrabel wrote:
> This series should properly fix the Xen PV guest regression introduced
> by 1667918b6483 (mm: numa: clear numa hinting information on
> mprotect).  The previous fix a9c8e4beeeb6 (xen: properly account for
> _PAGE_NUMA during xen pte translations) breaks save/restore
> (migration) and needs to be reverted.
> 
> I've only given this series a minimal amount of testing and would
> appreciate testing by someone who experienced/reproduced the original
> regression.

Attached is a simple test program that triggers the bug.

Patch #2 fixes the regression and I think that it is the correct fix but
I need to confirm it doesn't cause any measurable performance problems
on native x86 (I think it should not).

I don't have any more time to work on this until next week.

David

[-- Attachment #2: mprotect.c --]
[-- Type: text/x-csrc, Size: 1407 bytes --]

#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define PAGE_SIZE 4096
#define MAGIC 0x79128347ul

int main(void)
{
    void *p;
    volatile unsigned long *m;
    int ret;

    /* Map a frame for the test. */
    m = p = mmap(NULL, PAGE_SIZE, PROT_WRITE|PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    if (p == MAP_FAILED) {
        perror("mmap");
        exit(1);
    }

    /* Write to the frame. */
    *m = MAGIC;

    /* Set PROT_NONE on the mapping. */
    ret = mprotect(p, PAGE_SIZE, PROT_NONE);
    if (ret < 0) {
        perror("mprotect");
        exit(1);
    }

    /*
     * Set PROT_READ (or any other non-PROT_NONE protection) on the
     * mapping.
     *
     * On a buggy system pte_mknonnuma() will clear _PAGE_PROTNONE (==
     * _PAGE_NUMA) and set _PAGE_PRESENT /without/ doing a PFN to MFN
     * conversion.
     *
     * The resulting mapping will be to the wrong MFN.
     */
    ret = mprotect(p, PAGE_SIZE, PROT_READ);
    if (ret < 0) {
        perror("mprotect");
        exit(1);
    }

    /* Check we're still mapping the original frame. */
    if (*m != MAGIC) {
        printf("FAIL (0x%lx != 0x%lx)\n", *m, MAGIC);
        ret = 1;
    }

    /*
     * On a buggy system, the munmap() will usually trigger a "Bad
     * page" error since the mapping isn't for the original page.
     */
    munmap(p, PAGE_SIZE);

    return ret;
}

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] x86: use pv-ops in {pte,pmd}_{set,clear}_flags()
  2014-03-25 20:16       ` [PATCH 2/2] x86: use pv-ops in {pte,pmd}_{set,clear}_flags() Linus Torvalds
@ 2014-03-31 12:26         ` Mel Gorman
  2014-03-31 15:41           ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2014-03-31 12:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Noonan, Rik van Riel, David Vrabel, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, linux-mm

On Tue, Mar 25, 2014 at 01:16:02PM -0700, Linus Torvalds wrote:
> On Mon, Mar 24, 2014 at 8:31 AM, Steven Noonan <steven@uplinklabs.net> wrote:
> > Vrabel's comments make me think that revisiting the elimination of the
> > _PAGE_NUMA bit implementation would be a good idea... should I CC you
> > on this discussion (not sure if you're subscribed to xen-devel, or if
> > LKML is a better place for that discussion)?
> 
> I detest the PAGE_NUMA games myself, but:
> 

First of all, sorry for the slow response even by my standards. I was at
LSF/MM and Collaboration all last week and it took up all my attention. Today
is my first day back properly online and trawling through the inbox mess.

> From: David Vrabel <david.vrabel@citrix.com>:
> >
> > I really do not understand how you're supposed to distinguish between a
> > PTE for a PROT_NONE page and one with _PAGE_NUMA -- they're identical.
> > i.e., pte_numa() will return true for a PROT_NONE protected page which
> > just seems wrong to me.
> 
> The way to distinguish PAGE_NUMA from PROTNONE is *supposed* to be by
> looking at the vma, and PROTNONE goes together with a vma with
> PROT_NONE. That's what the comments in pgtable_types.h say.
> 

This is the expectation. We did not want to even attempt tracking NUMA
hints on a per-VMA basis because the fault handler would go to hell with
the need to fixup vmas.

> However, as far as I can tell, that is pure and utter bullshit.  It's
> true that generally handle_mm_fault() shouldn't be called for
> PROT_NONE pages, since it will fail the protection checks. However, we
> have FOLL_FORCE that overrides those protection checks for things like
> ptrace etc. So people have tried to convince me that _PAGE_NUMA works,
> but I'm really not at all sure they are right.
> 

For FOLL_FORCE, we do not set FOLL_NUMA in this chunk here

        /*
         * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
         * would be called on PROT_NONE ranges. We must never invoke
         * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
         * page faults would unprotect the PROT_NONE ranges if
         * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
         * bitflag. So to avoid that, don't set FOLL_NUMA if
         * FOLL_FORCE is set.
         */
        if (!(gup_flags & FOLL_FORCE))
                gup_flags |= FOLL_NUMA;

Without FOLL_NUMA, we do not do "pmd_numa" checks because they cannot
distinguish between a prot_none and pmd_numa as they use identical bits
on x86. This is in follow_page_mask

        if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
                goto no_page_table;

Without the checks FOLL_FORCE would screw up when it encountered a page
protected for NUMA hinting faults. I recognise that it further muddies
the waters on what _PAGE_NUMA actually means.

A potential alternative would have been to have two pte bits -- _PAGE_NONE
and an unused PTE bit (if there is one) that we'd call_PAGE_NUMA where a
pmd_mknuma sets both. The _PAGE_NONE is what would cause a hinting fault
but we'd use the second bit to distinguish between PROT_NONE and a NUMA
hinting fault. I doubt the end result would be much cleaner though and
it would be a mess.

Another alternative is to simply not allow NUMA_BALANCING on Xen. It's not
even clear what it means as the Xen NUMA topology may or may not correspond
to underlying physical nodes. It's even less clear what happens if both
guest and host use automatic balancing.

> I fundamentally think that it was a horrible horrible disaster to make
> _PAGE_NUMA alias onto _PAGE_PROTNONE.
> 

We did not have much of a choice. We needed something that would trap a
fault and _PAGE_PROTNONE is not available on all architectures. ppc64
reused _PAGE_COHERENT for example.

> But I'm cc'ing the people who tried to convince me otherwise last time
> around, to see if they can articulate this mess better now.
> 
> The argument *seems* to be that if things are truly PROT_NONE, then
> the page will never be touched by page faulting code (and as
> mentioned, I think that argument is fundamentally broken), and if it's
> PROT_NUMA then the page faulting code will magically do the right
> thing.
> 

This is essentially the argument with the addendum that follow_page is
meant to avoid trying pmd_numa checks on FOLL_FORCE.

> FURTHERMORE, the argument was that we can't just call things PROT_NONE
> and just say that "those are the semantics", because on other
> architectures PROT_NONE might mean/do something else.

Or that the equivalent of _PAGE_PROTNONE did not exist and was
implemented by some other means.

> Which really
> makes no sense either, because if the argument was that PROT_NONE
> causes faults that can either be handled as faults (for PROT_NONE) or
> as NUMA issues (for NUMA), then dammit, that argument should be
> completely architecture-independent.
> 
> But I gave up arguing with people. I will state (again) that I think
> this is a f*cking mess, and saying that PROTNONE and NUMA are somehow
> the exact same thing on x86 but not in general is bogus crap. And
> saying that you can determine which it is from the vma is very
> debatable too.
> 

Ok, so how do you suggest that _PAGE_NUMA could have been implemented
that did *not* use _PAGE_PROTNONE on x86, trapped a fault and was not
expensive as hell to handle?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] x86: use pv-ops in {pte,pmd}_{set,clear}_flags()
  2014-03-31 12:26         ` Mel Gorman
@ 2014-03-31 15:41           ` Linus Torvalds
  2014-03-31 16:10             ` Linus Torvalds
  2014-04-01 18:18             ` David Vrabel
  0 siblings, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2014-03-31 15:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Steven Noonan, Rik van Riel, David Vrabel, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, linux-mm

On Mon, Mar 31, 2014 at 5:26 AM, Mel Gorman <mgorman@suse.de> wrote:
>
> Ok, so how do you suggest that _PAGE_NUMA could have been implemented
> that did *not* use _PAGE_PROTNONE on x86, trapped a fault and was not
> expensive as hell to handle?

So on x86, the obvious model is to use another bit. We've got several.
The _PAGE_NUMA case only matters for when _PAGE_PRESENT is clear, and
when that bit is clear the hardware doesn't care about any of the
other bits. Currently we use:

  #define _PAGE_BIT_PROTNONE      _PAGE_BIT_GLOBAL
  #define _PAGE_BIT_FILE          _PAGE_BIT_DIRTY

which are bits 8 and 6 respectively, afaik.

and the only rule is that (a) we should *not* use a bit we already use
when the page is not present (since that is ambiguous!) and (b) we
should *not* use a bit that is used by the swap index cases. I think
bit 7 should work, but maybe I missed something.

Can somebody tell me why _PAGE_NUMA is *not* that bit seven? Make
"pte_present()" on x86 just check all of the present/numa/protnone
bits, and if any of them is set, it's a "present" page.

Now, unlike x86, some other architectures do *not* have free bits, so
there may be problems elsewhere.

            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] x86: use pv-ops in {pte,pmd}_{set,clear}_flags()
  2014-03-31 15:41           ` Linus Torvalds
@ 2014-03-31 16:10             ` Linus Torvalds
  2014-03-31 16:27               ` Cyrill Gorcunov
  2014-04-01 18:18             ` David Vrabel
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2014-03-31 16:10 UTC (permalink / raw)
  To: Mel Gorman, Peter Anvin, Ingo Molnar, the arch/x86 maintainers
  Cc: Steven Noonan, Rik van Riel, David Vrabel, Andrew Morton,
	Peter Zijlstra, linux-mm, Cyrill Gorcunov

[ Adding x86 maintainers - Ingo was involved earlier, make it more explicit ]

On Mon, Mar 31, 2014 at 8:41 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So on x86, the obvious model is to use another bit. We've got several.
> The _PAGE_NUMA case only matters for when _PAGE_PRESENT is clear, and
> when that bit is clear the hardware doesn't care about any of the
> other bits. Currently we use:
>
>   #define _PAGE_BIT_PROTNONE      _PAGE_BIT_GLOBAL
>   #define _PAGE_BIT_FILE          _PAGE_BIT_DIRTY
>
> which are bits 8 and 6 respectively, afaik.

Side note to the x86 guys: I think it was a mistake (long long long
ago) to define these "valid when not present" bits in terms of the
"valid when present" bits.

It causes insane situations like this:

  #if _PAGE_BIT_FILE < _PAGE_BIT_PROTNONE

which makes no sense *except* if you think that those bits can have
random odd hardware-defined values. But they really can't. They are
just random bit numbers we chose.

It has *also* caused horrible pain with the whole "soft dirty" thing,
and we have absolutely ridiculous macros in pgtable-2level.h for the
insane soft-dirty case, trying to keep the swap bits spread out "just
right" to make soft-dirty (_PAGE_BIT_HIDDEN aka bit 11) not alias with
the bits we use for swap offsets etc.

So how about we just say:

 - define the bits we use when PAGE_PRESENT==0 separately and explicitly

 - clean up the crazy soft-dirty crap, preferably by just making it
depend on a 64-bit pte (so you have to have X86_PAE enabled or be on
x86-64)

that would sound like a good cleanup, because right now it's a
complete nightmare to think about which bits are used how when P is 0.
The above insane #if being the prime example of that confusion.

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] x86: use pv-ops in {pte,pmd}_{set,clear}_flags()
  2014-03-31 16:10             ` Linus Torvalds
@ 2014-03-31 16:27               ` Cyrill Gorcunov
  0 siblings, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov @ 2014-03-31 16:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Peter Anvin, Ingo Molnar, the arch/x86 maintainers,
	Steven Noonan, Rik van Riel, David Vrabel, Andrew Morton,
	Peter Zijlstra, linux-mm

On Mon, Mar 31, 2014 at 09:10:07AM -0700, Linus Torvalds wrote:
> 
> It causes insane situations like this:
> 
>   #if _PAGE_BIT_FILE < _PAGE_BIT_PROTNONE
> 
> which makes no sense *except* if you think that those bits can have
> random odd hardware-defined values. But they really can't. They are
> just random bit numbers we chose.

I never understand this ifdef (I've asked once but got no reply).

> It has *also* caused horrible pain with the whole "soft dirty" thing,
> and we have absolutely ridiculous macros in pgtable-2level.h for the
> insane soft-dirty case, trying to keep the swap bits spread out "just
> right" to make soft-dirty (_PAGE_BIT_HIDDEN aka bit 11) not alias with
> the bits we use for swap offsets etc.
> 
> So how about we just say:
> 
>  - define the bits we use when PAGE_PRESENT==0 separately and explicitly
> 
>  - clean up the crazy soft-dirty crap, preferably by just making it
> depend on a 64-bit pte (so you have to have X86_PAE enabled or be on
> x86-64)

Sounds good for me, i'll try my best (if noone object).

> 
> that would sound like a good cleanup, because right now it's a
> complete nightmare to think about which bits are used how when P is 0.
> The above insane #if being the prime example of that confusion.

	Cyrill

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] x86: use pv-ops in {pte,pmd}_{set,clear}_flags()
  2014-03-31 15:41           ` Linus Torvalds
  2014-03-31 16:10             ` Linus Torvalds
@ 2014-04-01 18:18             ` David Vrabel
  2014-04-01 18:43               ` Linus Torvalds
  1 sibling, 1 reply; 16+ messages in thread
From: David Vrabel @ 2014-04-01 18:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Steven Noonan, Rik van Riel, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, linux-mm

On 31/03/14 16:41, Linus Torvalds wrote:
> On Mon, Mar 31, 2014 at 5:26 AM, Mel Gorman <mgorman@suse.de> wrote:
>>
>> Ok, so how do you suggest that _PAGE_NUMA could have been implemented
>> that did *not* use _PAGE_PROTNONE on x86, trapped a fault and was not
>> expensive as hell to handle?
> 
> So on x86, the obvious model is to use another bit. We've got several.
> The _PAGE_NUMA case only matters for when _PAGE_PRESENT is clear, and
> when that bit is clear the hardware doesn't care about any of the
> other bits. Currently we use:
> 
>   #define _PAGE_BIT_PROTNONE      _PAGE_BIT_GLOBAL
>   #define _PAGE_BIT_FILE          _PAGE_BIT_DIRTY
> 
> which are bits 8 and 6 respectively, afaik.
> 
> and the only rule is that (a) we should *not* use a bit we already use
> when the page is not present (since that is ambiguous!) and (b) we
> should *not* use a bit that is used by the swap index cases. I think
> bit 7 should work, but maybe I missed something.

I don't think it's sufficient to avoid collisions with bits used only
with P=0.  The original value of this bit must be retained when the
_PAGE_NUMA bit is set/cleared.

Bit 7 is PAT[2] and whilst Linux currently sets up the PAT such that
PAT[2] is a 'don't care', there has been talk up adjusting the PAT to
include more types. So I'm not sure it's a good idea to use bit 7.

What's wrong with using e.g., bit 62? And not supporting this NUMA
rebalancing feature on 32-bit non-PAE builds?

David

> Can somebody tell me why _PAGE_NUMA is *not* that bit seven? Make
> "pte_present()" on x86 just check all of the present/numa/protnone
> bits, and if any of them is set, it's a "present" page.
> 
> Now, unlike x86, some other architectures do *not* have free bits, so
> there may be problems elsewhere.
> 
>             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] x86: use pv-ops in {pte,pmd}_{set,clear}_flags()
  2014-04-01 18:18             ` David Vrabel
@ 2014-04-01 18:43               ` Linus Torvalds
  2014-04-01 19:03                 ` Cyrill Gorcunov
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2014-04-01 18:43 UTC (permalink / raw)
  To: David Vrabel
  Cc: Mel Gorman, Steven Noonan, Rik van Riel, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, linux-mm

On Tue, Apr 1, 2014 at 11:18 AM, David Vrabel <david.vrabel@citrix.com> wrote:
>
> I don't think it's sufficient to avoid collisions with bits used only
> with P=0.  The original value of this bit must be retained when the
> _PAGE_NUMA bit is set/cleared.
>
> Bit 7 is PAT[2] and whilst Linux currently sets up the PAT such that
> PAT[2] is a 'don't care', there has been talk up adjusting the PAT to
> include more types. So I'm not sure it's a good idea to use bit 7.
>
> What's wrong with using e.g., bit 62? And not supporting this NUMA
> rebalancing feature on 32-bit non-PAE builds?

Sounds good to me, but it's not available in 32-bit PAE. The high bits
are all reserved, afaik.

But you'd have to be insane to care about NUMA balancing on 32-bit,
even with PAE. So restricting it to x86-64 and using the high bits (I
think bits 52-62 are all available to SW) sounds fine to me.

Same goes for soft-dirty. I think it's fine if we say that you won't
have soft-dirty with a 32-bit kernel. Even with PAE.

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] x86: use pv-ops in {pte,pmd}_{set,clear}_flags()
  2014-04-01 18:43               ` Linus Torvalds
@ 2014-04-01 19:03                 ` Cyrill Gorcunov
  2014-04-02 11:33                   ` Pavel Emelyanov
  0 siblings, 1 reply; 16+ messages in thread
From: Cyrill Gorcunov @ 2014-04-01 19:03 UTC (permalink / raw)
  To: Linus Torvalds, Pavel Emelyanov
  Cc: David Vrabel, Mel Gorman, Steven Noonan, Rik van Riel,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, linux-mm

On Tue, Apr 01, 2014 at 11:43:11AM -0700, Linus Torvalds wrote:
> On Tue, Apr 1, 2014 at 11:18 AM, David Vrabel <david.vrabel@citrix.com> wrote:
> >
> > I don't think it's sufficient to avoid collisions with bits used only
> > with P=0.  The original value of this bit must be retained when the
> > _PAGE_NUMA bit is set/cleared.
> >
> > Bit 7 is PAT[2] and whilst Linux currently sets up the PAT such that
> > PAT[2] is a 'don't care', there has been talk up adjusting the PAT to
> > include more types. So I'm not sure it's a good idea to use bit 7.
> >
> > What's wrong with using e.g., bit 62? And not supporting this NUMA
> > rebalancing feature on 32-bit non-PAE builds?
> 
> Sounds good to me, but it's not available in 32-bit PAE. The high bits
> are all reserved, afaik.
> 
> But you'd have to be insane to care about NUMA balancing on 32-bit,
> even with PAE. So restricting it to x86-64 and using the high bits (I
> think bits 52-62 are all available to SW) sounds fine to me.
> 
> Same goes for soft-dirty. I think it's fine if we say that you won't
> have soft-dirty with a 32-bit kernel. Even with PAE.

Well, at the moment we use soft-dirty for x86-64 only in criu but there
were plans to implement complete 32bit support as well. While personally
I don't mind dropping soft-dirty for non x86-64 case, I would like
to hear Pavel's opinion, Pavel?

(n.b, i'm still working on cleaning up _page bits, it appeared to
 be harder than I've been expecting).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] x86: use pv-ops in {pte,pmd}_{set,clear}_flags()
  2014-04-01 19:03                 ` Cyrill Gorcunov
@ 2014-04-02 11:33                   ` Pavel Emelyanov
  2014-04-02 13:29                     ` Cyrill Gorcunov
  0 siblings, 1 reply; 16+ messages in thread
From: Pavel Emelyanov @ 2014-04-02 11:33 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Linus Torvalds, David Vrabel, Mel Gorman, Steven Noonan,
	Rik van Riel, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	linux-mm

On 04/01/2014 11:03 PM, Cyrill Gorcunov wrote:
> On Tue, Apr 01, 2014 at 11:43:11AM -0700, Linus Torvalds wrote:
>> On Tue, Apr 1, 2014 at 11:18 AM, David Vrabel <david.vrabel@citrix.com> wrote:
>>>
>>> I don't think it's sufficient to avoid collisions with bits used only
>>> with P=0.  The original value of this bit must be retained when the
>>> _PAGE_NUMA bit is set/cleared.
>>>
>>> Bit 7 is PAT[2] and whilst Linux currently sets up the PAT such that
>>> PAT[2] is a 'don't care', there has been talk up adjusting the PAT to
>>> include more types. So I'm not sure it's a good idea to use bit 7.
>>>
>>> What's wrong with using e.g., bit 62? And not supporting this NUMA
>>> rebalancing feature on 32-bit non-PAE builds?
>>
>> Sounds good to me, but it's not available in 32-bit PAE. The high bits
>> are all reserved, afaik.
>>
>> But you'd have to be insane to care about NUMA balancing on 32-bit,
>> even with PAE. So restricting it to x86-64 and using the high bits (I
>> think bits 52-62 are all available to SW) sounds fine to me.
>>
>> Same goes for soft-dirty. I think it's fine if we say that you won't
>> have soft-dirty with a 32-bit kernel. Even with PAE.
> 
> Well, at the moment we use soft-dirty for x86-64 only in criu but there
> were plans to implement complete 32bit support as well. While personally
> I don't mind dropping soft-dirty for non x86-64 case, I would like
> to hear Pavel's opinion, Pavel?

We (Parallels) don't have plans on C/R on 32-bit kernels, but I speak only
for Parallels. However, people I know who need 32-bit C/R use ARM :)

> (n.b, i'm still working on cleaning up _page bits, it appeared to
>  be harder than I've been expecting).
> .

Thanks,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] x86: use pv-ops in {pte,pmd}_{set,clear}_flags()
  2014-04-02 11:33                   ` Pavel Emelyanov
@ 2014-04-02 13:29                     ` Cyrill Gorcunov
  0 siblings, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov @ 2014-04-02 13:29 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Linus Torvalds, David Vrabel, Mel Gorman, Steven Noonan,
	Rik van Riel, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	linux-mm

On Wed, Apr 02, 2014 at 03:33:48PM +0400, Pavel Emelyanov wrote:
...
> >>
> >> But you'd have to be insane to care about NUMA balancing on 32-bit,
> >> even with PAE. So restricting it to x86-64 and using the high bits (I
> >> think bits 52-62 are all available to SW) sounds fine to me.
> >>
> >> Same goes for soft-dirty. I think it's fine if we say that you won't
> >> have soft-dirty with a 32-bit kernel. Even with PAE.
> > 
> > Well, at the moment we use soft-dirty for x86-64 only in criu but there
> > were plans to implement complete 32bit support as well. While personally
> > I don't mind dropping soft-dirty for non x86-64 case, I would like
> > to hear Pavel's opinion, Pavel?
> 
> We (Parallels) don't have plans on C/R on 32-bit kernels, but I speak only
> for Parallels. However, people I know who need 32-bit C/R use ARM :)

OK, since it's x86 specific I can prepare patch for dropping softdirty on
x86-32 (this will release ugly macros in file mapping a bit but not that
significantly).

Guys, while looking into how to re-define _PAGE bits for case where present
bit is dropped I though about the form like

#define _PAGE_BIT_FILE		(_PAGE_BIT_PRESENT + 1)	/* _PAGE_BIT_RW */
#define _PAGE_BIT_NUMA		(_PAGE_BIT_PRESENT + 2)	/* _PAGE_BIT_USER */
#define _PAGE_BIT_PROTNONE	(_PAGE_BIT_PRESENT + 3)	/* _PAGE_BIT_PWT */

and while _PAGE_BIT_FILE case should work (as well as swap pages), I'm not that
sure about the numa and protnone case. I fear there are some code paths which
depends on the former bits positions -- ie when

	PAGE_BIT_PROTNONE = _PAGE_BIT_NUMA = _PAGE_BIT_GLOBAL.

One of the _PAGE_BIT_GLOBAL user is the page attributes code. It seems to always check
_PAGE_BIT_PRESENT together with _PAGE_BIT_GLOBAL, so if _PAGE_BIT_PROTNONE get redefined
to a new value it should not fail. Thus main concern is protnone + numa code, which
I must admit I don't know well enough yet.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/2] x86: fix Xen PV regression caused by NUMA page migration
  2014-03-21 18:18 [RFC PATCH 0/2] x86: fix Xen PV regression caused by NUMA page migration David Vrabel
                   ` (2 preceding siblings ...)
  2014-03-26 19:10 ` [RFC PATCH 0/2] x86: fix Xen PV regression caused by NUMA page migration David Vrabel
@ 2014-04-15  8:24 ` David Sutton
  3 siblings, 0 replies; 16+ messages in thread
From: David Sutton @ 2014-04-15  8:24 UTC (permalink / raw)
  To: xen-devel

David,

David Vrabel <david.vrabel <at> citrix.com> writes:

> 
> This series should properly fix the Xen PV guest regression introduced
> by 1667918b6483 (mm: numa: clear numa hinting information on
> mprotect).  The previous fix a9c8e4beeeb6 (xen: properly account for
> _PAGE_NUMA during xen pte translations) breaks save/restore
> (migration) and needs to be reverted.
> 
> I've only given this series a minimal amount of testing and would
> appreciate testing by someone who experienced/reproduced the original
> regression.
>
I just upgraded to the Archlinux linux 3.14.1-1 packages and it started to
show the same symptoms as before the original patch was applied (the key one
I noticed being major instability running firefox under dom0. I then applied
the 2nd patch in this set and recompiled; initial testing (doing things
which would almost certainly cause the issue to trigger, such as viewing a
youtube video through firefox) indicated that the patch is working as I
haven't been able to trigger the bug yet.
> 
> David
> 
Regards,

  David

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-04-15  8:24 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-21 18:18 [RFC PATCH 0/2] x86: fix Xen PV regression caused by NUMA page migration David Vrabel
2014-03-21 18:18 ` [PATCH 1/2] Revert "xen: properly account for _PAGE_NUMA during xen pte translations" David Vrabel
2014-03-21 18:18 ` [PATCH 2/2] x86: use pv-ops in {pte, pmd}_{set, clear}_flags() David Vrabel
2014-03-24 11:28   ` David Vrabel
     [not found]     ` <CAKbGBLiVqaHEOZx6y4MW4xDTUdKRhVLZXTTGiqYT7vuH2Wgeww@mail.gmail.com>
2014-03-25 20:16       ` [PATCH 2/2] x86: use pv-ops in {pte,pmd}_{set,clear}_flags() Linus Torvalds
2014-03-31 12:26         ` Mel Gorman
2014-03-31 15:41           ` Linus Torvalds
2014-03-31 16:10             ` Linus Torvalds
2014-03-31 16:27               ` Cyrill Gorcunov
2014-04-01 18:18             ` David Vrabel
2014-04-01 18:43               ` Linus Torvalds
2014-04-01 19:03                 ` Cyrill Gorcunov
2014-04-02 11:33                   ` Pavel Emelyanov
2014-04-02 13:29                     ` Cyrill Gorcunov
2014-03-26 19:10 ` [RFC PATCH 0/2] x86: fix Xen PV regression caused by NUMA page migration David Vrabel
2014-04-15  8:24 ` David Sutton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.