All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: mceier@gmail.com, Davidlohr Bueso <dave@stgolabs.net>,
	kernel test robot <rong.a.chen@intel.com>,
	Davidlohr Bueso <dbueso@suse.de>,
	Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Borislav Petkov <bp@alien8.de>,
	LKML <linux-kernel@vger.kernel.org>,
	lkp@lists.01.org, "Kenneth R. Crudup" <kenny@panix.com>
Subject: [PATCH] x86/pat: Fix off-by-one bugs in interval tree search
Date: Sun, 1 Dec 2019 15:49:47 +0100	[thread overview]
Message-ID: <20191201144947.GA4167@gmail.com> (raw)
In-Reply-To: <20191201104624.GA51279@gmail.com>


* Ingo Molnar <mingo@kernel.org> wrote:

> * Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > But the final difference is a real difference where it used to be WC,
> > and is now UC-:
> > 
> >     -write-combining @ 0x2000000000-0x2100000000
> >     -write-combining @ 0x2000000000-0x2100000000
> >     +uncached-minus @ 0x2000000000-0x2100000000
> >     +uncached-minus @ 0x2000000000-0x2100000000
> > 
> > which certainly could easily explain the huge performance degradation.

> It's not an unconditional regression, as both Boris and me tried to 
> reproduce it on different systems that do ioremap_wc() as well and didn't 
> measure a slowdown, but something about the memory layout probably 
> triggers the tree management bug.

Ok, I think I found at least one bug in the new PAT code, the conversion 
of memtype_check_conflict() is buggy I think:

   8d04a5f97a5f: ("x86/mm/pat: Convert the PAT tree to a generic interval tree")

        dprintk("Overlap at 0x%Lx-0x%Lx\n", match->start, match->end);
        found_type = match->type;
 
-       node = rb_next(&match->rb);
-       while (node) {
-               match = rb_entry(node, struct memtype, rb);
-
-               if (match->start >= end) /* Checked all possible matches */
-                       goto success;
-
-               if (is_node_overlap(match, start, end) &&
-                   match->type != found_type) {
+       match = memtype_interval_iter_next(match, start, end);
+       while (match) {
+               if (match->type != found_type)
                        goto failure;
-               }
 
-               node = rb_next(&match->rb);
+               match = memtype_interval_iter_next(match, start, end);
        }


Note how the '>= end' condition to end the interval check, got converted 
into:

+       match = memtype_interval_iter_next(match, start, end);

This is subtly off by one, because the interval trees interfaces require 
closed interval parameters:

  include/linux/interval_tree_generic.h

 /*                                                                            \
  * Iterate over intervals intersecting [start;last]                           \
  *                                                                            \
  * Note that a node's interval intersects [start;last] iff:                   \
  *   Cond1: ITSTART(node) <= last                                             \
  * and                                                                        \
  *   Cond2: start <= ITLAST(node)                                             \
  */                                                                           \

  ...

                if (ITSTART(node) <= last) {            /* Cond1 */           \
                        if (start <= ITLAST(node))      /* Cond2 */           \
                                return node;    /* node is leftmost match */  \

[start;last] is a closed interval (note that '<= last' check) - while the 
PAT 'end' parameter is 1 byte beyond the end of the range, because 
ioremap() and the other mapping APIs usually use the [start,end) 
half-open interval, derived from 'size'.

This is what ioremap() does for example:

        /*
         * Mappings have to be page-aligned
         */
        offset = phys_addr & ~PAGE_MASK;
        phys_addr &= PHYSICAL_PAGE_MASK;
        size = PAGE_ALIGN(last_addr+1) - phys_addr;

        retval = reserve_memtype(phys_addr, (u64)phys_addr + size,
                                                pcm, &new_pcm);


phys_addr+size will be on a page boundary, after the last byte of the 
mapped interval.

So the correct parameter to use in the interval tree searches is not 
'end' but 'end-1'.

This could have relevance if conflicting PAT ranges are exactly adjacent, 
for example a future WC region is followed immediately by an already 
mapped UC- region - in this case memtype_check_conflict() would 
incorrectly deny the WC memtype region and downgrade the memtype to UC-.

BTW., rather annoyingly this downgrading is done silently in 
memtype_check_insert():

int memtype_check_insert(struct memtype *new,
                         enum page_cache_mode *ret_type)
{
        int err = 0;

        err = memtype_check_conflict(new->start, new->end, new->type, ret_type);
        if (err)
                return err;

        if (ret_type)
                new->type = *ret_type;

        memtype_interval_insert(new, &memtype_rbroot);
        return 0;
}


So on such a conflict we'd just silently get UC- in *ret_type, and write 
it into the new region, never the wiser ...

So assuming that the patch below fixes the primary bug the diagnostics 
side of ioremap() cache attribute downgrades would be another thing to 
fix.

Anyway, I checked all the interval-tree iterations, and most of them are 
off by one - but I think the one related to memtype_check_conflict() is 
the one causing this particular performance regression.

The only correct interval-tree searches were these two:

arch/x86/mm/pat_interval.c:     match = memtype_interval_iter_first(&memtype_rbroot, 0, ULONG_MAX);
arch/x86/mm/pat_interval.c:             match = memtype_interval_iter_next(match, 0, ULONG_MAX);

The ULONG_MAX was hiding the off-by-one in plain sight. :-)

So it would be nice if everyone who is seeing this bug could test the 
patch below against Linus's latest tree - does it fix the regression?

If not then please send the before/after dump of 
/sys/kernel/debug/x86/pat_memtype_list - and even if it works please send 
the dumps so we can double check it all.

Note that the bug was benign in the sense of implementing a too strict 
cache attribute conflict policy and downgrading cache attributes - so 
AFAICS the worst outcome of this bug would be a performance regression.

Patch is only lightly tested, so take care. (Patch is emphatically not 
signed off yet, because I spent most of the day on this and I don't yet 
trust my fix - all of the affected sites need to be reviewed more 
carefully.)

Thanks,

	Ingo


====================>
From: Ingo Molnar <mingo@kernel.org>
Date: Sun, 1 Dec 2019 15:25:50 +0100
Subject: [PATCH] x86/pat: Fix off-by-one bugs in interval tree search

NOT-Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pat_interval.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/pat_interval.c b/arch/x86/mm/pat_interval.c
index 47a1bf30748f..6855362eaf21 100644
--- a/arch/x86/mm/pat_interval.c
+++ b/arch/x86/mm/pat_interval.c
@@ -56,7 +56,7 @@ static struct memtype *memtype_match(u64 start, u64 end, int match_type)
 {
 	struct memtype *match;
 
-	match = memtype_interval_iter_first(&memtype_rbroot, start, end);
+	match = memtype_interval_iter_first(&memtype_rbroot, start, end-1);
 	while (match != NULL && match->start < end) {
 		if ((match_type == MEMTYPE_EXACT_MATCH) &&
 		    (match->start == start) && (match->end == end))
@@ -66,7 +66,7 @@ static struct memtype *memtype_match(u64 start, u64 end, int match_type)
 		    (match->start < start) && (match->end == end))
 			return match;
 
-		match = memtype_interval_iter_next(match, start, end);
+		match = memtype_interval_iter_next(match, start, end-1);
 	}
 
 	return NULL; /* Returns NULL if there is no match */
@@ -79,7 +79,7 @@ static int memtype_check_conflict(u64 start, u64 end,
 	struct memtype *match;
 	enum page_cache_mode found_type = reqtype;
 
-	match = memtype_interval_iter_first(&memtype_rbroot, start, end);
+	match = memtype_interval_iter_first(&memtype_rbroot, start, end-1);
 	if (match == NULL)
 		goto success;
 
@@ -89,12 +89,12 @@ static int memtype_check_conflict(u64 start, u64 end,
 	dprintk("Overlap at 0x%Lx-0x%Lx\n", match->start, match->end);
 	found_type = match->type;
 
-	match = memtype_interval_iter_next(match, start, end);
+	match = memtype_interval_iter_next(match, start, end-1);
 	while (match) {
 		if (match->type != found_type)
 			goto failure;
 
-		match = memtype_interval_iter_next(match, start, end);
+		match = memtype_interval_iter_next(match, start, end-1);
 	}
 success:
 	if (newtype)
@@ -160,7 +160,7 @@ struct memtype *memtype_erase(u64 start, u64 end)
 struct memtype *memtype_lookup(u64 addr)
 {
 	return memtype_interval_iter_first(&memtype_rbroot, addr,
-					   addr + PAGE_SIZE);
+					   addr + PAGE_SIZE-1);
 }
 
 #if defined(CONFIG_DEBUG_FS)

WARNING: multiple messages have this Message-ID (diff)
From: Ingo Molnar <mingo@kernel.org>
To: lkp@lists.01.org
Subject: [PATCH] x86/pat: Fix off-by-one bugs in interval tree search
Date: Sun, 01 Dec 2019 15:49:47 +0100	[thread overview]
Message-ID: <20191201144947.GA4167@gmail.com> (raw)
In-Reply-To: <20191201104624.GA51279@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 8594 bytes --]


* Ingo Molnar <mingo@kernel.org> wrote:

> * Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > But the final difference is a real difference where it used to be WC,
> > and is now UC-:
> > 
> >     -write-combining @ 0x2000000000-0x2100000000
> >     -write-combining @ 0x2000000000-0x2100000000
> >     +uncached-minus @ 0x2000000000-0x2100000000
> >     +uncached-minus @ 0x2000000000-0x2100000000
> > 
> > which certainly could easily explain the huge performance degradation.

> It's not an unconditional regression, as both Boris and me tried to 
> reproduce it on different systems that do ioremap_wc() as well and didn't 
> measure a slowdown, but something about the memory layout probably 
> triggers the tree management bug.

Ok, I think I found at least one bug in the new PAT code, the conversion 
of memtype_check_conflict() is buggy I think:

   8d04a5f97a5f: ("x86/mm/pat: Convert the PAT tree to a generic interval tree")

        dprintk("Overlap at 0x%Lx-0x%Lx\n", match->start, match->end);
        found_type = match->type;
 
-       node = rb_next(&match->rb);
-       while (node) {
-               match = rb_entry(node, struct memtype, rb);
-
-               if (match->start >= end) /* Checked all possible matches */
-                       goto success;
-
-               if (is_node_overlap(match, start, end) &&
-                   match->type != found_type) {
+       match = memtype_interval_iter_next(match, start, end);
+       while (match) {
+               if (match->type != found_type)
                        goto failure;
-               }
 
-               node = rb_next(&match->rb);
+               match = memtype_interval_iter_next(match, start, end);
        }


Note how the '>= end' condition to end the interval check, got converted 
into:

+       match = memtype_interval_iter_next(match, start, end);

This is subtly off by one, because the interval trees interfaces require 
closed interval parameters:

  include/linux/interval_tree_generic.h

 /*                                                                            \
  * Iterate over intervals intersecting [start;last]                           \
  *                                                                            \
  * Note that a node's interval intersects [start;last] iff:                   \
  *   Cond1: ITSTART(node) <= last                                             \
  * and                                                                        \
  *   Cond2: start <= ITLAST(node)                                             \
  */                                                                           \

  ...

                if (ITSTART(node) <= last) {            /* Cond1 */           \
                        if (start <= ITLAST(node))      /* Cond2 */           \
                                return node;    /* node is leftmost match */  \

[start;last] is a closed interval (note that '<= last' check) - while the 
PAT 'end' parameter is 1 byte beyond the end of the range, because 
ioremap() and the other mapping APIs usually use the [start,end) 
half-open interval, derived from 'size'.

This is what ioremap() does for example:

        /*
         * Mappings have to be page-aligned
         */
        offset = phys_addr & ~PAGE_MASK;
        phys_addr &= PHYSICAL_PAGE_MASK;
        size = PAGE_ALIGN(last_addr+1) - phys_addr;

        retval = reserve_memtype(phys_addr, (u64)phys_addr + size,
                                                pcm, &new_pcm);


phys_addr+size will be on a page boundary, after the last byte of the 
mapped interval.

So the correct parameter to use in the interval tree searches is not 
'end' but 'end-1'.

This could have relevance if conflicting PAT ranges are exactly adjacent, 
for example a future WC region is followed immediately by an already 
mapped UC- region - in this case memtype_check_conflict() would 
incorrectly deny the WC memtype region and downgrade the memtype to UC-.

BTW., rather annoyingly this downgrading is done silently in 
memtype_check_insert():

int memtype_check_insert(struct memtype *new,
                         enum page_cache_mode *ret_type)
{
        int err = 0;

        err = memtype_check_conflict(new->start, new->end, new->type, ret_type);
        if (err)
                return err;

        if (ret_type)
                new->type = *ret_type;

        memtype_interval_insert(new, &memtype_rbroot);
        return 0;
}


So on such a conflict we'd just silently get UC- in *ret_type, and write 
it into the new region, never the wiser ...

So assuming that the patch below fixes the primary bug the diagnostics 
side of ioremap() cache attribute downgrades would be another thing to 
fix.

Anyway, I checked all the interval-tree iterations, and most of them are 
off by one - but I think the one related to memtype_check_conflict() is 
the one causing this particular performance regression.

The only correct interval-tree searches were these two:

arch/x86/mm/pat_interval.c:     match = memtype_interval_iter_first(&memtype_rbroot, 0, ULONG_MAX);
arch/x86/mm/pat_interval.c:             match = memtype_interval_iter_next(match, 0, ULONG_MAX);

The ULONG_MAX was hiding the off-by-one in plain sight. :-)

So it would be nice if everyone who is seeing this bug could test the 
patch below against Linus's latest tree - does it fix the regression?

If not then please send the before/after dump of 
/sys/kernel/debug/x86/pat_memtype_list - and even if it works please send 
the dumps so we can double check it all.

Note that the bug was benign in the sense of implementing a too strict 
cache attribute conflict policy and downgrading cache attributes - so 
AFAICS the worst outcome of this bug would be a performance regression.

Patch is only lightly tested, so take care. (Patch is emphatically not 
signed off yet, because I spent most of the day on this and I don't yet 
trust my fix - all of the affected sites need to be reviewed more 
carefully.)

Thanks,

	Ingo


====================>
From: Ingo Molnar <mingo@kernel.org>
Date: Sun, 1 Dec 2019 15:25:50 +0100
Subject: [PATCH] x86/pat: Fix off-by-one bugs in interval tree search

NOT-Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pat_interval.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/pat_interval.c b/arch/x86/mm/pat_interval.c
index 47a1bf30748f..6855362eaf21 100644
--- a/arch/x86/mm/pat_interval.c
+++ b/arch/x86/mm/pat_interval.c
@@ -56,7 +56,7 @@ static struct memtype *memtype_match(u64 start, u64 end, int match_type)
 {
 	struct memtype *match;
 
-	match = memtype_interval_iter_first(&memtype_rbroot, start, end);
+	match = memtype_interval_iter_first(&memtype_rbroot, start, end-1);
 	while (match != NULL && match->start < end) {
 		if ((match_type == MEMTYPE_EXACT_MATCH) &&
 		    (match->start == start) && (match->end == end))
@@ -66,7 +66,7 @@ static struct memtype *memtype_match(u64 start, u64 end, int match_type)
 		    (match->start < start) && (match->end == end))
 			return match;
 
-		match = memtype_interval_iter_next(match, start, end);
+		match = memtype_interval_iter_next(match, start, end-1);
 	}
 
 	return NULL; /* Returns NULL if there is no match */
@@ -79,7 +79,7 @@ static int memtype_check_conflict(u64 start, u64 end,
 	struct memtype *match;
 	enum page_cache_mode found_type = reqtype;
 
-	match = memtype_interval_iter_first(&memtype_rbroot, start, end);
+	match = memtype_interval_iter_first(&memtype_rbroot, start, end-1);
 	if (match == NULL)
 		goto success;
 
@@ -89,12 +89,12 @@ static int memtype_check_conflict(u64 start, u64 end,
 	dprintk("Overlap at 0x%Lx-0x%Lx\n", match->start, match->end);
 	found_type = match->type;
 
-	match = memtype_interval_iter_next(match, start, end);
+	match = memtype_interval_iter_next(match, start, end-1);
 	while (match) {
 		if (match->type != found_type)
 			goto failure;
 
-		match = memtype_interval_iter_next(match, start, end);
+		match = memtype_interval_iter_next(match, start, end-1);
 	}
 success:
 	if (newtype)
@@ -160,7 +160,7 @@ struct memtype *memtype_erase(u64 start, u64 end)
 struct memtype *memtype_lookup(u64 addr)
 {
 	return memtype_interval_iter_first(&memtype_rbroot, addr,
-					   addr + PAGE_SIZE);
+					   addr + PAGE_SIZE-1);
 }
 
 #if defined(CONFIG_DEBUG_FS)

  reply	other threads:[~2019-12-01 14:49 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-27  0:53 [x86/mm/pat] 8d04a5f97a: phoronix-test-suite.glmark2.0.score -23.7% regression kernel test robot
2019-11-27  0:53 ` kernel test robot
2019-11-30 20:23 ` Mariusz Ceier
2019-11-30 21:27   ` Davidlohr Bueso
2019-11-30 21:27     ` Davidlohr Bueso
2019-11-30 22:08     ` Mariusz Ceier
2019-11-30 22:35       ` Linus Torvalds
2019-11-30 22:35         ` Linus Torvalds
2019-12-01 10:46         ` Ingo Molnar
2019-12-01 10:46           ` Ingo Molnar
2019-12-01 14:49           ` Ingo Molnar [this message]
2019-12-01 14:49             ` [PATCH] x86/pat: Fix off-by-one bugs in interval tree search Ingo Molnar
2019-12-01 16:09             ` Mariusz Ceier
2019-12-01 19:53               ` Ingo Molnar
2019-12-01 19:53                 ` Ingo Molnar
2019-12-01 16:42             ` Kenneth R. Crudup
2019-12-01 17:01             ` Davidlohr Bueso
2019-12-01 17:01               ` Davidlohr Bueso
2019-12-01 17:08             ` Kenneth R. Crudup
2019-12-01 19:55               ` Ingo Molnar
2019-12-01 19:55                 ` Ingo Molnar
2019-12-01 20:09                 ` Kenneth R. Crudup
2019-12-01 20:30                   ` Ingo Molnar
2019-12-01 20:30                     ` Ingo Molnar
2019-12-01 20:04             ` [tip: x86/urgent] x86/mm/pat: " tip-bot2 for Ingo Molnar
2019-12-02  8:31             ` [PATCH] x86/pat: " Rong Chen
2019-12-02  8:31               ` Rong Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191201144947.GA4167@gmail.com \
    --to=mingo@kernel.org \
    --cc=bp@alien8.de \
    --cc=dave@stgolabs.net \
    --cc=dbueso@suse.de \
    --cc=kenny@panix.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkp@lists.01.org \
    --cc=mceier@gmail.com \
    --cc=peterz@infradead.org \
    --cc=rong.a.chen@intel.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.