All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
@ 2015-11-18 15:03 Jeremy Linton
  2015-11-18 15:20 ` Mark Rutland
  0 siblings, 1 reply; 25+ messages in thread
From: Jeremy Linton @ 2015-11-18 15:03 UTC (permalink / raw)
  To: linux-arm-kernel

The HP m400 fails to boot the linux 4.4rc1 kernel. It usually
hangs or sometimes takes an unhanded exception around the DMA
zone messages. This was bisected to the new CONT PTE changes.

Adding an extra flush_tlb_all() in the code path which is
changing the kernel permissions allows the machine to boot
consistently.

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
---
 arch/arm64/mm/mmu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index e3f563c..e92fe77 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -121,6 +121,7 @@ static void __populate_init_pte(pte_t *pte, unsigned long addr,
 		pfn++;
 		addr += PAGE_SIZE;
 	} while (addr != end);
+	flush_tlb_all();
 }
 
 static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-18 15:03 [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs Jeremy Linton
@ 2015-11-18 15:20 ` Mark Rutland
  2015-11-18 16:08   ` Jeremy Linton
  0 siblings, 1 reply; 25+ messages in thread
From: Mark Rutland @ 2015-11-18 15:20 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Jeremy,

On Wed, Nov 18, 2015 at 09:03:19AM -0600, Jeremy Linton wrote:
> The HP m400 fails to boot the linux 4.4rc1 kernel. 

Are you using defconfig? If not, can you share your config?

> It usually hangs or sometimes takes an unhanded exception around the
> DMA zone messages. This was bisected to the new CONT PTE changes.

Do you have any examples of the unhandled exception cases? Are they a
mixed bag, or a consistent exception class?

> Adding an extra flush_tlb_all() in the code path which is
> changing the kernel permissions allows the machine to boot
> consistently.

As you mention changing permissions, I take it you're using
CONFIG_DEBUG_RODATA?

Thanks,
Mark.

> Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
> ---
>  arch/arm64/mm/mmu.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index e3f563c..e92fe77 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -121,6 +121,7 @@ static void __populate_init_pte(pte_t *pte, unsigned long addr,
>  		pfn++;
>  		addr += PAGE_SIZE;
>  	} while (addr != end);
> +	flush_tlb_all();
>  }
>  
>  static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
> -- 
> 2.4.3
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-18 15:20 ` Mark Rutland
@ 2015-11-18 16:08   ` Jeremy Linton
  2015-11-18 16:29     ` Mark Rutland
  0 siblings, 1 reply; 25+ messages in thread
From: Jeremy Linton @ 2015-11-18 16:08 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/18/2015 09:20 AM, Mark Rutland wrote:
> Hi Jeremy,
>
> On Wed, Nov 18, 2015 at 09:03:19AM -0600, Jeremy Linton wrote:
>> The HP m400 fails to boot the linux 4.4rc1 kernel.
>
> Are you using defconfig? If not, can you share your config?
	No, its not defconfig, its roughly the RHELSA config tossed into a 
mainline 4.4 tree and all the default options selected. AFAIK RHELSA is 
still limited access.

>
>> It usually hangs or sometimes takes an unhanded exception around the
>> DMA zone messages. This was bisected to the new CONT PTE changes.
>
> Do you have any examples of the unhandled exception cases? Are they a
> mixed bag, or a consistent exception class?

I'm guessing about 90% of the time its a dead hang, the remaining are 
the faults of which there is one that happens more frequently than the 
others. Here is one i found in my notes..

[    0.000000] On node 0 totalpages: 1048512
[    0.000000]   DMA zone: 64 pages used for memmap
[    0.000000]   DMA zone: 0 pages reserved
[    0.000000]   DMA zone: 65472 pages, LIFO batch:1
[    0.000000] Unhandled fault: unknown 48 (0x96000070) at 
0xfffffe0000d60588

>> Adding an extra flush_tlb_all() in the code path which is
>> changing the kernel permissions allows the machine to boot
>> consistently.
>
> As you mention changing permissions, I take it you're using
> CONFIG_DEBUG_RODATA?

The failing configuration doesn't have DEBUG_RODATA set, I might have 
been pretty loose with my terminology.

Frankly, I wondered originally how config RODATA was working reliably 
because the flushes were only around the directories getting split, 
fixup_init() (and basically anything calling create_mapping_late()) 
looked like there were paths that could avoid flushing. When I added the 
CONT changes I didn't add flushes to paths that didn't previously have 
them (except in the split cont range case, which matched the spit p[mu]d 
case). I made the mistake of assuming someone knew about some edge case 
that avoided the need for the flush.

Once I find/fix the console issue on that machine with 4.4rc1 (there are 
a small handful of issues that keep mainline from working on it, 
including the sata patch that was posted, and rejected), I will focus on 
hoisting the tlb flush into create_mapping_late() and removing the 
splattering of flushes in those code paths. That is unless there is a 
reason to be preforming them as soon as the directories are split.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-18 16:08   ` Jeremy Linton
@ 2015-11-18 16:29     ` Mark Rutland
  2015-11-18 17:14       ` Jeremy Linton
  2015-11-23 15:51       ` Catalin Marinas
  0 siblings, 2 replies; 25+ messages in thread
From: Mark Rutland @ 2015-11-18 16:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Nov 18, 2015 at 10:08:58AM -0600, Jeremy Linton wrote:
> On 11/18/2015 09:20 AM, Mark Rutland wrote:
> >Hi Jeremy,
> >
> >On Wed, Nov 18, 2015 at 09:03:19AM -0600, Jeremy Linton wrote:
> >>The HP m400 fails to boot the linux 4.4rc1 kernel.
> >
> >Are you using defconfig? If not, can you share your config?
> 	No, its not defconfig, its roughly the RHELSA config tossed into a
> mainline 4.4 tree and all the default options selected. AFAIK RHELSA
> is still limited access.

That renders this extremely difficult for anyone else to reproduce...

> >>It usually hangs or sometimes takes an unhanded exception around the
> >>DMA zone messages. This was bisected to the new CONT PTE changes.
> >
> >Do you have any examples of the unhandled exception cases? Are they a
> >mixed bag, or a consistent exception class?
> 
> I'm guessing about 90% of the time its a dead hang, the remaining
> are the faults of which there is one that happens more frequently
> than the others. Here is one i found in my notes..

Ok. In future please provide a sample with any bug report.

> [    0.000000] On node 0 totalpages: 1048512
> [    0.000000]   DMA zone: 64 pages used for memmap
> [    0.000000]   DMA zone: 0 pages reserved
> [    0.000000]   DMA zone: 65472 pages, LIFO batch:1
> [    0.000000] Unhandled fault: unknown 48 (0x96000070) at
> 0xfffffe0000d60588

>From a quick grep that's from do_mem_abort, where the "unknown 48" is
the DFSC, the bit in brackets is the ESR, and the address is the
faulting address from FAR_EL1.

That 48 / 0b110000 for the DFSC decodes as "TLB conflict abort" per the
ARM ARM. Other than that, the WnR bit is set in the ISS.

So this is probably a break-before-make issue.

Can you figure out where 0xfffffe0000d60588 pointed to, and where in the
kernel the access was performed? It would be nice to know if this is
consistently happening at some edge of the kernel address space.

FWIW, Will had a patch [1] for detecting PTE level break-before-make
violations. I gave this a go on Juno with v4.4-rc1, and saw an issue in
the EFI virtmap code that I'm currently investigating.

> >>Adding an extra flush_tlb_all() in the code path which is
> >>changing the kernel permissions allows the machine to boot
> >>consistently.
> >
> >As you mention changing permissions, I take it you're using
> >CONFIG_DEBUG_RODATA?
> 
> The failing configuration doesn't have DEBUG_RODATA set, I might
> have been pretty loose with my terminology.

Ok, good to know.

> Frankly, I wondered originally how config RODATA was working
> reliably because the flushes were only around the directories
> getting split, fixup_init() (and basically anything calling
> create_mapping_late()) looked like there were paths that could avoid
> flushing. When I added the CONT changes I didn't add flushes to
> paths that didn't previously have them (except in the split cont
> range case, which matched the spit p[mu]d case). I made the mistake
> of assuming someone knew about some edge case that avoided the need
> for the flush.

I'll need to page the code back into my head, but I recall I had
concerns about break-before-make, so there's some auditing to be done.

> Once I find/fix the console issue on that machine with 4.4rc1 (there
> are a small handful of issues that keep mainline from working on it,
> including the sata patch that was posted, and rejected), I will
> focus on hoisting the tlb flush into create_mapping_late() and
> removing the splattering of flushes in those code paths. That is
> unless there is a reason to be preforming them as soon as the
> directories are split.

We need to figure out exactly what maintenance we actually need.

Hoisting the TLB flush isn't necessarily possible if we need to perform
break-before-make at the PTE level, and even that may not be possible
for the kernel page tables; we might need to do something more
drastic like using ASIDs and double-buffering them...

We also need to figure out what's happening with the code as it is.

Thanks,
Mark.

[1] https://git.kernel.org/cgit/linux/kernel/git/will/linux.git/commit/?h=aarch64/devel&id=372f39220ad35fa39a75419f2221ffeb6ffd78d3

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-18 16:29     ` Mark Rutland
@ 2015-11-18 17:14       ` Jeremy Linton
  2015-11-18 18:04         ` Mark Rutland
  2015-11-23 15:51       ` Catalin Marinas
  1 sibling, 1 reply; 25+ messages in thread
From: Jeremy Linton @ 2015-11-18 17:14 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/18/2015 10:29 AM, Mark Rutland wrote:
> On Wed, Nov 18, 2015 at 10:08:58AM -0600, Jeremy Linton wrote:
>> 	No, its not defconfig, its roughly the RHELSA config tossed into a
>> mainline 4.4 tree and all the default options selected. AFAIK RHELSA
>> is still limited access.
>
> That renders this extremely difficult for anyone else to reproduce...

Well the kernel in question boots fine on a Juno. I haven't tried any 
other APM based machines. And given whats happening I doubt its config 
related.

> That 48 / 0b110000 for the DFSC decodes as "TLB conflict abort" per the
> ARM ARM. Other than that, the WnR bit is set in the ISS.
>
> So this is probably a break-before-make issue.
>
> Can you figure out where 0xfffffe0000d60588 pointed to, and where in the
> kernel the access was performed? It would be nice to know if this is
> consistently happening at some edge of the kernel address space.

I decoded everything when I initially saw it, but it didn't make a lick 
of sense related to what I was attempting to accomplish so I didn't keep 
any of it. Only later when I found out it wasn't related to the patches 
I was applying did I start trying to track down the regression. Even so, 
given some other patches that went in, it wasn't blindingly obvious 
where the problem was until I was sure that it was related to the linear 
mapping changes. AKA I didn't think anyone would be able to debug the 
failure with that little information, maybe i'm wrong on that point... 
Anyway, the kernel that produced that failure is long gone, I can in the 
near future attempt to reproduce the message.


>> Once I find/fix the console issue on that machine with 4.4rc1 (there
>> are a small handful of issues that keep mainline from working on it,
>> including the sata patch that was posted, and rejected), I will
>> focus on hoisting the tlb flush into create_mapping_late() and
>> removing the splattering of flushes in those code paths. That is
>> unless there is a reason to be preforming them as soon as the
>> directories are split.
>
> We need to figure out exactly what maintenance we actually need.
>
> Hoisting the TLB flush isn't necessarily possible if we need to perform
> break-before-make at the PTE level, and even that may not be possible
> for the kernel page tables; we might need to do something more
> drastic like using ASIDs and double-buffering them...
>
> We also need to figure out what's happening with the code as it is.

Well, I'm suspect what is happening is that there are conflicting TLB's 
hanging around, one for a cont range that is overlapping a stale non 
cont one. This sort of implies that this has been happening all along, 
AKA RO regions were being "lazy" activated if you will. Its only on a 
core that aborts when it detects that (which i assume requires differing 
size entries for this core) does it cause problems. The 
break-before-make issue, seems like it won't cause a big problem here as 
long as there is some way to assure valid TLBs before the update, and 
then assure they are cleared following it. Hence the overly aggressive 
change works because it flushes following every cont block update. Which 
would bother me more if the code were run more than once per boot (or in 
the future per module load/unload if someone gets around to updating the 
no execute reliably).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-18 17:14       ` Jeremy Linton
@ 2015-11-18 18:04         ` Mark Rutland
  2015-11-18 19:31           ` Jeremy Linton
  0 siblings, 1 reply; 25+ messages in thread
From: Mark Rutland @ 2015-11-18 18:04 UTC (permalink / raw)
  To: linux-arm-kernel

> >We also need to figure out what's happening with the code as it is.
> 
> Well, I'm suspect what is happening is that there are conflicting
> TLB's hanging around, one for a cont range that is overlapping a stale
> non cont one. This sort of implies that this has been happening all
> along, AKA RO regions were being "lazy" activated if you will. Its
> only on a core that aborts when it detects that (which i assume
> requires differing size entries for this core) does it cause problems.

I suspect that we may have believed that the TLB maintenance at the end
of paging_init was sufficient, as we evidently  id not consider TLB
conflicts.

> The break-before-make issue, seems like it won't cause a big problem
> here as long as there is some way to assure valid TLBs before the
> update, and then assure they are cleared following it.

If at any instant in time you have a valid TLB entry for an address, and
the page tables hold a value that would give rise to a conflicting TLB
entry, you can encounter a TLB conflict abort.

It's a race against the hardware.

> Hence the overly aggressive change works because it
> flushes following every cont block update. Which would bother me
> more if the code were run more than once per boot (or in the future
> per module load/unload if someone gets around to updating the no
> execute reliably).

You're racing against other parts of the CPU (the page table walker(s),
I-caches, etc). The flushing only minimises the window for a race, and
does not prevent the race from being possible.

Given that the envelope is constantly pushing forward w.r.t. how
aggressive CPUs may be in this area, we need to fix the issue by
reasoning against what the architecture guarantees us.

We're almost certainly going to have to revisit this code in future, and
sprinkling TLB maintenance over it will only make it harder to reason
about in future.

Mark.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-18 18:04         ` Mark Rutland
@ 2015-11-18 19:31           ` Jeremy Linton
  2015-11-19 11:31             ` Mark Rutland
  0 siblings, 1 reply; 25+ messages in thread
From: Jeremy Linton @ 2015-11-18 19:31 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/18/2015 12:04 PM, Mark Rutland wrote:

> You're racing against other parts of the CPU (the page table walker(s),
> I-caches, etc). The flushing only minimises the window for a race, and
> does not prevent the race from being possible.
>
> Given that the envelope is constantly pushing forward w.r.t. how
> aggressive CPUs may be in this area, we need to fix the issue by
> reasoning against what the architecture guarantees us.
	Its also not suppose to fault on speculative access, and to me that 
means page table walks/etc that are the result of speculative access. 
Which AFAIK, closes the window significantly. I would only really worry 
about interrupt activity, and updates to the memory containing the PTE's 
themselves. Either way the simple change (rather than rewriting the 
whole code path) is probably to flag the fault handler to simply resume 
from these kinds of faults during create_mapping_late().

	But that isn't what is happening here AFAIK, the faults are long after 
the PTE's have been updated, and are the result of failure to flush the 
TLB..

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-18 19:31           ` Jeremy Linton
@ 2015-11-19 11:31             ` Mark Rutland
  2015-11-20 19:52               ` Mark Rutland
  2015-11-20 20:15               ` Mark Rutland
  0 siblings, 2 replies; 25+ messages in thread
From: Mark Rutland @ 2015-11-19 11:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Nov 18, 2015 at 01:31:18PM -0600, Jeremy Linton wrote:
> On 11/18/2015 12:04 PM, Mark Rutland wrote:
> 
> >You're racing against other parts of the CPU (the page table walker(s),
> >I-caches, etc). The flushing only minimises the window for a race, and
> >does not prevent the race from being possible.
> >
> >Given that the envelope is constantly pushing forward w.r.t. how
> >aggressive CPUs may be in this area, we need to fix the issue by
> >reasoning against what the architecture guarantees us.
> 	Its also not suppose to fault on speculative access, and to me that
> means page table walks/etc that are the result of speculative
> access.

I was under the impression that TLB conflict abort could be delivered
for asynchronous events (e.g. speculative I-cache fetches rather than
for speculative execution of already fetched instructions).

Having looked at the ARM ARM, I appear to have been mistaken. As you
say, it appears that TLB conflict aborts are always delivered
synchronously.

> Which AFAIK, closes the window significantly. I would only
> really worry about interrupt activity, and updates to the memory
> containing the PTE's themselves. Either way the simple change
> (rather than rewriting the whole code path) is probably to flag the
> fault handler to simply resume from these kinds of faults during
> create_mapping_late().

Unfortunately that may not be sufficient. The conflicting address range
might cover the current stack or the text of the exception handler, and
in those cases trying to handle the exception would result in taking
another TLB conflict abort recursively.

> 	But that isn't what is happening here AFAIK, the faults are long
> after the PTE's have been updated, and are the result of failure to
> flush the TLB..

It's not quite the same case, certainly.

It's also possible that the faults you are seeing see are also possible
earlier, but simply less likely, and that we get away without seeing the
other potential issues because of things that may change (i.e. the way
the compiler lays out the text).

I think that if we need to do something more drastic to account for the
other issues above (e.g. by ensuring that we can never allocate
conflicting TLB entries in the first place), and that said strategy
would also fix this problem, that would be preferable, given that we're
going to have to do that eventually anyway.

Mark.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-19 11:31             ` Mark Rutland
@ 2015-11-20 19:52               ` Mark Rutland
  2015-11-23 12:15                 ` Catalin Marinas
  2015-11-20 20:15               ` Mark Rutland
  1 sibling, 1 reply; 25+ messages in thread
From: Mark Rutland @ 2015-11-20 19:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Nov 19, 2015 at 11:31:34AM +0000, Mark Rutland wrote:
> On Wed, Nov 18, 2015 at 01:31:18PM -0600, Jeremy Linton wrote:
> > On 11/18/2015 12:04 PM, Mark Rutland wrote:
> > 
> > >You're racing against other parts of the CPU (the page table walker(s),
> > >I-caches, etc). The flushing only minimises the window for a race, and
> > >does not prevent the race from being possible.
> > >
> > >Given that the envelope is constantly pushing forward w.r.t. how
> > >aggressive CPUs may be in this area, we need to fix the issue by
> > >reasoning against what the architecture guarantees us.
> > 	Its also not suppose to fault on speculative access, and to me that
> > means page table walks/etc that are the result of speculative
> > access.
> 
> I was under the impression that TLB conflict abort could be delivered
> for asynchronous events (e.g. speculative I-cache fetches rather than
> for speculative execution of already fetched instructions).
> 
> Having looked at the ARM ARM, I appear to have been mistaken. As you
> say, it appears that TLB conflict aborts are always delivered
> synchronously.
> 
> > Which AFAIK, closes the window significantly. I would only
> > really worry about interrupt activity, and updates to the memory
> > containing the PTE's themselves. Either way the simple change
> > (rather than rewriting the whole code path) is probably to flag the
> > fault handler to simply resume from these kinds of faults during
> > create_mapping_late().

> > 	But that isn't what is happening here AFAIK, the faults are long
> > after the PTE's have been updated, and are the result of failure to
> > flush the TLB..

> I think that if we need to do something more drastic to account for the
> other issues above (e.g. by ensuring that we can never allocate
> conflicting TLB entries in the first place), and that said strategy
> would also fix this problem, that would be preferable, given that we're
> going to have to do that eventually anyway.

Having looked into this further, we also have the same issue with the
kasan init code.

I believe that the issue is restricted to one-off init code, as I don't
think that we do anything at runtime which would be problematic. If
anyone knows of a counter-example, please let me know!

Given that, we can restrict the problem to an early UP environment, and
it won't matter if therre's some large(ish) fixed cost associated with
updating the kernel page tables. I think that we can avoid the issue
entirely by modifying a copy of the kernel page tables, which we can
later install via some idmap code (going via a reserved table to clear
the TLBs).

I'm working on patches to implement the above, which I'll try to get
somewhere with next week.

Thanks,
Mark

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-19 11:31             ` Mark Rutland
  2015-11-20 19:52               ` Mark Rutland
@ 2015-11-20 20:15               ` Mark Rutland
  1 sibling, 0 replies; 25+ messages in thread
From: Mark Rutland @ 2015-11-20 20:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Nov 19, 2015 at 11:31:34AM +0000, Mark Rutland wrote:
> On Wed, Nov 18, 2015 at 01:31:18PM -0600, Jeremy Linton wrote:
> > On 11/18/2015 12:04 PM, Mark Rutland wrote:
> > 
> > >You're racing against other parts of the CPU (the page table walker(s),
> > >I-caches, etc). The flushing only minimises the window for a race, and
> > >does not prevent the race from being possible.
> > >
> > >Given that the envelope is constantly pushing forward w.r.t. how
> > >aggressive CPUs may be in this area, we need to fix the issue by
> > >reasoning against what the architecture guarantees us.
> > 	Its also not suppose to fault on speculative access, and to me that
> > means page table walks/etc that are the result of speculative
> > access.
> 
> I was under the impression that TLB conflict abort could be delivered
> for asynchronous events (e.g. speculative I-cache fetches rather than
> for speculative execution of already fetched instructions).
> 
> Having looked at the ARM ARM, I appear to have been mistaken. As you
> say, it appears that TLB conflict aborts are always delivered
> synchronously.

Having invesitgated further, while we may not encounter (synchronous)
TLB conflict aborts, we may still encounter (asynchronous) issues from
conflicting TLB entries.

Per the ARM ARM, if the TLB contains multiple entries for the same
address, the result of a translation may be some amalgamation of said
entries (where the amalgamation could be an arbitrary function of all of
said matching entries).

Thus page table walks and *-cache fetches may use completely erroneous
addresses and/or attributes, asynchronous to the instruction stream, and
as a result of this may change the state of MMIO peripherals, trigger
SError, etc.

This is a much scarier proposition than the TLB conflict aborts.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-20 19:52               ` Mark Rutland
@ 2015-11-23 12:15                 ` Catalin Marinas
  2015-11-23 13:49                   ` Mark Rutland
  2015-11-23 14:31                   ` Jeremy Linton
  0 siblings, 2 replies; 25+ messages in thread
From: Catalin Marinas @ 2015-11-23 12:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Nov 20, 2015 at 07:52:44PM +0000, Mark Rutland wrote:
> On Thu, Nov 19, 2015 at 11:31:34AM +0000, Mark Rutland wrote:
> > I think that if we need to do something more drastic to account for the
> > other issues above (e.g. by ensuring that we can never allocate
> > conflicting TLB entries in the first place), and that said strategy
> > would also fix this problem, that would be preferable, given that we're
> > going to have to do that eventually anyway.
> 
> Having looked into this further, we also have the same issue with the
> kasan init code.

I don't think the kasan_init() problem is that bad. We are preserving
the same size mappings (PAGE_SIZE) and just changing the physical
address they point at without a break-before-make (just a TTBR1 switch).
I don't know how clear the ARM ARM is around this but at least so far we
haven't hit any problems.

The problem with the contiguous bit is that we switch from e.g. a 4KB
mapping to a 64KB one and it's very likely that we would get a TLB
conflict.

With CONFIG_DEBUG_RODATA, we go from bigger block to a smaller one, so
less chance of a TLB conflict but still present. I need to read the ARM
ARM some more in this area (and maybe ask for clarification).

> I believe that the issue is restricted to one-off init code, as I don't
> think that we do anything at runtime which would be problematic. If
> anyone knows of a counter-example, please let me know!
> 
> Given that, we can restrict the problem to an early UP environment, and
> it won't matter if therre's some large(ish) fixed cost associated with
> updating the kernel page tables. I think that we can avoid the issue
> entirely by modifying a copy of the kernel page tables, which we can
> later install via some idmap code (going via a reserved table to clear
> the TLBs).
> 
> I'm working on patches to implement the above, which I'll try to get
> somewhere with next week.

That's a complete fix indeed but it would require some more testing and
I don't think it's feasible for 4.4-rc. In the meantime, I propose that
we revert the contiguous PTE patches and push them again once we fix the
TLB conflict problems.

-- 
Catalin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-23 12:15                 ` Catalin Marinas
@ 2015-11-23 13:49                   ` Mark Rutland
  2015-11-23 14:48                     ` Jeremy Linton
  2015-11-23 14:31                   ` Jeremy Linton
  1 sibling, 1 reply; 25+ messages in thread
From: Mark Rutland @ 2015-11-23 13:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Nov 23, 2015 at 12:15:15PM +0000, Catalin Marinas wrote:
> On Fri, Nov 20, 2015 at 07:52:44PM +0000, Mark Rutland wrote:
> > On Thu, Nov 19, 2015 at 11:31:34AM +0000, Mark Rutland wrote:
> > > I think that if we need to do something more drastic to account for the
> > > other issues above (e.g. by ensuring that we can never allocate
> > > conflicting TLB entries in the first place), and that said strategy
> > > would also fix this problem, that would be preferable, given that we're
> > > going to have to do that eventually anyway.
> > 
> > Having looked into this further, we also have the same issue with the
> > kasan init code.
> 
> I don't think the kasan_init() problem is that bad. We are preserving
> the same size mappings (PAGE_SIZE) and just changing the physical
> address they point at without a break-before-make (just a TTBR1 switch).

Per the ARM ARM, "CONSTRAINED UNPREDICTABLE behaviors due to caching of
control or data values", the result of a translation could be "an
amalgamation" of the values. I believe that we have to read
"amalgamation" as "arbitrary function of" here.

I don't think that we're safe because we only changed the output
addresses of entries.

> I don't know how clear the ARM ARM is around this but at least so far we
> haven't hit any problems.

I assume you're talking generally here, rather than specifically about
kasan. I agree that we haven't spotted any issues so far.

Given that kasan itself is new and requires a relatively new compiler,
it may not yet have been tested on a platform where it would fail on.

Jeremy, for reference, have you tried kasan on m400? Or DEBUG_RODATA?

> The problem with the contiguous bit is that we switch from e.g. a 4KB
> mapping to a 64KB one and it's very likely that we would get a TLB
> conflict.
> 
> With CONFIG_DEBUG_RODATA, we go from bigger block to a smaller one, so
> less chance of a TLB conflict but still present. I need to read the ARM
> ARM some more in this area (and maybe ask for clarification).

We should certainly try to get clarification here.

> > I believe that the issue is restricted to one-off init code, as I don't
> > think that we do anything at runtime which would be problematic. If
> > anyone knows of a counter-example, please let me know!
> > 
> > Given that, we can restrict the problem to an early UP environment, and
> > it won't matter if therre's some large(ish) fixed cost associated with
> > updating the kernel page tables. I think that we can avoid the issue
> > entirely by modifying a copy of the kernel page tables, which we can
> > later install via some idmap code (going via a reserved table to clear
> > the TLBs).
> > 
> > I'm working on patches to implement the above, which I'll try to get
> > somewhere with next week.
> 
> That's a complete fix indeed but it would require some more testing and
> I don't think it's feasible for 4.4-rc. In the meantime, I propose that
> we revert the contiguous PTE patches and push them again once we fix the
> TLB conflict problems.

I agree that this would be too late for v4.4-rc*.

In the meantime, I guess that reverting the patches is the best thing to
do given we're already at rc2.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-23 12:15                 ` Catalin Marinas
  2015-11-23 13:49                   ` Mark Rutland
@ 2015-11-23 14:31                   ` Jeremy Linton
  1 sibling, 0 replies; 25+ messages in thread
From: Jeremy Linton @ 2015-11-23 14:31 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/23/2015 06:15 AM, Catalin Marinas wrote:

> That's a complete fix indeed but it would require some more testing and
> I don't think it's feasible for 4.4-rc. In the meantime, I propose that
> we revert the contiguous PTE patches and push them again once we fix the
> TLB conflict problems.


The problem as you say exists with the block mappings as well, and the 
explicit TLB flush there is apparently sufficient to keep this problem 
from occurring. The 1 line explicit flush is sufficient to keep it from 
happening in the late mapping case as well, and actually fixes a problem 
that was in the kernel from before these patches (the complete lack of 
flush in paths calling create_mapping_late()). So, I fail to see how 
reverting the patch fixes anything, it only returns to the previously 
(now known) to be bad code path.

BTW: The M400, has other issues that keep it from booting with ACPI 
firmware (the 4.3 EOI changes irritate a 'bug' in the MADT table).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-23 13:49                   ` Mark Rutland
@ 2015-11-23 14:48                     ` Jeremy Linton
  2015-11-23 15:41                       ` Will Deacon
  0 siblings, 1 reply; 25+ messages in thread
From: Jeremy Linton @ 2015-11-23 14:48 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/23/2015 07:49 AM, Mark Rutland wrote:
> Jeremy, for reference, have you tried kasan on m400? Or DEBUG_RODATA?


No, because the machine has a list of issues that keep it from booting a 
mainline kernel in a functional state. Once those are cleared up I will 
revisit this patch.

The goal was to create a conceptually safe fix for the problem, which 
isn't all this hypothetical stuff being discussed, but the fact that the 
TLBs are not being flushed properly (with or without the CONT bit stuff) 
resulting in a tlb conflict fault long after this code path has finished 
executing.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-23 14:48                     ` Jeremy Linton
@ 2015-11-23 15:41                       ` Will Deacon
  2015-11-23 15:46                         ` Jeremy Linton
  0 siblings, 1 reply; 25+ messages in thread
From: Will Deacon @ 2015-11-23 15:41 UTC (permalink / raw)
  To: linux-arm-kernel

Jeremy,

On Mon, Nov 23, 2015 at 08:48:40AM -0600, Jeremy Linton wrote:
> On 11/23/2015 07:49 AM, Mark Rutland wrote:
> >Jeremy, for reference, have you tried kasan on m400? Or DEBUG_RODATA?
> 
> No, because the machine has a list of issues that keep it from booting a
> mainline kernel in a functional state. Once those are cleared up I will
> revisit this patch.
> 
> The goal was to create a conceptually safe fix for the problem, which isn't
> all this hypothetical stuff being discussed, but the fact that the TLBs are
> not being flushed properly (with or without the CONT bit stuff) resulting in
> a tlb conflict fault long after this code path has finished executing.

I appreciate that you just want to get something working, and we've
established that a TLBIALL probably does solve the problem, however there's
more to it than that.

If we start adding TLBI/DSB/ISB/NOP/cache flush/read from config space
type operations in arbitrary places, we run a very real risk of masking
future bugs. Then, when the original problem that prompted the temporary
bodge is fixed properly, we uncover a whole raft of problems that we didn't
even realise were there. Consequently, we get stuck with the bodge forever
and, crucially, we lose the ability to reason about the state of the CPU
under Linux. That reasoning is incredibly useful when developing new
architectural features, debugging, profiling or assessing whether or
not Linux is susceptible to hardware errata and is precisely why the
"hypothetical stuff" matters.

For these reasons, we have no viable option other than reverting the
offending series until the underlying problem is fixed properly. With
any luck, that's the 4.5 timeframe so we really only lose a single
release (providing that you have time to rebase and repost).

Sorry about this; I hope it doesn't dissuade you from reporting bugs in
the future.

Will

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-23 15:41                       ` Will Deacon
@ 2015-11-23 15:46                         ` Jeremy Linton
  0 siblings, 0 replies; 25+ messages in thread
From: Jeremy Linton @ 2015-11-23 15:46 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/23/2015 09:41 AM, Will Deacon wrote:

> For these reasons, we have no viable option other than reverting the
> offending series until the underlying problem is fixed properly. With
> any luck, that's the 4.5 timeframe so we really only lose a single
> release (providing that you have time to rebase and repost).


That is fine but, I think you need to take this fix anyway (or something 
similar), because there _IS_ a TLB flush bug in that code path. I saw it 
when I wrote the original path, but discounted it assuming that the 
original authors had more insight on the hardware than I did.

Whether the fix takes the suggest form or another, the cont bit changes 
are only irritating an existing bug, not causing it.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-18 16:29     ` Mark Rutland
  2015-11-18 17:14       ` Jeremy Linton
@ 2015-11-23 15:51       ` Catalin Marinas
  2015-11-23 16:02         ` Jeremy Linton
  1 sibling, 1 reply; 25+ messages in thread
From: Catalin Marinas @ 2015-11-23 15:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Nov 18, 2015 at 04:29:32PM +0000, Mark Rutland wrote:
> FWIW, Will had a patch [1] for detecting PTE level break-before-make
> violations. I gave this a go on Juno with v4.4-rc1, and saw an issue in
> the EFI virtmap code that I'm currently investigating.
[...]
> [1] https://git.kernel.org/cgit/linux/kernel/git/will/linux.git/commit/?h=aarch64/devel&id=372f39220ad35fa39a75419f2221ffeb6ffd78d3

I thought about adding a check for when we change from contiguous to
non-contiguous or vice-versa, on top of Will's patch:

--------------8<--------------------
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index defbfde43a43..70e02e3b1a78 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -240,6 +240,10 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
 	if (WARN_ON((old & PTE_ATTRINDX_MASK) != (new & PTE_ATTRINDX_MASK)))
 		goto pte_bad;
 
+	/* Changing contiguity may lead to a TLB conflict */
+	if (WARN_ON((old & PTE_CONT) != (new & PTE_CONT)))
+		goto pte_bad;
+
 	/* Change of OA is only an issue if one mapping is writable */
 	if (!(old & new & PTE_RDONLY) &&
 	    WARN_ON(pte_pfn(*ptep) != pte_pfn(pte)))
--------------8<--------------------

But it doesn't look nice afterwards. It's the fixup_init() trying to
re-write the entries and we start changing the PTE_CONT bit:

Call trace:
[<ffffffc0000952b8>] __create_mapping.isra.5+0x360/0x530
[<ffffffc0000954ec>] fixup_init+0x64/0x80
[<ffffffc0000945a4>] free_initmem+0xc/0x38
[<ffffffc0005eb9f8>] kernel_init+0x20/0xe0
[<ffffffc000085c50>] ret_from_fork+0x10/0x40

What I don't get is why we have fixup_init() even when
!CONFIG_DEBUG_RODATA. It probably breaks the initial mapping just to get
a non-executable init section. However, the other sections are left
executable when this config option is disabled. The patch below fixes
the warnings above:

--------------8<--------------------
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index abb66f84d4ac..e0f82e1a1c09 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -482,9 +482,11 @@ void mark_rodata_ro(void)
 
 void fixup_init(void)
 {
+#ifdef CONFIG_DEBUG_RODATA
 	create_mapping_late(__pa(__init_begin), (unsigned long)__init_begin,
 			(unsigned long)__init_end - (unsigned long)__init_begin,
 			PAGE_KERNEL);
+#endif
 }
 
 /*
--------------8<--------------------

Jeremy, can you give this fixup_init() patch a try, see whether it makes
any difference (it's just a temporary hack which may prevent us from
reverting the PTE_CONT patches until we have a better solution).

Anyway, I think we should merge Will's patch since it's a handy debug
tool.

-- 
Catalin

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-23 15:51       ` Catalin Marinas
@ 2015-11-23 16:02         ` Jeremy Linton
  2015-11-23 16:37           ` Laura Abbott
  2015-11-23 16:52           ` Catalin Marinas
  0 siblings, 2 replies; 25+ messages in thread
From: Jeremy Linton @ 2015-11-23 16:02 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/23/2015 09:51 AM, Catalin Marinas wrote:
> Call trace:
> [<ffffffc0000952b8>] __create_mapping.isra.5+0x360/0x530
> [<ffffffc0000954ec>] fixup_init+0x64/0x80
> [<ffffffc0000945a4>] free_initmem+0xc/0x38
> [<ffffffc0005eb9f8>] kernel_init+0x20/0xe0
> [<ffffffc000085c50>] ret_from_fork+0x10/0x40
>
> What I don't get is why we have fixup_init() even when
> !CONFIG_DEBUG_RODATA. It probably breaks the initial mapping just to get
> a non-executable init section. However, the other sections are left
> executable when this config option is disabled. The patch below fixes
> the warnings above:

	Well the kernel permissions are a bit of a mess, and not at all 
"secure" in their current state. But I guess the thought must have been 
that turning off execute on the init sections was a good way to find 
functions incorrectly marked _init(). Which is different from RO. 
Frankly, I expect someone will push to cleanup the kernel permissions at 
some point (I've got it on my "spare time todo, list"), but this will of 
course make the create_mapping_late a lot more popular as I see it being 
called during module load/unload.
	Anyway, I've been saying the problem is create_mapping_late() all 
along, as you notice there isn't any TLB flushes in fixup_init() and 
that is the core of the problem, not all this other discussion.

> Jeremy, can you give this fixup_init() patch a try, see whether it makes
> any difference (it's just a temporary hack which may prevent us from
> reverting the PTE_CONT patches until we have a better solution).

I will try it, but i'm pretty sure that fixes it too.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-23 16:02         ` Jeremy Linton
@ 2015-11-23 16:37           ` Laura Abbott
  2015-11-23 16:42             ` Jeremy Linton
  2015-11-23 16:52           ` Catalin Marinas
  1 sibling, 1 reply; 25+ messages in thread
From: Laura Abbott @ 2015-11-23 16:37 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/23/15 8:02 AM, Jeremy Linton wrote:
> On 11/23/2015 09:51 AM, Catalin Marinas wrote:
>> Call trace:
>> [<ffffffc0000952b8>] __create_mapping.isra.5+0x360/0x530
>> [<ffffffc0000954ec>] fixup_init+0x64/0x80
>> [<ffffffc0000945a4>] free_initmem+0xc/0x38
>> [<ffffffc0005eb9f8>] kernel_init+0x20/0xe0
>> [<ffffffc000085c50>] ret_from_fork+0x10/0x40
>>
>> What I don't get is why we have fixup_init() even when
>> !CONFIG_DEBUG_RODATA. It probably breaks the initial mapping just to get
>> a non-executable init section. However, the other sections are left
>> executable when this config option is disabled. The patch below fixes
>> the warnings above:
>
>      Well the kernel permissions are a bit of a mess, and not at all
> "secure" in their current state. But I guess the thought must have been
> that turning off execute on the init sections was a good way to find
> functions incorrectly marked _init(). Which is different from RO.
> Frankly, I expect someone will push to cleanup the kernel permissions at
> some point (I've got it on my "spare time todo, list"), but this will of
> course make the create_mapping_late a lot more popular as I see it being
> called during module load/unload.
>      Anyway, I've been saying the problem is create_mapping_late() all
> along, as you notice there isn't any TLB flushes in fixup_init() and
> that is the core of the problem, not all this other discussion.
>

fixup_init was deliberately designed to change the sections even if
DEBUG_RODATA was not enabled. This was partially designed to match the
behavior of arm(32) which also drops the nx bit but also good practice
in general for security.

Which permissions still need to be cleaned up from your perspective?

Thanks,
Laura

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-23 16:37           ` Laura Abbott
@ 2015-11-23 16:42             ` Jeremy Linton
  2015-11-23 17:52               ` Laura Abbott
  2015-11-24  8:04               ` Ard Biesheuvel
  0 siblings, 2 replies; 25+ messages in thread
From: Jeremy Linton @ 2015-11-23 16:42 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/23/2015 10:37 AM, Laura Abbott wrote:
> Which permissions still need to be cleaned up from your perspective?

IMHO, the vast majority of the linear map should not be 
executable/read/write, which is what happens today.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-23 16:02         ` Jeremy Linton
  2015-11-23 16:37           ` Laura Abbott
@ 2015-11-23 16:52           ` Catalin Marinas
  2015-11-23 17:24             ` Catalin Marinas
  1 sibling, 1 reply; 25+ messages in thread
From: Catalin Marinas @ 2015-11-23 16:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Nov 23, 2015 at 10:02:02AM -0600, Jeremy Linton wrote:
> On 11/23/2015 09:51 AM, Catalin Marinas wrote:
> >Call trace:
> >[<ffffffc0000952b8>] __create_mapping.isra.5+0x360/0x530
> >[<ffffffc0000954ec>] fixup_init+0x64/0x80
> >[<ffffffc0000945a4>] free_initmem+0xc/0x38
> >[<ffffffc0005eb9f8>] kernel_init+0x20/0xe0
> >[<ffffffc000085c50>] ret_from_fork+0x10/0x40
> >
> >What I don't get is why we have fixup_init() even when
> >!CONFIG_DEBUG_RODATA. It probably breaks the initial mapping just to get
> >a non-executable init section. However, the other sections are left
> >executable when this config option is disabled. The patch below fixes
> >the warnings above:
> 
> 	Well the kernel permissions are a bit of a mess, and not at all "secure" in
> their current state. But I guess the thought must have been that turning off
> execute on the init sections was a good way to find functions incorrectly
> marked _init(). Which is different from RO.

fixup_executable() is non-empty only with DEBUG_RODATA, even though it
is done early. I was assuming we should have done the same with
fixup_init() but I've seen Laura's reply that it was deliberate.

> Frankly, I expect someone will push to cleanup the kernel permissions
> at some point (I've got it on my "spare time todo, list"), but this
> will of course make the create_mapping_late a lot more popular as I
> see it being called during module load/unload.
> 	Anyway, I've been saying the problem is create_mapping_late()
> all along, as you notice there isn't any TLB flushes in fixup_init()
> and that is the core of the problem, not all this other discussion.

The problem here is not changing permissions, we do this all the time
with fork() and a single TLB at the end. While the __create_mapping()
path seems to have TLB invalidation when changing a non-empty entry on
higher levels (though without break-before-make), it assumes that the
PTE was always none and no further TLB done. Your patch in this thread
indeed fixes this part (though without a check for pte_none(old_pte) but
that's an optimisation).

However, the problem you are seeing is not some permission change being
ignored (like a stale entry) but a TLB conflict which means that the
same VA has two entries in the TLB. We need better clarification in the
ARM ARM (a ticket about to be raised) but what makes this very likely is
going from a small block mapping to a bigger one. In this case, a
non-contiguous PTE overlapping with a contiguous one loaded via a
different address in the same contiguous range. Adding the TLB flush in
__populate_init_pte() hides this by reducing the window.

We have other cases where we go for smaller to larger block like the 1GB
section. I think until MarkR finishes his code to go via a temporary
TTBR1 + idmap, we should prevent all those. We can hope that going the
other direction (from bigger to smaller block mapping) is fine but we
don't have a clear answer yet.

Your original patch with a better description of the use cases and
potentially a check for pte_none() is still useful, though not a fix for
the TLB conflict. Something like:

----------------8<--------------------
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index abb66f84d4ac..d110313e58e9 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -113,14 +113,20 @@ static void __populate_init_pte(pte_t *pte, unsigned long addr,
 				pgprot_t prot)
 {
 	unsigned long pfn = __phys_to_pfn(phys);
+	bool need_flush = false;
 
 	do {
+		if (!pte_none(*pte))
+			need_flush = true;
 		/* clear all the bits except the pfn, then apply the prot */
 		set_pte(pte, pfn_pte(pfn, prot));
 		pte++;
 		pfn++;
 		addr += PAGE_SIZE;
 	} while (addr != end);
+
+	if (need_flush)
+		flush_tlb_all();
 }
 
 static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
----------------8<--------------------

-- 
Catalin

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-23 16:52           ` Catalin Marinas
@ 2015-11-23 17:24             ` Catalin Marinas
  0 siblings, 0 replies; 25+ messages in thread
From: Catalin Marinas @ 2015-11-23 17:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Nov 23, 2015 at 04:52:15PM +0000, Catalin Marinas wrote:
> We have other cases where we go for smaller to larger block like the 1GB
> section. I think until MarkR finishes his code to go via a temporary
> TTBR1 + idmap, we should prevent all those. We can hope that going the
> other direction (from bigger to smaller block mapping) is fine but we
> don't have a clear answer yet.

This patch (just briefly tested) prevents going from a smaller block to a
bigger one) and the set_pte() sanity check no longer triggers. We still
get some contiguous entries, though I haven't checked whether they've
been reduced.

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index abb66f84d4ac..b3f3f3e3d827 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -89,6 +89,21 @@ static void split_pmd(pmd_t *pmd, pte_t *pte)
 	} while (pte++, i++, i < PTRS_PER_PTE);
 }
 
+static bool __pte_range_none_or_cont(pte_t *pte)
+{
+	int i;
+
+	for (i = 0; i < CONT_PTES; i++) {
+		if (!pte_none(*pte))
+			return false;
+		if (!pte_cont(*pte))
+			return false;
+		pte++;
+	}
+
+	return true;
+}
+
 /*
  * Given a PTE with the CONT bit set, determine where the CONT range
  * starts, and clear the entire range of PTE CONT bits.
@@ -143,7 +158,8 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
 	pte = pte_offset_kernel(pmd, addr);
 	do {
 		next = min(end, (addr + CONT_SIZE) & CONT_MASK);
-		if (((addr | next | phys) & ~CONT_MASK) == 0) {
+		if (((addr | next | phys) & ~CONT_MASK) == 0 &&
+		    __pte_range_none_or_cont(pte)) {
 			/* a block of CONT_PTES  */
 			__populate_init_pte(pte, addr, next, phys,
 					    __pgprot(pgprot_val(prot) | PTE_CONT));
@@ -206,25 +222,12 @@ static void alloc_init_pmd(struct mm_struct *mm, pud_t *pud,
 	do {
 		next = pmd_addr_end(addr, end);
 		/* try section mapping first */
-		if (((addr | next | phys) & ~SECTION_MASK) == 0) {
-			pmd_t old_pmd =*pmd;
+		if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
+		    (pmd_none(*pmd) || pmd_sect(*pmd)))
 			set_pmd(pmd, __pmd(phys |
 					   pgprot_val(mk_sect_prot(prot))));
-			/*
-			 * Check for previous table entries created during
-			 * boot (__create_page_tables) and flush them.
-			 */
-			if (!pmd_none(old_pmd)) {
-				flush_tlb_all();
-				if (pmd_table(old_pmd)) {
-					phys_addr_t table = __pa(pte_offset_map(&old_pmd, 0));
-					if (!WARN_ON_ONCE(slab_is_available()))
-						memblock_free(table, PAGE_SIZE);
-				}
-			}
-		} else {
+		else
 			alloc_init_pte(pmd, addr, next, phys, prot, alloc);
-		}
 		phys += next - addr;
 	} while (pmd++, addr = next, addr != end);
 }
@@ -262,29 +265,12 @@ static void alloc_init_pud(struct mm_struct *mm, pgd_t *pgd,
 		/*
 		 * For 4K granule only, attempt to put down a 1GB block
 		 */
-		if (use_1G_block(addr, next, phys)) {
-			pud_t old_pud = *pud;
+		if (use_1G_block(addr, next, phys) &&
+		    (pud_none(*pud) || pud_sect(*pud)))
 			set_pud(pud, __pud(phys |
 					   pgprot_val(mk_sect_prot(prot))));
-
-			/*
-			 * If we have an old value for a pud, it will
-			 * be pointing to a pmd table that we no longer
-			 * need (from swapper_pg_dir).
-			 *
-			 * Look up the old pmd table and free it.
-			 */
-			if (!pud_none(old_pud)) {
-				flush_tlb_all();
-				if (pud_table(old_pud)) {
-					phys_addr_t table = __pa(pmd_offset(&old_pud, 0));
-					if (!WARN_ON_ONCE(slab_is_available()))
-						memblock_free(table, PAGE_SIZE);
-				}
-			}
-		} else {
+		else
 			alloc_init_pmd(mm, pud, addr, next, phys, prot, alloc);
-		}
 		phys += next - addr;
 	} while (pud++, addr = next, addr != end);
 }

-- 
Catalin

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-23 16:42             ` Jeremy Linton
@ 2015-11-23 17:52               ` Laura Abbott
  2015-11-23 18:46                 ` Jeremy Linton
  2015-11-24  8:04               ` Ard Biesheuvel
  1 sibling, 1 reply; 25+ messages in thread
From: Laura Abbott @ 2015-11-23 17:52 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/23/15 8:42 AM, Jeremy Linton wrote:
> On 11/23/2015 10:37 AM, Laura Abbott wrote:
>> Which permissions still need to be cleaned up from your perspective?
>
> IMHO, the vast majority of the linear map should not be executable/read/write, which is what happens today.
>

With CONFIG_DEBUG_RODATA on I'm not seeing any rwx regions

# cat /sys/kernel/debug/kernel_page_tables
---[ vmalloc() Area ]---
0xffffff8000000000-0xffffff8000010000          64K     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000012000-0xffffff8000013000           4K     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000018000-0xffffff8000019000           4K     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000020000-0xffffff8000030000          64K     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000031000-0xffffff8000071000         256K     RW NX SHD AF            UXN MEM/NORMAL-NC
0xffffff8000080000-0xffffff8000180000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff80001ae000-0xffffff80001b1000          12K     RW NX SHD AF            UXN MEM/NORMAL
0xffffff8000200000-0xffffff8000300000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000380000-0xffffff8000480000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000500000-0xffffff8000600000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000680000-0xffffff8000780000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000800000-0xffffff8000900000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000980000-0xffffff8000a80000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000b00000-0xffffff8000c00000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000c80000-0xffffff8000d80000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000e00000-0xffffff8000f00000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000f80000-0xffffff8001080000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8001100000-0xffffff8001200000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8001280000-0xffffff8001380000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8001400000-0xffffff8001500000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8001580000-0xffffff8001680000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8001700000-0xffffff8001800000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
---[ vmalloc() End ]---
---[ vmemmap start ]---
0xffffffbdc1000000-0xffffffbdc3000000          32M     RW NX SHD AF        BLK UXN MEM/NORMAL
---[ vmemmap end ]---
---[ Fixmap start ]---
0xffffffbffa800000-0xffffffbffaa00000           2M     ro NX SHD AF        BLK UXN MEM/NORMAL
---[ Fixmap end ]---
---[ PCI I/O start ]---
0xffffffbffae00000-0xffffffbffae10000          64K     RW NX SHD AF            UXN DEVICE/nGnRE
---[ PCI I/O end ]---
---[ Modules start ]---
---[ Modules end ]---
---[ Kernel Mapping ]---
0xffffffc000000000-0xffffffc000080000         512K     RW NX SHD AF    CON     UXN MEM/NORMAL
0xffffffc000080000-0xffffffc000082000           8K     RW NX SHD AF            UXN MEM/NORMAL
0xffffffc000082000-0xffffffc000090000          56K     ro x  SHD AF            UXN MEM/NORMAL
0xffffffc000090000-0xffffffc000200000        1472K     ro x  SHD AF    CON     UXN MEM/NORMAL
0xffffffc000200000-0xffffffc000800000           6M     ro x  SHD AF        BLK UXN MEM/NORMAL
0xffffffc000800000-0xffffffc000980000        1536K     ro x  SHD AF    CON     UXN MEM/NORMAL
0xffffffc000980000-0xffffffc000987000          28K     ro x  SHD AF            UXN MEM/NORMAL
0xffffffc000987000-0xffffffc000990000          36K     RW NX SHD AF            UXN MEM/NORMAL
0xffffffc000990000-0xffffffc000a00000         448K     RW NX SHD AF    CON     UXN MEM/NORMAL
0xffffffc000a00000-0xffffffc000a10000          64K     RW NX SHD AF            UXN MEM/NORMAL
0xffffffc000a10000-0xffffffc000c00000        1984K     RW NX SHD AF    CON     UXN MEM/NORMAL
0xffffffc000c00000-0xffffffc040000000        1012M     RW NX SHD AF        BLK UXN MEM/NORMAL
0xffffffc040000000-0xffffffc080000000           1G     RW NX SHD AF        BLK UXN MEM/NORMAL

are there other parts of the map you're seeing that aren't on my system?

Thanks,
Laura

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-23 17:52               ` Laura Abbott
@ 2015-11-23 18:46                 ` Jeremy Linton
  0 siblings, 0 replies; 25+ messages in thread
From: Jeremy Linton @ 2015-11-23 18:46 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/23/2015 11:52 AM, Laura Abbott wrote:
> On 11/23/15 8:42 AM, Jeremy Linton wrote:
>> On 11/23/2015 10:37 AM, Laura Abbott wrote:
>>> Which permissions still need to be cleaned up from your perspective?
>>
>> IMHO, the vast majority of the linear map should not be
>> executable/read/write, which is what happens today.
>>
>
> With CONFIG_DEBUG_RODATA on I'm not seeing any rwx regions

> 0xffffffc000000000-0xffffffc000080000         512K     RW NX SHD AF
> CON     UXN MEM/NORMAL
> 0xffffffc000080000-0xffffffc000082000           8K     RW NX SHD
> AF            UXN MEM/NORMAL
> 0xffffffc000082000-0xffffffc000090000          56K     ro x  SHD
> AF            UXN MEM/NORMAL
> 0xffffffc000090000-0xffffffc000200000        1472K     ro x  SHD AF
> CON     UXN MEM/NORMAL
> 0xffffffc000200000-0xffffffc000800000           6M     ro x  SHD
> AF        BLK UXN MEM/NORMAL
> 0xffffffc000800000-0xffffffc000980000        1536K     ro x  SHD AF
> CON     UXN MEM/NORMAL
> 0xffffffc000980000-0xffffffc000987000          28K     ro x  SHD
> AF            UXN MEM/NORMAL
> 0xffffffc000987000-0xffffffc000990000          36K     RW NX SHD
> AF            UXN MEM/NORMAL
> 0xffffffc000990000-0xffffffc000a00000         448K     RW NX SHD AF
> CON     UXN MEM/NORMAL
> 0xffffffc000a00000-0xffffffc000a10000          64K     RW NX SHD
> AF            UXN MEM/NORMAL
> 0xffffffc000a10000-0xffffffc000c00000        1984K     RW NX SHD AF
> CON     UXN MEM/NORMAL
> 0xffffffc000c00000-0xffffffc040000000        1012M     RW NX SHD
> AF        BLK UXN MEM/NORMAL
> 0xffffffc040000000-0xffffffc080000000           1G     RW NX SHD
> AF        BLK UXN MEM/NORMAL
>
> are there other parts of the map you're seeing that aren't on my system?


Thats my mistake, I didn't remember seeing the NX set for the block 
mappings with the RO case, the machine I just checked looks good too. 
So, I you can ignore that little rant.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
  2015-11-23 16:42             ` Jeremy Linton
  2015-11-23 17:52               ` Laura Abbott
@ 2015-11-24  8:04               ` Ard Biesheuvel
  1 sibling, 0 replies; 25+ messages in thread
From: Ard Biesheuvel @ 2015-11-24  8:04 UTC (permalink / raw)
  To: linux-arm-kernel

On 23 November 2015 at 17:42, Jeremy Linton <jlinton@redhat.com> wrote:
> On 11/23/2015 10:37 AM, Laura Abbott wrote:
>>
>> Which permissions still need to be cleaned up from your perspective?
>
>
> IMHO, the vast majority of the linear map should not be
> executable/read/write, which is what happens today.
>

IMHO, this should not be hidden under DEBUG_RODATA (which some people
may mistake for a debug option, while others may assume it affects
rodata only) since the kernel stacks are executable without it.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2015-11-24  8:04 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-18 15:03 [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs Jeremy Linton
2015-11-18 15:20 ` Mark Rutland
2015-11-18 16:08   ` Jeremy Linton
2015-11-18 16:29     ` Mark Rutland
2015-11-18 17:14       ` Jeremy Linton
2015-11-18 18:04         ` Mark Rutland
2015-11-18 19:31           ` Jeremy Linton
2015-11-19 11:31             ` Mark Rutland
2015-11-20 19:52               ` Mark Rutland
2015-11-23 12:15                 ` Catalin Marinas
2015-11-23 13:49                   ` Mark Rutland
2015-11-23 14:48                     ` Jeremy Linton
2015-11-23 15:41                       ` Will Deacon
2015-11-23 15:46                         ` Jeremy Linton
2015-11-23 14:31                   ` Jeremy Linton
2015-11-20 20:15               ` Mark Rutland
2015-11-23 15:51       ` Catalin Marinas
2015-11-23 16:02         ` Jeremy Linton
2015-11-23 16:37           ` Laura Abbott
2015-11-23 16:42             ` Jeremy Linton
2015-11-23 17:52               ` Laura Abbott
2015-11-23 18:46                 ` Jeremy Linton
2015-11-24  8:04               ` Ard Biesheuvel
2015-11-23 16:52           ` Catalin Marinas
2015-11-23 17:24             ` Catalin Marinas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.