All of lore.kernel.org
 help / color / mirror / Atom feed
* crashes in 4.10 because of "parisc: Enable KASLR"
@ 2017-02-01 17:37 Mikulas Patocka
  2017-02-01 19:54 ` Helge Deller
  0 siblings, 1 reply; 9+ messages in thread
From: Mikulas Patocka @ 2017-02-01 17:37 UTC (permalink / raw)
  To: Helge Deller; +Cc: linux-parisc, John David Anglin

Hi

I had some crashes on parisc with the kernel 4.10 and I tracked them to 
the patch 18d98a79382cbe5a7569788d5b7b18e7015506f2 "parisc: Enable KASLR".

The crashes can be reproduced by compiling kernel in a loop - a crash 
happens in a few hours. I have a C8000 workstation with two dual-core 
PA8900 CPUs.

The kernel 4.9 is stable, 4.9 with the patch 18d98a79 isn't.

Mikulas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: crashes in 4.10 because of "parisc: Enable KASLR"
  2017-02-01 17:37 crashes in 4.10 because of "parisc: Enable KASLR" Mikulas Patocka
@ 2017-02-01 19:54 ` Helge Deller
  2017-02-01 20:10   ` Mikulas Patocka
  2017-02-01 22:29   ` Aaro Koskinen
  0 siblings, 2 replies; 9+ messages in thread
From: Helge Deller @ 2017-02-01 19:54 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: linux-parisc, John David Anglin

Hi Mikulas,

On 01.02.2017 18:37, Mikulas Patocka wrote:
> I had some crashes on parisc with the kernel 4.10 and I tracked them to 
> the patch 18d98a79382cbe5a7569788d5b7b18e7015506f2 "parisc: Enable KASLR".

If the patch is really guilty, then it's probably best to revert it
before 4.10 gets released, but...

> The crashes can be reproduced by compiling kernel in a loop - a crash 
> happens in a few hours. I have a C8000 workstation with two dual-core 
> PA8900 CPUs.
> 
> The kernel 4.9 is stable, 4.9 with the patch 18d98a79 isn't.

I'm not 100% convinced that 4.9 is fully stable and that the patch
is the reason for the crashes you see.
What kind of crashes do you see? Userspace or kernel ?
I know we still face random userspace crashes which happen with plain kernel 4.9/stable too.
Dave is still trying various patches to resolve those crashes.

Helge

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: crashes in 4.10 because of "parisc: Enable KASLR"
  2017-02-01 19:54 ` Helge Deller
@ 2017-02-01 20:10   ` Mikulas Patocka
  2017-02-01 21:12     ` John David Anglin
  2017-02-01 22:29   ` Aaro Koskinen
  1 sibling, 1 reply; 9+ messages in thread
From: Mikulas Patocka @ 2017-02-01 20:10 UTC (permalink / raw)
  To: Helge Deller; +Cc: linux-parisc, John David Anglin



On Wed, 1 Feb 2017, Helge Deller wrote:

> Hi Mikulas,
> 
> On 01.02.2017 18:37, Mikulas Patocka wrote:
> > I had some crashes on parisc with the kernel 4.10 and I tracked them to 
> > the patch 18d98a79382cbe5a7569788d5b7b18e7015506f2 "parisc: Enable KASLR".
> 
> If the patch is really guilty, then it's probably best to revert it
> before 4.10 gets released, but...
> 
> > The crashes can be reproduced by compiling kernel in a loop - a crash 
> > happens in a few hours. I have a C8000 workstation with two dual-core 
> > PA8900 CPUs.
> > 
> > The kernel 4.9 is stable, 4.9 with the patch 18d98a79 isn't.
> 
> I'm not 100% convinced that 4.9 is fully stable and that the patch
> is the reason for the crashes you see.
> What kind of crashes do you see? Userspace or kernel ?

Userspace crashes. Random crashes or internal errors in gcc when compiling 
the kernel. I once had "aptitude" crash.

> I know we still face random userspace crashes which happen with plain kernel 4.9/stable too.
> Dave is still trying various patches to resolve those crashes.

Is 4.8 stable for you?

> Helge

Mikulas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: crashes in 4.10 because of "parisc: Enable KASLR"
  2017-02-01 20:10   ` Mikulas Patocka
@ 2017-02-01 21:12     ` John David Anglin
  2017-12-08 11:22       ` Mikulas Patocka
  0 siblings, 1 reply; 9+ messages in thread
From: John David Anglin @ 2017-02-01 21:12 UTC (permalink / raw)
  To: Mikulas Patocka, Helge Deller; +Cc: linux-parisc

On 2017-02-01 3:10 PM, Mikulas Patocka wrote:
>> I'm not 100% convinced that 4.9 is fully stable and that the patch
>> is the reason for the crashes you see.
>> What kind of crashes do you see? Userspace or kernel ?
> Userspace crashes. Random crashes or internal errors in gcc when compiling
> the kernel. I once had "aptitude" crash.
The userspace crashes are present in 4.8 and 4.9 as well.  For example, 
this build failed due to an OS problem:
https://buildd.debian.org/status/fetch.php?pkg=kdenlive&arch=hppa&ver=16.12.1-2&stamp=1485956026&raw=0

Probably, 10% or more large packages fail to build because of this. Note 
that this only occurs on machines
(e.g., c8000) that only support equivalent aliases.  We don't see this 
on the parisc buildd which has two PA8600 CPUs.

My current theory is the following functions are buggy:

/* vmap range flushes and invalidates.  Architecturally, we don't need
  * the invalidate, because the CPU should refuse to speculate once an
  * area has been flushed, so invalidate is left empty */
static inline void flush_kernel_vmap_range(void *vaddr, int size)
{
         unsigned long start = (unsigned long)vaddr;

         flush_kernel_dcache_range_asm(start, start + size);
}
static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
{
         unsigned long start = (unsigned long)vaddr;
         void *cursor = vaddr;

         for ( ; cursor < vaddr + size; cursor += PAGE_SIZE) {
                 struct page *page = vmalloc_to_page(cursor);

                 if (test_and_clear_bit(PG_dcache_dirty, &page->flags))
                         flush_kernel_dcache_page(page);
         }
         flush_kernel_dcache_range_asm(start, start + size);
}

The kernel sets up a vmap range for I/O and we have non equivalent 
aliases to the offset map
pages.  I know the PG_dcache_dirty is never set when these routines are 
called, so the for loop
does nothing.  Nuking the whole data cache appears to fix the 
application errors but my test
was cut short by a second problem.  No one else seems to do anything 
with offset map, so
we might have a parisc specific driver problem.

We also have a down_read/up_read problem where applications stall 
forever and are not killable
(D state in top).  Some seemed related to signal processing but they 
have occurred in other
situations as well.  They seem more prevalent.  For example, I can't 
remember this happening
with 3.18 branch.  This problem seems to be triggered by application 
tests involving multiple
threads (glibc, gcc go and libgomp, and mariadb).

Dave

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: crashes in 4.10 because of "parisc: Enable KASLR"
  2017-02-01 19:54 ` Helge Deller
  2017-02-01 20:10   ` Mikulas Patocka
@ 2017-02-01 22:29   ` Aaro Koskinen
  2017-02-02  0:25     ` John David Anglin
  1 sibling, 1 reply; 9+ messages in thread
From: Aaro Koskinen @ 2017-02-01 22:29 UTC (permalink / raw)
  To: Helge Deller; +Cc: Mikulas Patocka, linux-parisc, John David Anglin

Hi,

On Wed, Feb 01, 2017 at 08:54:57PM +0100, Helge Deller wrote:
> I know we still face random userspace crashes which happen with plain kernel 4.9/stable too.
> Dave is still trying various patches to resolve those crashes.

FWIW, I was able to run full GCC 6 bootstrap (> 20 hours) on C3700/PA8700
without any issues using plain 4.9 kernel.

A.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: crashes in 4.10 because of "parisc: Enable KASLR"
  2017-02-01 22:29   ` Aaro Koskinen
@ 2017-02-02  0:25     ` John David Anglin
  0 siblings, 0 replies; 9+ messages in thread
From: John David Anglin @ 2017-02-02  0:25 UTC (permalink / raw)
  To: Aaro Koskinen; +Cc: Helge Deller, Mikulas Patocka, linux-parisc

On 2017-02-01, at 5:29 PM, Aaro Koskinen wrote:

> On Wed, Feb 01, 2017 at 08:54:57PM +0100, Helge Deller wrote:
>> I know we still face random userspace crashes which happen with plain kernel 4.9/stable too.
>> Dave is still trying various patches to resolve those crashes.
> 
> FWIW, I was able to run full GCC 6 bootstrap (> 20 hours) on C3700/PA8700
> without any issues using plain 4.9 kernel.

As noted in my previous message, the problem appears to only occur on machines that only support
equivalent aliases (PA8800/PA8900).  The C3700 doesn't have the large L2 cache.

Dave
--
John David Anglin	dave.anglin@bell.net




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: crashes in 4.10 because of "parisc: Enable KASLR"
  2017-02-01 21:12     ` John David Anglin
@ 2017-12-08 11:22       ` Mikulas Patocka
  2017-12-10 21:42         ` John David Anglin
  0 siblings, 1 reply; 9+ messages in thread
From: Mikulas Patocka @ 2017-12-08 11:22 UTC (permalink / raw)
  To: John David Anglin; +Cc: Helge Deller, linux-parisc



On Wed, 1 Feb 2017, John David Anglin wrote:

> On 2017-02-01 3:10 PM, Mikulas Patocka wrote:
> > > I'm not 100% convinced that 4.9 is fully stable and that the patch
> > > is the reason for the crashes you see.
> > > What kind of crashes do you see? Userspace or kernel ?
> > Userspace crashes. Random crashes or internal errors in gcc when compiling
> > the kernel. I once had "aptitude" crash.
> The userspace crashes are present in 4.8 and 4.9 as well.  For example, this
> build failed due to an OS problem:
> https://buildd.debian.org/status/fetch.php?pkg=kdenlive&arch=hppa&ver=16.12.1-2&stamp=1485956026&raw=0
> 
> Probably, 10% or more large packages fail to build because of this. Note that
> this only occurs on machines
> (e.g., c8000) that only support equivalent aliases.  We don't see this on the
> parisc buildd which has two PA8600 CPUs.
> 
> My current theory is the following functions are buggy:
> 
> /* vmap range flushes and invalidates.  Architecturally, we don't need
>  * the invalidate, because the CPU should refuse to speculate once an
>  * area has been flushed, so invalidate is left empty */
> static inline void flush_kernel_vmap_range(void *vaddr, int size)
> {
>         unsigned long start = (unsigned long)vaddr;
> 
>         flush_kernel_dcache_range_asm(start, start + size);
> }
> static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
> {
>         unsigned long start = (unsigned long)vaddr;
>         void *cursor = vaddr;
> 
>         for ( ; cursor < vaddr + size; cursor += PAGE_SIZE) {
>                 struct page *page = vmalloc_to_page(cursor);
> 
>                 if (test_and_clear_bit(PG_dcache_dirty, &page->flags))
>                         flush_kernel_dcache_page(page);
>         }
>         flush_kernel_dcache_range_asm(start, start + size);
> }

BTW. if you flush a cache line, then - according to the pa-risc 
specification - the page stays in the TLB and the CPU can fetch anything 
that is in the TLB speculatively. So, such a flush could really have no 
effect.

The kernel should first flush TLB for the affected range and then flush 
the data using the tmpalias mapping.

Mikulas

> The kernel sets up a vmap range for I/O and we have non equivalent aliases to
> the offset map
> pages.  I know the PG_dcache_dirty is never set when these routines are
> called, so the for loop
> does nothing.  Nuking the whole data cache appears to fix the application
> errors but my test
> was cut short by a second problem.  No one else seems to do anything with
> offset map, so
> we might have a parisc specific driver problem.
> 
> We also have a down_read/up_read problem where applications stall forever and
> are not killable
> (D state in top).  Some seemed related to signal processing but they have
> occurred in other
> situations as well.  They seem more prevalent.  For example, I can't remember
> this happening
> with 3.18 branch.  This problem seems to be triggered by application tests
> involving multiple
> threads (glibc, gcc go and libgomp, and mariadb).
> 
> Dave
> 
> -- 
> John David Anglin  dave.anglin@bell.net
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: crashes in 4.10 because of "parisc: Enable KASLR"
  2017-12-08 11:22       ` Mikulas Patocka
@ 2017-12-10 21:42         ` John David Anglin
  2017-12-11 15:12           ` John David Anglin
  0 siblings, 1 reply; 9+ messages in thread
From: John David Anglin @ 2017-12-10 21:42 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Helge Deller, linux-parisc

On 2017-12-08, at 6:22 AM, Mikulas Patocka wrote:

> 
> On Wed, 1 Feb 2017, John David Anglin wrote:
> 
>> On 2017-02-01 3:10 PM, Mikulas Patocka wrote:
>>>> I'm not 100% convinced that 4.9 is fully stable and that the patch
>>>> is the reason for the crashes you see.
>>>> What kind of crashes do you see? Userspace or kernel ?
>>> Userspace crashes. Random crashes or internal errors in gcc when compiling
>>> the kernel. I once had "aptitude" crash.
>> The userspace crashes are present in 4.8 and 4.9 as well.  For example, this
>> build failed due to an OS problem:
>> https://buildd.debian.org/status/fetch.php?pkg=kdenlive&arch=hppa&ver=16.12.1-2&stamp=1485956026&raw=0
>> 
>> Probably, 10% or more large packages fail to build because of this. Note that
>> this only occurs on machines
>> (e.g., c8000) that only support equivalent aliases.  We don't see this on the
>> parisc buildd which has two PA8600 CPUs.
>> 
>> My current theory is the following functions are buggy:
>> 
>> /* vmap range flushes and invalidates.  Architecturally, we don't need
>> * the invalidate, because the CPU should refuse to speculate once an
>> * area has been flushed, so invalidate is left empty */
>> static inline void flush_kernel_vmap_range(void *vaddr, int size)
>> {
>>        unsigned long start = (unsigned long)vaddr;
>> 
>>        flush_kernel_dcache_range_asm(start, start + size);
>> }
>> static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
>> {
>>        unsigned long start = (unsigned long)vaddr;
>>        void *cursor = vaddr;
>> 
>>        for ( ; cursor < vaddr + size; cursor += PAGE_SIZE) {
>>                struct page *page = vmalloc_to_page(cursor);
>> 
>>                if (test_and_clear_bit(PG_dcache_dirty, &page->flags))
>>                        flush_kernel_dcache_page(page);
>>        }
>>        flush_kernel_dcache_range_asm(start, start + size);
>> }
> 
> BTW. if you flush a cache line, then - according to the pa-risc 
> specification - the page stays in the TLB and the CPU can fetch anything 
> that is in the TLB speculatively. So, such a flush could really have no 
> effect.
> 
> The kernel should first flush TLB for the affected range and then flush 
> the data using the tmpalias mapping.

I agree.  Flushing using the tmpalias mapping handles cache move-in correctly but at the moment
we only have routines to flush whole pages.  I think the big problem is we don't create translations
for non access TLB misses correctly.  See top of page F-11.  We should set access rights to 0 or 1
to prevent I-cache move-in, and the T bit to 1 to prevent D-cache move-in.  As things stands, set up
the TLB entry for non access exceptions the same as we do for normal access exceptions.  As a result,
cache flushes may themselves cause a problem.

I had always wondered why this code is backwards:

void flush_kernel_dcache_page_addr(void *addr)
{
        unsigned long flags;

        flush_kernel_dcache_page_asm(addr);
        purge_tlb_start(flags);
        pdtlb_kernel(addr);
        purge_tlb_end(flags);
}

I did try reversing the order yesterday and it it seemed to increase the number of random segmentation faults.
As it stands, there is a bit of a race between the cache flush and the TLB purge.

Dave
--
John David Anglin	dave.anglin@bell.net




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: crashes in 4.10 because of "parisc: Enable KASLR"
  2017-12-10 21:42         ` John David Anglin
@ 2017-12-11 15:12           ` John David Anglin
  0 siblings, 0 replies; 9+ messages in thread
From: John David Anglin @ 2017-12-11 15:12 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Helge Deller, linux-parisc

On 2017-12-10 4:42 PM, John David Anglin wrote:
> I had always wondered why this code is backwards:
>
> void flush_kernel_dcache_page_addr(void *addr)
> {
>          unsigned long flags;
>
>          flush_kernel_dcache_page_asm(addr);
>          purge_tlb_start(flags);
>          pdtlb_kernel(addr);
>          purge_tlb_end(flags);
> }
>
> I did try reversing the order yesterday and it it seemed to increase the number of random segmentation faults.
> As it stands, there is a bit of a race between the cache flush and the TLB purge.
Actually, the order is understandable given that 
flush_kernel_dcache_page_asm uses the same translation
as regular accesses.  So, purging the translation before the cache flush 
does nothing and speculation can
corrupt the cache.

I think there are a few places where we have the order wrong.

We might be able to use the tmpalias flush_dcache_page_asm routine for 
the above.

Dave

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-12-11 15:12 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-01 17:37 crashes in 4.10 because of "parisc: Enable KASLR" Mikulas Patocka
2017-02-01 19:54 ` Helge Deller
2017-02-01 20:10   ` Mikulas Patocka
2017-02-01 21:12     ` John David Anglin
2017-12-08 11:22       ` Mikulas Patocka
2017-12-10 21:42         ` John David Anglin
2017-12-11 15:12           ` John David Anglin
2017-02-01 22:29   ` Aaro Koskinen
2017-02-02  0:25     ` John David Anglin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.