All of lore.kernel.org
 help / color / mirror / Atom feed
* Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer")
@ 2018-08-20 15:23 Christophe LEROY
  2018-08-20 16:01 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Christophe LEROY @ 2018-08-20 15:23 UTC (permalink / raw)
  To: Michal Hocko, Michael Ellerman, Ram Pai, Andrew Morton
  Cc: linuxppc-dev, linux-mm

Hello,

I have an odd issue on my powerpc 8xx board.

I am running latest 4.14 and get the following SIGSEGV which appears 
more or less randomly.

[    9.190354] touch[91]: unhandled signal 11 at 67807b58 nip 777cf114 
lr 777cf100 code 30001
[   24.634810] ifconfig[160]: unhandled signal 11 at 67ae7b58 nip 
77aaf114 lr 77aaf100 code 30001
[   30.383737] default.deconfi[231]: unhandled signal 11 at 67c8bb58 nip 
77c53114 lr 77c53100 code 30001
[   37.655588] S15syslogd[251]: unhandled signal 11 at 6784fb58 nip 
77817114 lr 77817100 code 30001
[   40.974649] snmpd[315]: unhandled signal 11 at 67e0bb58 nip 77dd3114 
lr 77dd3100 code 30001
[   43.220964] exe[338]: unhandled signal 11 at 67cd3b58 nip 77c9b114 lr 
77c9b100 code 30001
[   44.191494] exe[348]: unhandled signal 11 at 67c1fb58 nip 77be7114 lr 
77be7100 code 30001
[   59.175022] sleep[655]: unhandled signal 11 at 67ca3b58 nip 77c6b114 
lr 77c6b100 code 30001
[   61.853406] smcroute[705]: unhandled signal 11 at 6789bb58 nip 
77863114 lr 77863100 code 30001
[   64.662431] smcroute[778]: unhandled signal 11 at 67e03b58 nip 
77dcb114 lr 77dcb100 code 30001
[   65.623103] smcroute[795]: unhandled signal 11 at 67bdbb58 nip 
77ba3114 lr 77ba3100 code 30001
[   66.579416] exe[825]: unhandled signal 11 at 67edbb58 nip 77ea3114 lr 
77ea3100 code 30001
[   68.382941] exe[864]: unhandled signal 11 at 6789bb58 nip 77863114 lr 
77863100 code 30001
[   95.187346] exe[1147]: unhandled signal 11 at 67e83b58 nip 77e4b114 
lr 77e4b100 code 30001
[  105.238218] exe[1158]: unhandled signal 11 at 67ca3b58 nip 77c6b114 
lr 77c6b100 code 30001
[  127.556731] exe[1181]: unhandled signal 11 at 67cc3b58 nip 77c8b114 
lr 77c8b100 code 30001
[  135.558982] exe[1195]: unhandled signal 11 at 678d7b58 nip 7789f114 
lr 7789f100 code 30001
[  147.579142] exe[1216]: unhandled signal 11 at 67c6bb58 nip 77c33114 
lr 77c33100 code 30001
[  175.538747] exe[1262]: unhandled signal 11 at 67e2fb58 nip 77df7114 
lr 77df7100 code 30001
[  186.552670] exe[1275]: unhandled signal 11 at 6781fb58 nip 777e7114 
lr 777e7100 code 30001
[  230.629786] exe[1344]: unhandled signal 11 at 67cb3b58 nip 77c7b114 
lr 77c7b100 code 30001
[  249.640396] repair-service.[1369]: unhandled signal 11 at 67e5fb58 
nip 77e27114 lr 77e27100 code 30001
[  378.003410] exe[1593]: unhandled signal 11 at 678d7b58 nip 7789f114 
lr 7789f100 code 30001
[  414.060661] exe[1656]: unhandled signal 11 at 67cc7b58 nip 77c8f114 
lr 77c8f100 code 30001

The problem is present in 3.13, 3.14 and 3.15.

I bisected its appearance with commit 6b31d5955cb29 ("mm, oom: fix 
potential data corruption when oom_reaper races with writer")

And I bisected its disappearance with commit 99cd1302327a2 ("powerpc: 
Deliver SEGV signal on pkey violation")

Looking at those two commits, especially the one which makes it 
dissapear, I'm quite sceptic. Any idea on what could be the cause and/or 
how to investigate further ?

Thanks
Christophe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer")
  2018-08-20 15:23 Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer") Christophe LEROY
@ 2018-08-20 16:01 ` Michal Hocko
  2018-08-20 16:04     ` Christophe LEROY
  2018-08-21  6:40   ` Michael Ellerman
  2018-08-23  1:25   ` Michael Ellerman
  2 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2018-08-20 16:01 UTC (permalink / raw)
  To: Christophe LEROY
  Cc: Michael Ellerman, Ram Pai, Andrew Morton, linuxppc-dev, linux-mm

On Mon 20-08-18 17:23:58, Christophe LEROY wrote:
> Hello,
> 
> I have an odd issue on my powerpc 8xx board.
> 
> I am running latest 4.14 and get the following SIGSEGV which appears more or
> less randomly.
> 
> [    9.190354] touch[91]: unhandled signal 11 at 67807b58 nip 777cf114 lr
> 777cf100 code 30001
> [   24.634810] ifconfig[160]: unhandled signal 11 at 67ae7b58 nip 77aaf114
> lr 77aaf100 code 30001
> [   30.383737] default.deconfi[231]: unhandled signal 11 at 67c8bb58 nip
> 77c53114 lr 77c53100 code 30001
> [   37.655588] S15syslogd[251]: unhandled signal 11 at 6784fb58 nip 77817114
> lr 77817100 code 30001
> [   40.974649] snmpd[315]: unhandled signal 11 at 67e0bb58 nip 77dd3114 lr
> 77dd3100 code 30001
> [   43.220964] exe[338]: unhandled signal 11 at 67cd3b58 nip 77c9b114 lr
> 77c9b100 code 30001
> [   44.191494] exe[348]: unhandled signal 11 at 67c1fb58 nip 77be7114 lr
> 77be7100 code 30001
> [   59.175022] sleep[655]: unhandled signal 11 at 67ca3b58 nip 77c6b114 lr
> 77c6b100 code 30001
> [   61.853406] smcroute[705]: unhandled signal 11 at 6789bb58 nip 77863114
> lr 77863100 code 30001
> [   64.662431] smcroute[778]: unhandled signal 11 at 67e03b58 nip 77dcb114
> lr 77dcb100 code 30001
> [   65.623103] smcroute[795]: unhandled signal 11 at 67bdbb58 nip 77ba3114
> lr 77ba3100 code 30001
> [   66.579416] exe[825]: unhandled signal 11 at 67edbb58 nip 77ea3114 lr
> 77ea3100 code 30001
> [   68.382941] exe[864]: unhandled signal 11 at 6789bb58 nip 77863114 lr
> 77863100 code 30001
> [   95.187346] exe[1147]: unhandled signal 11 at 67e83b58 nip 77e4b114 lr
> 77e4b100 code 30001
> [  105.238218] exe[1158]: unhandled signal 11 at 67ca3b58 nip 77c6b114 lr
> 77c6b100 code 30001
> [  127.556731] exe[1181]: unhandled signal 11 at 67cc3b58 nip 77c8b114 lr
> 77c8b100 code 30001
> [  135.558982] exe[1195]: unhandled signal 11 at 678d7b58 nip 7789f114 lr
> 7789f100 code 30001
> [  147.579142] exe[1216]: unhandled signal 11 at 67c6bb58 nip 77c33114 lr
> 77c33100 code 30001
> [  175.538747] exe[1262]: unhandled signal 11 at 67e2fb58 nip 77df7114 lr
> 77df7100 code 30001
> [  186.552670] exe[1275]: unhandled signal 11 at 6781fb58 nip 777e7114 lr
> 777e7100 code 30001
> [  230.629786] exe[1344]: unhandled signal 11 at 67cb3b58 nip 77c7b114 lr
> 77c7b100 code 30001
> [  249.640396] repair-service.[1369]: unhandled signal 11 at 67e5fb58 nip
> 77e27114 lr 77e27100 code 30001
> [  378.003410] exe[1593]: unhandled signal 11 at 678d7b58 nip 7789f114 lr
> 7789f100 code 30001
> [  414.060661] exe[1656]: unhandled signal 11 at 67cc7b58 nip 77c8f114 lr
> 77c8f100 code 30001
> 
> The problem is present in 3.13, 3.14 and 3.15.
> 
> I bisected its appearance with commit 6b31d5955cb29 ("mm, oom: fix potential
> data corruption when oom_reaper races with writer")

Do you see any oom killer invocations preceeding the SEGV? Some of those
killed tasks simply do not look like a sensible oom victims (e.g.
touch)...

> And I bisected its disappearance with commit 99cd1302327a2 ("powerpc:
> Deliver SEGV signal on pkey violation")

Those two seem completely unrelated.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer")
  2018-08-20 16:01 ` Michal Hocko
@ 2018-08-20 16:04     ` Christophe LEROY
  0 siblings, 0 replies; 13+ messages in thread
From: Christophe LEROY @ 2018-08-20 16:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michael Ellerman, Ram Pai, Andrew Morton, linuxppc-dev, linux-mm



Le 20/08/2018 A  18:01, Michal Hocko a A(C)critA :
> On Mon 20-08-18 17:23:58, Christophe LEROY wrote:
>> Hello,
>>
>> I have an odd issue on my powerpc 8xx board.
>>
>> I am running latest 4.14 and get the following SIGSEGV which appears more or
>> less randomly.
>>
>> [    9.190354] touch[91]: unhandled signal 11 at 67807b58 nip 777cf114 lr
>> 777cf100 code 30001
>> [   24.634810] ifconfig[160]: unhandled signal 11 at 67ae7b58 nip 77aaf114
>> lr 77aaf100 code 30001
>> [   30.383737] default.deconfi[231]: unhandled signal 11 at 67c8bb58 nip
>> 77c53114 lr 77c53100 code 30001
>> [   37.655588] S15syslogd[251]: unhandled signal 11 at 6784fb58 nip 77817114
>> lr 77817100 code 30001
>> [   40.974649] snmpd[315]: unhandled signal 11 at 67e0bb58 nip 77dd3114 lr
>> 77dd3100 code 30001
>> [   43.220964] exe[338]: unhandled signal 11 at 67cd3b58 nip 77c9b114 lr
>> 77c9b100 code 30001
>> [   44.191494] exe[348]: unhandled signal 11 at 67c1fb58 nip 77be7114 lr
>> 77be7100 code 30001
>> [   59.175022] sleep[655]: unhandled signal 11 at 67ca3b58 nip 77c6b114 lr
>> 77c6b100 code 30001
>> [   61.853406] smcroute[705]: unhandled signal 11 at 6789bb58 nip 77863114
>> lr 77863100 code 30001
>> [   64.662431] smcroute[778]: unhandled signal 11 at 67e03b58 nip 77dcb114
>> lr 77dcb100 code 30001
>> [   65.623103] smcroute[795]: unhandled signal 11 at 67bdbb58 nip 77ba3114
>> lr 77ba3100 code 30001
>> [   66.579416] exe[825]: unhandled signal 11 at 67edbb58 nip 77ea3114 lr
>> 77ea3100 code 30001
>> [   68.382941] exe[864]: unhandled signal 11 at 6789bb58 nip 77863114 lr
>> 77863100 code 30001
>> [   95.187346] exe[1147]: unhandled signal 11 at 67e83b58 nip 77e4b114 lr
>> 77e4b100 code 30001
>> [  105.238218] exe[1158]: unhandled signal 11 at 67ca3b58 nip 77c6b114 lr
>> 77c6b100 code 30001
>> [  127.556731] exe[1181]: unhandled signal 11 at 67cc3b58 nip 77c8b114 lr
>> 77c8b100 code 30001
>> [  135.558982] exe[1195]: unhandled signal 11 at 678d7b58 nip 7789f114 lr
>> 7789f100 code 30001
>> [  147.579142] exe[1216]: unhandled signal 11 at 67c6bb58 nip 77c33114 lr
>> 77c33100 code 30001
>> [  175.538747] exe[1262]: unhandled signal 11 at 67e2fb58 nip 77df7114 lr
>> 77df7100 code 30001
>> [  186.552670] exe[1275]: unhandled signal 11 at 6781fb58 nip 777e7114 lr
>> 777e7100 code 30001
>> [  230.629786] exe[1344]: unhandled signal 11 at 67cb3b58 nip 77c7b114 lr
>> 77c7b100 code 30001
>> [  249.640396] repair-service.[1369]: unhandled signal 11 at 67e5fb58 nip
>> 77e27114 lr 77e27100 code 30001
>> [  378.003410] exe[1593]: unhandled signal 11 at 678d7b58 nip 7789f114 lr
>> 7789f100 code 30001
>> [  414.060661] exe[1656]: unhandled signal 11 at 67cc7b58 nip 77c8f114 lr
>> 77c8f100 code 30001
>>
>> The problem is present in 3.13, 3.14 and 3.15.
>>
>> I bisected its appearance with commit 6b31d5955cb29 ("mm, oom: fix potential
>> data corruption when oom_reaper races with writer")
> 
> Do you see any oom killer invocations preceeding the SEGV? Some of those
> killed tasks simply do not look like a sensible oom victims (e.g.
> touch)...

No I don't see any.

> 
>> And I bisected its disappearance with commit 99cd1302327a2 ("powerpc:
>> Deliver SEGV signal on pkey violation")
> 
> Those two seem completely unrelated.
> 

That's my feeling too, hence my incredulity

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer")
@ 2018-08-20 16:04     ` Christophe LEROY
  0 siblings, 0 replies; 13+ messages in thread
From: Christophe LEROY @ 2018-08-20 16:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michael Ellerman, Ram Pai, Andrew Morton, linuxppc-dev, linux-mm



Le 20/08/2018 à 18:01, Michal Hocko a écrit :
> On Mon 20-08-18 17:23:58, Christophe LEROY wrote:
>> Hello,
>>
>> I have an odd issue on my powerpc 8xx board.
>>
>> I am running latest 4.14 and get the following SIGSEGV which appears more or
>> less randomly.
>>
>> [    9.190354] touch[91]: unhandled signal 11 at 67807b58 nip 777cf114 lr
>> 777cf100 code 30001
>> [   24.634810] ifconfig[160]: unhandled signal 11 at 67ae7b58 nip 77aaf114
>> lr 77aaf100 code 30001
>> [   30.383737] default.deconfi[231]: unhandled signal 11 at 67c8bb58 nip
>> 77c53114 lr 77c53100 code 30001
>> [   37.655588] S15syslogd[251]: unhandled signal 11 at 6784fb58 nip 77817114
>> lr 77817100 code 30001
>> [   40.974649] snmpd[315]: unhandled signal 11 at 67e0bb58 nip 77dd3114 lr
>> 77dd3100 code 30001
>> [   43.220964] exe[338]: unhandled signal 11 at 67cd3b58 nip 77c9b114 lr
>> 77c9b100 code 30001
>> [   44.191494] exe[348]: unhandled signal 11 at 67c1fb58 nip 77be7114 lr
>> 77be7100 code 30001
>> [   59.175022] sleep[655]: unhandled signal 11 at 67ca3b58 nip 77c6b114 lr
>> 77c6b100 code 30001
>> [   61.853406] smcroute[705]: unhandled signal 11 at 6789bb58 nip 77863114
>> lr 77863100 code 30001
>> [   64.662431] smcroute[778]: unhandled signal 11 at 67e03b58 nip 77dcb114
>> lr 77dcb100 code 30001
>> [   65.623103] smcroute[795]: unhandled signal 11 at 67bdbb58 nip 77ba3114
>> lr 77ba3100 code 30001
>> [   66.579416] exe[825]: unhandled signal 11 at 67edbb58 nip 77ea3114 lr
>> 77ea3100 code 30001
>> [   68.382941] exe[864]: unhandled signal 11 at 6789bb58 nip 77863114 lr
>> 77863100 code 30001
>> [   95.187346] exe[1147]: unhandled signal 11 at 67e83b58 nip 77e4b114 lr
>> 77e4b100 code 30001
>> [  105.238218] exe[1158]: unhandled signal 11 at 67ca3b58 nip 77c6b114 lr
>> 77c6b100 code 30001
>> [  127.556731] exe[1181]: unhandled signal 11 at 67cc3b58 nip 77c8b114 lr
>> 77c8b100 code 30001
>> [  135.558982] exe[1195]: unhandled signal 11 at 678d7b58 nip 7789f114 lr
>> 7789f100 code 30001
>> [  147.579142] exe[1216]: unhandled signal 11 at 67c6bb58 nip 77c33114 lr
>> 77c33100 code 30001
>> [  175.538747] exe[1262]: unhandled signal 11 at 67e2fb58 nip 77df7114 lr
>> 77df7100 code 30001
>> [  186.552670] exe[1275]: unhandled signal 11 at 6781fb58 nip 777e7114 lr
>> 777e7100 code 30001
>> [  230.629786] exe[1344]: unhandled signal 11 at 67cb3b58 nip 77c7b114 lr
>> 77c7b100 code 30001
>> [  249.640396] repair-service.[1369]: unhandled signal 11 at 67e5fb58 nip
>> 77e27114 lr 77e27100 code 30001
>> [  378.003410] exe[1593]: unhandled signal 11 at 678d7b58 nip 7789f114 lr
>> 7789f100 code 30001
>> [  414.060661] exe[1656]: unhandled signal 11 at 67cc7b58 nip 77c8f114 lr
>> 77c8f100 code 30001
>>
>> The problem is present in 3.13, 3.14 and 3.15.
>>
>> I bisected its appearance with commit 6b31d5955cb29 ("mm, oom: fix potential
>> data corruption when oom_reaper races with writer")
> 
> Do you see any oom killer invocations preceeding the SEGV? Some of those
> killed tasks simply do not look like a sensible oom victims (e.g.
> touch)...

No I don't see any.

> 
>> And I bisected its disappearance with commit 99cd1302327a2 ("powerpc:
>> Deliver SEGV signal on pkey violation")
> 
> Those two seem completely unrelated.
> 

That's my feeling too, hence my incredulity

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer")
  2018-08-20 15:23 Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer") Christophe LEROY
@ 2018-08-21  6:40   ` Michael Ellerman
  2018-08-21  6:40   ` Michael Ellerman
  2018-08-23  1:25   ` Michael Ellerman
  2 siblings, 0 replies; 13+ messages in thread
From: Michael Ellerman @ 2018-08-21  6:40 UTC (permalink / raw)
  To: Christophe LEROY, Michal Hocko, Ram Pai, Andrew Morton
  Cc: linuxppc-dev, linux-mm

Christophe LEROY <christophe.leroy@c-s.fr> writes:
...
>
> And I bisected its disappearance with commit 99cd1302327a2 ("powerpc: 
> Deliver SEGV signal on pkey violation")

Whoa that's weird.

> Looking at those two commits, especially the one which makes it 
> dissapear, I'm quite sceptic. Any idea on what could be the cause and/or 
> how to investigate further ?

Are you sure it's not some corruption that just happens to be masked by
that commit? I can't see anything in that commit that could explain that
change in behaviour.

The only real change is if you're hitting DSISR_KEYFAULT isn't it?

What happens if you take 087003e9ef7c and apply the various hunks from
99cd1302327a2 gradually (or those that you can anyway)?

cheers

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer")
@ 2018-08-21  6:40   ` Michael Ellerman
  0 siblings, 0 replies; 13+ messages in thread
From: Michael Ellerman @ 2018-08-21  6:40 UTC (permalink / raw)
  To: Christophe LEROY, Michal Hocko, Ram Pai, Andrew Morton
  Cc: linuxppc-dev, linux-mm

Christophe LEROY <christophe.leroy@c-s.fr> writes:
...
>
> And I bisected its disappearance with commit 99cd1302327a2 ("powerpc: 
> Deliver SEGV signal on pkey violation")

Whoa that's weird.

> Looking at those two commits, especially the one which makes it 
> dissapear, I'm quite sceptic. Any idea on what could be the cause and/or 
> how to investigate further ?

Are you sure it's not some corruption that just happens to be masked by
that commit? I can't see anything in that commit that could explain that
change in behaviour.

The only real change is if you're hitting DSISR_KEYFAULT isn't it?

What happens if you take 087003e9ef7c and apply the various hunks from
99cd1302327a2 gradually (or those that you can anyway)?

cheers

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer")
  2018-08-21  6:40   ` Michael Ellerman
  (?)
@ 2018-08-21 17:50   ` Ram Pai
  2018-08-22  8:19       ` Christophe LEROY
  -1 siblings, 1 reply; 13+ messages in thread
From: Ram Pai @ 2018-08-21 17:50 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Christophe LEROY, Michal Hocko, Andrew Morton, linuxppc-dev, linux-mm

On Tue, Aug 21, 2018 at 04:40:15PM +1000, Michael Ellerman wrote:
> Christophe LEROY <christophe.leroy@c-s.fr> writes:
> ...
> >
> > And I bisected its disappearance with commit 99cd1302327a2 ("powerpc: 
> > Deliver SEGV signal on pkey violation")
> 
> Whoa that's weird.
> 
> > Looking at those two commits, especially the one which makes it 
> > dissapear, I'm quite sceptic. Any idea on what could be the cause and/or 
> > how to investigate further ?
> 
> Are you sure it's not some corruption that just happens to be masked by
> that commit? I can't see anything in that commit that could explain that
> change in behaviour.
> 
> The only real change is if you're hitting DSISR_KEYFAULT isn't it?

even with the 'commit 99cd1302327a2', a SEGV signal should get generated;
which should kill the process. Unless the process handles SEGV signals 
with SEGV_PKUERR differently.

The other surprising thing is, why is DSISR_KEYFAULT getting generated
in the first place?  Are keys somehow getting programmed into the HPTE?

Feels like some random corruption.

Is this behavior seen with power8 or power9?

RP

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer")
  2018-08-21 17:50   ` Ram Pai
@ 2018-08-22  8:19       ` Christophe LEROY
  0 siblings, 0 replies; 13+ messages in thread
From: Christophe LEROY @ 2018-08-22  8:19 UTC (permalink / raw)
  To: Ram Pai, Michael Ellerman
  Cc: Michal Hocko, Andrew Morton, linuxppc-dev, linux-mm



Le 21/08/2018 A  19:50, Ram Pai a A(C)critA :
> On Tue, Aug 21, 2018 at 04:40:15PM +1000, Michael Ellerman wrote:
>> Christophe LEROY <christophe.leroy@c-s.fr> writes:
>> ...
>>>
>>> And I bisected its disappearance with commit 99cd1302327a2 ("powerpc:
>>> Deliver SEGV signal on pkey violation")
>>
>> Whoa that's weird.
>>
>>> Looking at those two commits, especially the one which makes it
>>> dissapear, I'm quite sceptic. Any idea on what could be the cause and/or
>>> how to investigate further ?
>>
>> Are you sure it's not some corruption that just happens to be masked by
>> that commit? I can't see anything in that commit that could explain that
>> change in behaviour.
>>
>> The only real change is if you're hitting DSISR_KEYFAULT isn't it?
> 
> even with the 'commit 99cd1302327a2', a SEGV signal should get generated;
> which should kill the process. Unless the process handles SEGV signals
> with SEGV_PKUERR differently.

No, the sigsegv are not handled differently. And the trace shown it is 
SEGV_MAPERR which is generated.

> 
> The other surprising thing is, why is DSISR_KEYFAULT getting generated
> in the first place?  Are keys somehow getting programmed into the HPTE?

Can't be that, because DSISR_KEYFAULT is filtered out when applying 
DSISR_SRR1_MATCH_32S mask.

> 
> Feels like some random corruption.

In a way yes, except that it is always at the same instruction (in 
ld.so) and always because the accessed address is 0x67xxxxxx instead of 
0x77xxxxxx
I also tested with TASK_SIZE set to 0xa0000000 instead of 0x80000000, 
and I get same failure with bad address being 0x87xxxxxx instead of 
0x97xxxxxx

Christophe

> 
> Is this behavior seen with power8 or power9?
> 
> RP
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer")
@ 2018-08-22  8:19       ` Christophe LEROY
  0 siblings, 0 replies; 13+ messages in thread
From: Christophe LEROY @ 2018-08-22  8:19 UTC (permalink / raw)
  To: Ram Pai, Michael Ellerman
  Cc: Michal Hocko, Andrew Morton, linuxppc-dev, linux-mm



Le 21/08/2018 à 19:50, Ram Pai a écrit :
> On Tue, Aug 21, 2018 at 04:40:15PM +1000, Michael Ellerman wrote:
>> Christophe LEROY <christophe.leroy@c-s.fr> writes:
>> ...
>>>
>>> And I bisected its disappearance with commit 99cd1302327a2 ("powerpc:
>>> Deliver SEGV signal on pkey violation")
>>
>> Whoa that's weird.
>>
>>> Looking at those two commits, especially the one which makes it
>>> dissapear, I'm quite sceptic. Any idea on what could be the cause and/or
>>> how to investigate further ?
>>
>> Are you sure it's not some corruption that just happens to be masked by
>> that commit? I can't see anything in that commit that could explain that
>> change in behaviour.
>>
>> The only real change is if you're hitting DSISR_KEYFAULT isn't it?
> 
> even with the 'commit 99cd1302327a2', a SEGV signal should get generated;
> which should kill the process. Unless the process handles SEGV signals
> with SEGV_PKUERR differently.

No, the sigsegv are not handled differently. And the trace shown it is 
SEGV_MAPERR which is generated.

> 
> The other surprising thing is, why is DSISR_KEYFAULT getting generated
> in the first place?  Are keys somehow getting programmed into the HPTE?

Can't be that, because DSISR_KEYFAULT is filtered out when applying 
DSISR_SRR1_MATCH_32S mask.

> 
> Feels like some random corruption.

In a way yes, except that it is always at the same instruction (in 
ld.so) and always because the accessed address is 0x67xxxxxx instead of 
0x77xxxxxx
I also tested with TASK_SIZE set to 0xa0000000 instead of 0x80000000, 
and I get same failure with bad address being 0x87xxxxxx instead of 
0x97xxxxxx

Christophe

> 
> Is this behavior seen with power8 or power9?
> 
> RP
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer")
  2018-08-22  8:19       ` Christophe LEROY
@ 2018-08-22 22:55         ` Ram Pai
  -1 siblings, 0 replies; 13+ messages in thread
From: Ram Pai @ 2018-08-22 22:55 UTC (permalink / raw)
  To: Christophe LEROY
  Cc: Michael Ellerman, Michal Hocko, Andrew Morton, linuxppc-dev, linux-mm

On Wed, Aug 22, 2018 at 10:19:02AM +0200, Christophe LEROY wrote:
> 
> 
> Le 21/08/2018 a 19:50, Ram Pai a ecrit :
> >On Tue, Aug 21, 2018 at 04:40:15PM +1000, Michael Ellerman wrote:
> >>Christophe LEROY <christophe.leroy@c-s.fr> writes:
> >>...
> >>>
> >>>And I bisected its disappearance with commit 99cd1302327a2 ("powerpc:
> >>>Deliver SEGV signal on pkey violation")
> >>
> >>Whoa that's weird.
> >>
> >>>Looking at those two commits, especially the one which makes it
> >>>dissapear, I'm quite sceptic. Any idea on what could be the cause and/or
> >>>how to investigate further ?
> >>
> >>Are you sure it's not some corruption that just happens to be masked by
> >>that commit? I can't see anything in that commit that could explain that
> >>change in behaviour.
> >>
> >>The only real change is if you're hitting DSISR_KEYFAULT isn't it?
> >
> >even with the 'commit 99cd1302327a2', a SEGV signal should get generated;
> >which should kill the process. Unless the process handles SEGV signals
> >with SEGV_PKUERR differently.
> 
> No, the sigsegv are not handled differently. And the trace shown it
> is SEGV_MAPERR which is generated.
> 
> >
> >The other surprising thing is, why is DSISR_KEYFAULT getting generated
> >in the first place?  Are keys somehow getting programmed into the HPTE?
> 
> Can't be that, because DSISR_KEYFAULT is filtered out when applying
> DSISR_SRR1_MATCH_32S mask.

Ah.. in that case, 99cd1302327a2 does nothing to fix the problem.

Are you sure it is this patch that fixes the problem?


RP

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer")
@ 2018-08-22 22:55         ` Ram Pai
  0 siblings, 0 replies; 13+ messages in thread
From: Ram Pai @ 2018-08-22 22:55 UTC (permalink / raw)
  To: Christophe LEROY
  Cc: Michael Ellerman, Michal Hocko, Andrew Morton, linuxppc-dev, linux-mm

On Wed, Aug 22, 2018 at 10:19:02AM +0200, Christophe LEROY wrote:
> 
> 
> Le 21/08/2018 à 19:50, Ram Pai a écrit :
> >On Tue, Aug 21, 2018 at 04:40:15PM +1000, Michael Ellerman wrote:
> >>Christophe LEROY <christophe.leroy@c-s.fr> writes:
> >>...
> >>>
> >>>And I bisected its disappearance with commit 99cd1302327a2 ("powerpc:
> >>>Deliver SEGV signal on pkey violation")
> >>
> >>Whoa that's weird.
> >>
> >>>Looking at those two commits, especially the one which makes it
> >>>dissapear, I'm quite sceptic. Any idea on what could be the cause and/or
> >>>how to investigate further ?
> >>
> >>Are you sure it's not some corruption that just happens to be masked by
> >>that commit? I can't see anything in that commit that could explain that
> >>change in behaviour.
> >>
> >>The only real change is if you're hitting DSISR_KEYFAULT isn't it?
> >
> >even with the 'commit 99cd1302327a2', a SEGV signal should get generated;
> >which should kill the process. Unless the process handles SEGV signals
> >with SEGV_PKUERR differently.
> 
> No, the sigsegv are not handled differently. And the trace shown it
> is SEGV_MAPERR which is generated.
> 
> >
> >The other surprising thing is, why is DSISR_KEYFAULT getting generated
> >in the first place?  Are keys somehow getting programmed into the HPTE?
> 
> Can't be that, because DSISR_KEYFAULT is filtered out when applying
> DSISR_SRR1_MATCH_32S mask.

Ah.. in that case, 99cd1302327a2 does nothing to fix the problem.

Are you sure it is this patch that fixes the problem?


RP

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer")
  2018-08-20 15:23 Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer") Christophe LEROY
@ 2018-08-23  1:25   ` Michael Ellerman
  2018-08-21  6:40   ` Michael Ellerman
  2018-08-23  1:25   ` Michael Ellerman
  2 siblings, 0 replies; 13+ messages in thread
From: Michael Ellerman @ 2018-08-23  1:25 UTC (permalink / raw)
  To: Christophe LEROY, Michal Hocko, Ram Pai, Andrew Morton
  Cc: linuxppc-dev, linux-mm

Christophe LEROY <christophe.leroy@c-s.fr> writes:
> Hello,
>
> I have an odd issue on my powerpc 8xx board.
>
> I am running latest 4.14 and get the following SIGSEGV which appears 
> more or less randomly.
>
> [    9.190354] touch[91]: unhandled signal 11 at 67807b58 nip 777cf114 
> lr 777cf100 code 30001
> [   24.634810] ifconfig[160]: unhandled signal 11 at 67ae7b58 nip 
> 77aaf114 lr 77aaf100 code 30001


It would be interesting to see the code dump here and which registers
are being used.

Can you backport the show unhandled signal changes and see what that
shows us?

cheers

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer")
@ 2018-08-23  1:25   ` Michael Ellerman
  0 siblings, 0 replies; 13+ messages in thread
From: Michael Ellerman @ 2018-08-23  1:25 UTC (permalink / raw)
  To: Christophe LEROY, Michal Hocko, Ram Pai, Andrew Morton
  Cc: linuxppc-dev, linux-mm

Christophe LEROY <christophe.leroy@c-s.fr> writes:
> Hello,
>
> I have an odd issue on my powerpc 8xx board.
>
> I am running latest 4.14 and get the following SIGSEGV which appears 
> more or less randomly.
>
> [    9.190354] touch[91]: unhandled signal 11 at 67807b58 nip 777cf114 
> lr 777cf100 code 30001
> [   24.634810] ifconfig[160]: unhandled signal 11 at 67ae7b58 nip 
> 77aaf114 lr 77aaf100 code 30001


It would be interesting to see the code dump here and which registers
are being used.

Can you backport the show unhandled signal changes and see what that
shows us?

cheers

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-08-23  1:26 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-20 15:23 Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer") Christophe LEROY
2018-08-20 16:01 ` Michal Hocko
2018-08-20 16:04   ` Christophe LEROY
2018-08-20 16:04     ` Christophe LEROY
2018-08-21  6:40 ` Michael Ellerman
2018-08-21  6:40   ` Michael Ellerman
2018-08-21 17:50   ` Ram Pai
2018-08-22  8:19     ` Christophe LEROY
2018-08-22  8:19       ` Christophe LEROY
2018-08-22 22:55       ` Ram Pai
2018-08-22 22:55         ` Ram Pai
2018-08-23  1:25 ` Michael Ellerman
2018-08-23  1:25   ` Michael Ellerman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.