linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* Problems with swapping in v4.5-rc on POWER
@ 2016-02-25  2:10 Hugh Dickins
  2016-02-25  4:12 ` Michael Ellerman
  2016-02-25  4:52 ` Aneesh Kumar K.V
  0 siblings, 2 replies; 12+ messages in thread
From: Hugh Dickins @ 2016-02-25  2:10 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Paul Mackerras, linuxppc-dev, linux-mm

I've plagiarized the subject from Paulus's "Problems with THP" mail
last weekend; but my similar problems are on PowerMac G5 baremetal,
with 4kB pages, not capable of THP and no THP configured in.

Under heavily swapping load, running kernel builds on tmpfs in limited
memory, I've been seeing random segfaults too, internal compiler errors
etc.  Not easily reproduced: sometimes happens in minutes, sometimes
not for several hours.

I tried and failed to construct a reproducer for you: my lack of a good
recipe has deterred me from reporting it, and seeing Paulus's mail on
THP gave me hope that the answer would come up in that thread; but no,
that was quickly resolved as a THP issue, since fixed.

(Mine had appeared to be fixed in v4.5-rc4 anyway; but I guess I
just didn't try hard enough, it resurfaced on -rc5 immediately.)

I've seen no sign of such problems on x86.  And I saw no sign of such
problems on v4.4-rc8-mm1, when I included the fixes to the _PAGE_PTE
and _PAGE_SWP_SOFT_DIRTY swapoff issues we discussed back then (in
33 hours of load, should be good enough; but did see such problems
a couple of times before including those fixes - I took them to be
a side-effect of the page flags issue, but now rather doubt that).

The minutes or hours thing: I wonder if that indicates a missing
initialization somewhere: that can easily show up soon after booting,
but then the machine settles into a steady state of reusing the same
structures, now initialized; until much later something disturbs the
state and it has to allocate more.  Sheer speculation, but I wonder.

Hugh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with swapping in v4.5-rc on POWER
  2016-02-25  2:10 Problems with swapping in v4.5-rc on POWER Hugh Dickins
@ 2016-02-25  4:12 ` Michael Ellerman
  2016-02-25  5:36   ` Hugh Dickins
  2016-02-25  4:52 ` Aneesh Kumar K.V
  1 sibling, 1 reply; 12+ messages in thread
From: Michael Ellerman @ 2016-02-25  4:12 UTC (permalink / raw)
  To: Hugh Dickins, Aneesh Kumar K.V; +Cc: linux-mm, linuxppc-dev

On Wed, 2016-02-24 at 18:10 -0800, Hugh Dickins via Linuxppc-dev wrote:

> I've plagiarized the subject from Paulus's "Problems with THP" mail
> last weekend; but my similar problems are on PowerMac G5 baremetal,
> with 4kB pages, not capable of THP and no THP configured in.
> 
> Under heavily swapping load, running kernel builds on tmpfs in limited
> memory, I've been seeing random segfaults too, internal compiler errors
> etc.  Not easily reproduced: sometimes happens in minutes, sometimes
> not for several hours.
> 
> I tried and failed to construct a reproducer for you: my lack of a good
> recipe has deterred me from reporting it, and seeing Paulus's mail on
> THP gave me hope that the answer would come up in that thread; but no,
> that was quickly resolved as a THP issue, since fixed.
> 
> (Mine had appeared to be fixed in v4.5-rc4 anyway; but I guess I
> just didn't try hard enough, it resurfaced on -rc5 immediately.)
> 
> I've seen no sign of such problems on x86.  And I saw no sign of such
> problems on v4.4-rc8-mm1, when I included the fixes to the _PAGE_PTE
> and _PAGE_SWP_SOFT_DIRTY swapoff issues we discussed back then (in
> 33 hours of load, should be good enough; but did see such problems
> a couple of times before including those fixes - I took them to be
> a side-effect of the page flags issue, but now rather doubt that).
> 
> The minutes or hours thing: I wonder if that indicates a missing
> initialization somewhere: that can easily show up soon after booting,
> but then the machine settles into a steady state of reusing the same
> structures, now initialized; until much later something disturbs the
> state and it has to allocate more.  Sheer speculation, but I wonder.

Thanks Hugh.

I do run tests on G5, but obviously not rigorously enough. I kicked off a few
kernel builds on mine and it survived, though once it hits swap it's almost
unusably slow. I'll leave it running overnight and see if I hit anything.

cheers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with swapping in v4.5-rc on POWER
  2016-02-25  2:10 Problems with swapping in v4.5-rc on POWER Hugh Dickins
  2016-02-25  4:12 ` Michael Ellerman
@ 2016-02-25  4:52 ` Aneesh Kumar K.V
  2016-02-25  5:43   ` Hugh Dickins
  1 sibling, 1 reply; 12+ messages in thread
From: Aneesh Kumar K.V @ 2016-02-25  4:52 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Paul Mackerras, linuxppc-dev, linux-mm

Hugh Dickins <hughd@google.com> writes:

> I've plagiarized the subject from Paulus's "Problems with THP" mail
> last weekend; but my similar problems are on PowerMac G5 baremetal,
> with 4kB pages, not capable of THP and no THP configured in.
>
> Under heavily swapping load, running kernel builds on tmpfs in limited
> memory, I've been seeing random segfaults too, internal compiler errors
> etc.  Not easily reproduced: sometimes happens in minutes, sometimes
> not for several hours.
>
> I tried and failed to construct a reproducer for you: my lack of a good
> recipe has deterred me from reporting it, and seeing Paulus's mail on
> THP gave me hope that the answer would come up in that thread; but no,
> that was quickly resolved as a THP issue, since fixed.
>
> (Mine had appeared to be fixed in v4.5-rc4 anyway; but I guess I
> just didn't try hard enough, it resurfaced on -rc5 immediately.)
>
> I've seen no sign of such problems on x86.  And I saw no sign of such
> problems on v4.4-rc8-mm1, when I included the fixes to the _PAGE_PTE
> and _PAGE_SWP_SOFT_DIRTY swapoff issues we discussed back then (in
> 33 hours of load, should be good enough; but did see such problems
> a couple of times before including those fixes - I took them to be
> a side-effect of the page flags issue, but now rather doubt that).
>

Can you test the impact of the merge listed below ?(ie, revert the merge and see if
we can reproduce and also verify with merge applied). This will give us a
set of commits to look closer. We had quiet a lot of page table
related changes going in this merge window. 

f689b742f217b2ffe7 ("Pull powerpc updates from Michael Ellerman:")

That is the merge commit that added _PAGE_PTE. 


> The minutes or hours thing: I wonder if that indicates a missing
> initialization somewhere: that can easily show up soon after booting,
> but then the machine settles into a steady state of reusing the same
> structures, now initialized; until much later something disturbs the
> state and it has to allocate more.  Sheer speculation, but I wonder.
>


-aneesh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with swapping in v4.5-rc on POWER
  2016-02-25  4:12 ` Michael Ellerman
@ 2016-02-25  5:36   ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2016-02-25  5:36 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: Hugh Dickins, Aneesh Kumar K.V, linux-mm, linuxppc-dev

On Thu, 25 Feb 2016, Michael Ellerman wrote:
> 
> I do run tests on G5, but obviously not rigorously enough. I kicked off a few
> kernel builds on mine and it survived, though once it hits swap it's almost
> unusably slow. I'll leave it running overnight and see if I hit anything.

Oh yes, I'd forgotten how unusably slow: I tend to forget that I slipped an
SSD in there some while back, just for the swapping: slow, but not unusable.

Thanks, I'm hoping you will be able to reproduce it yourself.

Hugh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with swapping in v4.5-rc on POWER
  2016-02-25  4:52 ` Aneesh Kumar K.V
@ 2016-02-25  5:43   ` Hugh Dickins
  2016-02-25 21:35     ` Hugh Dickins
  0 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2016-02-25  5:43 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Hugh Dickins, Paul Mackerras, linuxppc-dev, linux-mm

On Thu, 25 Feb 2016, Aneesh Kumar K.V wrote:
> 
> Can you test the impact of the merge listed below ?(ie, revert the merge and see if
> we can reproduce and also verify with merge applied). This will give us a
> set of commits to look closer. We had quiet a lot of page table
> related changes going in this merge window. 
> 
> f689b742f217b2ffe7 ("Pull powerpc updates from Michael Ellerman:")
> 
> That is the merge commit that added _PAGE_PTE. 

Another experiment running on it at the moment, I'd like to give that
a few more hours, and then will try the revert you suggest.  But does
that merge revert cleanly, did you try?  I'm afraid of interactions,
whether obvious or subtle, with the THP refcounting rework.  Oh, since
I don't have THP configured on, maybe I can ignore any issues from that.

Hugh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with swapping in v4.5-rc on POWER
  2016-02-25  5:43   ` Hugh Dickins
@ 2016-02-25 21:35     ` Hugh Dickins
  2016-02-26 10:04       ` Hugh Dickins
  0 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2016-02-25 21:35 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Michael Ellerman, Paul Mackerras, linuxppc-dev, linux-mm

On Wed, 24 Feb 2016, Hugh Dickins wrote:
> On Thu, 25 Feb 2016, Aneesh Kumar K.V wrote:
> > 
> > Can you test the impact of the merge listed below ?(ie, revert the merge and see if
> > we can reproduce and also verify with merge applied). This will give us a
> > set of commits to look closer. We had quiet a lot of page table
> > related changes going in this merge window. 
> > 
> > f689b742f217b2ffe7 ("Pull powerpc updates from Michael Ellerman:")
> > 
> > That is the merge commit that added _PAGE_PTE. 
> 
> Another experiment running on it at the moment, I'd like to give that
> a few more hours, and then will try the revert you suggest.  But does
> that merge revert cleanly, did you try?  I'm afraid of interactions,
> whether obvious or subtle, with the THP refcounting rework.  Oh, since
> I don't have THP configured on, maybe I can ignore any issues from that.

That revert worked painlessly, only a very few and simple conflicts,
I ran that under load for 12 hours, no problem seen.

I've now checked out an f689b742 tree and started on that, just to
confirm that it fails fairly quickly I hope; and will then proceed
to git bisect, giving that as bad and 37cea93b as good.

Given the uncertainty of whether 12 hours is really long enough to be
sure, and perhaps difficulties along the way, I don't rate my chances
of a reliable bisection higher than 60%, but we'll see.

Hugh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with swapping in v4.5-rc on POWER
  2016-02-25 21:35     ` Hugh Dickins
@ 2016-02-26 10:04       ` Hugh Dickins
  2016-03-02 20:49         ` Hugh Dickins
  0 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2016-02-26 10:04 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Michael Ellerman, Paul Mackerras, linuxppc-dev, linux-mm

On Thu, 25 Feb 2016, Hugh Dickins wrote:
> On Wed, 24 Feb 2016, Hugh Dickins wrote:
> > On Thu, 25 Feb 2016, Aneesh Kumar K.V wrote:
> > > 
> > > Can you test the impact of the merge listed below ?(ie, revert the merge and see if
> > > we can reproduce and also verify with merge applied). This will give us a
> > > set of commits to look closer. We had quiet a lot of page table
> > > related changes going in this merge window. 
> > > 
> > > f689b742f217b2ffe7 ("Pull powerpc updates from Michael Ellerman:")
> > > 
> > > That is the merge commit that added _PAGE_PTE. 
> > 
> > Another experiment running on it at the moment, I'd like to give that
> > a few more hours, and then will try the revert you suggest.  But does
> > that merge revert cleanly, did you try?  I'm afraid of interactions,
> > whether obvious or subtle, with the THP refcounting rework.  Oh, since
> > I don't have THP configured on, maybe I can ignore any issues from that.
> 
> That revert worked painlessly, only a very few and simple conflicts,
> I ran that under load for 12 hours, no problem seen.
> 
> I've now checked out an f689b742 tree and started on that, just to
> confirm that it fails fairly quickly I hope; and will then proceed
> to git bisect, giving that as bad and 37cea93b as good.
> 
> Given the uncertainty of whether 12 hours is really long enough to be
> sure, and perhaps difficulties along the way, I don't rate my chances
> of a reliable bisection higher than 60%, but we'll see.

I'm sure you won't want a breathless report from me on each bisection
step, but I ought to report that: contrary to our expectations, the
f689b742 survived without error for 12 hours, so appears to be good.
I'll bisect between there and v4.5-rc1.

Hugh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with swapping in v4.5-rc on POWER
  2016-02-26 10:04       ` Hugh Dickins
@ 2016-03-02 20:49         ` Hugh Dickins
  2016-03-03  5:51           ` Michael Ellerman
  0 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2016-03-02 20:49 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Hugh Dickins, Michael Ellerman, Paul Mackerras, linuxppc-dev, linux-mm

On Fri, 26 Feb 2016, Hugh Dickins wrote:
> On Thu, 25 Feb 2016, Hugh Dickins wrote:
> > On Wed, 24 Feb 2016, Hugh Dickins wrote:
> > > On Thu, 25 Feb 2016, Aneesh Kumar K.V wrote:
> > > > 
> > > > Can you test the impact of the merge listed below ?(ie, revert the merge and see if
> > > > we can reproduce and also verify with merge applied). This will give us a
> > > > set of commits to look closer. We had quiet a lot of page table
> > > > related changes going in this merge window. 
> > > > 
> > > > f689b742f217b2ffe7 ("Pull powerpc updates from Michael Ellerman:")
> > > > 
> > > > That is the merge commit that added _PAGE_PTE. 
> > > 
> > > Another experiment running on it at the moment, I'd like to give that
> > > a few more hours, and then will try the revert you suggest.  But does
> > > that merge revert cleanly, did you try?  I'm afraid of interactions,
> > > whether obvious or subtle, with the THP refcounting rework.  Oh, since
> > > I don't have THP configured on, maybe I can ignore any issues from that.
> > 
> > That revert worked painlessly, only a very few and simple conflicts,
> > I ran that under load for 12 hours, no problem seen.
> > 
> > I've now checked out an f689b742 tree and started on that, just to
> > confirm that it fails fairly quickly I hope; and will then proceed
> > to git bisect, giving that as bad and 37cea93b as good.
> > 
> > Given the uncertainty of whether 12 hours is really long enough to be
> > sure, and perhaps difficulties along the way, I don't rate my chances
> > of a reliable bisection higher than 60%, but we'll see.
> 
> I'm sure you won't want a breathless report from me on each bisection
> step, but I ought to report that: contrary to our expectations, the
> f689b742 survived without error for 12 hours, so appears to be good.
> I'll bisect between there and v4.5-rc1.

The bisection completed this morning (log appended below):
not a satisfactory conclusion, it's pointing to a davem/net merge.

I was uncomfortable when I marked that point bad in the first place:
it ran for 9 hours before hitting a compiler error, which was nearly
twice as long as the longest I'd seen before (5 hours), and
uncomfortably close to the 12 hours I've been taking as good.

My current thinking is that the powerpc merge that you indicated,
that I found to be "good", is the one that contains the bad commit;
but that the bug is very rare to manifest in that kernel, and my test
of the davem/net merge happened to be unusually unlucky to hit it.

Then some other later change makes it significantly easier to hit;
and identifying that change may make it much easier to pin down
what the original bug is.

So I've replayed the bisection up to that point, marked the davem/net
merge as good this time, and set off again in the hope that it will
lead somewhere more enlightening.  But prepared for disappointment.

Hugh

git bisect start
# good: [f689b742f217b2ffe7925f8a6521b208ee995309] Merge tag 'powerpc-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
git bisect good f689b742f217b2ffe7925f8a6521b208ee995309
# bad: [92e963f50fc74041b5e9e744c330dca48e04f08d] Linux 4.5-rc1
git bisect bad 92e963f50fc74041b5e9e744c330dca48e04f08d
# bad: [7f36f1b2a8c4f55f8226ed6c8bb4ed6de11c4015] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide
git bisect bad 7f36f1b2a8c4f55f8226ed6c8bb4ed6de11c4015
# bad: [6606b342febfd470b4a33acb73e360eeaca1d9bb] Merge git://www.linux-watchdog.org/linux-watchdog
git bisect bad 6606b342febfd470b4a33acb73e360eeaca1d9bb
# good: [d0021d3bdfe9d551859bca1f58da0e6be8e26043] Merge remote-tracking branch 'asoc/topic/wm8960' into asoc-next
git bisect good d0021d3bdfe9d551859bca1f58da0e6be8e26043
# good: [e3315b439c30c208582ac64e58f0c0d36b83181e] ALSA: oxfw: allocate own address region for SCS.1 series
git bisect good e3315b439c30c208582ac64e58f0c0d36b83181e
# good: [3da834e3e5a4a5d26882955298b55a9ed37a00bc] clk: remove duplicated COMMON_CLK_NXP record from clk/Kconfig
git bisect good 3da834e3e5a4a5d26882955298b55a9ed37a00bc
# bad: [e535d74bc50df2357d3253f8f3ca48c66d0d892a] Merge tag 'docs-4.5' of git://git.lwn.net/linux
git bisect bad e535d74bc50df2357d3253f8f3ca48c66d0d892a
# bad: [4e5448a31d73d0e944b7adb9049438a09bc332cb] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect bad 4e5448a31d73d0e944b7adb9049438a09bc332cb
# good: [b70ce2ab41cb67ab3d661eda078f7c4029bbca95] dts: hisi: fixes no syscon fault when init mdio
git bisect good b70ce2ab41cb67ab3d661eda078f7c4029bbca95
# good: [4a658527271bce43afb1cf4feec89afe6716ca59] xen-netback: delete NAPI instance when queue fails to initialize
git bisect good 4a658527271bce43afb1cf4feec89afe6716ca59
# good: [c6894dec8ea9ae05747124dce98b3b5c2e69b168] bridge: fix lockdep addr_list_lock false positive splat
git bisect good c6894dec8ea9ae05747124dce98b3b5c2e69b168
# good: [36beca6571c941b28b0798667608239731f9bc3a] sparc64: Fix numa node distance initialization
git bisect good 36beca6571c941b28b0798667608239731f9bc3a
# good: [750afbf8ee9c6a1c74a1fe5fc9852146b1d72687] bgmac: Fix reversed test of build_skb() return value.
git bisect good 750afbf8ee9c6a1c74a1fe5fc9852146b1d72687
# good: [5a18d263f8d27418c98b8e8551dadfe975c054e3] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
git bisect good 5a18d263f8d27418c98b8e8551dadfe975c054e3
# first bad commit: [4e5448a31d73d0e944b7adb9049438a09bc332cb] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with swapping in v4.5-rc on POWER
  2016-03-02 20:49         ` Hugh Dickins
@ 2016-03-03  5:51           ` Michael Ellerman
  2016-03-04 17:58             ` Hugh Dickins
  0 siblings, 1 reply; 12+ messages in thread
From: Michael Ellerman @ 2016-03-03  5:51 UTC (permalink / raw)
  To: Hugh Dickins, Aneesh Kumar K.V; +Cc: Paul Mackerras, linuxppc-dev, linux-mm

On Wed, 2016-03-02 at 12:49 -0800, Hugh Dickins wrote:
> On Fri, 26 Feb 2016, Hugh Dickins wrote:
> > On Thu, 25 Feb 2016, Hugh Dickins wrote:
> > > On Wed, 24 Feb 2016, Hugh Dickins wrote:
> > > > On Thu, 25 Feb 2016, Aneesh Kumar K.V wrote:
> > > > > 
> > > > > Can you test the impact of the merge listed below ?(ie, revert the merge and see if
> > > > > we can reproduce and also verify with merge applied). This will give us a
> > > > > set of commits to look closer. We had quiet a lot of page table
> > > > > related changes going in this merge window. 
> > > > > 
> > > > > f689b742f217b2ffe7 ("Pull powerpc updates from Michael Ellerman:")
> > > > > 
> > > > > That is the merge commit that added _PAGE_PTE. 
> > > > 
> > > > Another experiment running on it at the moment, I'd like to give that
> > > > a few more hours, and then will try the revert you suggest.  But does
> > > > that merge revert cleanly, did you try?  I'm afraid of interactions,
> > > > whether obvious or subtle, with the THP refcounting rework.  Oh, since
> > > > I don't have THP configured on, maybe I can ignore any issues from that.
> > > 
> > > That revert worked painlessly, only a very few and simple conflicts,
> > > I ran that under load for 12 hours, no problem seen.
> > > 
> > > I've now checked out an f689b742 tree and started on that, just to
> > > confirm that it fails fairly quickly I hope; and will then proceed
> > > to git bisect, giving that as bad and 37cea93b as good.
> > > 
> > > Given the uncertainty of whether 12 hours is really long enough to be
> > > sure, and perhaps difficulties along the way, I don't rate my chances
> > > of a reliable bisection higher than 60%, but we'll see.
> > 
> > I'm sure you won't want a breathless report from me on each bisection
> > step, but I ought to report that: contrary to our expectations, the
> > f689b742 survived without error for 12 hours, so appears to be good.
> > I'll bisect between there and v4.5-rc1.
> 
> The bisection completed this morning (log appended below):
> not a satisfactory conclusion, it's pointing to a davem/net merge.
> 
> I was uncomfortable when I marked that point bad in the first place:
> it ran for 9 hours before hitting a compiler error, which was nearly
> twice as long as the longest I'd seen before (5 hours), and
> uncomfortably close to the 12 hours I've been taking as good.
> 
> My current thinking is that the powerpc merge that you indicated,
> that I found to be "good", is the one that contains the bad commit;
> but that the bug is very rare to manifest in that kernel, and my test
> of the davem/net merge happened to be unusually unlucky to hit it.
> 
> Then some other later change makes it significantly easier to hit;
> and identifying that change may make it much easier to pin down
> what the original bug is.
> 
> So I've replayed the bisection up to that point, marked the davem/net
> merge as good this time, and set off again in the hope that it will
> lead somewhere more enlightening.  But prepared for disappointment.

Thanks Hugh. That logic sounds reasonable, I doubt we can blame davem :)

I've setup another box here to try and reproduce it. It's running with 4k
pages, no THP, and it's going well into swap. Hopefully I can hit the same bug,
but we'll see in 12 hours I guess.

cheers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with swapping in v4.5-rc on POWER
  2016-03-03  5:51           ` Michael Ellerman
@ 2016-03-04 17:58             ` Hugh Dickins
  2016-03-07  3:00               ` Michael Ellerman
  0 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2016-03-04 17:58 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Hugh Dickins, Aneesh Kumar K.V, Paul Mackerras, linuxppc-dev, linux-mm

On Thu, 3 Mar 2016, Michael Ellerman wrote:
> On Wed, 2016-03-02 at 12:49 -0800, Hugh Dickins wrote:
> > On Fri, 26 Feb 2016, Hugh Dickins wrote:
> > > On Thu, 25 Feb 2016, Hugh Dickins wrote:
> > > > On Wed, 24 Feb 2016, Hugh Dickins wrote:
> > > > > On Thu, 25 Feb 2016, Aneesh Kumar K.V wrote:
> > > > > > 
> > > > > > Can you test the impact of the merge listed below ?(ie, revert the merge and see if
> > > > > > we can reproduce and also verify with merge applied). This will give us a
> > > > > > set of commits to look closer. We had quiet a lot of page table
> > > > > > related changes going in this merge window. 
> > > > > > 
> > > > > > f689b742f217b2ffe7 ("Pull powerpc updates from Michael Ellerman:")
> > > > > > 
> > > > > > That is the merge commit that added _PAGE_PTE. 
> > > > > 
> > > > > Another experiment running on it at the moment, I'd like to give that
> > > > > a few more hours, and then will try the revert you suggest.  But does
> > > > > that merge revert cleanly, did you try?  I'm afraid of interactions,
> > > > > whether obvious or subtle, with the THP refcounting rework.  Oh, since
> > > > > I don't have THP configured on, maybe I can ignore any issues from that.
> > > > 
> > > > That revert worked painlessly, only a very few and simple conflicts,
> > > > I ran that under load for 12 hours, no problem seen.
> > > > 
> > > > I've now checked out an f689b742 tree and started on that, just to
> > > > confirm that it fails fairly quickly I hope; and will then proceed
> > > > to git bisect, giving that as bad and 37cea93b as good.
> > > > 
> > > > Given the uncertainty of whether 12 hours is really long enough to be
> > > > sure, and perhaps difficulties along the way, I don't rate my chances
> > > > of a reliable bisection higher than 60%, but we'll see.
> > > 
> > > I'm sure you won't want a breathless report from me on each bisection
> > > step, but I ought to report that: contrary to our expectations, the
> > > f689b742 survived without error for 12 hours, so appears to be good.
> > > I'll bisect between there and v4.5-rc1.
> > 
> > The bisection completed this morning (log appended below):
> > not a satisfactory conclusion, it's pointing to a davem/net merge.
> > 
> > I was uncomfortable when I marked that point bad in the first place:
> > it ran for 9 hours before hitting a compiler error, which was nearly
> > twice as long as the longest I'd seen before (5 hours), and
> > uncomfortably close to the 12 hours I've been taking as good.
> > 
> > My current thinking is that the powerpc merge that you indicated,
> > that I found to be "good", is the one that contains the bad commit;
> > but that the bug is very rare to manifest in that kernel, and my test
> > of the davem/net merge happened to be unusually unlucky to hit it.
> > 
> > Then some other later change makes it significantly easier to hit;
> > and identifying that change may make it much easier to pin down
> > what the original bug is.
> > 
> > So I've replayed the bisection up to that point, marked the davem/net
> > merge as good this time, and set off again in the hope that it will
> > lead somewhere more enlightening.  But prepared for disappointment.
> 
> Thanks Hugh. That logic sounds reasonable, I doubt we can blame davem :)
> 
> I've setup another box here to try and reproduce it. It's running with 4k
> pages, no THP, and it's going well into swap. Hopefully I can hit the same bug,
> but we'll see in 12 hours I guess.

The alternative bisection was as unsatisfactory as the first:
again it fingered an irrelevant merge (rather than any commit
pulled in by that merge) as the bad commit.

It seems this issue is too intermittent for bisection to be useful,
on my load anyway.

The best I can do now is try v4.4 for a couple of days, to verify that
still comes out good (rather than the machine going bad coincident with
v4.5-rc), then try v4.5-rc7 to verify that that still comes out bad.

I'll report back on those; but beyond that, I'll have to leave it to you.

Hugh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with swapping in v4.5-rc on POWER
  2016-03-04 17:58             ` Hugh Dickins
@ 2016-03-07  3:00               ` Michael Ellerman
  2016-03-08 11:49                 ` Hugh Dickins
  0 siblings, 1 reply; 12+ messages in thread
From: Michael Ellerman @ 2016-03-07  3:00 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Aneesh Kumar K.V, Paul Mackerras, linuxppc-dev, linux-mm

On Fri, 2016-03-04 at 09:58 -0800, Hugh Dickins wrote:
> 
> The alternative bisection was as unsatisfactory as the first:
> again it fingered an irrelevant merge (rather than any commit
> pulled in by that merge) as the bad commit.
> 
> It seems this issue is too intermittent for bisection to be useful,
> on my load anyway.

Darn. Thanks for trying.

> The best I can do now is try v4.4 for a couple of days, to verify that
> still comes out good (rather than the machine going bad coincident with
> v4.5-rc), then try v4.5-rc7 to verify that that still comes out bad.

Thanks, that would still be helpful.

> I'll report back on those; but beyond that, I'll have to leave it to you.

I haven't had any luck here :/

Can you give us a more verbose description of your test setup?

 - G5, which exact model?
 - 4k pages, no THP.
 - how much ram & swap?
 - building linus' tree, make -j ?
 - source and output on tmpfs? (how big?)
 - what device is the swap device? (you said SSD I think?)
 - anything else I've forgotten?

Oh and can you send us your bisect logs, we can at least trust the bad results
I think.

cheers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with swapping in v4.5-rc on POWER
  2016-03-07  3:00               ` Michael Ellerman
@ 2016-03-08 11:49                 ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2016-03-08 11:49 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Hugh Dickins, Aneesh Kumar K.V, Paul Mackerras, linuxppc-dev, linux-mm

On Mon, 7 Mar 2016, Michael Ellerman wrote:
> On Fri, 2016-03-04 at 09:58 -0800, Hugh Dickins wrote:
> > 
> > The alternative bisection was as unsatisfactory as the first:
> > again it fingered an irrelevant merge (rather than any commit
> > pulled in by that merge) as the bad commit.
> > 
> > It seems this issue is too intermittent for bisection to be useful,
> > on my load anyway.
> 
> Darn. Thanks for trying.
> 
> > The best I can do now is try v4.4 for a couple of days, to verify that
> > still comes out good (rather than the machine going bad coincident with
> > v4.5-rc), then try v4.5-rc7 to verify that that still comes out bad.
> 
> Thanks, that would still be helpful.

v4.4 ran under load for 56 hours without any trouble, before I stopped
it to switch kernels.  v4.5-rc7 ran for 19.5 hours, then hit the problem
(sigsegv in "as" on this occasion).

> 
> > I'll report back on those; but beyond that, I'll have to leave it to you.
> 
> I haven't had any luck here :/
> 
> Can you give us a more verbose description of your test setup?

I'll be a lot more terse than you'd like, not much time to spare.
If I had a good reproducer, then of course I should specify it exactly
to you; but no, 19.5 hours or 5 hours or a few minutes, that does not
amount to a good reproducer.

> 
>  - G5, which exact model?

/proc/cpuinfo says:

processor	: 0
cpu		: PPC970MP, altivec supported
clock		: 2500.000000MHz
revision	: 1.1 (pvr 0044 0101)

processor	: 1
cpu		: PPC970MP, altivec supported
clock		: 2500.000000MHz
revision	: 1.1 (pvr 0044 0101)

processor	: 2
cpu		: PPC970MP, altivec supported
clock		: 2500.000000MHz
revision	: 1.1 (pvr 0044 0101)

processor	: 3
cpu		: PPC970MP, altivec supported
clock		: 2500.000000MHz
revision	: 1.1 (pvr 0044 0101)

timebase	: 33333333
platform	: PowerMac
model		: PowerMac11,2
machine		: PowerMac11,2
motherboard	: PowerMac11,2 MacRISC4 Power Macintosh 
detected as	: 337 (PowerMac G5 Dual Core)
pmac flags	: 00000000
L2 cache	: 1024K unified
pmac-generation	: NewWorld

>  - 4k pages, no THP.

Yes.

>  - how much ram & swap?

I boot with mem=700M, and use 1.5G swap.

>  - building linus' tree, make -j ?

Building an old 2.6.24 tree (which had a higher source to built ratio
than nowadays; with patches to get it to build with more recent toolchain,
from openSUSE 13.1); building some config I used to run on that machine.

Building two of them, each make -j20, concurrently: one in tmpfs,
one in 4kB-blocksize ext4 on loop on tmpfs file.  But I doubt that
complication is relevant here: sometimes it's the build in tmpfs
that collapses, sometimes the build in ext4, it's fairly even which.

(Do not bother to attempt such a load on linux-next, only on v4.5:
the OOM rework in mmotm has an unsolved problem with order=2 allocations,
which means that such a load will be OOM-killed very quickly.)

>  - source and output on tmpfs? (how big?)

One source and output in ext4 on loop on file filling 470M tmpfs.
Other source and output in tmpfs on /tmp which I happen to size at 1300M
(but could be half that).  Sizes of course fitted to that source tree
and config I happen to be building.

>  - what device is the swap device? (you said SSD I think?)

Old 75G Intel SSD:
ata2.00: ATA-7: INTEL SSDSA2M080G2GN, 2CV102HD, max UDMA/133

>  - anything else I've forgotten?

I happen to run with /proc/sys/vm/swappiness 100,
merely because it's swapping that I'm trying to exercise.

I doubt that any of the details above are important: plenty of
swapping is probably the only message (and doing everything in
tmpfs in limited memory is a good way to force plenty of swapping).

> 
> Oh and can you send us your bisect logs, we can at least trust the bad results
> I think.

Remember that both of these bisections started from 4.5-rc1 as bad,
and f689b742f217, the powerpc merge, as good - since I didn't see a
problem at that commit in 12 hours.  But we all suspect that in fact
something in that powerpc merge was actually the bad.

git bisect start
# good: [f689b742f217b2ffe7925f8a6521b208ee995309] Merge tag 'powerpc-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
git bisect good f689b742f217b2ffe7925f8a6521b208ee995309
# bad: [92e963f50fc74041b5e9e744c330dca48e04f08d] Linux 4.5-rc1
git bisect bad 92e963f50fc74041b5e9e744c330dca48e04f08d
# bad: [7f36f1b2a8c4f55f8226ed6c8bb4ed6de11c4015] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide
git bisect bad 7f36f1b2a8c4f55f8226ed6c8bb4ed6de11c4015
# bad: [6606b342febfd470b4a33acb73e360eeaca1d9bb] Merge git://www.linux-watchdog.org/linux-watchdog
git bisect bad 6606b342febfd470b4a33acb73e360eeaca1d9bb
# good: [d0021d3bdfe9d551859bca1f58da0e6be8e26043] Merge remote-tracking branch 'asoc/topic/wm8960' into asoc-next
git bisect good d0021d3bdfe9d551859bca1f58da0e6be8e26043
# good: [e3315b439c30c208582ac64e58f0c0d36b83181e] ALSA: oxfw: allocate own address region for SCS.1 series
git bisect good e3315b439c30c208582ac64e58f0c0d36b83181e
# good: [3da834e3e5a4a5d26882955298b55a9ed37a00bc] clk: remove duplicated COMMON_CLK_NXP record from clk/Kconfig
git bisect good 3da834e3e5a4a5d26882955298b55a9ed37a00bc
# bad: [e535d74bc50df2357d3253f8f3ca48c66d0d892a] Merge tag 'docs-4.5' of git://git.lwn.net/linux
git bisect bad e535d74bc50df2357d3253f8f3ca48c66d0d892a
# bad: [4e5448a31d73d0e944b7adb9049438a09bc332cb] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect bad 4e5448a31d73d0e944b7adb9049438a09bc332cb
# good: [b70ce2ab41cb67ab3d661eda078f7c4029bbca95] dts: hisi: fixes no syscon fault when init mdio
git bisect good b70ce2ab41cb67ab3d661eda078f7c4029bbca95
# good: [4a658527271bce43afb1cf4feec89afe6716ca59] xen-netback: delete NAPI instance when queue fails to initialize
git bisect good 4a658527271bce43afb1cf4feec89afe6716ca59
# good: [c6894dec8ea9ae05747124dce98b3b5c2e69b168] bridge: fix lockdep addr_list_lock false positive splat
git bisect good c6894dec8ea9ae05747124dce98b3b5c2e69b168
# good: [36beca6571c941b28b0798667608239731f9bc3a] sparc64: Fix numa node distance initialization
git bisect good 36beca6571c941b28b0798667608239731f9bc3a
# good: [750afbf8ee9c6a1c74a1fe5fc9852146b1d72687] bgmac: Fix reversed test of build_skb() return value.
git bisect good 750afbf8ee9c6a1c74a1fe5fc9852146b1d72687
# good: [5a18d263f8d27418c98b8e8551dadfe975c054e3] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
git bisect good 5a18d263f8d27418c98b8e8551dadfe975c054e3
# first bad commit: [4e5448a31d73d0e944b7adb9049438a09bc332cb] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

And then I replayed, taking the davem/net merge as good instead,
on the basis that it had taken longer than usual to hit the issue:

git bisect start
# good: [f689b742f217b2ffe7925f8a6521b208ee995309] Merge tag 'powerpc-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
git bisect good f689b742f217b2ffe7925f8a6521b208ee995309
# bad: [92e963f50fc74041b5e9e744c330dca48e04f08d] Linux 4.5-rc1
git bisect bad 92e963f50fc74041b5e9e744c330dca48e04f08d
# bad: [7f36f1b2a8c4f55f8226ed6c8bb4ed6de11c4015] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide
git bisect bad 7f36f1b2a8c4f55f8226ed6c8bb4ed6de11c4015
# bad: [6606b342febfd470b4a33acb73e360eeaca1d9bb] Merge git://www.linux-watchdog.org/linux-watchdog
git bisect bad 6606b342febfd470b4a33acb73e360eeaca1d9bb
# good: [d0021d3bdfe9d551859bca1f58da0e6be8e26043] Merge remote-tracking branch 'asoc/topic/wm8960' into asoc-next
git bisect good d0021d3bdfe9d551859bca1f58da0e6be8e26043
# good: [e3315b439c30c208582ac64e58f0c0d36b83181e] ALSA: oxfw: allocate own address region for SCS.1 series
git bisect good e3315b439c30c208582ac64e58f0c0d36b83181e
# good: [3da834e3e5a4a5d26882955298b55a9ed37a00bc] clk: remove duplicated COMMON_CLK_NXP record from clk/Kconfig
git bisect good 3da834e3e5a4a5d26882955298b55a9ed37a00bc
# bad: [e535d74bc50df2357d3253f8f3ca48c66d0d892a] Merge tag 'docs-4.5' of git://git.lwn.net/linux
git bisect bad e535d74bc50df2357d3253f8f3ca48c66d0d892a
# good: [4e5448a31d73d0e944b7adb9049438a09bc332cb] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect good 4e5448a31d73d0e944b7adb9049438a09bc332cb
# good: [aa13a960fc1bd28cfd8b3aef43e523ade1817a2c] Documentation: cpu-hotplug: Fix sysfs mount instructions
git bisect good aa13a960fc1bd28cfd8b3aef43e523ade1817a2c
# good: [afd8c08446d6503adc1ccd2726a8e27f35d95b79] Documentation: Explain pci=conf1,conf2 more verbosely
git bisect good afd8c08446d6503adc1ccd2726a8e27f35d95b79
# good: [e5b6c1518878e157df4121c1caf70d9c470a6d31] firmware: dmi_scan: Save SMBIOS Type 9 System Slots
git bisect good e5b6c1518878e157df4121c1caf70d9c470a6d31
# good: [ec3fc58b1e7a32cc9f552b306f8dbb4454e83798] thermal: add description for integral_cutoff unit
git bisect good ec3fc58b1e7a32cc9f552b306f8dbb4454e83798
# bad: [ece6267878aed4eadff766112f1079984315d8c8] Merge tag 'clk-for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
git bisect bad ece6267878aed4eadff766112f1079984315d8c8
# bad: [d45187aaf0e256d23da2f7694a7826524499aa31] Merge branch 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging
git bisect bad d45187aaf0e256d23da2f7694a7826524499aa31
# first bad commit: [d45187aaf0e256d23da2f7694a7826524499aa31] Merge branch 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging

Hugh

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-03-08 11:50 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-25  2:10 Problems with swapping in v4.5-rc on POWER Hugh Dickins
2016-02-25  4:12 ` Michael Ellerman
2016-02-25  5:36   ` Hugh Dickins
2016-02-25  4:52 ` Aneesh Kumar K.V
2016-02-25  5:43   ` Hugh Dickins
2016-02-25 21:35     ` Hugh Dickins
2016-02-26 10:04       ` Hugh Dickins
2016-03-02 20:49         ` Hugh Dickins
2016-03-03  5:51           ` Michael Ellerman
2016-03-04 17:58             ` Hugh Dickins
2016-03-07  3:00               ` Michael Ellerman
2016-03-08 11:49                 ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).