linux-mips.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* NFS/TCP crashes on MIPS/RBTX4927 in v4.20-rcX (bisected)
@ 2018-12-04 13:53 Geert Uytterhoeven
  2018-12-05 13:11 ` Atsushi Nemoto
  0 siblings, 1 reply; 12+ messages in thread
From: Geert Uytterhoeven @ 2018-12-04 13:53 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Atsushi Nemoto, open list:NFS, SUNRPC, AND...,
	linux-mips, Linux Kernel Mailing List

        Hi Trond,

Recently, I've upgraded my NFS server to Ubuntu 18.04LTS.  Apparently
the NFS server in that release dropped support for NFS over UDP, hence I
appended ",tcp,v3" to all my nfsroot kernel command line parameters.
This works fine on my arm/arm64 development boards, but causes a crash
on RBTX4927:

    VFS: Mounted root (nfs filesystem) on device 0:13.
    devtmpfs: mounted
    Freeing prom memory: 1020k freed
    Freeing unused kernel memory: 208K
    This architecture does not have kernel memory protection.
    Run /sbin/init as init process
    do_page_fault(): sending SIGSEGV to init for invalid read access
from 57e7e414
    epc = 77f9e188 in ld-2.19.so[77f9c000+22000]
    ra  = 77f9d91c in ld-2.19.so[77f9c000+22000]
    Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

I found similar crashes in a report from 2006, but of course the code
has changed too much to apply the solution proposed there
(https://www.linux-mips.org/archives/linux-mips/2006-09/msg00169.html).

Userland is Debian 8 (the last release supporting "old" MIPS).
My kernel is based on v4.20.0-rc5, but the issue happens with v4.20-rc1,
too.

However, I noticed it works in v4.19! Hence I've bisected this, to commit
277e4ab7d530bf28 ("SUNRPC: Simplify TCP receive code by switching to using
iterators").

Dropping the ",tcp" part from the nfsroot parameter also fixes the issue.

Given RBTX4926 is little endian, just like my arm/arm64 boards, it's probably
not an endianness issue.  Sparse didn't show anything suspicious before/after
the guilty commit.

Do you have a clue?
Thanks!

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/TCP crashes on MIPS/RBTX4927 in v4.20-rcX (bisected)
  2018-12-04 13:53 NFS/TCP crashes on MIPS/RBTX4927 in v4.20-rcX (bisected) Geert Uytterhoeven
@ 2018-12-05 13:11 ` Atsushi Nemoto
  2018-12-05 13:41   ` Geert Uytterhoeven
  0 siblings, 1 reply; 12+ messages in thread
From: Atsushi Nemoto @ 2018-12-05 13:11 UTC (permalink / raw)
  To: geert; +Cc: trond.myklebust, linux-nfs, linux-mips, linux-kernel

Hi Geert,

On Tue, 4 Dec 2018 14:53:07 +0100, Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> I found similar crashes in a report from 2006, but of course the code
> has changed too much to apply the solution proposed there
> (https://www.linux-mips.org/archives/linux-mips/2006-09/msg00169.html).
> 
> Userland is Debian 8 (the last release supporting "old" MIPS).
> My kernel is based on v4.20.0-rc5, but the issue happens with v4.20-rc1,
> too.
> 
> However, I noticed it works in v4.19! Hence I've bisected this, to commit
> 277e4ab7d530bf28 ("SUNRPC: Simplify TCP receive code by switching to using
> iterators").
> 
> Dropping the ",tcp" part from the nfsroot parameter also fixes the issue.
> 
> Given RBTX4926 is little endian, just like my arm/arm64 boards, it's probably
> not an endianness issue.  Sparse didn't show anything suspicious before/after
> the guilty commit.
> 
> Do you have a clue?

If it was a cache issue, disabling i-cache or d-cache completely might
help understanding the problem.  I added TXx9 specific "icdisable" and
"dcdisable" kernel options for debugging long ago.

I hope these options still works correctly with recent kernel but not
sure.

Also, disabling i-cache makes your board VERY slow, of course.

---
Atsushi Nemoto

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/TCP crashes on MIPS/RBTX4927 in v4.20-rcX (bisected)
  2018-12-05 13:11 ` Atsushi Nemoto
@ 2018-12-05 13:41   ` Geert Uytterhoeven
  2018-12-05 13:45     ` Trond Myklebust
  2018-12-07 14:51     ` Atsushi Nemoto
  0 siblings, 2 replies; 12+ messages in thread
From: Geert Uytterhoeven @ 2018-12-05 13:41 UTC (permalink / raw)
  To: Atsushi Nemoto
  Cc: Trond Myklebust, open list:NFS, SUNRPC, AND...,
	linux-mips, Linux Kernel Mailing List

Hi Nemoto-san,

On Wed, Dec 5, 2018 at 2:11 PM Atsushi Nemoto <anemo@mba.ocn.ne.jp> wrote:
> On Tue, 4 Dec 2018 14:53:07 +0100, Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> > I found similar crashes in a report from 2006, but of course the code
> > has changed too much to apply the solution proposed there
> > (https://www.linux-mips.org/archives/linux-mips/2006-09/msg00169.html).
> >
> > Userland is Debian 8 (the last release supporting "old" MIPS).
> > My kernel is based on v4.20.0-rc5, but the issue happens with v4.20-rc1,
> > too.
> >
> > However, I noticed it works in v4.19! Hence I've bisected this, to commit
> > 277e4ab7d530bf28 ("SUNRPC: Simplify TCP receive code by switching to using
> > iterators").
> >
> > Dropping the ",tcp" part from the nfsroot parameter also fixes the issue.
> >
> > Given RBTX4927 is little endian, just like my arm/arm64 boards, it's probably
> > not an endianness issue.  Sparse didn't show anything suspicious before/after
> > the guilty commit.
> >
> > Do you have a clue?
>
> If it was a cache issue, disabling i-cache or d-cache completely might
> help understanding the problem.  I added TXx9 specific "icdisable" and
> "dcdisable" kernel options for debugging long ago.
>
> I hope these options still works correctly with recent kernel but not
> sure.
>
> Also, disabling i-cache makes your board VERY slow, of course.

Thanks!

When using these options, I do see a slowdown in early boot, but the issue
is still there.

My next guess is an unaligned access not using {get,put}_unaligned(), which
doesn't seem to work on tx4927, but doesn't cause an exception neither.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/TCP crashes on MIPS/RBTX4927 in v4.20-rcX (bisected)
  2018-12-05 13:41   ` Geert Uytterhoeven
@ 2018-12-05 13:45     ` Trond Myklebust
  2018-12-05 14:47       ` Geert Uytterhoeven
  2018-12-07 14:51     ` Atsushi Nemoto
  1 sibling, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2018-12-05 13:45 UTC (permalink / raw)
  To: geert, anemo; +Cc: linux-kernel, linux-nfs, linux-mips

On Wed, 2018-12-05 at 14:41 +0100, Geert Uytterhoeven wrote:
> Hi Nemoto-san,
> 
> On Wed, Dec 5, 2018 at 2:11 PM Atsushi Nemoto <anemo@mba.ocn.ne.jp>
> wrote:
> > On Tue, 4 Dec 2018 14:53:07 +0100, Geert Uytterhoeven <
> > geert@linux-m68k.org> wrote:
> > > I found similar crashes in a report from 2006, but of course the
> > > code
> > > has changed too much to apply the solution proposed there
> > > (
> > > https://www.linux-mips.org/archives/linux-mips/2006-09/msg00169.html
> > > ).
> > > 
> > > Userland is Debian 8 (the last release supporting "old" MIPS).
> > > My kernel is based on v4.20.0-rc5, but the issue happens with
> > > v4.20-rc1,
> > > too.
> > > 
> > > However, I noticed it works in v4.19! Hence I've bisected this,
> > > to commit
> > > 277e4ab7d530bf28 ("SUNRPC: Simplify TCP receive code by switching
> > > to using
> > > iterators").
> > > 
> > > Dropping the ",tcp" part from the nfsroot parameter also fixes
> > > the issue.
> > > 
> > > Given RBTX4927 is little endian, just like my arm/arm64 boards,
> > > it's probably
> > > not an endianness issue.  Sparse didn't show anything suspicious
> > > before/after
> > > the guilty commit.
> > > 
> > > Do you have a clue?
> > 
> > If it was a cache issue, disabling i-cache or d-cache completely
> > might
> > help understanding the problem.  I added TXx9 specific "icdisable"
> > and
> > "dcdisable" kernel options for debugging long ago.
> > 
> > I hope these options still works correctly with recent kernel but
> > not
> > sure.
> > 
> > Also, disabling i-cache makes your board VERY slow, of course.
> 
> Thanks!
> 
> When using these options, I do see a slowdown in early boot, but the
> issue
> is still there.
> 
> My next guess is an unaligned access not using {get,put}_unaligned(),
> which
> doesn't seem to work on tx4927, but doesn't cause an exception
> neither.

Can you try my linux-next branch on git.linux-nfs.org? It contains a
fixes for a hang that results from the above commit.

git pull git://git.linux-nfs.org/projects/trondmy/linux-nfs.git linux-next

Cheers
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/TCP crashes on MIPS/RBTX4927 in v4.20-rcX (bisected)
  2018-12-05 13:45     ` Trond Myklebust
@ 2018-12-05 14:47       ` Geert Uytterhoeven
  2018-12-17 14:03         ` Geert Uytterhoeven
  0 siblings, 1 reply; 12+ messages in thread
From: Geert Uytterhoeven @ 2018-12-05 14:47 UTC (permalink / raw)
  To: trondmy
  Cc: Atsushi Nemoto, Linux Kernel Mailing List, open list:NFS, SUNRPC,
	AND...,
	linux-mips

Hi Trond,

On Wed, Dec 5, 2018 at 2:45 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
> On Wed, 2018-12-05 at 14:41 +0100, Geert Uytterhoeven wrote:
> > On Wed, Dec 5, 2018 at 2:11 PM Atsushi Nemoto <anemo@mba.ocn.ne.jp>
> > wrote:
> > > On Tue, 4 Dec 2018 14:53:07 +0100, Geert Uytterhoeven <
> > > geert@linux-m68k.org> wrote:
> > > > I found similar crashes in a report from 2006, but of course the
> > > > code
> > > > has changed too much to apply the solution proposed there
> > > > (
> > > > https://www.linux-mips.org/archives/linux-mips/2006-09/msg00169.html
> > > > ).
> > > >
> > > > Userland is Debian 8 (the last release supporting "old" MIPS).
> > > > My kernel is based on v4.20.0-rc5, but the issue happens with
> > > > v4.20-rc1,
> > > > too.
> > > >
> > > > However, I noticed it works in v4.19! Hence I've bisected this,
> > > > to commit
> > > > 277e4ab7d530bf28 ("SUNRPC: Simplify TCP receive code by switching
> > > > to using
> > > > iterators").
> > > >
> > > > Dropping the ",tcp" part from the nfsroot parameter also fixes
> > > > the issue.
> > > >
> > > > Given RBTX4927 is little endian, just like my arm/arm64 boards,
> > > > it's probably
> > > > not an endianness issue.  Sparse didn't show anything suspicious
> > > > before/after
> > > > the guilty commit.
> > > >
> > > > Do you have a clue?
> > >
> > > If it was a cache issue, disabling i-cache or d-cache completely
> > > might
> > > help understanding the problem.  I added TXx9 specific "icdisable"
> > > and
> > > "dcdisable" kernel options for debugging long ago.
> > >
> > > I hope these options still works correctly with recent kernel but
> > > not
> > > sure.
> > >
> > > Also, disabling i-cache makes your board VERY slow, of course.
> >
> > Thanks!
> >
> > When using these options, I do see a slowdown in early boot, but the
> > issue
> > is still there.
> >
> > My next guess is an unaligned access not using {get,put}_unaligned(),
> > which
> > doesn't seem to work on tx4927, but doesn't cause an exception
> > neither.
>
> Can you try my linux-next branch on git.linux-nfs.org? It contains a
> fixes for a hang that results from the above commit.
>
> git pull git://git.linux-nfs.org/projects/trondmy/linux-nfs.git linux-next

Thanks for the suggestion, but unfortunately it doesn't help.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/TCP crashes on MIPS/RBTX4927 in v4.20-rcX (bisected)
  2018-12-05 13:41   ` Geert Uytterhoeven
  2018-12-05 13:45     ` Trond Myklebust
@ 2018-12-07 14:51     ` Atsushi Nemoto
  2018-12-07 16:19       ` Geert Uytterhoeven
  1 sibling, 1 reply; 12+ messages in thread
From: Atsushi Nemoto @ 2018-12-07 14:51 UTC (permalink / raw)
  To: geert; +Cc: trond.myklebust, linux-nfs, linux-mips, linux-kernel

On Wed, 5 Dec 2018 14:41:30 +0100, Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> When using these options, I do see a slowdown in early boot, but the issue
> is still there.

Hmm, the NIC of the board is NE2000 variants, so DMA coherency will
not be an issue anyway.  So strange ...

The board has a PCI slot.  If you had an legacy PCI NIC card, trying
with it might help finding the bug.

> My next guess is an unaligned access not using {get,put}_unaligned(), which
> doesn't seem to work on tx4927, but doesn't cause an exception neither.

IIRC, TX49 can raise an exception on unaligned access.

---
Atsushi Nemoto

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/TCP crashes on MIPS/RBTX4927 in v4.20-rcX (bisected)
  2018-12-07 14:51     ` Atsushi Nemoto
@ 2018-12-07 16:19       ` Geert Uytterhoeven
  0 siblings, 0 replies; 12+ messages in thread
From: Geert Uytterhoeven @ 2018-12-07 16:19 UTC (permalink / raw)
  To: Atsushi Nemoto
  Cc: Trond Myklebust, open list:NFS, SUNRPC, AND...,
	linux-mips, Linux Kernel Mailing List

Hi Nemoto-san,

On Fri, Dec 7, 2018 at 3:51 PM Atsushi Nemoto <anemo@mba.ocn.ne.jp> wrote:
> On Wed, 5 Dec 2018 14:41:30 +0100, Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> > My next guess is an unaligned access not using {get,put}_unaligned(), which
> > doesn't seem to work on tx4927, but doesn't cause an exception neither.
>
> IIRC, TX49 can raise an exception on unaligned access.

I thought so, too, but had verified that reading from an unaligned address
didn't raise an exception, but returned a corrupt value instead.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/TCP crashes on MIPS/RBTX4927 in v4.20-rcX (bisected)
  2018-12-05 14:47       ` Geert Uytterhoeven
@ 2018-12-17 14:03         ` Geert Uytterhoeven
  2018-12-17 14:51           ` Trond Myklebust
  0 siblings, 1 reply; 12+ messages in thread
From: Geert Uytterhoeven @ 2018-12-17 14:03 UTC (permalink / raw)
  To: trondmy
  Cc: Atsushi Nemoto, Linux Kernel Mailing List, open list:NFS, SUNRPC,
	AND...,
	linux-mips

Hi Trond,

On Wed, Dec 5, 2018 at 3:47 PM Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> On Wed, Dec 5, 2018 at 2:45 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
> > On Wed, 2018-12-05 at 14:41 +0100, Geert Uytterhoeven wrote:
> > > On Wed, Dec 5, 2018 at 2:11 PM Atsushi Nemoto <anemo@mba.ocn.ne.jp>
> > > wrote:
> > > > On Tue, 4 Dec 2018 14:53:07 +0100, Geert Uytterhoeven <
> > > > geert@linux-m68k.org> wrote:
> > > > > I found similar crashes in a report from 2006, but of course the
> > > > > code
> > > > > has changed too much to apply the solution proposed there
> > > > > (
> > > > > https://www.linux-mips.org/archives/linux-mips/2006-09/msg00169.html
> > > > > ).
> > > > >
> > > > > Userland is Debian 8 (the last release supporting "old" MIPS).
> > > > > My kernel is based on v4.20.0-rc5, but the issue happens with
> > > > > v4.20-rc1,
> > > > > too.
> > > > >
> > > > > However, I noticed it works in v4.19! Hence I've bisected this,
> > > > > to commit
> > > > > 277e4ab7d530bf28 ("SUNRPC: Simplify TCP receive code by switching
> > > > > to using
> > > > > iterators").
> > > > >
> > > > > Dropping the ",tcp" part from the nfsroot parameter also fixes
> > > > > the issue.
> > > > >
> > > > > Given RBTX4927 is little endian, just like my arm/arm64 boards,
> > > > > it's probably
> > > > > not an endianness issue.  Sparse didn't show anything suspicious
> > > > > before/after
> > > > > the guilty commit.
> > > > >
> > > > > Do you have a clue?
> > > >
> > > > If it was a cache issue, disabling i-cache or d-cache completely
> > > > might
> > > > help understanding the problem.  I added TXx9 specific "icdisable"
> > > > and
> > > > "dcdisable" kernel options for debugging long ago.
> > > >
> > > > I hope these options still works correctly with recent kernel but
> > > > not
> > > > sure.
> > > >
> > > > Also, disabling i-cache makes your board VERY slow, of course.
> > >
> > > Thanks!
> > >
> > > When using these options, I do see a slowdown in early boot, but the
> > > issue
> > > is still there.
> > >
> > > My next guess is an unaligned access not using {get,put}_unaligned(),
> > > which
> > > doesn't seem to work on tx4927, but doesn't cause an exception
> > > neither.
> >
> > Can you try my linux-next branch on git.linux-nfs.org? It contains a
> > fixes for a hang that results from the above commit.
> >
> > git pull git://git.linux-nfs.org/projects/trondmy/linux-nfs.git linux-next
>
> Thanks for the suggestion, but unfortunately it doesn't help.

In the mean time, I tried your newer linux-next, no change.
I tried several other things:
  - remove the packed attribute (why did you add that?),
  - verify (at runtime) that all accesses to fraghdr, xid, and calldir
are aligned,
  - enable RPC_DEBUG_DATA, nothing fishy seen at first sight.

Is anyone else seeing this on MIPS, or any other platform?
Does mounting NFS with -o nfsvers=3,tcp work on other MIPS platforms?

Thanks!


Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/TCP crashes on MIPS/RBTX4927 in v4.20-rcX (bisected)
  2018-12-17 14:03         ` Geert Uytterhoeven
@ 2018-12-17 14:51           ` Trond Myklebust
  2018-12-17 18:55             ` Geert Uytterhoeven
  0 siblings, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2018-12-17 14:51 UTC (permalink / raw)
  To: geert; +Cc: linux-kernel, linux-nfs, linux-mips, anemo

Hi Geert,

On Mon, 2018-12-17 at 15:03 +0100, Geert Uytterhoeven wrote:
> Hi Trond,
> 
> On Wed, Dec 5, 2018 at 3:47 PM Geert Uytterhoeven <
> geert@linux-m68k.org> wrote:
> > On Wed, Dec 5, 2018 at 2:45 PM Trond Myklebust <
> > trondmy@hammerspace.com> wrote:
> > > On Wed, 2018-12-05 at 14:41 +0100, Geert Uytterhoeven wrote:
> > > > On Wed, Dec 5, 2018 at 2:11 PM Atsushi Nemoto <
> > > > anemo@mba.ocn.ne.jp>
> > > > wrote:
> > > > > On Tue, 4 Dec 2018 14:53:07 +0100, Geert Uytterhoeven <
> > > > > geert@linux-m68k.org> wrote:
> > > > > > I found similar crashes in a report from 2006, but of
> > > > > > course the
> > > > > > code
> > > > > > has changed too much to apply the solution proposed there
> > > > > > (
> > > > > > https://www.linux-mips.org/archives/linux-mips/2006-09/msg00169.html
> > > > > > ).
> > > > > > 
> > > > > > Userland is Debian 8 (the last release supporting "old"
> > > > > > MIPS).
> > > > > > My kernel is based on v4.20.0-rc5, but the issue happens
> > > > > > with
> > > > > > v4.20-rc1,
> > > > > > too.
> > > > > > 
> > > > > > However, I noticed it works in v4.19! Hence I've bisected
> > > > > > this,
> > > > > > to commit
> > > > > > 277e4ab7d530bf28 ("SUNRPC: Simplify TCP receive code by
> > > > > > switching
> > > > > > to using
> > > > > > iterators").
> > > > > > 
> > > > > > Dropping the ",tcp" part from the nfsroot parameter also
> > > > > > fixes
> > > > > > the issue.
> > > > > > 
> > > > > > Given RBTX4927 is little endian, just like my arm/arm64
> > > > > > boards,
> > > > > > it's probably
> > > > > > not an endianness issue.  Sparse didn't show anything
> > > > > > suspicious
> > > > > > before/after
> > > > > > the guilty commit.
> > > > > > 
> > > > > > Do you have a clue?
> > > > > 
> > > > > If it was a cache issue, disabling i-cache or d-cache
> > > > > completely
> > > > > might
> > > > > help understanding the problem.  I added TXx9 specific
> > > > > "icdisable"
> > > > > and
> > > > > "dcdisable" kernel options for debugging long ago.
> > > > > 
> > > > > I hope these options still works correctly with recent kernel
> > > > > but
> > > > > not
> > > > > sure.
> > > > > 
> > > > > Also, disabling i-cache makes your board VERY slow, of
> > > > > course.
> > > > 
> > > > Thanks!
> > > > 
> > > > When using these options, I do see a slowdown in early boot,
> > > > but the
> > > > issue
> > > > is still there.
> > > > 
> > > > My next guess is an unaligned access not using
> > > > {get,put}_unaligned(),
> > > > which
> > > > doesn't seem to work on tx4927, but doesn't cause an exception
> > > > neither.
> > > 
> > > Can you try my linux-next branch on git.linux-nfs.org? It
> > > contains a
> > > fixes for a hang that results from the above commit.
> > > 
> > > git pull git://git.linux-nfs.org/projects/trondmy/linux-nfs.git
> > > linux-next
> > 
> > Thanks for the suggestion, but unfortunately it doesn't help.
> 
> In the mean time, I tried your newer linux-next, no change.
> I tried several other things:
>   - remove the packed attribute (why did you add that?),

The packed attribute allows us to avoid a series of copy operations
when decoding the first three elements of a RPC over TCP header (which
is why they are all declared as big endian). The alternative would be
to have a 12 byte buffer there for temporary storage, and then a
duplicate set of 3 32-bit words into which we copy the buffer contents
after extracting them from the (non-blocking) socket.

>   - verify (at runtime) that all accesses to fraghdr, xid, and
> calldir
> are aligned,
>   - enable RPC_DEBUG_DATA, nothing fishy seen at first sight.
> 
> Is anyone else seeing this on MIPS, or any other platform?
> Does mounting NFS with -o nfsvers=3,tcp work on other MIPS platforms?

I have no access to any MIPS hardware for the purposes of testing so
that would be a question for the community.

One thing that I have noticed is that unlike the old code, the bvec
'generic' code does appear to fail to call flush_dcache_page(). Could
that be causing the problem here? If so, why would that not be a
problem in the context of regular block I/O?

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/TCP crashes on MIPS/RBTX4927 in v4.20-rcX (bisected)
  2018-12-17 14:51           ` Trond Myklebust
@ 2018-12-17 18:55             ` Geert Uytterhoeven
  2018-12-17 19:01               ` Trond Myklebust
  0 siblings, 1 reply; 12+ messages in thread
From: Geert Uytterhoeven @ 2018-12-17 18:55 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Alexander Viro, Atsushi Nemoto, Ralf Baechle, Paul Burton,
	James Hogan, linux-nfs, linux-mips, linux-kernel,
	Geert Uytterhoeven

	Hi Trond,

(For the newly added CCs, first message was
https://lore.kernel.org/lkml/CAMuHMdVJr0PwvJg3FeTCy7vxuyY1=S1tPLHO7hPsoZX4wZ+-cQ@mail.gmail.com/)

> On Mon, Dec 17, 2018 at 3:51 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
> > On Mon, 2018-12-17 at 15:03 +0100, Geert Uytterhoeven wrote:
> > > On Wed, Dec 5, 2018 at 3:47 PM Geert Uytterhoeven <
> > > geert@linux-m68k.org> wrote:
> > > > On Wed, Dec 5, 2018 at 2:45 PM Trond Myklebust <
> > > > trondmy@hammerspace.com> wrote:
> > > > > On Wed, 2018-12-05 at 14:41 +0100, Geert Uytterhoeven wrote:
> > > > > > On Wed, Dec 5, 2018 at 2:11 PM Atsushi Nemoto <
> > > > > > anemo@mba.ocn.ne.jp>
> > > > > > wrote:
> > > > > > > On Tue, 4 Dec 2018 14:53:07 +0100, Geert Uytterhoeven <
> > > > > > > geert@linux-m68k.org> wrote:
> > > > > > > > I found similar crashes in a report from 2006, but of
> > > > > > > > course the
> > > > > > > > code
> > > > > > > > has changed too much to apply the solution proposed there
> > > > > > > > (
> > > > > > > > https://www.linux-mips.org/archives/linux-mips/2006-09/msg00169.html
> > > > > > > > ).
> > > > > > > >
> > > > > > > > Userland is Debian 8 (the last release supporting "old"
> > > > > > > > MIPS).
> > > > > > > > My kernel is based on v4.20.0-rc5, but the issue happens
> > > > > > > > with
> > > > > > > > v4.20-rc1,
> > > > > > > > too.
> > > > > > > >
> > > > > > > > However, I noticed it works in v4.19! Hence I've bisected
> > > > > > > > this,
> > > > > > > > to commit
> > > > > > > > 277e4ab7d530bf28 ("SUNRPC: Simplify TCP receive code by
> > > > > > > > switching
> > > > > > > > to using
> > > > > > > > iterators").
> > > > > > > >
> > > > > > > > Dropping the ",tcp" part from the nfsroot parameter also
> > > > > > > > fixes
> > > > > > > > the issue.
> > > > > > > >
> > > > > > > > Given RBTX4927 is little endian, just like my arm/arm64
> > > > > > > > boards,
> > > > > > > > it's probably
> > > > > > > > not an endianness issue.  Sparse didn't show anything
> > > > > > > > suspicious
> > > > > > > > before/after
> > > > > > > > the guilty commit.
> > > > > > > >
> > > > > > > > Do you have a clue?
> > > > > > >
> > > > > > > If it was a cache issue, disabling i-cache or d-cache
> > > > > > > completely
> > > > > > > might
> > > > > > > help understanding the problem.  I added TXx9 specific
> > > > > > > "icdisable"
> > > > > > > and
> > > > > > > "dcdisable" kernel options for debugging long ago.
> > > > > > >
> > > > > > > I hope these options still works correctly with recent kernel
> > > > > > > but
> > > > > > > not
> > > > > > > sure.
> > > > > > >
> > > > > > > Also, disabling i-cache makes your board VERY slow, of
> > > > > > > course.
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > When using these options, I do see a slowdown in early boot,
> > > > > > but the
> > > > > > issue
> > > > > > is still there.
> > > > > >
> > > > > > My next guess is an unaligned access not using
> > > > > > {get,put}_unaligned(),
> > > > > > which
> > > > > > doesn't seem to work on tx4927, but doesn't cause an exception
> > > > > > neither.
> > > > >
> > > > > Can you try my linux-next branch on git.linux-nfs.org? It
> > > > > contains a
> > > > > fixes for a hang that results from the above commit.
> > > > >
> > > > > git pull git://git.linux-nfs.org/projects/trondmy/linux-nfs.git
> > > > > linux-next
> > > >
> > > > Thanks for the suggestion, but unfortunately it doesn't help.
> > >
> > > In the mean time, I tried your newer linux-next, no change.
> > > I tried several other things:
> > >   - remove the packed attribute (why did you add that?),
> >
> > The packed attribute allows us to avoid a series of copy operations
> > when decoding the first three elements of a RPC over TCP header (which
> > is why they are all declared as big endian). The alternative would be
> > to have a 12 byte buffer there for temporary storage, and then a
> > duplicate set of 3 32-bit words into which we copy the buffer contents
> > after extracting them from the (non-blocking) socket.
> >
> > >   - verify (at runtime) that all accesses to fraghdr, xid, and
> > > calldir
> > > are aligned,
> > >   - enable RPC_DEBUG_DATA, nothing fishy seen at first sight.
> > >
> > > Is anyone else seeing this on MIPS, or any other platform?
> > > Does mounting NFS with -o nfsvers=3,tcp work on other MIPS platforms?
> >
> > I have no access to any MIPS hardware for the purposes of testing so
> > that would be a question for the community.
> >
> > One thing that I have noticed is that unlike the old code, the bvec
> > 'generic' code does appear to fail to call flush_dcache_page(). Could
> > that be causing the problem here? If so, why would that not be a
> > problem in the context of regular block I/O?

Thanks for the hint!

It wasn't clear to me where exactly the old code called
flush_dcache_page(), but as rpcrdma_inline_fixup() calls it in between
copying to a page, and unmapping the page, I added a call to
flush_dcache_page() to all functions in lib/iov_iter.c that map a page
and copy to it, cfr. the patch below.

And suddenly NFS root over TCP is working again!

Note that I have no idea if it affects regular block I/O, as my RBTX4927
does not have block devices.

Also note that this platform does not use highmem.

So, where's the proper place to fix this?
Thanks in advance!

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 54c248526b55fc49..5be62db33414d3f9 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -277,6 +277,7 @@ static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t
 			to += copy;
 			bytes -= copy;
 		}
+		flush_dcache_page(page);
 		if (likely(!bytes)) {
 			kunmap_atomic(kaddr);
 			goto done;
@@ -463,6 +464,7 @@ static void memcpy_to_page(struct page *page, size_t offset, const char *from, s
 {
 	char *to = kmap_atomic(page);
 	memcpy(to + offset, from, len);
+	flush_dcache_page(page);
 	kunmap_atomic(to);
 }
 
@@ -470,6 +472,7 @@ static void memzero_page(struct page *page, size_t offset, size_t len)
 {
 	char *addr = kmap_atomic(page);
 	memset(addr + offset, 0, len);
+	flush_dcache_page(page);
 	kunmap_atomic(addr);
 }
 
@@ -580,6 +583,7 @@ static size_t csum_and_copy_to_pipe_iter(const void *addr, size_t bytes,
 		char *p = kmap_atomic(pipe->bufs[idx].page);
 		next = csum_partial_copy_nocheck(addr, p + r, chunk, 0);
 		sum = csum_block_add(sum, next, off);
+		flush_dcache_page(pipe->bufs[idx].page);
 		kunmap_atomic(p);
 		i->idx = idx;
 		i->iov_offset = r + chunk;
@@ -628,6 +632,7 @@ static unsigned long memcpy_mcsafe_to_page(struct page *page, size_t offset,
 
 	to = kmap_atomic(page);
 	ret = memcpy_mcsafe(to + offset, from, len);
+	flush_dcache_page(page);
 	kunmap_atomic(to);
 
 	return ret;
@@ -894,6 +899,7 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 	if (i->type & (ITER_BVEC|ITER_KVEC)) {
 		void *kaddr = kmap_atomic(page);
 		size_t wanted = _copy_from_iter(kaddr + offset, bytes, i);
+		flush_dcache_page(page);
 		kunmap_atomic(kaddr);
 		return wanted;
 	} else
@@ -958,6 +964,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 				 v.bv_offset, v.bv_len),
 		memcpy((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
 	)
+	flush_dcache_page(page);
 	kunmap_atomic(kaddr);
 	return bytes;
 }
@@ -1494,6 +1501,7 @@ size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum,
 		next = csum_partial_copy_nocheck((from += v.bv_len) - v.bv_len,
 						 p + v.bv_offset,
 						 v.bv_len, 0);
+		flush_dcache_page(v.bv_page);
 		kunmap_atomic(p);
 		sum = csum_block_add(sum, next, off);
 		off += v.bv_len;

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: NFS/TCP crashes on MIPS/RBTX4927 in v4.20-rcX (bisected)
  2018-12-17 18:55             ` Geert Uytterhoeven
@ 2018-12-17 19:01               ` Trond Myklebust
  2018-12-19  9:56                 ` Geert Uytterhoeven
  0 siblings, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2018-12-17 19:01 UTC (permalink / raw)
  To: geert
  Cc: linux-kernel, ralf, linux-mips, linux-nfs, viro, anemo,
	paul.burton, jhogan

On Mon, 2018-12-17 at 19:55 +0100, Geert Uytterhoeven wrote:
> 	Hi Trond,
> 
> (For the newly added CCs, first message was
> https://lore.kernel.org/lkml/CAMuHMdVJr0PwvJg3FeTCy7vxuyY1=S1tPLHO7hPsoZX4wZ+-cQ@mail.gmail.com/)
> 
> > On Mon, Dec 17, 2018 at 3:51 PM Trond Myklebust <
> > trondmy@hammerspace.com> wrote:
> > > On Mon, 2018-12-17 at 15:03 +0100, Geert Uytterhoeven wrote:
> > > > On Wed, Dec 5, 2018 at 3:47 PM Geert Uytterhoeven <
> > > > geert@linux-m68k.org> wrote:
> > > > > On Wed, Dec 5, 2018 at 2:45 PM Trond Myklebust <
> > > > > trondmy@hammerspace.com> wrote:
> > > > > > On Wed, 2018-12-05 at 14:41 +0100, Geert Uytterhoeven
> > > > > > wrote:
> > > > > > > On Wed, Dec 5, 2018 at 2:11 PM Atsushi Nemoto <
> > > > > > > anemo@mba.ocn.ne.jp>
> > > > > > > wrote:
> > > > > > > > On Tue, 4 Dec 2018 14:53:07 +0100, Geert Uytterhoeven <
> > > > > > > > geert@linux-m68k.org> wrote:
> > > > > > > > > I found similar crashes in a report from 2006, but of
> > > > > > > > > course the
> > > > > > > > > code
> > > > > > > > > has changed too much to apply the solution proposed
> > > > > > > > > there
> > > > > > > > > (
> > > > > > > > > https://www.linux-mips.org/archives/linux-mips/2006-09/msg00169.html
> > > > > > > > > ).
> > > > > > > > > 
> > > > > > > > > Userland is Debian 8 (the last release supporting
> > > > > > > > > "old"
> > > > > > > > > MIPS).
> > > > > > > > > My kernel is based on v4.20.0-rc5, but the issue
> > > > > > > > > happens
> > > > > > > > > with
> > > > > > > > > v4.20-rc1,
> > > > > > > > > too.
> > > > > > > > > 
> > > > > > > > > However, I noticed it works in v4.19! Hence I've
> > > > > > > > > bisected
> > > > > > > > > this,
> > > > > > > > > to commit
> > > > > > > > > 277e4ab7d530bf28 ("SUNRPC: Simplify TCP receive code
> > > > > > > > > by
> > > > > > > > > switching
> > > > > > > > > to using
> > > > > > > > > iterators").
> > > > > > > > > 
> > > > > > > > > Dropping the ",tcp" part from the nfsroot parameter
> > > > > > > > > also
> > > > > > > > > fixes
> > > > > > > > > the issue.
> > > > > > > > > 
> > > > > > > > > Given RBTX4927 is little endian, just like my
> > > > > > > > > arm/arm64
> > > > > > > > > boards,
> > > > > > > > > it's probably
> > > > > > > > > not an endianness issue.  Sparse didn't show anything
> > > > > > > > > suspicious
> > > > > > > > > before/after
> > > > > > > > > the guilty commit.
> > > > > > > > > 
> > > > > > > > > Do you have a clue?
> > > > > > > > 
> > > > > > > > If it was a cache issue, disabling i-cache or d-cache
> > > > > > > > completely
> > > > > > > > might
> > > > > > > > help understanding the problem.  I added TXx9 specific
> > > > > > > > "icdisable"
> > > > > > > > and
> > > > > > > > "dcdisable" kernel options for debugging long ago.
> > > > > > > > 
> > > > > > > > I hope these options still works correctly with recent
> > > > > > > > kernel
> > > > > > > > but
> > > > > > > > not
> > > > > > > > sure.
> > > > > > > > 
> > > > > > > > Also, disabling i-cache makes your board VERY slow, of
> > > > > > > > course.
> > > > > > > 
> > > > > > > Thanks!
> > > > > > > 
> > > > > > > When using these options, I do see a slowdown in early
> > > > > > > boot,
> > > > > > > but the
> > > > > > > issue
> > > > > > > is still there.
> > > > > > > 
> > > > > > > My next guess is an unaligned access not using
> > > > > > > {get,put}_unaligned(),
> > > > > > > which
> > > > > > > doesn't seem to work on tx4927, but doesn't cause an
> > > > > > > exception
> > > > > > > neither.
> > > > > > 
> > > > > > Can you try my linux-next branch on git.linux-nfs.org? It
> > > > > > contains a
> > > > > > fixes for a hang that results from the above commit.
> > > > > > 
> > > > > > git pull git://git.linux-nfs.org/projects/trondmy/linux-
> > > > > > nfs.git
> > > > > > linux-next
> > > > > 
> > > > > Thanks for the suggestion, but unfortunately it doesn't help.
> > > > 
> > > > In the mean time, I tried your newer linux-next, no change.
> > > > I tried several other things:
> > > >   - remove the packed attribute (why did you add that?),
> > > 
> > > The packed attribute allows us to avoid a series of copy
> > > operations
> > > when decoding the first three elements of a RPC over TCP header
> > > (which
> > > is why they are all declared as big endian). The alternative
> > > would be
> > > to have a 12 byte buffer there for temporary storage, and then a
> > > duplicate set of 3 32-bit words into which we copy the buffer
> > > contents
> > > after extracting them from the (non-blocking) socket.
> > > 
> > > >   - verify (at runtime) that all accesses to fraghdr, xid, and
> > > > calldir
> > > > are aligned,
> > > >   - enable RPC_DEBUG_DATA, nothing fishy seen at first sight.
> > > > 
> > > > Is anyone else seeing this on MIPS, or any other platform?
> > > > Does mounting NFS with -o nfsvers=3,tcp work on other MIPS
> > > > platforms?
> > > 
> > > I have no access to any MIPS hardware for the purposes of testing
> > > so
> > > that would be a question for the community.
> > > 
> > > One thing that I have noticed is that unlike the old code, the
> > > bvec
> > > 'generic' code does appear to fail to call flush_dcache_page().
> > > Could
> > > that be causing the problem here? If so, why would that not be a
> > > problem in the context of regular block I/O?
> 
> Thanks for the hint!
> 
> It wasn't clear to me where exactly the old code called
> flush_dcache_page(), but as rpcrdma_inline_fixup() calls it in
> between
> copying to a page, and unmapping the page, I added a call to
> flush_dcache_page() to all functions in lib/iov_iter.c that map a
> page
> and copy to it, cfr. the patch below.
> 
> And suddenly NFS root over TCP is working again!

Hah! ☺

> 
> Note that I have no idea if it affects regular block I/O, as my
> RBTX4927
> does not have block devices.
> 
> Also note that this platform does not use highmem.
> 
> So, where's the proper place to fix this?
> Thanks in advance!

Given that one of the main use cases for iov_iter is the page cache, I
think that your patch below is the correct one. However perhaps Al can
comment?

> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 54c248526b55fc49..5be62db33414d3f9 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -277,6 +277,7 @@ static size_t copy_page_from_iter_iovec(struct
> page *page, size_t offset, size_t
>  			to += copy;
>  			bytes -= copy;
>  		}
> +		flush_dcache_page(page);
>  		if (likely(!bytes)) {
>  			kunmap_atomic(kaddr);
>  			goto done;
> @@ -463,6 +464,7 @@ static void memcpy_to_page(struct page *page,
> size_t offset, const char *from, s
>  {
>  	char *to = kmap_atomic(page);
>  	memcpy(to + offset, from, len);
> +	flush_dcache_page(page);
>  	kunmap_atomic(to);
>  }
>  
> @@ -470,6 +472,7 @@ static void memzero_page(struct page *page,
> size_t offset, size_t len)
>  {
>  	char *addr = kmap_atomic(page);
>  	memset(addr + offset, 0, len);
> +	flush_dcache_page(page);
>  	kunmap_atomic(addr);
>  }
>  
> @@ -580,6 +583,7 @@ static size_t csum_and_copy_to_pipe_iter(const
> void *addr, size_t bytes,
>  		char *p = kmap_atomic(pipe->bufs[idx].page);
>  		next = csum_partial_copy_nocheck(addr, p + r, chunk,
> 0);
>  		sum = csum_block_add(sum, next, off);
> +		flush_dcache_page(pipe->bufs[idx].page);
>  		kunmap_atomic(p);
>  		i->idx = idx;
>  		i->iov_offset = r + chunk;
> @@ -628,6 +632,7 @@ static unsigned long memcpy_mcsafe_to_page(struct
> page *page, size_t offset,
>  
>  	to = kmap_atomic(page);
>  	ret = memcpy_mcsafe(to + offset, from, len);
> +	flush_dcache_page(page);
>  	kunmap_atomic(to);
>  
>  	return ret;
> @@ -894,6 +899,7 @@ size_t copy_page_from_iter(struct page *page,
> size_t offset, size_t bytes,
>  	if (i->type & (ITER_BVEC|ITER_KVEC)) {
>  		void *kaddr = kmap_atomic(page);
>  		size_t wanted = _copy_from_iter(kaddr + offset, bytes,
> i);
> +		flush_dcache_page(page);
>  		kunmap_atomic(kaddr);
>  		return wanted;
>  	} else
> @@ -958,6 +964,7 @@ size_t iov_iter_copy_from_user_atomic(struct page
> *page,
>  				 v.bv_offset, v.bv_len),
>  		memcpy((p += v.iov_len) - v.iov_len, v.iov_base,
> v.iov_len)
>  	)
> +	flush_dcache_page(page);
>  	kunmap_atomic(kaddr);
>  	return bytes;
>  }
> @@ -1494,6 +1501,7 @@ size_t csum_and_copy_to_iter(const void *addr,
> size_t bytes, __wsum *csum,
>  		next = csum_partial_copy_nocheck((from += v.bv_len) -
> v.bv_len,
>  						 p + v.bv_offset,
>  						 v.bv_len, 0);
> +		flush_dcache_page(v.bv_page);
>  		kunmap_atomic(p);
>  		sum = csum_block_add(sum, next, off);
>  		off += v.bv_len;
> 
> Gr{oetje,eeting}s,
> 
> 						Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- 
> geert@linux-m68k.org
> 


-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/TCP crashes on MIPS/RBTX4927 in v4.20-rcX (bisected)
  2018-12-17 19:01               ` Trond Myklebust
@ 2018-12-19  9:56                 ` Geert Uytterhoeven
  0 siblings, 0 replies; 12+ messages in thread
From: Geert Uytterhoeven @ 2018-12-19  9:56 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: linux-kernel, ralf, linux-mips, linux-nfs, viro, anemo,
	paul.burton, jhogan, Andrew Morton

Any comments from the iovec experts?
This is a regression in v4.20-rc1.

Thanks!

On Mon, Dec 17, 2018 at 8:01 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
> On Mon, 2018-12-17 at 19:55 +0100, Geert Uytterhoeven wrote:
> > (For the newly added CCs, first message was
> > https://lore.kernel.org/lkml/CAMuHMdVJr0PwvJg3FeTCy7vxuyY1=S1tPLHO7hPsoZX4wZ+-cQ@mail.gmail.com/)
> >
> > > On Mon, Dec 17, 2018 at 3:51 PM Trond Myklebust <
> > > trondmy@hammerspace.com> wrote:
> > > > On Mon, 2018-12-17 at 15:03 +0100, Geert Uytterhoeven wrote:
> > > > > On Wed, Dec 5, 2018 at 3:47 PM Geert Uytterhoeven <
> > > > > geert@linux-m68k.org> wrote:
> > > > > > On Wed, Dec 5, 2018 at 2:45 PM Trond Myklebust <
> > > > > > trondmy@hammerspace.com> wrote:
> > > > > > > On Wed, 2018-12-05 at 14:41 +0100, Geert Uytterhoeven
> > > > > > > wrote:
> > > > > > > > On Wed, Dec 5, 2018 at 2:11 PM Atsushi Nemoto <
> > > > > > > > anemo@mba.ocn.ne.jp>
> > > > > > > > wrote:
> > > > > > > > > On Tue, 4 Dec 2018 14:53:07 +0100, Geert Uytterhoeven <
> > > > > > > > > geert@linux-m68k.org> wrote:
> > > > > > > > > > I found similar crashes in a report from 2006, but of
> > > > > > > > > > course the
> > > > > > > > > > code
> > > > > > > > > > has changed too much to apply the solution proposed
> > > > > > > > > > there
> > > > > > > > > > (
> > > > > > > > > > https://www.linux-mips.org/archives/linux-mips/2006-09/msg00169.html
> > > > > > > > > > ).
> > > > > > > > > >
> > > > > > > > > > Userland is Debian 8 (the last release supporting
> > > > > > > > > > "old"
> > > > > > > > > > MIPS).
> > > > > > > > > > My kernel is based on v4.20.0-rc5, but the issue
> > > > > > > > > > happens
> > > > > > > > > > with
> > > > > > > > > > v4.20-rc1,
> > > > > > > > > > too.
> > > > > > > > > >
> > > > > > > > > > However, I noticed it works in v4.19! Hence I've
> > > > > > > > > > bisected
> > > > > > > > > > this,
> > > > > > > > > > to commit
> > > > > > > > > > 277e4ab7d530bf28 ("SUNRPC: Simplify TCP receive code
> > > > > > > > > > by
> > > > > > > > > > switching
> > > > > > > > > > to using
> > > > > > > > > > iterators").
> > > > > > > > > >
> > > > > > > > > > Dropping the ",tcp" part from the nfsroot parameter
> > > > > > > > > > also
> > > > > > > > > > fixes
> > > > > > > > > > the issue.
> > > > > > > > > >
> > > > > > > > > > Given RBTX4927 is little endian, just like my
> > > > > > > > > > arm/arm64
> > > > > > > > > > boards,
> > > > > > > > > > it's probably
> > > > > > > > > > not an endianness issue.  Sparse didn't show anything
> > > > > > > > > > suspicious
> > > > > > > > > > before/after
> > > > > > > > > > the guilty commit.
> > > > > > > > > >
> > > > > > > > > > Do you have a clue?
> > > > > > > > >
> > > > > > > > > If it was a cache issue, disabling i-cache or d-cache
> > > > > > > > > completely
> > > > > > > > > might
> > > > > > > > > help understanding the problem.  I added TXx9 specific
> > > > > > > > > "icdisable"
> > > > > > > > > and
> > > > > > > > > "dcdisable" kernel options for debugging long ago.
> > > > > > > > >
> > > > > > > > > I hope these options still works correctly with recent
> > > > > > > > > kernel
> > > > > > > > > but
> > > > > > > > > not
> > > > > > > > > sure.
> > > > > > > > >
> > > > > > > > > Also, disabling i-cache makes your board VERY slow, of
> > > > > > > > > course.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > > When using these options, I do see a slowdown in early
> > > > > > > > boot,
> > > > > > > > but the
> > > > > > > > issue
> > > > > > > > is still there.
> > > > > > > >
> > > > > > > > My next guess is an unaligned access not using
> > > > > > > > {get,put}_unaligned(),
> > > > > > > > which
> > > > > > > > doesn't seem to work on tx4927, but doesn't cause an
> > > > > > > > exception
> > > > > > > > neither.
> > > > > > >
> > > > > > > Can you try my linux-next branch on git.linux-nfs.org? It
> > > > > > > contains a
> > > > > > > fixes for a hang that results from the above commit.
> > > > > > >
> > > > > > > git pull git://git.linux-nfs.org/projects/trondmy/linux-
> > > > > > > nfs.git
> > > > > > > linux-next
> > > > > >
> > > > > > Thanks for the suggestion, but unfortunately it doesn't help.
> > > > >
> > > > > In the mean time, I tried your newer linux-next, no change.
> > > > > I tried several other things:
> > > > >   - remove the packed attribute (why did you add that?),
> > > >
> > > > The packed attribute allows us to avoid a series of copy
> > > > operations
> > > > when decoding the first three elements of a RPC over TCP header
> > > > (which
> > > > is why they are all declared as big endian). The alternative
> > > > would be
> > > > to have a 12 byte buffer there for temporary storage, and then a
> > > > duplicate set of 3 32-bit words into which we copy the buffer
> > > > contents
> > > > after extracting them from the (non-blocking) socket.
> > > >
> > > > >   - verify (at runtime) that all accesses to fraghdr, xid, and
> > > > > calldir
> > > > > are aligned,
> > > > >   - enable RPC_DEBUG_DATA, nothing fishy seen at first sight.
> > > > >
> > > > > Is anyone else seeing this on MIPS, or any other platform?
> > > > > Does mounting NFS with -o nfsvers=3,tcp work on other MIPS
> > > > > platforms?
> > > >
> > > > I have no access to any MIPS hardware for the purposes of testing
> > > > so
> > > > that would be a question for the community.
> > > >
> > > > One thing that I have noticed is that unlike the old code, the
> > > > bvec
> > > > 'generic' code does appear to fail to call flush_dcache_page().
> > > > Could
> > > > that be causing the problem here? If so, why would that not be a
> > > > problem in the context of regular block I/O?
> >
> > Thanks for the hint!
> >
> > It wasn't clear to me where exactly the old code called
> > flush_dcache_page(), but as rpcrdma_inline_fixup() calls it in
> > between
> > copying to a page, and unmapping the page, I added a call to
> > flush_dcache_page() to all functions in lib/iov_iter.c that map a
> > page
> > and copy to it, cfr. the patch below.
> >
> > And suddenly NFS root over TCP is working again!
>
> Hah!
>
> >
> > Note that I have no idea if it affects regular block I/O, as my
> > RBTX4927
> > does not have block devices.
> >
> > Also note that this platform does not use highmem.
> >
> > So, where's the proper place to fix this?
> > Thanks in advance!
>
> Given that one of the main use cases for iov_iter is the page cache, I
> think that your patch below is the correct one. However perhaps Al can
> comment?
>
> >
> > diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> > index 54c248526b55fc49..5be62db33414d3f9 100644
> > --- a/lib/iov_iter.c
> > +++ b/lib/iov_iter.c
> > @@ -277,6 +277,7 @@ static size_t copy_page_from_iter_iovec(struct
> > page *page, size_t offset, size_t
> >                       to += copy;
> >                       bytes -= copy;
> >               }
> > +             flush_dcache_page(page);
> >               if (likely(!bytes)) {
> >                       kunmap_atomic(kaddr);
> >                       goto done;
> > @@ -463,6 +464,7 @@ static void memcpy_to_page(struct page *page,
> > size_t offset, const char *from, s
> >  {
> >       char *to = kmap_atomic(page);
> >       memcpy(to + offset, from, len);
> > +     flush_dcache_page(page);
> >       kunmap_atomic(to);
> >  }
> >
> > @@ -470,6 +472,7 @@ static void memzero_page(struct page *page,
> > size_t offset, size_t len)
> >  {
> >       char *addr = kmap_atomic(page);
> >       memset(addr + offset, 0, len);
> > +     flush_dcache_page(page);
> >       kunmap_atomic(addr);
> >  }
> >
> > @@ -580,6 +583,7 @@ static size_t csum_and_copy_to_pipe_iter(const
> > void *addr, size_t bytes,
> >               char *p = kmap_atomic(pipe->bufs[idx].page);
> >               next = csum_partial_copy_nocheck(addr, p + r, chunk,
> > 0);
> >               sum = csum_block_add(sum, next, off);
> > +             flush_dcache_page(pipe->bufs[idx].page);
> >               kunmap_atomic(p);
> >               i->idx = idx;
> >               i->iov_offset = r + chunk;
> > @@ -628,6 +632,7 @@ static unsigned long memcpy_mcsafe_to_page(struct
> > page *page, size_t offset,
> >
> >       to = kmap_atomic(page);
> >       ret = memcpy_mcsafe(to + offset, from, len);
> > +     flush_dcache_page(page);
> >       kunmap_atomic(to);
> >
> >       return ret;
> > @@ -894,6 +899,7 @@ size_t copy_page_from_iter(struct page *page,
> > size_t offset, size_t bytes,
> >       if (i->type & (ITER_BVEC|ITER_KVEC)) {
> >               void *kaddr = kmap_atomic(page);
> >               size_t wanted = _copy_from_iter(kaddr + offset, bytes,
> > i);
> > +             flush_dcache_page(page);
> >               kunmap_atomic(kaddr);
> >               return wanted;
> >       } else
> > @@ -958,6 +964,7 @@ size_t iov_iter_copy_from_user_atomic(struct page
> > *page,
> >                                v.bv_offset, v.bv_len),
> >               memcpy((p += v.iov_len) - v.iov_len, v.iov_base,
> > v.iov_len)
> >       )
> > +     flush_dcache_page(page);
> >       kunmap_atomic(kaddr);
> >       return bytes;
> >  }
> > @@ -1494,6 +1501,7 @@ size_t csum_and_copy_to_iter(const void *addr,
> > size_t bytes, __wsum *csum,
> >               next = csum_partial_copy_nocheck((from += v.bv_len) -
> > v.bv_len,
> >                                                p + v.bv_offset,
> >                                                v.bv_len, 0);
> > +             flush_dcache_page(v.bv_page);
> >               kunmap_atomic(p);
> >               sum = csum_block_add(sum, next, off);
> >               off += v.bv_len;

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-12-19  9:56 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-04 13:53 NFS/TCP crashes on MIPS/RBTX4927 in v4.20-rcX (bisected) Geert Uytterhoeven
2018-12-05 13:11 ` Atsushi Nemoto
2018-12-05 13:41   ` Geert Uytterhoeven
2018-12-05 13:45     ` Trond Myklebust
2018-12-05 14:47       ` Geert Uytterhoeven
2018-12-17 14:03         ` Geert Uytterhoeven
2018-12-17 14:51           ` Trond Myklebust
2018-12-17 18:55             ` Geert Uytterhoeven
2018-12-17 19:01               ` Trond Myklebust
2018-12-19  9:56                 ` Geert Uytterhoeven
2018-12-07 14:51     ` Atsushi Nemoto
2018-12-07 16:19       ` Geert Uytterhoeven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).