Re: Bug 216582 - BUG: kernel NULL pointer dereference - nlmclnt_setlockargs

From: Daire Byrne <daire@dneg.com>
To: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>,
	Anna Schumaker <anna@kernel.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	"regressions@lists.linux.dev" <regressions@lists.linux.dev>
Subject: Re: Bug 216582 - BUG: kernel NULL pointer dereference - nlmclnt_setlockargs
Date: Sun, 16 Oct 2022 12:56:43 +0100	[thread overview]
Message-ID: <CAPt2mGPiYVYnK4dpZmQ4+R-=7bh-irhcY_XkYWB5hbDMyhbB9w@mail.gmail.com> (raw)
In-Reply-To: <8942d26d-1085-27f3-d15b-782d368e53b1@leemhuis.info>

Thorston,

Thanks, but I should just say that I'm not certain this is a
regression yet - it could just be a change in our workload that is
triggering something I haven't seen before.

I am slowly working back through kernel versions to verify that - but
it's really hard to trigger and does not happen often so it is slow
going. Also my workload and configuration is quite unique (NFS
re-exporting) so I may be the only one seeing this...

Cheers,

Daire

On Sun, 16 Oct 2022 at 12:21, Thorsten Leemhuis
<regressions@leemhuis.info> wrote:
>
> Hi, this is your Linux kernel regression tracker speaking.
>
> I noticed a regression report in bugzilla.kernel.org. As many (most?)
> kernel developer don't keep an eye on it, I decided to forward it by
> mail. Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=216582 :
>
> >  Daire Byrne 2022-10-13 22:04:19 UTC
> >
> > Hi,
> >
> > I've started seeing this crash at least once or twice a week with our
> > NFS re-export workloads (re-exporting a Linux NFsv3 server as
> > NFSv3).
> >
> > We have been stepping through kernel versions a bit on the server
> > recently so it feels like something new introduced somewhere around
> > v5.17 but I also can't rule out that our clients are doing something
> > "different" with their workloads to stress this code in some new way.
> > It still occurs in v6.0 too.
> >
> > [106412.314663] BUG: kernel NULL pointer dereference, address: 0000000000000020
> > [106412.321879] #PF: supervisor read access in kernel mode
> > [106412.327237] #PF: error_code(0x0000) - not-present page
> > [106412.332599] PGD 0 P4D 0
> > [106412.335353] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > [106412.339935] CPU: 34 PID: 2382 Comm: lockd Tainted: G            E     5.18.10-1.dneg.x86_64 #1
> > [106412.348773] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
> > [106412.358223] RIP: 0010:nlmclnt_setlockargs+0x4a/0x100 [lockd]
> > [106412.364116] Code: 00 00 49 81 c0 88 00 00 00 f0 0f c1 05 bf 06 01 00 83 c0 01 c7 47 30 04 00 00 00 48 8d 4f 44 48 8d 7f 4c 89 47 c4 48 8b 46 78 <48> 8b 40 20 48 8b 90 60 fe ff ff 48 8d b0 60 fe ff ff 48 89 57 f8
> > [106412.383117] RSP: 0018:ffffb3db50cdfa80 EFLAGS: 00010202
> > [106412.388569] RAX: 0000000000000000 RBX: ffff8a36749c9400 RCX: ffff8a36749c9444
> > [106412.395924] RDX: ffff8a37f8696300 RSI: ffffb3db50cdfbd8 RDI: ffff8a36749c944c
> > [106412.403277] RBP: ffffb3db50cdfa90 R08: ffff8a750b49bc88 R09: ffff8a37f8696300
> > [106412.410634] R10: 0000000000000230 R11: ffffffffffffffff R12: ffffb3db50cdfbd8
> > [106412.417984] R13: ffff8a7508beac00 R14: ffffb3db50cdfca0 R15: ffffb3db50cdfbd8
> > [106412.425338] FS:  0000000000000000(0000) GS:ffff8a73ffa80000(0000) knlGS:0000000000000000
> > [106412.433649] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [106412.439611] CR2: 0000000000000020 CR3: 00000001118e6006 CR4: 00000000003706e0
> > [106412.446984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [106412.454346] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [106412.461696] Call Trace:
> > [106412.464361]  <TASK>
> > [106412.466689]  nlmclnt_proc+0x1c6/0x5b0 [lockd]
> > [106412.471272]  nfs3_proc_lock+0x33/0xb0 [nfsv3]
> > [106412.475848]  ? nfs_put_lock_context+0x86/0x90 [nfs]
> > [106412.481008]  do_unlk+0x8f/0xd0 [nfs]
> > [106412.484837]  nfs_lock+0xcd/0x180 [nfs]
> > [106412.488815]  ? nlmsvc_mark_host+0x30/0x30 [lockd]
> > [106412.493752]  vfs_lock_file+0x1e/0x40
> > [106412.497547]  nlm_unlock_files.isra.0+0x6d/0xc0 [lockd]
> > [106412.502905]  nlm_traverse_files+0x163/0x2a0 [lockd]
> > [106412.508020]  nlmsvc_free_host_resources+0x2b/0x40 [lockd]
> > [106412.513648]  nlm_host_rebooted+0x2c/0x90 [lockd]
> > [106412.518483]  nlmsvc_proc_sm_notify+0xc0/0x130 [lockd]
> > [106412.523759]  ? nlmsvc_decode_reboot+0x7d/0xa0 [lockd]
> > [106412.529027]  nlmsvc_dispatch+0x8e/0x1a0 [lockd]
> > [106412.534312]  svc_process_common+0x484/0x620 [sunrpc]
> > [106412.539521]  ? lockd+0x1d0/0x1d0 [lockd]
> > [106412.543661]  ? set_grace_period+0xa0/0xa0 [lockd]
> > [106412.548582]  svc_process+0xbc/0xf0 [sunrpc]
> > [106412.553008]  lockd+0xd2/0x1d0 [lockd]
> > [106412.556906]  ? set_grace_period+0xa0/0xa0 [lockd]
> > [106412.561849]  kthread+0xee/0x120
> > [106412.565228]  ? kthread_complete_and_exit+0x20/0x20
> > [106412.570239]  ret_from_fork+0x1f/0x30
> > [106412.574033]  </TASK>
> > [106412.576436] Modules linked in: tcp_diag(E) inet_diag(E) nfsv3(E) nfs(E) cachefiles(E) fscache(E) netfs(E) ext4(E) mbcache(E) jbd2(E) intel_uncore_frequency_common(E) isst_if_common(E) sg(E) nfit(E) virtio_rng(E) rapl(E) i2c_piix4(E) input_leds(E) nfsd(E) sch_fq(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) tcp_bbr(E) binfmt_misc(E) ip_tables(E) xfs(E) libcrc32c(E) sd_mod(E) t10_pi(E) crc64_rocksoft_generic(E) crc64_rocksoft(E) crc64(E) crct10dif_pclmul(E) crc32_pclmul(E) virtio_scsi(E) crc32c_intel(E) ghash_clmulni_intel(E) 8021q(E) garp(E) mrp(E) virtio_pci(E) scsi_transport_iscsi(E) virtio_pci_legacy_dev(E) aesni_intel(E) virtio_pci_modern_dev(E) crypto_simd(E) virtio_ring(E) cryptd(E) gve(E) serio_raw(E) virtio(E) sunrpc(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) fuse(E)
> > [106412.646242] CR2: 0000000000000020
> > [106412.649780] ---[ end trace 0000000000000000 ]---
> > [106412.654617] RIP: 0010:nlmclnt_setlockargs+0x4a/0x100 [lockd]
> > [106412.660495] Code: 00 00 49 81 c0 88 00 00 00 f0 0f c1 05 bf 06 01 00 83 c0 01 c7 47 30 04 00 00 00 48 8d 4f 44 48 8d 7f 4c 89 47 c4 48 8b 46 78 <48> 8b 40 20 48 8b 90 60 fe ff ff 48 8d b0 60 fe ff ff 48 89 57 f8
> > [106412.679481] RSP: 0018:ffffb3db50cdfa80 EFLAGS: 00010202
> > [106412.684922] RAX: 0000000000000000 RBX: ffff8a36749c9400 RCX: ffff8a36749c9444
> > [106412.692269] RDX: ffff8a37f8696300 RSI: ffffb3db50cdfbd8 RDI: ffff8a36749c944c
> > [106412.699617] RBP: ffffb3db50cdfa90 R08: ffff8a750b49bc88 R09: ffff8a37f8696300
> > [106412.706969] R10: 0000000000000230 R11: ffffffffffffffff R12: ffffb3db50cdfbd8
> > [106412.714329] R13: ffff8a7508beac00 R14: ffffb3db50cdfca0 R15: ffffb3db50cdfbd8
> > [106412.721676] FS:  0000000000000000(0000) GS:ffff8a73ffa80000(0000) knlGS:0000000000000000
> > [106412.729981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [106412.736472] CR2: 0000000000000020 CR3: 00000001118e6006 CR4: 00000000003706e0
> > [106412.743821] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [106412.751171] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [106412.758520] Kernel panic - not syncing: Fatal exception
> > [106412.764850] Kernel Offset: 0x30000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > [106412.775850] ---[ end Kernel panic - not syncing: Fatal exception ]---
> >
> >
> > All I know is that I didn't notice this crash from v5.12 to v5.16 but
> > I have not been able to test this qualitatively yet. The crash is
> > rare enough that it makes A/B testing quite tricky.
> >
> > It's somewhat similar to
> > https://bugzilla.kernel.org/show_bug.cgi?id=213273 but that was for a
> > NFv4.2 re-export of NFSv3 and this is for a NFSv3 re-export of NFSv3
> > (for WAN caching).
> >
> > We are using nfs-utils-2.5.4.
> >
> > Daire
>
> See the ticket for more details.
>
> BTW, let me use this mail to also add the report to the list of tracked
> regressions to ensure it's doesn't fall through the cracks:
>
> #regzbot introduced: v5.17..v5.18
> #regzbot ignore-activity
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>
> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> reports and sometimes miss something important when writing mails like
> this. If that's the case here, don't hesitate to tell me in a public
> reply, it's in everyone's interest to set the public record straight.