From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: [PATCH] svcauth_gss: Revert 64c59a3726f2 ("Remove unnecessary allocation") From: Chuck Lever In-Reply-To: <20160906204238.GA30260@fieldses.org> Date: Tue, 6 Sep 2016 16:49:33 -0400 Cc: Linux NFS Mailing List , bcodding@redhat.com Message-Id: References: <20160901144839.6035.63068.stgit@klimt.1015granger.net> <20160906204238.GA30260@fieldses.org> To: "J. Bruce Fields" List-ID: > On Sep 6, 2016, at 4:42 PM, J. Bruce Fields = wrote: >=20 > On Thu, Sep 01, 2016 at 10:50:38AM -0400, Chuck Lever wrote: >> On occasion I see this on my server: >>=20 >> kernel: kernel BUG at /home/cel/src/linux/linux-2.6/mm/slub.c:3851! >> kernel: invalid opcode: 0000 [#1] SMP >> kernel: Modules linked in: cts rpcsec_gss_krb5 sb_edac edac_core = x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass = crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw = gf128mul glue_helper ablk_helper cryptd btrfs xor iTCO_wdt = iTCO_vendor_support raid6_pq pcspkr i2c_i801 i2c_smbus lpc_ich mfd_core = mei_me sg mei shpchp wmi ioatdma ipmi_si ipmi_msghandler acpi_pad = acpi_power_meter rpcrdma ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad = rdma_cm ib_cm iw_cm nfsd nfs_acl lockd grace auth_rpcgss sunrpc = ip_tables xfs libcrc32c mlx4_ib mlx4_en ib_core sr_mod cdrom sd_mod ast = drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm = crc32c_intel igb mlx4_core ahci libahci libata ptp pps_core dca = i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod >> kernel: CPU: 7 PID: 145 Comm: kworker/7:2 Not tainted = 4.8.0-rc4-00006-g9d06b0b #15 >> kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c = 09/09/2015 >> kernel: Workqueue: events do_cache_clean [sunrpc] >> kernel: task: ffff8808541d8000 task.stack: ffff880854344000 >> kernel: RIP: 0010:[] [] = kfree+0x155/0x180 >> kernel: RSP: 0018:ffff880854347d70 EFLAGS: 00010246 >> kernel: RAX: ffffea0020fe7660 RBX: ffff88083f9db064 RCX: = 146ff0f9d5ec5600 >> kernel: RDX: 000077ff80000000 RSI: ffff880853f01500 RDI: = ffff88083f9db064 >> kernel: RBP: ffff880854347d88 R08: ffff8808594ee000 R09: = ffff88087fdd8780 >> kernel: R10: 0000000000000000 R11: ffffea0020fe76c0 R12: = ffff880853f01500 >> kernel: R13: ffffffffa013cf76 R14: ffffffffa013cff0 R15: = ffffffffa04253a0 >> kernel: FS: 0000000000000000(0000) GS:ffff88087fdc0000(0000) = knlGS:0000000000000000 >> kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> kernel: CR2: 00007fed60b020c3 CR3: 0000000001c06000 CR4: = 00000000001406e0 >> kernel: Stack: >> kernel: ffff8808589f2f00 ffff880853f01500 0000000000000001 = ffff880854347da0 >> kernel: ffffffffa013cf76 ffff8808589f2f00 ffff880854347db8 = ffffffffa013d006 >> kernel: ffff8808589f2f20 ffff880854347e00 ffffffffa0406f60 = 0000000057c7044f >> kernel: Call Trace: >> kernel: [] rsc_free+0x16/0x90 [auth_rpcgss] >> kernel: [] rsc_put+0x16/0x30 [auth_rpcgss] >> kernel: [] cache_clean+0x2e0/0x300 [sunrpc] >> kernel: [] do_cache_clean+0xe/0x70 [sunrpc] >> kernel: [] process_one_work+0x1ff/0x3b0 >> kernel: [] worker_thread+0x2bc/0x4a0 >> kernel: [] ? rescuer_thread+0x3a0/0x3a0 >> kernel: [] kthread+0xe4/0xf0 >> kernel: [] ret_from_fork+0x1f/0x40 >> kernel: [] ? kthread_stop+0x110/0x110 >> kernel: Code: f7 ff ff eb 3b 65 8b 05 da 30 e2 7e 89 c0 48 0f a3 05 = a0 38 b8 00 0f 92 c0 84 c0 0f 85 d1 fe ff ff 0f 1f 44 00 00 e9 f5 fe ff = ff <0f> 0b 49 8b 03 31 f6 f6 c4 40 0f 85 62 ff ff ff e9 61 ff ff ff >> kernel: RIP [] kfree+0x155/0x180 >> kernel: RSP >> kernel: ---[ end trace 3fdec044969def26 ]--- >>=20 >> It seems to be most common after a server reboot where a client has = been >> using a Kerberos mount, and reconnects to continue its workload. >>=20 >> Signed-off-by: Chuck Lever >> --- >> Hi Bruce- >>=20 >> I've hit this off-and-on oops since 64c59a3726f2 was merged. >> Reverting it seems to make the problem go away. I haven't dug >> into why. >>=20 >> Thoughts? >=20 > Apologies, I wasn't thinking when I wrote that patch. The problem is > probably that rsc_lookup steals the passed-in memory to avoid doing an > allocation of its own, so we can't just pass in a pointer to memory = that > someone else is using.... >=20 > If we really want to avoid allocation there then maybe we should > preallocate somwhere, or reference count these handles. >=20 > For now reverting sounds like the right thing to do. NP, thanks for confirming! > Ben, did you ever confirm whether this helped with the problem you = were > seeing? (If I remember correctly, unnpredictable delays here could > cause the request to be dropped if later requests push the rpcsec_gss > sequence window too far.) If so then we could look into reference > counting. Well that's interesting. When a request is dropped, would the server disconnect? Because if it doesn't, the client will wait forever. > --b. >=20 >>=20 >>=20 >> net/sunrpc/auth_gss/svcauth_gss.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >>=20 >> diff --git a/net/sunrpc/auth_gss/svcauth_gss.c = b/net/sunrpc/auth_gss/svcauth_gss.c >> index 1d28181..d858202 100644 >> --- a/net/sunrpc/auth_gss/svcauth_gss.c >> +++ b/net/sunrpc/auth_gss/svcauth_gss.c >> @@ -569,9 +569,10 @@ gss_svc_searchbyctx(struct cache_detail *cd, = struct xdr_netobj *handle) >> struct rsc *found; >>=20 >> memset(&rsci, 0, sizeof(rsci)); >> - rsci.handle.data =3D handle->data; >> - rsci.handle.len =3D handle->len; >> + if (dup_to_netobj(&rsci.handle, handle->data, handle->len)) >> + return NULL; >> found =3D rsc_lookup(cd, &rsci); >> + rsc_free(&rsci); >> if (!found) >> return NULL; >> if (cache_check(cd, &found->h, NULL)) -- Chuck Lever