From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f41.google.com ([209.85.215.41]:33383 "EHLO mail-lf0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750744AbcBDXa2 (ORCPT ); Thu, 4 Feb 2016 18:30:28 -0500 Received: by mail-lf0-f41.google.com with SMTP id m1so47514612lfg.0 for ; Thu, 04 Feb 2016 15:30:27 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20160130182731.GF17997@ZenIV.linux.org.uk> References: <20160123012808.GK17997@ZenIV.linux.org.uk> <20160123191055.GN17997@ZenIV.linux.org.uk> <20160123214006.GO17997@ZenIV.linux.org.uk> <20160124001615.GT17997@ZenIV.linux.org.uk> <20160124040529.GX17997@ZenIV.linux.org.uk> <20160130173413.GE17997@ZenIV.linux.org.uk> <20160130182731.GF17997@ZenIV.linux.org.uk> Date: Thu, 4 Feb 2016 18:30:26 -0500 Message-ID: Subject: Re: Orangefs ABI documentation From: Mike Marshall To: Al Viro Cc: Linus Torvalds , linux-fsdevel , Stephen Rothwell Content-Type: text/plain; charset=UTF-8 Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Hi Al... Sorry I didn't get to your WARN_ONs right away, I've been trying to get a better handle on the crashing-the-kernel-when-the-client-core-restarts problem. I've gotten the kdump stuff working and my feeble crash-foo has improved a little... I did some testing with the out-of-tree version of the kernel module, and it is actually able to recover when the client-core is stopped pretty well. I looked closer at the client-core code and talked with some of the others who've been around longer than me. The idea behind the restart is rooted in "keeping things going" in production if the client-core was to crash for some reason. On closer inspection I see that the signal handling code in the client-core that simulates a crash on SIGSEGV and SIGABRT is indeed written with sigaction like the signal(2) man page says it should be. As for the WARN_ONs, the waitqueue one is easy to hit when the client-core stops and restarts, you can see here where precopy_buffers started whining about the client-core, you can see that the client core restarted when the debug mask got sent back over, and then the WARN_ON in waitqueue gets hit: [ 1239.198976] precopy_buffers: Failed to copy-in buffers. Please make sure that the pvfs2-client is running. -14 [ 1239.198979] precopy_buffers: Failed to copy-in buffers. Please make sure that the pvfs2-client is running. -14 [ 1239.198983] orangefs_file_write_iter: do_readv_writev failed, rc:-14:. [ 1239.199175] precopy_buffers: Failed to copy-in buffers. Please make sure that the pvfs2-client is running. -14 [ 1239.199177] precopy_buffers: Failed to copy-in buffers. Please make sure that the pvfs2-client is running. -14 [ 1239.199180] orangefs_file_write_iter: do_readv_writev failed, rc:-14:. [ 1239.199601] precopy_buffers: Failed to copy-in buffers. Please make sure that the pvfs2-client is running. -14 [ 1239.199602] precopy_buffers: Failed to copy-in buffers. Please make sure that the pvfs2-client is running. -14 [ 1239.199604] orangefs_file_write_iter: do_readv_writev failed, rc:-14:. [ 1239.248239] dispatch_ioctl_command: client debug mask has been been received :0: :0: [ 1239.248257] dispatch_ioctl_command: client debug array string has been receiv ed. [ 1239.307842] ------------[ cut here ]------------ [ 1239.307847] WARNING: CPU: 0 PID: 1347 at fs/orangefs/waitqueue.c:208 service_ operation+0x59f/0x9b0() [ 1239.307848] Modules linked in: bnep bluetooth ip6t_rpfilter rfkill ip6t_REJEC T nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ip v4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_f ilter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip 6_tables iptable_mangle iptable_security iptable_raw ppdev parport_pc virtio_bal loon virtio_console parport 8139too serio_raw pvpanic i2c_piix4 uinput qxl drm_k ms_helper ttm drm 8139cp i2c_core virtio_pci ata_generic virtio virtio_ring mii pata_acpi [ 1239.307870] CPU: 0 PID: 1347 Comm: dbench Not tainted 4.4.0-161988-g237f828-d irty #49 [ 1239.307871] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 1239.307872] 0000000000000000 0000000011eac412 ffff88003bf27cd0 ffffffff8139c 84d [ 1239.307874] ffff88003bf27d08 ffffffff8108e510 ffff880010968000 ffff88001096c 1d0 [ 1239.307876] ffff88001096c188 00000000fffffff5 0000000000000000 ffff88003bf27 d18 [ 1239.307877] Call Trace: [ 1239.307881] [] dump_stack+0x19/0x1c [ 1239.307884] [] warn_slowpath_common+0x80/0xc0 [ 1239.307886] [] warn_slowpath_null+0x1a/0x20 [ 1239.307887] [] service_operation+0x59f/0x9b0 [ 1239.307889] [] ? prepare_to_wait_event+0x100/0x100 [ 1239.307891] [] ? prepare_to_wait_event+0x100/0x100 [ 1239.307893] [] orangefs_readdir+0x172/0xd50 [ 1239.307895] [] ? trace_hardirqs_on+0xd/0x10 [ 1239.307898] [] iterate_dir+0x9f/0x130 [ 1239.307899] [] SyS_getdents+0xa0/0x140 [ 1239.307900] [] ? iterate_dir+0x130/0x130 [ 1239.307903] [] entry_SYSCALL_64_fastpath+0x12/0x76 [ 1239.307904] ---[ end trace 66a9a15ad78b3dea ]--- On the next restart, the kernel crashed. The client-core is restarted (restarting?) and orangefs_readdir is dealing with the wreckage and trying to give up the slot it was using: [ 1255.683226] dispatch_ioctl_command: client debug mask has been been received :0: :0: [ 1255.683245] dispatch_ioctl_command: client debug array string has been receiv ed. [ 1255.711036] BUG: unable to handle kernel paging request at ffffffff810b0d68 [ 1255.711911] IP: [] __lock_acquire+0x1a4/0x1e50 [ 1255.712605] PGD 1c13067 PUD 1c14063 PMD 10001e1 [ 1255.713147] Oops: 0003 [#1] [ 1255.713522] Modules linked in: bnep bluetooth ip6t_rpfilter rfkill ip6t_REJEC T nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ip v4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_f ilter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip 6_tables iptable_mangle iptable_security iptable_raw ppdev parport_pc virtio_bal loon virtio_console parport 8139too serio_raw pvpanic i2c_piix4 uinput qxl drm_k ms_helper ttm drm 8139cp i2c_core virtio_pci ata_generic virtio virtio_ring mii pata_acpi [ 1255.720073] CPU: 0 PID: 1347 Comm: dbench Tainted: G W 4.4.0-161 988-g237f828-dirty #49 [ 1255.721160] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 1255.721845] task: ffff880036894d00 ti: ffff88003bf24000 task.ti: ffff88003bf2 4000 [ 1255.722674] RIP: 0010:[] [] __lock_acqui re+0x1a4/0x1e50 [ 1255.723669] RSP: 0018:ffff88003bf27c28 EFLAGS: 00010082 [ 1255.724289] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff810b0bd0 [ 1255.725143] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880003067d58 [ 1255.726002] RBP: ffff88003bf27ce8 R08: 0000000000000001 R09: 0000000000000000 [ 1255.726852] R10: ffff880036894d00 R11: 0000000000000000 R12: ffff880003067d58 [ 1255.727731] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 1255.728585] FS: 00007fbe5f7d7740(0000) GS:ffffffff81c28000(0000) knlGS:00000 00000000000 [ 1255.729593] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1255.730288] CR2: ffffffff810b0d68 CR3: 00000000160da000 CR4: 00000000000006f0 [ 1255.731145] Stack: [ 1255.731393] 0000000000000296 0000000011eac412 ffff880010968000 ffff88003bf27 c70 [ 1255.732414] 00000001000ecc8c ffff88003bf27d18 ffffffff81781b3a ffff88003bf27 c90 [ 1255.733340] 0000000000000282 dead000000000200 0000000000000000 00000001000ec c8c [ 1255.734267] Call Trace: [ 1255.734567] [] ? schedule_timeout+0x18a/0x2a0 [ 1255.735296] [] ? trace_event_raw_event_tick_stop+0x100/0x1 00 [ 1255.736176] [] lock_acquire+0xc2/0x150 [ 1255.736814] [] ? put_back_slot+0x1b/0x70 [ 1255.737467] [] _raw_spin_lock+0x31/0x40 [ 1255.738159] [] ? put_back_slot+0x1b/0x70 [ 1255.738838] [] put_back_slot+0x1b/0x70 [ 1255.739494] [] orangefs_readdir_index_put+0x4b/0x70 [ 1255.740287] [] orangefs_readdir+0x12d/0xd50 [ 1255.740998] [] ? trace_hardirqs_on+0xd/0x10 [ 1255.741688] [] iterate_dir+0x9f/0x130 [ 1255.742339] [] SyS_getdents+0xa0/0x140 [ 1255.743017] [] ? iterate_dir+0x130/0x130 [ 1255.743678] [] entry_SYSCALL_64_fastpath+0x12/0x76 [ 1255.744417] Code: c3 49 81 3c 24 c0 6b ef 81 b8 00 00 00 00 44 0f 44 c0 83 fb 01 0f 87 f0 fe ff ff 89 d8 49 8b 4c c4 08 48 85 c9 0f 84 e0 fe ff ff 81 98 01 00 00 44 8b 1d 97 42 ac 01 41 8b 9a 40 06 00 00 45 [ 1255.747699] RIP [] __lock_acquire+0x1a4/0x1e50 [ 1255.748450] RSP [ 1255.748898] CR2: ffffffff810b0d68 [ 1255.749293] ---[ end trace 66a9a15ad78b3deb ]--- There's a lot of timing that has to be gotten right across one of these restarts, and all the changes we have made to the upstream kernel module vs. the out-of-tree version have, I guess, gotten some of that timing out of whack. I'm glad you're helping ... One of the tests I made in the last few days was to basically give up on every op that showed up in service_operation while is_daemon_in_service() was failing. That kept the kernel happy, but was obviously more brutal to the processes (dbench in my test case, people doing work in real life) than managing to wait on as many of the ops as possible. Anyhow... there's several more patches from both me and Martin in get-next as of today. There's still a problem with his permissions patch, but I guess it is good that we have that call-out now, and we'll get it working properly. And still no patch to fix the as yet official follow_link change - I hope I'm not giving Stephen Rothwell heartburn over that. He said I could just merge in your fceef393a538 commit and then change/get-rid-of our follow_link code and that would be OK... -Mike On Sat, Jan 30, 2016 at 1:27 PM, Al Viro wrote: > On Sat, Jan 30, 2016 at 05:34:13PM +0000, Al Viro wrote: > >> And again, how could >> /* op uses shared memory */ >> if (orangefs_get_bufmap_init() == 0) { >> in service_operation() possibly be true, when we have >> * op->uses_shared_memory just checked to be set >> * all callers that set it (orangefs_readdir() and wait_for_direct_io() >> having allocated a slot before calling service_operation() and not releasing >> it until service_operation() returns >> * __orangefs_bufmap not becoming NULL until all slots are freed and >> * orangefs_get_bufmap_init() returning 1 unless __orangefs_bufmap is >> NULL? >> >> AFAICS, that code (waiting for daemon to be restarted) is provably never >> executed. What am I missing? > > While we are at it, what happens to original code (without refcount changes, > etc. from my pile - the problem remains with those, but let's look at the > code at merge from v4.4) if something does read() and gets a signal > just as the daemon gets to > n = copy_from_iter(&op->downcall, downcall_size, iter); > in ->write_iter(), reporting that is has finished that read? We were in > wait_for_matching_downcall(), interruptibly sleeping. We got woken up by > signal delivery. Checked op->op_state; it's still not marked as serviced. > We check signal_pending() and find it true. We hit > orangefs_clean_up_interrupted_operation(), which doesn't do anything to > op->op_state, and we return -EINTR to service_operation(). Which returns > without waiting for anything to wait_for_direct_io() and we go into > sending a cancel. Now the daemon regains the timeslice. Marks op as > serviced. And proceeds to wait on op->io_completion_waitq. > > Who's going to wake it up? orangefs_cancel_op_in_progress() sure as hell > won't - it has no way to find op, nevermind doing wakeups. The rest of > wait_for_direct_io() also doesn't wake the daemon up. How is that supposed > to work? Moreover, we issue a cancel; when is it supposed to be processed > and how do we tell if it's already been processed? > > Who should be waiting for what in case of cancel being issued just as the > daemon gets around to reporting success of original operation? On the > protocol level, that is.