From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail-lf0-f41.google.com ([209.85.215.41]:33383 "EHLO
	mail-lf0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750744AbcBDXa2 (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Thu, 4 Feb 2016 18:30:28 -0500
Received: by mail-lf0-f41.google.com with SMTP id m1so47514612lfg.0
        for <linux-fsdevel@vger.kernel.org>; Thu, 04 Feb 2016 15:30:27 -0800 (PST)
MIME-Version: 1.0
In-Reply-To: <20160130182731.GF17997@ZenIV.linux.org.uk>
References: <20160123012808.GK17997@ZenIV.linux.org.uk>
	<CAOg9mST8rpzSs4FGB=7yiGJ2PZBqw2s=H1ozQBiWjDF232ZnhQ@mail.gmail.com>
	<20160123191055.GN17997@ZenIV.linux.org.uk>
	<CAOg9mSR667SdVvR323MOaqEs0DVGKcxr8e0tyx=Bu_=ENops0g@mail.gmail.com>
	<20160123214006.GO17997@ZenIV.linux.org.uk>
	<CAOg9mSRwNcXZ2W31Ua-zReYWLS1oQcXL+=5kJuXjwHWUQHQRhg@mail.gmail.com>
	<20160124001615.GT17997@ZenIV.linux.org.uk>
	<20160124040529.GX17997@ZenIV.linux.org.uk>
	<CA+D=wkjuAq-krCKPMFJ6M4iik4Cz=WLS1KE=pQQGSWpz07q6Rw@mail.gmail.com>
	<20160130173413.GE17997@ZenIV.linux.org.uk>
	<20160130182731.GF17997@ZenIV.linux.org.uk>
Date: Thu, 4 Feb 2016 18:30:26 -0500
Message-ID: <CAOg9mSQOZP155TnuoJG_YFb19oU35cj8fHTxn8BfJCw+asP-CQ@mail.gmail.com>
Subject: Re: Orangefs ABI documentation
From: Mike Marshall <hubcap@omnibond.com>
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>
Content-Type: text/plain; charset=UTF-8
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

Hi Al...

Sorry I didn't get to your WARN_ONs right away, I've been trying to get
a better handle on the crashing-the-kernel-when-the-client-core-restarts
problem. I've gotten the kdump stuff working and my feeble crash-foo
has improved a little...

I did some testing with the out-of-tree version of the kernel module,
and it is actually able to recover when the client-core is stopped pretty
well. I looked closer at the client-core code and talked with some
of the others who've been around longer than me. The idea behind
the restart is rooted in "keeping things going" in production if
the client-core was to crash for some reason. On closer inspection
I see that the signal handling code in the client-core that simulates
a crash on SIGSEGV and SIGABRT is indeed written with sigaction
like the signal(2) man page says it should be.

As for the WARN_ONs, the waitqueue one is easy to hit when the
client-core stops and restarts, you can see here where precopy_buffers
started whining about the client-core, you can see that the client
core restarted when the debug mask got sent back over, and then
the WARN_ON in waitqueue gets hit:

[ 1239.198976] precopy_buffers: Failed to copy-in buffers. Please make
sure that  the pvfs2-client is running. -14
[ 1239.198979] precopy_buffers: Failed to copy-in buffers. Please make
sure that  the pvfs2-client is running. -14
[ 1239.198983] orangefs_file_write_iter: do_readv_writev failed, rc:-14:.
[ 1239.199175] precopy_buffers: Failed to copy-in buffers. Please make
sure that  the pvfs2-client is running. -14
[ 1239.199177] precopy_buffers: Failed to copy-in buffers. Please make
sure that  the pvfs2-client is running. -14
[ 1239.199180] orangefs_file_write_iter: do_readv_writev failed, rc:-14:.
[ 1239.199601] precopy_buffers: Failed to copy-in buffers. Please make
sure that  the pvfs2-client is running. -14
[ 1239.199602] precopy_buffers: Failed to copy-in buffers. Please make
sure that  the pvfs2-client is running. -14
[ 1239.199604] orangefs_file_write_iter: do_readv_writev failed, rc:-14:.
[ 1239.248239] dispatch_ioctl_command: client debug mask has been been
received  :0: :0:
[ 1239.248257] dispatch_ioctl_command: client debug array string has
been receiv ed.
[ 1239.307842] ------------[ cut here ]------------
[ 1239.307847] WARNING: CPU: 0 PID: 1347 at
fs/orangefs/waitqueue.c:208 service_ operation+0x59f/0x9b0()
[ 1239.307848] Modules linked in: bnep bluetooth ip6t_rpfilter rfkill
ip6t_REJEC T nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6
nf_conntrack_ipv4 nf_defrag_ip v4 xt_conntrack nf_conntrack
ebtable_nat ebtable_broute bridge stp llc ebtable_f ilter ebtables
ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip
6_tables iptable_mangle iptable_security iptable_raw ppdev parport_pc
virtio_bal loon virtio_console parport 8139too serio_raw pvpanic
i2c_piix4 uinput qxl drm_k ms_helper ttm drm 8139cp i2c_core
virtio_pci ata_generic virtio virtio_ring mii  pata_acpi
[ 1239.307870] CPU: 0 PID: 1347 Comm: dbench Not tainted
4.4.0-161988-g237f828-d irty #49
[ 1239.307871] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 1239.307872]  0000000000000000 0000000011eac412 ffff88003bf27cd0
ffffffff8139c 84d
[ 1239.307874]  ffff88003bf27d08 ffffffff8108e510 ffff880010968000
ffff88001096c 1d0
[ 1239.307876]  ffff88001096c188 00000000fffffff5 0000000000000000
ffff88003bf27 d18
[ 1239.307877] Call Trace:
[ 1239.307881]  [<ffffffff8139c84d>] dump_stack+0x19/0x1c
[ 1239.307884]  [<ffffffff8108e510>] warn_slowpath_common+0x80/0xc0
[ 1239.307886]  [<ffffffff8108e65a>] warn_slowpath_null+0x1a/0x20
[ 1239.307887]  [<ffffffff812fe73f>] service_operation+0x59f/0x9b0
[ 1239.307889]  [<ffffffff810c28b0>] ? prepare_to_wait_event+0x100/0x100
[ 1239.307891]  [<ffffffff810c28b0>] ? prepare_to_wait_event+0x100/0x100
[ 1239.307893]  [<ffffffff812fbd12>] orangefs_readdir+0x172/0xd50
[ 1239.307895]  [<ffffffff810c9a2d>] ? trace_hardirqs_on+0xd/0x10
[ 1239.307898]  [<ffffffff8120e4bf>] iterate_dir+0x9f/0x130
[ 1239.307899]  [<ffffffff8120e9d0>] SyS_getdents+0xa0/0x140
[ 1239.307900]  [<ffffffff8120e550>] ? iterate_dir+0x130/0x130
[ 1239.307903]  [<ffffffff8178302f>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 1239.307904] ---[ end trace 66a9a15ad78b3dea ]---


On the next restart, the kernel crashed. The client-core is restarted
(restarting?) and orangefs_readdir is dealing with the wreckage and
trying to give up the slot it was using:

[ 1255.683226] dispatch_ioctl_command: client debug mask has been been
received  :0: :0:
[ 1255.683245] dispatch_ioctl_command: client debug array string has
been receiv ed.
[ 1255.711036] BUG: unable to handle kernel paging request at ffffffff810b0d68
[ 1255.711911] IP: [<ffffffff810cad14>] __lock_acquire+0x1a4/0x1e50
[ 1255.712605] PGD 1c13067 PUD 1c14063 PMD 10001e1
[ 1255.713147] Oops: 0003 [#1]
[ 1255.713522] Modules linked in: bnep bluetooth ip6t_rpfilter rfkill
ip6t_REJEC T nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6
nf_conntrack_ipv4 nf_defrag_ip v4 xt_conntrack nf_conntrack
ebtable_nat ebtable_broute bridge stp llc ebtable_f ilter ebtables
ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip
6_tables iptable_mangle iptable_security iptable_raw ppdev parport_pc
virtio_bal loon virtio_console parport 8139too serio_raw pvpanic
i2c_piix4 uinput qxl drm_k ms_helper ttm drm 8139cp i2c_core
virtio_pci ata_generic virtio virtio_ring mii  pata_acpi
[ 1255.720073] CPU: 0 PID: 1347 Comm: dbench Tainted: G        W
4.4.0-161 988-g237f828-dirty #49
[ 1255.721160] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 1255.721845] task: ffff880036894d00 ti: ffff88003bf24000 task.ti:
ffff88003bf2 4000
[ 1255.722674] RIP: 0010:[<ffffffff810cad14>]  [<ffffffff810cad14>]
__lock_acqui re+0x1a4/0x1e50
[ 1255.723669] RSP: 0018:ffff88003bf27c28  EFLAGS: 00010082
[ 1255.724289] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff810b0bd0
[ 1255.725143] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880003067d58
[ 1255.726002] RBP: ffff88003bf27ce8 R08: 0000000000000001 R09: 0000000000000000
[ 1255.726852] R10: ffff880036894d00 R11: 0000000000000000 R12: ffff880003067d58
[ 1255.727731] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1255.728585] FS:  00007fbe5f7d7740(0000) GS:ffffffff81c28000(0000)
knlGS:00000 00000000000
[ 1255.729593] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1255.730288] CR2: ffffffff810b0d68 CR3: 00000000160da000 CR4: 00000000000006f0
[ 1255.731145] Stack:
[ 1255.731393]  0000000000000296 0000000011eac412 ffff880010968000
ffff88003bf27 c70
[ 1255.732414]  00000001000ecc8c ffff88003bf27d18 ffffffff81781b3a
ffff88003bf27 c90
[ 1255.733340]  0000000000000282 dead000000000200 0000000000000000
00000001000ec c8c
[ 1255.734267] Call Trace:
[ 1255.734567]  [<ffffffff81781b3a>] ? schedule_timeout+0x18a/0x2a0
[ 1255.735296]  [<ffffffff810e82b0>] ?
trace_event_raw_event_tick_stop+0x100/0x1 00
[ 1255.736176]  [<ffffffff810cca82>] lock_acquire+0xc2/0x150
[ 1255.736814]  [<ffffffff812fcbfb>] ? put_back_slot+0x1b/0x70
[ 1255.737467]  [<ffffffff817825b1>] _raw_spin_lock+0x31/0x40
[ 1255.738159]  [<ffffffff812fcbfb>] ? put_back_slot+0x1b/0x70
[ 1255.738838]  [<ffffffff812fcbfb>] put_back_slot+0x1b/0x70
[ 1255.739494]  [<ffffffff812fd36b>] orangefs_readdir_index_put+0x4b/0x70
[ 1255.740287]  [<ffffffff812fbccd>] orangefs_readdir+0x12d/0xd50
[ 1255.740998]  [<ffffffff810c9a2d>] ? trace_hardirqs_on+0xd/0x10
[ 1255.741688]  [<ffffffff8120e4bf>] iterate_dir+0x9f/0x130
[ 1255.742339]  [<ffffffff8120e9d0>] SyS_getdents+0xa0/0x140
[ 1255.743017]  [<ffffffff8120e550>] ? iterate_dir+0x130/0x130
[ 1255.743678]  [<ffffffff8178302f>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 1255.744417] Code: c3 49 81 3c 24 c0 6b ef 81 b8 00 00 00 00 44 0f
44 c0 83 fb  01 0f 87 f0 fe ff ff 89 d8 49 8b 4c c4 08 48 85 c9 0f 84
e0 fe ff ff <ff> 81 98  01 00 00 44 8b 1d 97 42 ac 01 41 8b 9a 40 06
00 00 45
[ 1255.747699] RIP  [<ffffffff810cad14>] __lock_acquire+0x1a4/0x1e50
[ 1255.748450]  RSP <ffff88003bf27c28>
[ 1255.748898] CR2: ffffffff810b0d68
[ 1255.749293] ---[ end trace 66a9a15ad78b3deb ]---


There's a lot of timing that has to be gotten right across one of these
restarts, and all the changes we have made to the upstream kernel module
vs. the out-of-tree version have, I guess, gotten some of that timing out
of whack. I'm glad you're helping <g>...

One of the tests I made in the last few days was to basically give up on
every op that showed up in service_operation while is_daemon_in_service()
was failing. That kept the kernel happy, but was obviously more
brutal to the processes (dbench in my test case, people doing work in
real life) than managing to wait on as many of the ops as possible.

Anyhow... there's several more patches from both me and Martin in get-next
as of today. There's still a problem with his permissions patch, but I
guess it is good that we have that call-out now, and we'll get it working
properly. And still no patch to fix the as yet official follow_link
change - I hope I'm not giving Stephen Rothwell heartburn over that. He
said I could just merge in your fceef393a538 commit and then
change/get-rid-of our follow_link code and that would be OK...

-Mike

On Sat, Jan 30, 2016 at 1:27 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Jan 30, 2016 at 05:34:13PM +0000, Al Viro wrote:
>
>> And again, how could
>>                 /* op uses shared memory */
>>                 if (orangefs_get_bufmap_init() == 0) {
>> in service_operation() possibly be true, when we have
>>       * op->uses_shared_memory just checked to be set
>>       * all callers that set it (orangefs_readdir() and wait_for_direct_io()
>> having allocated a slot before calling service_operation() and not releasing
>> it until service_operation() returns
>>       * __orangefs_bufmap not becoming NULL until all slots are freed and
>>       * orangefs_get_bufmap_init() returning 1 unless __orangefs_bufmap is
>> NULL?
>>
>> AFAICS, that code (waiting for daemon to be restarted) is provably never
>> executed.  What am I missing?
>
> While we are at it, what happens to original code (without refcount changes,
> etc. from my pile - the problem remains with those, but let's look at the
> code at merge from v4.4) if something does read() and gets a signal
> just as the daemon gets to
>         n = copy_from_iter(&op->downcall, downcall_size, iter);
> in ->write_iter(), reporting that is has finished that read?  We were in
> wait_for_matching_downcall(), interruptibly sleeping.  We got woken up by
> signal delivery.  Checked op->op_state; it's still not marked as serviced.
> We check signal_pending() and find it true.  We hit
> orangefs_clean_up_interrupted_operation(), which doesn't do anything to
> op->op_state, and we return -EINTR to service_operation().  Which returns
> without waiting for anything to wait_for_direct_io() and we go into
> sending a cancel.  Now the daemon regains the timeslice.  Marks op as
> serviced.  And proceeds to wait on op->io_completion_waitq.
>
> Who's going to wake it up?  orangefs_cancel_op_in_progress() sure as hell
> won't - it has no way to find op, nevermind doing wakeups.  The rest of
> wait_for_direct_io() also doesn't wake the daemon up.  How is that supposed
> to work?  Moreover, we issue a cancel; when is it supposed to be processed
> and how do we tell if it's already been processed?
>
> Who should be waiting for what in case of cancel being issued just as the
> daemon gets around to reporting success of original operation?  On the
> protocol level, that is.