From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from zeniv.linux.org.uk ([195.92.253.2]:45699 "EHLO
	ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752809AbcBFTmO (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Sat, 6 Feb 2016 14:42:14 -0500
Date: Sat, 6 Feb 2016 19:42:10 +0000
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Mike Marshall <hubcap@omnibond.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>
Subject: Re: Orangefs ABI documentation
Message-ID: <20160206194210.GX17997@ZenIV.linux.org.uk>
References: <20160123191055.GN17997@ZenIV.linux.org.uk>
 <CAOg9mSR667SdVvR323MOaqEs0DVGKcxr8e0tyx=Bu_=ENops0g@mail.gmail.com>
 <20160123214006.GO17997@ZenIV.linux.org.uk>
 <CAOg9mSRwNcXZ2W31Ua-zReYWLS1oQcXL+=5kJuXjwHWUQHQRhg@mail.gmail.com>
 <20160124001615.GT17997@ZenIV.linux.org.uk>
 <20160124040529.GX17997@ZenIV.linux.org.uk>
 <CA+D=wkjuAq-krCKPMFJ6M4iik4Cz=WLS1KE=pQQGSWpz07q6Rw@mail.gmail.com>
 <20160130173413.GE17997@ZenIV.linux.org.uk>
 <20160130182731.GF17997@ZenIV.linux.org.uk>
 <CAOg9mSQOZP155TnuoJG_YFb19oU35cj8fHTxn8BfJCw+asP-CQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAOg9mSQOZP155TnuoJG_YFb19oU35cj8fHTxn8BfJCw+asP-CQ@mail.gmail.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Thu, Feb 04, 2016 at 06:30:26PM -0500, Mike Marshall wrote:
> As for the WARN_ONs, the waitqueue one is easy to hit when the
> client-core stops and restarts, you can see here where precopy_buffers
> started whining about the client-core, you can see that the client
> core restarted when the debug mask got sent back over, and then
> the WARN_ON in waitqueue gets hit:
> 
> [ 1239.198976] precopy_buffers: Failed to copy-in buffers. Please make
> sure that  the pvfs2-client is running. -14
> [ 1239.198979] precopy_buffers: Failed to copy-in buffers. Please make
> sure that  the pvfs2-client is running. -14
> [ 1239.198983] orangefs_file_write_iter: do_readv_writev failed, rc:-14:.
> [ 1239.199175] precopy_buffers: Failed to copy-in buffers. Please make
> sure that  the pvfs2-client is running. -14
> [ 1239.199177] precopy_buffers: Failed to copy-in buffers. Please make
> sure that  the pvfs2-client is running. -14
> [ 1239.199180] orangefs_file_write_iter: do_readv_writev failed, rc:-14:.
> [ 1239.199601] precopy_buffers: Failed to copy-in buffers. Please make
> sure that  the pvfs2-client is running. -14
> [ 1239.199602] precopy_buffers: Failed to copy-in buffers. Please make
> sure that  the pvfs2-client is running. -14
> [ 1239.199604] orangefs_file_write_iter: do_readv_writev failed, rc:-14:.
> [ 1239.248239] dispatch_ioctl_command: client debug mask has been been
> received  :0: :0:
> [ 1239.248257] dispatch_ioctl_command: client debug array string has
> been receiv ed.
> [ 1239.307842] ------------[ cut here ]------------
> [ 1239.307847] WARNING: CPU: 0 PID: 1347 at
> fs/orangefs/waitqueue.c:208 service_ operation+0x59f/0x9b0()
> [ 1239.307848] Modules linked in: bnep bluetooth ip6t_rpfilter rfkill
> ip6t_REJEC T nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6
> nf_conntrack_ipv4 nf_defrag_ip v4 xt_conntrack nf_conntrack
> ebtable_nat ebtable_broute bridge stp llc ebtable_f ilter ebtables
> ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip
> 6_tables iptable_mangle iptable_security iptable_raw ppdev parport_pc
> virtio_bal loon virtio_console parport 8139too serio_raw pvpanic
> i2c_piix4 uinput qxl drm_k ms_helper ttm drm 8139cp i2c_core
> virtio_pci ata_generic virtio virtio_ring mii  pata_acpi
> [ 1239.307870] CPU: 0 PID: 1347 Comm: dbench Not tainted
> 4.4.0-161988-g237f828-d irty #49
> [ 1239.307871] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> [ 1239.307872]  0000000000000000 0000000011eac412 ffff88003bf27cd0
> ffffffff8139c 84d
> [ 1239.307874]  ffff88003bf27d08 ffffffff8108e510 ffff880010968000
> ffff88001096c 1d0
> [ 1239.307876]  ffff88001096c188 00000000fffffff5 0000000000000000
> ffff88003bf27 d18
> [ 1239.307877] Call Trace:
> [ 1239.307881]  [<ffffffff8139c84d>] dump_stack+0x19/0x1c
> [ 1239.307884]  [<ffffffff8108e510>] warn_slowpath_common+0x80/0xc0
> [ 1239.307886]  [<ffffffff8108e65a>] warn_slowpath_null+0x1a/0x20
> [ 1239.307887]  [<ffffffff812fe73f>] service_operation+0x59f/0x9b0
> [ 1239.307889]  [<ffffffff810c28b0>] ? prepare_to_wait_event+0x100/0x100
> [ 1239.307891]  [<ffffffff810c28b0>] ? prepare_to_wait_event+0x100/0x100
> [ 1239.307893]  [<ffffffff812fbd12>] orangefs_readdir+0x172/0xd50
> [ 1239.307895]  [<ffffffff810c9a2d>] ? trace_hardirqs_on+0xd/0x10
> [ 1239.307898]  [<ffffffff8120e4bf>] iterate_dir+0x9f/0x130
> [ 1239.307899]  [<ffffffff8120e9d0>] SyS_getdents+0xa0/0x140
> [ 1239.307900]  [<ffffffff8120e550>] ? iterate_dir+0x130/0x130
> [ 1239.307903]  [<ffffffff8178302f>] entry_SYSCALL_64_fastpath+0x12/0x76
> [ 1239.307904] ---[ end trace 66a9a15ad78b3dea ]---

Bloody hell.  OK, something's seriously fishy here.  You have
service_operation() called from orangefs_readdir().  It had been
immediately preceded by
        ret = orangefs_readdir_index_get(&bufmap, &buffer_index);
returning a success.  Moreover, *all* paths after return from
service_operation() pass through some instance of
	orangefs_readdir_index_put(bufmap, buffer_index);
(possibly hidden behind readdir_handle_dtor(bufmap, &rhandle)).
We'd bloody better _not_ have had bufmap freed by that point, right?

OTOH, this service_operation() has just seen orangefs_get_bufmap_init()
returning 0.  Since
int orangefs_get_bufmap_init(void) 
{ 
        return __orangefs_bufmap ? 1 : 0;
}
we have observed NULL __orangefs_bufmap at that point.  __orangefs_bufmap
is static, its address is never taken and the only assignments to it are
static void orangefs_bufmap_unref(struct orangefs_bufmap *bufmap)
{
        if (atomic_dec_and_lock(&bufmap->refcnt, &orangefs_bufmap_lock)) {
                __orangefs_bufmap = NULL;
                spin_unlock(&orangefs_bufmap_lock);

                orangefs_bufmap_unmap(bufmap);
                orangefs_bufmap_free(bufmap);
        }
}
and
        bufmap = orangefs_bufmap_alloc(user_desc);
        if (!bufmap)
                goto out;

        ret = orangefs_bufmap_map(bufmap, user_desc);
        if (ret)
                goto out_free_bufmap;

        spin_lock(&orangefs_bufmap_lock);
        if (__orangefs_bufmap) {
                spin_unlock(&orangefs_bufmap_lock);
                gossip_err("orangefs: error: bufmap already initialized.\n");
                ret = -EALREADY;
                goto out_unmap_bufmap;
        }
        __orangefs_bufmap = bufmap;
        spin_unlock(&orangefs_bufmap_lock);

The latter is definitely *not* making it NULL - that's daemon asking to
install a new one.  And the former is unconditionally followed by
kfree(bufmap) (via orangefs_bufmap_free()).

IOW, if this WARN_ON() is triggered, you've already gotten to freeing the
damn thing.  And subsequent crash appears to match that theory.

The really weird thing is that orangefs_readdir_index_get() increments the
refcount, so having hit that __orangefs_bufmap = NULL; while we sat in
service_operation() means that refcounting of bufmap got broken somehow...

orangefs_bufmap_alloc() creates it with refcount 1 and is very shortly followed
either by orangefs_bufmap_free() or by __orangefs_bufmap changing from NULL to
the pointer to created instance.

orangefs_bufmap_ref() either returns NULL or non-NULL value equal to
__orangefs_bufmap, in the latter case having bumped its refcount.

orangefs_bufmap_unref() drops refcount on its argument and if it has
reached 0, clears __orangefs_bufmap and frees the argument.  That's
a bit of a red flag - either we are always called with argument equal to
__orangefs_bufmap, or we might end up with a leak...

Nothing else changes their refcounts.  In orangefs_bufmap_{size,shift}_query()
a successful orangefs_bufmap_ref() is immediately followed by
orangefs_bufmap_unref() (incidentally, both might as well have just
grabbed orangefs_bufmap_lock, checked __orangefs_bufmap and picked its
->desc_{size,shift} before dropping the lock - no need to mess with refcounts
in those two).

Remaining callers of orangefs_bufmap_ref() are orangefs_bufmap_get() and
orangefs_readdir_index_get().  Those begin with grabbing a reference (and
failing if __orangefs_bufmap was NULL) and eventually either returning a
zero and storing the acquired pointer in *mapp, or dropping the reference
and returning non-zero.  More serious red flag, BTW - *mapp is still set
in the latter case, which might end up with confused callers...

Let's see... bufmap_get is called in wait_for_direct_io() and returning
a negative is followed by buggering off to
        if (buffer_index >= 0) {
                orangefs_bufmap_put(bufmap, buffer_index);
                gossip_debug(GOSSIP_FILE_DEBUG,
                             "%s(%pU): PUT buffer_index %d\n",
                             __func__, handle, buffer_index);
                buffer_index = -1;
        }
Uh-oh...
	1) can that thing return a positive value?
	2) can it end up storing something in buffer_index and still
failing?
	3) can we get there with non-negative buffer_index?

OK, (1) and (2) translate into the same question for wait_for_a_slot(),
which returns 0, -EINTR or -ETIMEDOUT and only overwrites buffer_index in
case when it returns 0.  And (3)...  Fuck. (3) is possible.  Look:
        if (ret == -EAGAIN && op_state_purged(new_op)) {
                orangefs_bufmap_put(bufmap, buffer_index);
                gossip_debug(GOSSIP_FILE_DEBUG,
                             "%s:going to repopulate_shared_memory.\n",
                             __func__);
                goto populate_shared_memory;
        }
in wait_for_direct_io() sends us to populate_shared_memory with buffer_index
unchanged despite the fact that we'd done orangefs_bufmap_put() on it.  And
there we hit
        /* get a shared buffer index */
        ret = orangefs_bufmap_get(&bufmap, &buffer_index);
        if (ret < 0) {
                gossip_debug(GOSSIP_FILE_DEBUG,
                             "%s: orangefs_bufmap_get failure (%ld)\n",
                             __func__, (long)ret);
                goto out;
with failure followed by _another_ orangefs_bufmap_put() around out:

OK, there's your buggered refcounting.  The minimal fix is to slap
buffer_index = -1; right after the orangefs_bufmap_put() in there.

orangefs_readdir_index_get() caller simply buggers off on failure.  No problem
there, and longer term that's what I'd suggest doing in wait_for_direct_io()
as well.  Anyway, could you try this on top of your for-next and see if you
can reproduce either WARN_ON?

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index d865b58..40b3805 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -210,6 +210,7 @@ populate_shared_memory:
 	 */
 	if (ret == -EAGAIN && op_state_purged(new_op)) {
 		orangefs_bufmap_put(bufmap, buffer_index);
+		buffer_index = -1;
 		gossip_debug(GOSSIP_FILE_DEBUG,
 			     "%s:going to repopulate_shared_memory.\n",
 			     __func__);