From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753318AbbJaLeI (ORCPT <rfc822;w@1wt.eu>);
	Sat, 31 Oct 2015 07:34:08 -0400
Received: from mail-qg0-f41.google.com ([209.85.192.41]:35784 "EHLO
	mail-qg0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752934AbbJaLeF (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Sat, 31 Oct 2015 07:34:05 -0400
Date: Sat, 31 Oct 2015 07:34:00 -0400
From: Jeff Layton <jlayton@poochiereds.net>
To: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org, bfields@fieldses.org,
        Michael Skralivetsky <michael.skralivetsky@primarydata.com>,
        Chris Worley <chris.worley@primarydata.com>,
        Trond Myklebust <trond.myklebust@primarydata.com>,
        Lai Jiangshan <laijs@cn.fujitsu.com>
Subject: Re: timer code oops when calling mod_delayed_work
Message-ID: <20151031073400.2cf05d77@tlielax.poochiereds.net>
In-Reply-To: <20151031020012.GH3582@mtj.duckdns.org>
References: <20151029103113.2f893924@tlielax.poochiereds.net>
	<20151029135836.02ad9000@synchrony.poochiereds.net>
	<20151031020012.GH3582@mtj.duckdns.org>
X-Mailer: Claws Mail 3.12.0 (GTK+ 2.24.28; x86_64-redhat-linux-gnu)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Sat, 31 Oct 2015 11:00:12 +0900
Tejun Heo <tj@kernel.org> wrote:

> (cc'ing Lai)
> 
> Hello, Jeff.
> 
> On Thu, Oct 29, 2015 at 01:58:36PM -0400, Jeff Layton wrote:
> > crash> p cache_cleaner
> > cache_cleaner = $12 = {
> >   work = {
> >     data = {
> >       counter = 0xfffffffe1
> 
> If I'm not mistaken, PENDING, flush color 14, OFFQ and POOL_NONE.
> 
> >     }, 
> >     entry = {
> >       next = 0xffffffffa03623c8 <cache_cleaner+8>, 
> >       prev = 0xffffffffa03623c8 <cache_cleaner+8>
> 
> Empty entry.
> 
> >     }, 
> >     func = 0xffffffffa03333c0 <cache_cleaner_func>
> >   }, 
> >   timer = {
> >     entry = {
> >       next = 0x0, 
> >       pprev = 0xffff88085fd0eaf8
> >     }, 
> >     expires = 0x100021e99, 
> >     function = 0xffffffff810b66a0 <delayed_work_timer_fn>, 
> >     data = 0xffffffffa03623c0, 
> >     flags = 0x200014, 
> >     slack = 0xffffffff, 
> >     start_pid = 0x0, 
> >     start_site = 0x0, 
> >     start_comm = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
> >   }, 
> >   wq = 0xffff88085f48fc00, 
> >   cpu = 0x14
> > }
> > 
> > So the PENDING bit is set (lowest bit in data.counter), and timer->entry.pprev
> > pprev pointer is not NULL (so timer_pending is true). I also see that
> > there are several nfsd threads running the shrinker at the same time.
> > 
> > There is one potential problem that I see, but I'd appreciate someone
> > sanity checking me on this. Here is mod_delayed_work_on:
> ...
> > ...and here is the beginning of try_to_grab_pending:
> > 
> > ------------------[snip]------------------------
> >         /* try to steal the timer if it exists */
> >         if (is_dwork) {
> >                 struct delayed_work *dwork = to_delayed_work(work);
> > 
> >                 /*
> >                  * dwork->timer is irqsafe.  If del_timer() fails, it's
> >                  * guaranteed that the timer is not queued anywhere and not
> >                  * running on the local CPU.
> >                  */
> >                 if (likely(del_timer(&dwork->timer)))
> >                         return 1;
> >         }
> > 
> >         /* try to claim PENDING the normal way */
> >         if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)))
> >                 return 0;
> > ------------------[snip]------------------------
> > 
> > 
> > ...so if del_timer returns true, we'll return 1 from
> > try_to_grab_pending without actually setting the
> > WORK_STRUCT_PENDING_BIT, and then will end up calling
> > __queue_delayed_work.
> > 
> > That seems wrong to me -- shouldn't we be ensuring that that bit is set
> > when returning 1 from try_to_grab_pending to guard against concurrent
> > queue_delayed_work_on calls?
> 
> But if try_to_grab_pending() succeeded at stealing dwork->timer, it's
> known that the PENDING bit must already be set.  IOW, the bit is
> stolen together with the timer.
> 
> Heh, this one is tricky.  Yeah, try_to_grab_pending() missing PENDING
> would explain the failure but I can't see how it'd leak at the moment.
> 

Thanks Tejun. Yeah, I realized that after sending the response above.

If you successfully delete the timer the timer then the PENDING bit
should already be set. Might be worth throwing in something like this,
just before the return 1:

    WARN_ON(!test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)))

...but I doubt it would fire. I think it's likely that the bug is
elsewhere.

The other thing is that we've had this code in place for a couple of
years now, and this is the first time I've seen an oops like this. I
suspect that this may be a recent regression, but I don't know that for
sure.

I have asked Chris and Michael to see if they can bisect it down, but
it may be a bit before they can get that done. Any insight you might
have in the meantime would helpful.
-- 
Jeff Layton <jlayton@poochiereds.net>