Re: WARNING: ODEBUG bug in tcindex_destroy_work (3)

From: Cong Wang <xiyou.wangcong@gmail.com>
To: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	syzbot <syzbot+46f513c3033d592409d2@syzkaller.appspotmail.com>,
	David Miller <davem@davemloft.net>,
	Jamal Hadi Salim <jhs@mojatatu.com>,
	Jiri Pirko <jiri@resnulli.us>, Jakub Kicinski <kuba@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux Kernel Network Developers <netdev@vger.kernel.org>,
	syzkaller-bugs <syzkaller-bugs@googlegroups.com>
Subject: Re: WARNING: ODEBUG bug in tcindex_destroy_work (3)
Date: Sat, 28 Mar 2020 12:53:43 -0700	[thread overview]
Message-ID: <CAM_iQpU+1as_RAE64wfq+rWcCb16_amFP3V4rZVFRr29SfwD4Q@mail.gmail.com> (raw)
In-Reply-To: <20200325185815.GW19865@paulmck-ThinkPad-P72>

On Wed, Mar 25, 2020 at 11:58 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Mar 25, 2020 at 11:36:16AM -0700, Cong Wang wrote:
> > On Mon, Mar 23, 2020 at 6:01 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > Cong Wang <xiyou.wangcong@gmail.com> writes:
> > > > On Mon, Mar 23, 2020 at 2:14 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > >> > We use an ordered workqueue for tc filters, so these two
> > > >> > works are executed in the same order as they are queued.
> > > >>
> > > >> The workqueue is ordered, but look how the work is queued on the work
> > > >> queue:
> > > >>
> > > >> tcf_queue_work()
> > > >>   queue_rcu_work()
> > > >>     call_rcu(&rwork->rcu, rcu_work_rcufn);
> > > >>
> > > >> So after the grace period elapses rcu_work_rcufn() queues it in the
> > > >> actual work queue.
> > > >>
> > > >> Now tcindex_destroy() is invoked via tcf_proto_destroy() which can be
> > > >> invoked from preemtible context. Now assume the following:
> > > >>
> > > >> CPU0
> > > >>   tcf_queue_work()
> > > >>     tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);
> > > >>
> > > >> -> Migration
> > > >>
> > > >> CPU1
> > > >>    tcf_queue_work(&p->rwork, tcindex_destroy_work);
> > > >>
> > > >> So your RCU callbacks can be placed on different CPUs which obviously
> > > >> has no ordering guarantee at all. See also:
> > > >
> > > > Good catch!
> > > >
> > > > I thought about this when I added this ordered workqueue, but it
> > > > seems I misinterpret max_active, so despite we have max_active==1,
> > > > more than 1 work could still be queued on different CPU's here.
> > >
> > > The workqueue is not the problem. it works perfectly fine. The way how
> > > the work gets queued is the issue.
> >
> > Well, a RCU work is also a work, so the ordered workqueue should
> > apply to RCU works too, from users' perspective. Users should not
> > need to learn queue_rcu_work() is actually a call_rcu() which does
> > not guarantee the ordering for an ordered workqueue.
>
> And the workqueues might well guarantee the ordering in cases where the
> pair of RCU callbacks are invoked in a known order.  But that workqueues
> ordering guarantee does not extend upstream to RCU, nor do I know of a
> reasonable way to make this happen within the confines of RCU.
>
> If you have ideas, please do not keep them a secret, but please also
> understand that call_rcu() must meet some pretty severe performance and
> scalability constraints.
>
> I suppose that queue_rcu_work() could track outstanding call_rcu()
> invocations, and (one way or another) defer the second queue_rcu_work()
> if a first one is still pending from the current task, but that might not
> make the common-case user of queue_rcu_work() all that happy.  But perhaps
> there is a way to restrict these semantics to ordered workqueues.  In that
> case, one could imagine the second and subsequent too-quick call to
> queue_rcu_work() using the rcu_head structure's ->next field to queue these
> too-quick callbacks, and then having rcu_work_rcufn() check for queued
> too-quick callbacks, queuing the first one.
>
> But I must defer to Tejun on this one.
>
> And one additional caution...  This would meter out ordered
> queue_rcu_work() requests at a rate of no faster than one per RCU
> grace period.  The queue might build up, resulting in long delays.
> Are you sure that your use case can live with this?

I don't know, I guess we might be able to add a call_rcu() takes a cpu
as a parameter so that all of these call_rcu() callbacks will be queued
on a same CPU thusly guarantees the ordering. But of course we
need to figure out which cpu to use. :)

Just my two cents.

>
> > > > I don't know how to fix this properly, I think essentially RCU work
> > > > should be guaranteed the same ordering with regular work. But this
> > > > seems impossible unless RCU offers some API to achieve that.
> > >
> > > I don't think that's possible w/o putting constraints on the flexibility
> > > of RCU (Paul of course might disagree).
> > >
> > > I assume that the filters which hang of tcindex_data::perfect and
> > > tcindex_data:p must be freed before tcindex_data, right?
> > >
> > > Refcounting of tcindex_data should do the trick. I.e. any element which
> > > you add to a tcindex_data instance takes a refcount and when that is
> > > destroyed then the rcu/work callback drops a reference which once it
> > > reaches 0 triggers tcindex_data to be freed.
> >
> > Yeah, but the problem is more than just tcindex filter, we have many
> > places make the same assumption of ordering.
>
> But don't you also have a situation where there might be a large group
> of queue_rcu_work() invocations whose order doesn't matter, followed by a
> single queue_rcu_work() invocation that must be ordered after the earlier
> group?  If so, ordering -all- of these invocations might be overkill.
>
> Or did I misread your code?

You are right. Previously I thought all non-trivial tc filters would need
to address this ordering bug, but it turns out probably only tcindex
needs it, because most of them actually use linked lists. As long as
we remove the entry from the list before tcf_queue_work(), it is fine
to free the list head before each entry in the list.

I just sent out a minimal fix using the refcnt.

Thanks!