From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1751052AbdAQVBO (ORCPT <rfc822;w@1wt.eu>);
        Tue, 17 Jan 2017 16:01:14 -0500
Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:35004 "EHLO
        mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1750997AbdAQVBL (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 17 Jan 2017 16:01:11 -0500
Date: Tue, 17 Jan 2017 12:34:36 -0800
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>,
        Steve Rutherford <srutherford@google.com>,
        syzkaller <syzkaller@googlegroups.com>,
        Radim =?utf-8?B?S3LEjW3DocWZ?= <rkrcmar@redhat.com>,
        KVM list <kvm@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>
Subject: Re: kvm: use-after-free in process_srcu
Reply-To: paulmck@linux.vnet.ibm.com
References: <CABayD+fqOYGm77FM0QA8EytvjisawhySQcEBo6Zcti9hyo4Pyg@mail.gmail.com>
 <CACT4Y+b-9-xN+qFD1WCdCeGsBSqyHWKZ-BSuBVjCjf7grnTKNw@mail.gmail.com>
 <CACT4Y+ZquKsPz6hO=4RNPs0Wpa1Ayns2ZNAx90EbBbooUFEJJg@mail.gmail.com>
 <CACT4Y+bChiBMC9CsC7OCmJh48ioqhxB4WZqtr1d147fmaCJ_HA@mail.gmail.com>
 <754246063.9562871.1484603305281.JavaMail.zimbra@redhat.com>
 <CACT4Y+YTuqCK24X-DYpbG6jg9Nqqvb2uikZzyZeqFUmUmGbipw@mail.gmail.com>
 <CACT4Y+Y82G0KHgBGLvcACSgVxczeJV1TK-umg3qeZXX4S3D2qw@mail.gmail.com>
 <cf0b545e-b947-243c-91ec-d75d349da970@redhat.com>
 <CACT4Y+Z5TGfa7YSUKwm_dWFy4uFud71D3DS6JiDjRq9yU+Zcjw@mail.gmail.com>
 <f99af820-fc36-4786-e950-acef43ff3090@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <f99af820-fc36-4786-e950-acef43ff3090@redhat.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-GCONF: 00
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 17011720-0004-0000-0000-000011512B62
X-IBM-SpamModules-Scores: 
X-IBM-SpamModules-Versions: BY=3.00006451; HX=3.00000240; KW=3.00000007;
 PH=3.00000004; SC=3.00000199; SDB=6.00808993; UDB=6.00394058; IPR=6.00586351;
 BA=6.00005066; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000;
 ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00013953; XFM=3.00000011;
 UTC=2017-01-17 20:34:39
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 17011720-0005-0000-0000-00007C409A35
Message-Id: <20170117203436.GC5238@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-01-17_13:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=7
 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam
 adjust=0 reason=mlx scancount=1 engine=8.0.1-1612050000
 definitions=main-1701170272
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Tue, Jan 17, 2017 at 01:03:28PM +0100, Paolo Bonzini wrote:
> 
> 
> On 17/01/2017 12:13, Dmitry Vyukov wrote:
> > On Tue, Jan 17, 2017 at 12:08 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>
> >>
> >> On 17/01/2017 10:56, Dmitry Vyukov wrote:
> >>>> I am seeing use-after-frees in process_srcu as struct srcu_struct is
> >>>> already freed. Before freeing struct srcu_struct, code does
> >>>> cleanup_srcu_struct(&kvm->irq_srcu). We also tried to do:
> >>>>
> >>>> +      srcu_barrier(&kvm->irq_srcu);
> >>>>          cleanup_srcu_struct(&kvm->irq_srcu);
> >>>>
> >>>> It reduced rate of use-after-frees, but did not eliminate them
> >>>> completely. The full threaded is here:
> >>>> https://groups.google.com/forum/#!msg/syzkaller/i48YZ8mwePY/0PQ8GkQTBwAJ
> >>>>
> >>>> Does Paolo's fix above make sense to you? Namely adding
> >>>> flush_delayed_work(&sp->work) to cleanup_srcu_struct()?

Yes, we do need a flush_delayed_work(), good catch!

But doing multiple of them should not be necessary because there shouldn't
be any callbacks at all once the srcu_barrier() returns, and the only
time SRCU queues more work is if there is at least one callback pending.
The code is making sure that no new call_srcu() invocations happen before
it does the srcu_barrier(), right?

So if you are seing failures even with the single flush_delayed_work(),
it would be interesting to set a flag in the srcu_struct at
cleanup_srcu_struct time, and then splat if srcu_reschedule() does its
queue_delayed_work() when that flag is set.

> >>> I am not sure about interaction of flush_delayed_work and
> >>> srcu_reschedule... flush_delayed_work probably assumes that no work is
> >>> queued concurrently, but what if srcu_reschedule queues another work
> >>> concurrently... can't it happen that flush_delayed_work will miss that
> >>> newly scheduled work?
> >>
> >> Newly scheduled callbacks would be a bug in SRCU usage, but my patch is
> > 
> > I mean not srcu callbacks, but the sp->work being rescheduled.
> > Consider that callbacks are already scheduled. We call
> > flush_delayed_work, it waits for completion of process_srcu. But that
> > process_srcu schedules sp->work again in srcu_reschedule.

It only does this if there are callbacks still on the srcu_struct, so
if you are seeing this, we either have a bug in SRCU that finds callbacks
when none are present or we have a usage bug that is creating new callbacks
after src_barrier() starts.

Do any of your callback functions invoke call_srcu()?  (Hey, I have to ask!)

> >> indeed insufficient.  Because of SRCU's two-phase algorithm, it's possible
> >> that the first flush_delayed_work doesn't invoke all callbacks.  Instead I
> >> would propose this (still untested, but this time with a commit message):
> >>
> >> ---------------- 8< --------------
> >> From: Paolo Bonzini <pbonzini@redhat.com>
> >> Subject: [PATCH] srcu: wait for all callbacks before deeming SRCU "cleaned up"
> >>
> >> Even though there are no concurrent readers, it is possible that the
> >> work item is queued for delayed processing when cleanup_srcu_struct is
> >> called.  The work item needs to be flushed before returning, or a
> >> use-after-free can ensue.
> >>
> >> Furthermore, because of SRCU's two-phase algorithm it may take up to
> >> two executions of srcu_advance_batches before all callbacks are invoked.
> >> This can happen if the first flush_delayed_work happens as follows
> >>
> >>                                                           srcu_read_lock
> >>     process_srcu
> >>         srcu_advance_batches
> >>             ...
> >>             if (!try_check_zero(sp, idx^1, trycount))
> >>                 // there is a reader
> >>                 return;
> >>         srcu_invoke_callbacks
> >>             ...
> >>                                                           srcu_read_unlock
> >>                                                           cleanup_srcu_struct
> >>                                                               flush_delayed_work
> >>         srcu_reschedule
> >>             queue_delayed_work
> >>
> >> Now flush_delayed_work returns but srcu_reschedule will *not* have cleared
> >> sp->running to false.

But srcu_reschedule() sets sp->running to false if there are no callbacks.
And at that point, there had better be no callbacks.

> >> Not-tested-by: Paolo Bonzini <pbonzini@redhat.com>
> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>
> >> diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
> >> index 9b9cdd549caa..9470f1ba2ef2 100644
> >> --- a/kernel/rcu/srcu.c
> >> +++ b/kernel/rcu/srcu.c
> >> @@ -283,6 +283,14 @@ void cleanup_srcu_struct(struct srcu_struct *sp)
> >>  {
> >>         if (WARN_ON(srcu_readers_active(sp)))
> >>                 return; /* Leakage unless caller handles error. */
> >> +
> >> +       /*
> >> +        * No readers active, so any pending callbacks will rush through the two
> >> +        * batches before sp->running becomes false.  No risk of busy-waiting.
> >> +        */
> >> +       while (sp->running)
> >> +               flush_delayed_work(&sp->work);
> > 
> > Unsynchronized accesses to shared state make me nervous. running is
> > meant to be protected with sp->queue_lock.
> 
> I think it could just be
> 
> 	while (flush_delayed_work(&sp->work));
> 
> but let's wait for Paul.

If it needs to be more than just a single flush_delayed_work(), we have
some other bug somewhere.  ;-)

							Thanx, Paul

> Paolo
> 
> > At least we will get back to you with a KTSAN report.
> > 
> >>         free_percpu(sp->per_cpu_ref);
> >>         sp->per_cpu_ref = NULL;
> >>  }
> >>
> >>
> >> Thanks,
> >>
> >> Paolo
> >>
> >> --
> >> You received this message because you are subscribed to the Google Groups "syzkaller" group.
> >> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller+unsubscribe@googlegroups.com.
> >> For more options, visit https://groups.google.com/d/optout.
> > 
>