From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 872CBC43381 for ; Fri, 8 Mar 2019 07:10:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4119220811 for ; Fri, 8 Mar 2019 07:10:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726483AbfCHHKQ (ORCPT ); Fri, 8 Mar 2019 02:10:16 -0500 Received: from szxga06-in.huawei.com ([45.249.212.32]:45704 "EHLO huawei.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726261AbfCHHKQ (ORCPT ); Fri, 8 Mar 2019 02:10:16 -0500 Received: from DGGEMS403-HUB.china.huawei.com (unknown [172.30.72.58]) by Forcepoint Email with ESMTP id 691174F64368DBA4DBF9; Fri, 8 Mar 2019 15:10:14 +0800 (CST) Received: from [127.0.0.1] (10.177.29.68) by DGGEMS403-HUB.china.huawei.com (10.3.19.203) with Microsoft SMTP Server id 14.3.408.0; Fri, 8 Mar 2019 15:10:10 +0800 Message-ID: <5C821550.50506@huawei.com> Date: Fri, 8 Mar 2019 15:10:08 +0800 From: zhong jiang User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20120428 Thunderbird/12.0.1 MIME-Version: 1.0 To: Andrea Arcangeli , Mike Rapoport , Peter Xu , Andrew Morton CC: Dmitry Vyukov , syzbot , Michal Hocko , , Johannes Weiner , LKML , Linux-MM , syzkaller-bugs , Vladimir Davydov , David Rientjes , Hugh Dickins , Matthew Wilcox , Mel Gorman , Vlastimil Babka Subject: Re: KASAN: use-after-free Read in get_mem_cgroup_from_mm References: <00000000000006457e057c341ff8@google.com> <5C7BFE94.6070500@huawei.com> <5C7D2F82.40907@huawei.com> <5C7D4500.3070607@huawei.com> <5C7E1A38.2060906@huawei.com> <20190306020540.GA23850@redhat.com> In-Reply-To: <20190306020540.GA23850@redhat.com> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.177.29.68] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2019/3/6 10:05, Andrea Arcangeli wrote: > Hello everyone, > > [ CC'ed Mike and Peter ] > > On Tue, Mar 05, 2019 at 02:42:00PM +0800, zhong jiang wrote: >> On 2019/3/5 14:26, Dmitry Vyukov wrote: >>> On Mon, Mar 4, 2019 at 4:32 PM zhong jiang wrote: >>>> On 2019/3/4 22:11, Dmitry Vyukov wrote: >>>>> On Mon, Mar 4, 2019 at 3:00 PM zhong jiang wrote: >>>>>> On 2019/3/4 15:40, Dmitry Vyukov wrote: >>>>>>> On Sun, Mar 3, 2019 at 5:19 PM zhong jiang wrote: >>>>>>>> Hi, guys >>>>>>>> >>>>>>>> I also hit the following issue. but it fails to reproduce the issue by the log. >>>>>>>> >>>>>>>> it seems to the case that we access the mm->owner and deference it will result in the UAF. >>>>>>>> But it should not be possible that we specify the incomplete process to be the mm->owner. >>>>>>>> >>>>>>>> Any thoughts? >>>>>>> FWIW syzbot was able to reproduce this with this reproducer. >>>>>>> This looks like a very subtle race (threaded reproducer that runs >>>>>>> repeatedly in multiple processes), so most likely we are looking for >>>>>>> something like few instructions inconsistency window. >>>>>>> >>>>>> I has a little doubtful about the instrustions inconsistency window. >>>>>> >>>>>> I guess that you mean some smb barriers should be taken into account.:-) >>>>>> >>>>>> Because IMO, It should not be the lock case to result in the issue. >>>>> Since the crash was triggered on x86 _most likley_ this is not a >>>>> missed barrier. What I meant is that one thread needs to executed some >>>>> code, while another thread is stopped within few instructions. >>>>> >>>>> >>>> It is weird and I can not find any relationship you had said with the issue.:-( >>>> >>>> Because It is the cause that mm->owner has been freed, whereas we still deference it. >>>> >>>> From the lastest freed task call trace, It fails to create process. >>>> >>>> Am I miss something or I misunderstand your meaning. Please correct me. >>> Your analysis looks correct. I am just saying that the root cause of >>> this use-after-free seems to be a race condition. >>> >>> >>> >> Yep, Indeed, I can not figure out how the race works. I will dig up further. > Yes it's a race condition. > > We were aware about the non-cooperative fork userfaultfd feature > creating userfaultfd file descriptor that gets reported to the parent > uffd, despite they belong to mm created by failed forks. > > https://www.spinics.net/lists/linux-mm/msg136357.html > > The fork failure in my testcase happened because of signal pending > that interrupted fork after the failed-fork uffd context, was already > pushed to the userfaultfd reader/monitor. CRIU then takes care of > filtering the failed fork cases so we didn't want to make the fork > code more complicated just for userfaultfd. > > In reality if MEMCG is enabled at build time, mm->owner maintainance > code now creates a race condition in the above case, with any fork > failure. > > I pinged Mike yesterday to ask if my theory could be true for this bug > and one solution he suggested is to do the userfaultfd_dup at a point > where fork cannot fail anymore. That's precisely what we were > wondering to do back then to avoid the failed fork reports to the > non cooperative uffd monitor. > > That will solve the false positive deliveries that CRIU manager > currently filters out too. From a theoretical standpoint it's also > quite strange to even allow any uffd ioctl to run on a otherwise long > gone mm created for a process that in the end wasn't even created (the > mm got temporarily fully created, but no child task really ever used > such mm). However that mm is on its way to exit_mmap as soon as the > ioclt returns and this only ever happens during race conditions, so > the way CRIU monitor works there wasn't anything fundamentally > concerning about this detail, despite it's remarkably "strange". Our > priority was to keep the fork code as simple as possible and keep > userfaultfd as non intrusive as possible. > > One alternative solution I'm wondering about for this memcg issue is > to free the task struct with RCU also when fork has failed and to add > the mm_update_next_owner before mmput. That will still report failed > forks to the uffd monitor, so it's not the ideal fix, but since it's > probably simpler I'm posting it below. Also I couldn't reproduce the > problem with the testcase here yet. > > >From 6cbf9d377b705476e5226704422357176f79e32c Mon Sep 17 00:00:00 2001 > From: Andrea Arcangeli > Date: Tue, 5 Mar 2019 19:21:37 -0500 > Subject: [PATCH 1/1] userfaultfd: use RCU to free the task struct when fork > fails if MEMCG > > MEMCG depends on the task structure not to be freed under > rcu_read_lock() in get_mem_cgroup_from_mm() after it dereferences > mm->owner. > > A better fix would be to avoid registering forked vmas in userfaultfd > contexts reported to the monitor, if case fork ends up failing. Hi, Andrea I can reproduce the issue in arm64 qemu machine. The issue will leave after applying the patch. Tested-by: zhong jiang Meanwhile, I just has a little doubt whether it is necessary to use RCU to free the task struct or not. I think that mm->owner alway be NULL after failing to create to process. Because we call mm_clear_owner. Thanks, zhong jiang > Signed-off-by: Andrea Arcangeli > --- > kernel/fork.c | 34 ++++++++++++++++++++++++++++++++-- > 1 file changed, 32 insertions(+), 2 deletions(-) > > diff --git a/kernel/fork.c b/kernel/fork.c > index eb9953c82104..3bcbb361ffbc 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -953,6 +953,15 @@ static void mm_init_aio(struct mm_struct *mm) > #endif > } > > +static __always_inline void mm_clear_owner(struct mm_struct *mm, > + struct task_struct *p) > +{ > +#ifdef CONFIG_MEMCG > + if (mm->owner == p) > + mm->owner = NULL; > +#endif > +} > + > static void mm_init_owner(struct mm_struct *mm, struct task_struct *p) > { > #ifdef CONFIG_MEMCG > @@ -1345,6 +1354,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk) > free_pt: > /* don't put binfmt in mmput, we haven't got module yet */ > mm->binfmt = NULL; > + mm_init_owner(mm, NULL); > mmput(mm); > > fail_nomem: > @@ -1676,6 +1686,24 @@ static inline void rcu_copy_process(struct task_struct *p) > #endif /* #ifdef CONFIG_TASKS_RCU */ > } > > +#ifdef CONFIG_MEMCG > +static void __delayed_free_task(struct rcu_head *rhp) > +{ > + struct task_struct *tsk = container_of(rhp, struct task_struct, rcu); > + > + free_task(tsk); > +} > +#endif /* CONFIG_MEMCG */ > + > +static __always_inline void delayed_free_task(struct task_struct *tsk) > +{ > +#ifdef CONFIG_MEMCG > + call_rcu(&tsk->rcu, __delayed_free_task); > +#else /* CONFIG_MEMCG */ > + free_task(tsk); > +#endif /* CONFIG_MEMCG */ > +} > + > /* > * This creates a new process as a copy of the old one, > * but does not actually start it yet. > @@ -2137,8 +2165,10 @@ static __latent_entropy struct task_struct *copy_process( > bad_fork_cleanup_namespaces: > exit_task_namespaces(p); > bad_fork_cleanup_mm: > - if (p->mm) > + if (p->mm) { > + mm_clear_owner(p->mm, p); > mmput(p->mm); > + } > bad_fork_cleanup_signal: > if (!(clone_flags & CLONE_THREAD)) > free_signal_struct(p->signal); > @@ -2169,7 +2199,7 @@ static __latent_entropy struct task_struct *copy_process( > bad_fork_free: > p->state = TASK_DEAD; > put_task_stack(p); > - free_task(p); > + delayed_free_task(p); > fork_out: > spin_lock_irq(¤t->sighand->siglock); > hlist_del_init(&delayed.node); > > . > From mboxrd@z Thu Jan 1 00:00:00 1970 From: zhong jiang Subject: Re: KASAN: use-after-free Read in get_mem_cgroup_from_mm Date: Fri, 8 Mar 2019 15:10:08 +0800 Message-ID: <5C821550.50506@huawei.com> References: <00000000000006457e057c341ff8@google.com> <5C7BFE94.6070500@huawei.com> <5C7D2F82.40907@huawei.com> <5C7D4500.3070607@huawei.com> <5C7E1A38.2060906@huawei.com> <20190306020540.GA23850@redhat.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20190306020540.GA23850@redhat.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Andrea Arcangeli , Mike Rapoport , Peter Xu , Andrew Morton Cc: Dmitry Vyukov , syzbot , Michal Hocko , cgroups@vger.kernel.org, Johannes Weiner , LKML , Linux-MM , syzkaller-bugs , Vladimir Davydov , David Rientjes , Hugh Dickins , Matthew Wilcox , Mel Gorman , Vlastimil Babka On 2019/3/6 10:05, Andrea Arcangeli wrote: > Hello everyone, > > [ CC'ed Mike and Peter ] > > On Tue, Mar 05, 2019 at 02:42:00PM +0800, zhong jiang wrote: >> On 2019/3/5 14:26, Dmitry Vyukov wrote: >>> On Mon, Mar 4, 2019 at 4:32 PM zhong jiang wrote: >>>> On 2019/3/4 22:11, Dmitry Vyukov wrote: >>>>> On Mon, Mar 4, 2019 at 3:00 PM zhong jiang wrote: >>>>>> On 2019/3/4 15:40, Dmitry Vyukov wrote: >>>>>>> On Sun, Mar 3, 2019 at 5:19 PM zhong jiang wrote: >>>>>>>> Hi, guys >>>>>>>> >>>>>>>> I also hit the following issue. but it fails to reproduce the issue by the log. >>>>>>>> >>>>>>>> it seems to the case that we access the mm->owner and deference it will result in the UAF. >>>>>>>> But it should not be possible that we specify the incomplete process to be the mm->owner. >>>>>>>> >>>>>>>> Any thoughts? >>>>>>> FWIW syzbot was able to reproduce this with this reproducer. >>>>>>> This looks like a very subtle race (threaded reproducer that runs >>>>>>> repeatedly in multiple processes), so most likely we are looking for >>>>>>> something like few instructions inconsistency window. >>>>>>> >>>>>> I has a little doubtful about the instrustions inconsistency window. >>>>>> >>>>>> I guess that you mean some smb barriers should be taken into account.:-) >>>>>> >>>>>> Because IMO, It should not be the lock case to result in the issue. >>>>> Since the crash was triggered on x86 _most likley_ this is not a >>>>> missed barrier. What I meant is that one thread needs to executed some >>>>> code, while another thread is stopped within few instructions. >>>>> >>>>> >>>> It is weird and I can not find any relationship you had said with the issue.:-( >>>> >>>> Because It is the cause that mm->owner has been freed, whereas we still deference it. >>>> >>>> From the lastest freed task call trace, It fails to create process. >>>> >>>> Am I miss something or I misunderstand your meaning. Please correct me. >>> Your analysis looks correct. I am just saying that the root cause of >>> this use-after-free seems to be a race condition. >>> >>> >>> >> Yep, Indeed, I can not figure out how the race works. I will dig up further. > Yes it's a race condition. > > We were aware about the non-cooperative fork userfaultfd feature > creating userfaultfd file descriptor that gets reported to the parent > uffd, despite they belong to mm created by failed forks. > > https://www.spinics.net/lists/linux-mm/msg136357.html > > The fork failure in my testcase happened because of signal pending > that interrupted fork after the failed-fork uffd context, was already > pushed to the userfaultfd reader/monitor. CRIU then takes care of > filtering the failed fork cases so we didn't want to make the fork > code more complicated just for userfaultfd. > > In reality if MEMCG is enabled at build time, mm->owner maintainance > code now creates a race condition in the above case, with any fork > failure. > > I pinged Mike yesterday to ask if my theory could be true for this bug > and one solution he suggested is to do the userfaultfd_dup at a point > where fork cannot fail anymore. That's precisely what we were > wondering to do back then to avoid the failed fork reports to the > non cooperative uffd monitor. > > That will solve the false positive deliveries that CRIU manager > currently filters out too. From a theoretical standpoint it's also > quite strange to even allow any uffd ioctl to run on a otherwise long > gone mm created for a process that in the end wasn't even created (the > mm got temporarily fully created, but no child task really ever used > such mm). However that mm is on its way to exit_mmap as soon as the > ioclt returns and this only ever happens during race conditions, so > the way CRIU monitor works there wasn't anything fundamentally > concerning about this detail, despite it's remarkably "strange". Our > priority was to keep the fork code as simple as possible and keep > userfaultfd as non intrusive as possible. > > One alternative solution I'm wondering about for this memcg issue is > to free the task struct with RCU also when fork has failed and to add > the mm_update_next_owner before mmput. That will still report failed > forks to the uffd monitor, so it's not the ideal fix, but since it's > probably simpler I'm posting it below. Also I couldn't reproduce the > problem with the testcase here yet. > > >From 6cbf9d377b705476e5226704422357176f79e32c Mon Sep 17 00:00:00 2001 > From: Andrea Arcangeli > Date: Tue, 5 Mar 2019 19:21:37 -0500 > Subject: [PATCH 1/1] userfaultfd: use RCU to free the task struct when fork > fails if MEMCG > > MEMCG depends on the task structure not to be freed under > rcu_read_lock() in get_mem_cgroup_from_mm() after it dereferences > mm->owner. > > A better fix would be to avoid registering forked vmas in userfaultfd > contexts reported to the monitor, if case fork ends up failing. Hi, Andrea I can reproduce the issue in arm64 qemu machine. The issue will leave after applying the patch. Tested-by: zhong jiang Meanwhile, I just has a little doubt whether it is necessary to use RCU to free the task struct or not. I think that mm->owner alway be NULL after failing to create to process. Because we call mm_clear_owner. Thanks, zhong jiang > Signed-off-by: Andrea Arcangeli > --- > kernel/fork.c | 34 ++++++++++++++++++++++++++++++++-- > 1 file changed, 32 insertions(+), 2 deletions(-) > > diff --git a/kernel/fork.c b/kernel/fork.c > index eb9953c82104..3bcbb361ffbc 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -953,6 +953,15 @@ static void mm_init_aio(struct mm_struct *mm) > #endif > } > > +static __always_inline void mm_clear_owner(struct mm_struct *mm, > + struct task_struct *p) > +{ > +#ifdef CONFIG_MEMCG > + if (mm->owner == p) > + mm->owner = NULL; > +#endif > +} > + > static void mm_init_owner(struct mm_struct *mm, struct task_struct *p) > { > #ifdef CONFIG_MEMCG > @@ -1345,6 +1354,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk) > free_pt: > /* don't put binfmt in mmput, we haven't got module yet */ > mm->binfmt = NULL; > + mm_init_owner(mm, NULL); > mmput(mm); > > fail_nomem: > @@ -1676,6 +1686,24 @@ static inline void rcu_copy_process(struct task_struct *p) > #endif /* #ifdef CONFIG_TASKS_RCU */ > } > > +#ifdef CONFIG_MEMCG > +static void __delayed_free_task(struct rcu_head *rhp) > +{ > + struct task_struct *tsk = container_of(rhp, struct task_struct, rcu); > + > + free_task(tsk); > +} > +#endif /* CONFIG_MEMCG */ > + > +static __always_inline void delayed_free_task(struct task_struct *tsk) > +{ > +#ifdef CONFIG_MEMCG > + call_rcu(&tsk->rcu, __delayed_free_task); > +#else /* CONFIG_MEMCG */ > + free_task(tsk); > +#endif /* CONFIG_MEMCG */ > +} > + > /* > * This creates a new process as a copy of the old one, > * but does not actually start it yet. > @@ -2137,8 +2165,10 @@ static __latent_entropy struct task_struct *copy_process( > bad_fork_cleanup_namespaces: > exit_task_namespaces(p); > bad_fork_cleanup_mm: > - if (p->mm) > + if (p->mm) { > + mm_clear_owner(p->mm, p); > mmput(p->mm); > + } > bad_fork_cleanup_signal: > if (!(clone_flags & CLONE_THREAD)) > free_signal_struct(p->signal); > @@ -2169,7 +2199,7 @@ static __latent_entropy struct task_struct *copy_process( > bad_fork_free: > p->state = TASK_DEAD; > put_task_stack(p); > - free_task(p); > + delayed_free_task(p); > fork_out: > spin_lock_irq(¤t->sighand->siglock); > hlist_del_init(&delayed.node); > > . >