From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING, NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E19E9C433E6 for ; Fri, 12 Mar 2021 16:35:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B0A8C65026 for ; Fri, 12 Mar 2021 16:35:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232262AbhCLQei (ORCPT ); Fri, 12 Mar 2021 11:34:38 -0500 Received: from imap2.colo.codethink.co.uk ([78.40.148.184]:44450 "EHLO imap2.colo.codethink.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231789AbhCLQed (ORCPT ); Fri, 12 Mar 2021 11:34:33 -0500 Received: from cpc79921-stkp12-2-0-cust288.10-2.cable.virginm.net ([86.16.139.33] helo=[192.168.0.18]) by imap2.colo.codethink.co.uk with esmtpsa (Exim 4.92 #3 (Debian)) id 1lKkk5-0002Vs-5D; Fri, 12 Mar 2021 16:34:25 +0000 Subject: Re: [syzbot] BUG: unable to handle kernel access to user memory in schedule_tail From: Ben Dooks To: Dmitry Vyukov Cc: syzbot , Paul Walmsley , Palmer Dabbelt , Albert Ou , linux-riscv , Daniel Bristot de Oliveira , Benjamin Segall , dietmar.eggemann@arm.com, Juri Lelli , LKML , Mel Gorman , Ingo Molnar , Peter Zijlstra , Steven Rostedt , syzkaller-bugs , Vincent Guittot References: <000000000000b74f1b05bd316729@google.com> <84b0471d-42c1-175f-ae1d-a18c310c7f77@codethink.co.uk> Organization: Codethink Limited. Message-ID: <816870e9-9354-ffbd-936b-40e38e4276a4@codethink.co.uk> Date: Fri, 12 Mar 2021 16:34:23 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 12/03/2021 16:30, Ben Dooks wrote: > On 12/03/2021 15:12, Dmitry Vyukov wrote: >> On Fri, Mar 12, 2021 at 2:50 PM Ben Dooks >> wrote: >>> >>> On 10/03/2021 17:16, Dmitry Vyukov wrote: >>>> On Wed, Mar 10, 2021 at 5:46 PM syzbot >>>> wrote: >>>>> >>>>> Hello, >>>>> >>>>> syzbot found the following issue on: >>>>> >>>>> HEAD commit:    0d7588ab riscv: process: Fix no prototype for >>>>> arch_dup_tas.. >>>>> git tree: >>>>> git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git fixes >>>>> console output: >>>>> https://syzkaller.appspot.com/x/log.txt?x=1212c6e6d00000 >>>>> kernel config: >>>>> https://syzkaller.appspot.com/x/.config?x=e3c595255fb2d136 >>>>> dashboard link: >>>>> https://syzkaller.appspot.com/bug?extid=e74b94fe601ab9552d69 >>>>> userspace arch: riscv64 >>>>> >>>>> Unfortunately, I don't have any reproducer for this issue yet. >>>>> >>>>> IMPORTANT: if you fix the issue, please add the following tag to >>>>> the commit: >>>>> Reported-by: syzbot+e74b94fe601ab9552d69@syzkaller.appspotmail.com >>>> >>>> +riscv maintainers >>>> >>>> This is riscv64-specific. >>>> I've seen similar crashes in put_user in other places. It looks like >>>> put_user crashes in the user address is not mapped/protected (?). >>> >>> I've been having a look, and this seems to be down to access of the >>> tsk->set_child_tid variable. I assume the fuzzing here is to pass a >>> bad address to clone? >>> >>>   From looking at the code, the put_user() code should have set the >>> relevant SR_SUM bit (the value for this, which is 1<<18 is in the >>> s2 register in the crash report) and from looking at the compiler >>> output from my gcc-10, the code looks to be dong the relevant csrs >>> and then csrc around the put_user >>> >>> So currently I do not understand how the above could have happened >>> over than something re-tried the code seqeunce and ended up retrying >>> the faulting instruction without the SR_SUM bit set. >> >> I would maybe blame qemu for randomly resetting SR_SUM, but it's >> strange that 99% of these crashes are in schedule_tail. If it would be >> qemu, then they would be more evenly distributed... >> >> Another observation: looking at a dozen of crash logs, in none of >> these cases fuzzer was actually trying to fuzz clone with some insane >> arguments. So it looks like completely normal clone's (e..g coming >> from pthread_create) result in this crash. >> >> I also wonder why there is ret_from_exception, is it normal? I see >> handle_exception disables SR_SUM: >> https://elixir.bootlin.com/linux/v5.12-rc2/source/arch/riscv/kernel/entry.S#L73 >> > > So I think if SR_SUM is set, then it faults the access to user memory > which the _user() routines clear to allow them access. > > I'm thinking there is at least one issue here: > > - the test in fault is the wrong way around for die kernel > - the handler only catches this if the page has yet to be mapped. > > So I think the test should be: > >         if (!user_mode(regs) && addr < TASK_SIZE && >                         unlikely(regs->status & SR_SUM) > > This then should continue on and allow the rest of the handler to > complete mapping the page if it is not there. > > I have been trying to create a very simple clone test, but so far it > has yet to actually trigger anything. I should have added there doesn't seem to be a good way to use mmap() to allocate memory but not insert a vm-mapping post the mmap(). -- Ben Dooks http://www.codethink.co.uk/ Senior Engineer Codethink - Providing Genius https://www.codethink.co.uk/privacy.html