From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.4 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 404F6C388F2 for ; Thu, 22 Oct 2020 09:27:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id EB0A22417D for ; Thu, 22 Oct 2020 09:27:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2895902AbgJVJ1P (ORCPT ); Thu, 22 Oct 2020 05:27:15 -0400 Received: from foss.arm.com ([217.140.110.172]:52266 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2507371AbgJVJ1O (ORCPT ); Thu, 22 Oct 2020 05:27:14 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C4C11D6E; Thu, 22 Oct 2020 02:27:13 -0700 (PDT) Received: from [10.57.13.45] (unknown [10.57.13.45]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 817E23F66E; Thu, 22 Oct 2020 02:27:11 -0700 (PDT) Subject: Re: [PATCH 1/2] coresight: tmc-etf: Fix NULL ptr dereference in tmc_enable_etf_sink_perf() To: Sai Prakash Ranjan Cc: Mathieu Poirier , mike.leach@linaro.org, coresight@lists.linaro.org, swboyd@chromium.org, linux-arm-msm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, denik@google.com, leo.yan@linaro.org, peterz@infradead.org References: <5bbb2d35-3e56-56d7-4722-bf34c5efa2fb@arm.com> <9fa4fcc25dac17b343d151a9d089b48c@codeaurora.org> <707b7860-0daa-d3e3-1f0f-17e1b05feae2@arm.com> <5ad6acdc69c1c2e1e17f5c701a09b7e1@codeaurora.org> <8affc09d4045812e2f5a065695b375de@codeaurora.org> <0ee3566e50143bac5b662b2edf551b89@codeaurora.org> <6db16b0547122ab8a53d56bdfbfb391e@codeaurora.org> From: Suzuki Poulose Message-ID: Date: Thu, 22 Oct 2020 10:27:10 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.3.2 MIME-Version: 1.0 In-Reply-To: <6db16b0547122ab8a53d56bdfbfb391e@codeaurora.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/22/20 9:02 AM, Sai Prakash Ranjan wrote: > On 2020-10-21 15:38, Suzuki Poulose wrote: >> On 10/21/20 8:29 AM, Sai Prakash Ranjan wrote: >>> On 2020-10-20 21:40, Sai Prakash Ranjan wrote: >>>> On 2020-10-14 21:29, Sai Prakash Ranjan wrote: >>>>> On 2020-10-14 18:46, Suzuki K Poulose wrote: >>>>>> On 10/14/2020 10:36 AM, Sai Prakash Ranjan wrote: >>>>>>> On 2020-10-13 22:05, Suzuki K Poulose wrote: >>>>>>>> On 10/07/2020 02:00 PM, Sai Prakash Ranjan wrote: >>>>>>>>> There was a report of NULL pointer dereference in ETF enable >>>>>>>>> path for perf CS mode with PID monitoring. It is almost 100% >>>>>>>>> reproducible when the process to monitor is something very >>>>>>>>> active such as chrome and with ETF as the sink and not ETR. >>>>>>>>> Currently in a bid to find the pid, the owner is dereferenced >>>>>>>>> via task_pid_nr() call in tmc_enable_etf_sink_perf() and with >>>>>>>>> owner being NULL, we get a NULL pointer dereference. >>>>>>>>> >>>>>>>>> Looking at the ETR and other places in the kernel, ETF and the >>>>>>>>> ETB are the only places trying to dereference the task(owner) >>>>>>>>> in tmc_enable_etf_sink_perf() which is also called from the >>>>>>>>> sched_in path as in the call trace. Owner(task) is NULL even >>>>>>>>> in the case of ETR in tmc_enable_etr_sink_perf(), but since we >>>>>>>>> cache the PID in alloc_buffer() callback and it is done as part >>>>>>>>> of etm_setup_aux() when allocating buffer for ETR sink, we never >>>>>>>>> dereference this NULL pointer and we are safe. So lets do the >>>>>>>> >>>>>>>> The patch is necessary to fix some of the issues. But I feel it is >>>>>>>> not complete. Why is it safe earlier and not later ? I believe >>>>>>>> we are >>>>>>>> simply reducing the chances of hitting the issue, by doing this >>>>>>>> earlier than >>>>>>>> later. I would say we better fix all instances to make sure that >>>>>>>> the >>>>>>>> event->owner is valid. (e.g, I can see that the for kernel events >>>>>>>> event->owner == -1 ?) >>>>>>>> >>>>>>>> struct task_struct *tsk = READ_ONCE(event->owner); >>>>>>>> >>>>>>>> if (!tsk || is_kernel_event(event)) >>>>>>>>    /* skip ? */ >>>>>>>> >>>>>>> >>>>>>> Looking at it some more, is_kernel_event() is not exposed >>>>>>> outside events core and probably for good reason. Why do >>>>>>> we need to check for this and not just tsk? >>>>>> >>>>>> Because the event->owner could be : >>>>>> >>>>>>  = NULL >>>>>>  = -1UL  // kernel event >>>>>>  = valid. >>>>>> >>>>> >>>>> Yes I understood that part, but here we were trying to >>>>> fix the NULL pointer dereference right and hence the >>>>> question as to why we need to check for kernel events? >>>>> I am no expert in perf but I don't see anywhere in the >>>>> kernel checking for is_kernel_event(), so I am a bit >>>>> skeptical if exporting that is actually right or not. >>>>> >>>> >>>> I have stress tested with the original patch many times >>>> now, i.e., without a check for event->owner and is_kernel_event() >>>> and didn't observe any crash. Plus on ETR where this was already >>>> done, no crashes were reported till date and with ETF, the issue >>>> was quickly reproducible, so I am fairly confident that this >>>> doesn't just delay the original issue but actually fixes >>>> it. I will run an overnight test again to confirm this. >>>> >>> >>> I ran the overnight test which collected aroung 4G data(see below), >>> with the following small change to see if the two cases >>> (event->owner=NULL and is_kernel_event()) are triggered >>> with suggested changes and it didn't trigger at all. >>> Do we still need those additional checks? >>> >> >> Yes. Please see perf_event_create_kernel_event(), which is >> an exported function allowing any kernel code (including modules) >> to use the PMU (just like the userspace perf tool would do). >> Just because your use case doesn't trigger this (because >> you don't run something that can trigger this) doesn't mean >> this can't be triggered. >> > > Thanks for that pointer, I will add them in the next version. > And instead of redefining TASK_TOMBSTONE in the driver, you may simply use IS_ERR_OR_NULL(tsk) to cover both NULL case and kernel event. Cheers Suzuki