From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.3 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3EF45C432BE for ; Thu, 26 Aug 2021 15:54:40 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1B5AF61027 for ; Thu, 26 Aug 2021 15:54:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242940AbhHZPz0 (ORCPT ); Thu, 26 Aug 2021 11:55:26 -0400 Received: from alexa-out.qualcomm.com ([129.46.98.28]:60562 "EHLO alexa-out.qualcomm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230203AbhHZPz0 (ORCPT ); Thu, 26 Aug 2021 11:55:26 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; i=@quicinc.com; q=dns/txt; s=qcdkim; t=1629993279; x=1661529279; h=subject:to:cc:references:from:message-id:date: mime-version:in-reply-to:content-transfer-encoding; bh=lYVeHISZQAyQvD77bpxgjkMpC8zuBoVQZ8+/CjlrXWc=; b=A9vPa49/Tw0EdQMwFdNGQ9aHi/ezzbAWvRKJZJR3c1av0zHhb+0j3okl HGacb70HA2TZoYBcaXTYvdC4NVQjCG0Cpaub760sIBZUdF+JwLOacxY8U ryF+2a638wEOzrH1INI8tnYnfEbKQvjoCGiwLq4a40IdGnZeuOJiwS4o0 c=; Received: from ironmsg07-lv.qualcomm.com ([10.47.202.151]) by alexa-out.qualcomm.com with ESMTP; 26 Aug 2021 08:54:39 -0700 X-QCInternal: smtphost Received: from nalasex01a.na.qualcomm.com ([10.47.209.196]) by ironmsg07-lv.qualcomm.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Aug 2021 08:54:37 -0700 Received: from [10.226.59.216] (10.80.80.8) by nalasex01a.na.qualcomm.com (10.47.209.196) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.2.922.7; Thu, 26 Aug 2021 08:54:37 -0700 Subject: Re: bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer To: Paul Davey , "bbhatt@codeaurora.org" , "loic.poulain@linaro.org" , "hemantk@codeaurora.org" , "manivannan.sadhasivam@linaro.org" CC: "linux-arm-msm@vger.kernel.org" References: <9a6a00acc60c676f39f89a8ce2989416bed1b24d.camel@alliedtelesis.co.nz> <544b3db2-b135-d870-8dd8-ec4450576cb7@codeaurora.org> <2b1cbecf50a57a229e30d1bff060d0e241e2841a.camel@alliedtelesis.co.nz> From: Jeffrey Hugo Message-ID: <05e4ff6f-5118-6afb-a000-81d07e5b1078@quicinc.com> Date: Thu, 26 Aug 2021 09:54:36 -0600 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <2b1cbecf50a57a229e30d1bff060d0e241e2841a.camel@alliedtelesis.co.nz> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.80.80.8] X-ClientProxiedBy: nasanexm03g.na.qualcomm.com (10.85.0.49) To nalasex01a.na.qualcomm.com (10.47.209.196) Precedence: bulk List-ID: X-Mailing-List: linux-arm-msm@vger.kernel.org On 8/23/2021 12:47 AM, Paul Davey wrote: > Hi Hemant, Jeffery > > I have some more information after some testing. > >>> Do you have a log which prints the TRE being processed? Basically i >>> am >>> trying understand this : by the time you get double free issue, is >>> there >>> any pattern with respect to the TRE that is being processed. For >>> example >>> when host processed the given TRE for the first time with RP1, >>> stale >>> TRE >>> was posted by Event RP2 right after RP1 >>> >>> ->RP1 [TRE1] >>> ->RP2 [TRE1] >>> >>> or occurrence of stale TRE event is random? >> > I have now collected some information by adding buffers which record > some of the information desired and searching or printing this > information only when the issue is detected in order to avoid constant > verbose debug information and potential slowdowns. > > From this information I can report that when this issue happens two > consecutive transfer completion events occur with the same TRE pointer > in them, I did not record events which are not transfer completion > events or the event ring RP during processing. > > So the event is as follows: > > mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr: > 77c94780 > mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr: > 77c94780 This isn't good. I would suspect that the device is glitching then, which should be fixed on the device side, but that doesn't help you here and now. I'm thinking your change is probably a good idea based on this, but I have additional questions. Can you check the address of the completion events in the shared memory (basically the event ring) when you see this? I want to rule out the possibility that host is double processing the same event, and this is truly a case of the device duplicating an event. I hope that makes sense to you. > Where the ptr value in the event being parsed is this value. > > I have also seen a case where a completion event would attempt to run > possibly more than once but is not caught by the check I posted in my > initial question. > > I added code to zero out some buf_info fields when processing > completion events and produce logs if the cb_buf that would be passed > to the completion callback is NULL. > > I am investigating this further and will collect similar information as > provided above for this case. > >> >>> If you can log all the events you are processing, so that we can >>> check >>> when second event arrives for already processed TRE, is the >>> transfer >>> length same as originally processed TRE or it is different. In case >>> it >>> is different length, is the length matching to the TRE which was >>> queue >>> but not processed yet. You can print the mhi_queue_skb TRE content >>> while >>> queuing skb. How easy to reproduce this issue ? Is this showing up >>> in >>> high throughput use case or it is random? any specific step to >>> reproduce >>> this issue? > > This TRE address matches one in the buffer of TREs added in gen_tre: > > mhi mhi0: (IP_HW0_MBIM-Up) Event TRE addr: 8000000077c94780, ev_len: > 5e2, tre_len: 5e2 > > And here we see the length's match between these. > >> I would wonder, what is the codebase being testing? Are the latest >> MHI >> patches included? When we saw something similar on AIC100, it was >> addressed by the sanity check changes I upstreamed. > > This is a bit complicated to answer, the codebase being tested is our > kernel, based off 5.7.19 with the mhi drivers being updated to the > upstream mainline kernel state as of around late June this year by > cherry-picking all commits to relevant paths. Additionally the change > I submitted to this list for making the driver function on big endian > systems and changes to the PCI driver to customize the channels to the > EM9191 module and some changes to the MBIM net driver MRU for > performance reasons. Additionally I have added one change to increase > the time that is waited at poweron for the modem to enter ready state > without increasing the timeout as this interfered with latency at > shutdown but I suspect using the upstream patch to resolve the shutdown > latency will remove the need for this and we can just increase the > timeout limit for the modem. > > I can confirm that the tested codebase has the following commit > included: > ec32332df764 bus: mhi: core: Sanity check values from remote device > before use Ok. Sounds like it wouldn't catch the issue you describe above anyways. Thanks for confirming though, so we can eliminate what I saw as a possible cause.