From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 5D22AC433FE for ; Fri, 14 Oct 2022 12:24:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=+ioCOMsblM7ECgcsbw3yQfH/ixEsSFP7JYe1bJzCKgA=; b=YinmjmeCKAr/xT TZztnFfRFu+Y6Q7YN+aus9G+Lk3ehhC5HyELCRl227B2SbXhbTZDmUgC1uiKlloIh7MhQO7MO7zEg dF9vVMF5cezUacR+TVewoZ2lkBmC/d9VGRiSJcwWPu12ZyYufquTRzBFhRWZL9rqfT5QKfqEi/6ZL /bi8DmIerXHcpvLWC6F4IPoGXAye9EaorBosQ0BzCdrw3e3/RNCLmrqow8PRkDNhN519I/Z2pTu+a KOR10kM6YXJ7ZzEFcatKBtuMQmUUkETJfio6uOsp8PI0CJUTozUZTzdcZmU9T8OVCPaaQ9X3M9WDz UktBKt0dECin8obRirjg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1ojJif-00Edlg-IW; Fri, 14 Oct 2022 12:23:17 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1ojJic-00Edk8-0A for linux-arm-kernel@lists.infradead.org; Fri, 14 Oct 2022 12:23:16 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 2C1F913D5; Fri, 14 Oct 2022 05:23:17 -0700 (PDT) Received: from e120937-lin (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C64C13F792; Fri, 14 Oct 2022 05:23:09 -0700 (PDT) Date: Fri, 14 Oct 2022 13:23:03 +0100 From: Cristian Marussi To: Sudeep Holla Cc: YaxiongTian , iambestgod@qq.com, james.quinlan@broadcom.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, tianyaxiong@kylinos.cn Subject: Re: [PATCH -next 1/1] firmware: arm_scmi: Fix possible deadlock in shmem_tx_prepare() Message-ID: References: <20221014115639.waexbqi4vxbu6rxv@bogus> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20221014115639.waexbqi4vxbu6rxv@bogus> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20221014_052314_180214_DDB9F677 X-CRM114-Status: GOOD ( 49.42 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Fri, Oct 14, 2022 at 12:56:39PM +0100, Sudeep Holla wrote: > On Thu, Oct 13, 2022 at 05:02:15PM +0100, Cristian Marussi wrote: > > On Thu, Oct 13, 2022 at 03:05:43PM +0800, YaxiongTian wrote: > > > Hi Cristian > > > = > > > =EF=BF=BD=EF=BF=BD There may be a problem with my qq email client, = =EF=BF=BD I don't see my mail in > > > the > > > = > > > communityI had to switch outlook email.Forgive me if you've received > > > multiple emails. > > > = > > No worries. > > = > > > >Problem is anyway, as you said, you'll have to pick this timeout fro= m the > > > >related transport scmi_desc (even if as of now the max_rx_timeout for > > > >all existent shared mem transport is the same..) and this means anyw= ay > > > >adding more complexity to the chain of calls to just to print a warn= of > > > >some kind in a rare error-situation from which you cannot recover an= yway. > > > = > > > =EF=BF=BD Yes,it has add more complexity about Monitorring this time.= For system > > > stability,the safest thing to do is to abort the transmission.But thi= s will > > > lose performance due to more complexity in such unusual situation. > > > = > > > >Due to other unrelated discussions, I was starting to think about > > > >exposing some debug-only (Kconfig dependent) SCMI stats like timeout= s, > > > errors, > > > >unpexpected/OoO/late_replies in order to ease the debug and monitori= ng > > > >of the health of a running SCMI stack: maybe this could be a place w= here > > > >to flag this FW issues without changing the spinloop above (or > > > >to add the kind of timeout you mentioned but only when some sort of > > > >CONFIG_SCMI_DEBUG is enabled...)...still to fully think it through, = though. > > > = > > > =EF=BF=BD I think it should active report warn or err rather than use= r queries the > > > information manually.(i.e fs_debug way).Becasue in system startup\S1\= S3\S4, > > > user can not queries this flag in Fw,they need get stuck message > > > immediately. > > > = > > = > > Looking more closely at this, I experimented a bit with an SCMI stack b= ased on > > mailbox transport in which I had forcefully set the spin_until_cond() to > > spin forever. > > = > > Even though on a normal SCMI system when the SCMI stack fails at boot > > the system is supposed to boot anyway (maybe slower), this particular > > failure in TX path led indeed to a system that does not boot at all and > > spits out an infinite sequence of: > > = > > [ 2924.499486] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > > [ 2924.505596] rcu: 2-...0: (0 ticks this GP) idle=3D1be4/1/0x40000= 00000000000 softirq=3D50/50 fqs=3D364757 > > [ 2924.514672] (detected by 4, t=3D730678 jiffies, g=3D-1119, q=3D134 = ncpus=3D6) > > [ 2924.521215] Task dump for CPU 2: > > [ 2924.524445] task:kworker/u12:0 state:R running task stack: = 0 pid: 9 ppid: 2 flags:0x0000000a > > [ 2924.534391] Workqueue: events_unbound deferred_probe_work_func > > [ 2924.540244] Call trace: > > [ 2924.542691] __switch_to+0xe4/0x1b8 > > [ 2924.546189] deferred_probe_work_func+0xa4/0xf8 > > [ 2924.550731] process_one_work+0x208/0x480 > > [ 2924.554754] worker_thread+0x230/0x428 > > [ 2924.558514] kthread+0x114/0x120 > > [ 2924.561752] ret_from_fork+0x10/0x20 > > = > > I imagine this is the annoying thing you want to avoid. > > = > > So experimenting a bit with a patch similar to yours (ignoring the time= out > > config issues and using the static cnt to temporarily stuck and revive = the SCMI > > transport) > > = > > ------>8----- > > diff --git a/drivers/firmware/arm_scmi/shmem.c b/drivers/firmware/arm_s= cmi/shmem.c > > index 0e3eaea5d852..6dde669abd03 100644 > > --- a/drivers/firmware/arm_scmi/shmem.c > > +++ b/drivers/firmware/arm_scmi/shmem.c > > @@ -8,6 +8,7 @@ > > #include > > #include > > #include > > = > > #include "common.h" > > = > > @@ -29,17 +30,28 @@ struct scmi_shared_mem { > > u8 msg_payload[]; > > }; > > = > > +static int cnt =3D 50; > > void shmem_tx_prepare(struct scmi_shared_mem __iomem *shmem, > > struct scmi_xfer *xfer) > > { > > + ktime_t stop; > > + > > /* > > * Ideally channel must be free by now unless OS timeout last > > * request and platform continued to process the same, wait > > * until it releases the shared memory, otherwise we may endup > > * overwriting its response with new message payload or vice-ve= rsa > > */ > > - spin_until_cond(ioread32(&shmem->channel_status) & > > - SCMI_SHMEM_CHAN_STAT_CHANNEL_FREE); > > + stop =3D ktime_add_ms(ktime_get(), 35); > > + spin_until_cond(((--cnt > 0) && ioread32(&shmem->channel_status= ) & > > + SCMI_SHMEM_CHAN_STAT_CHANNEL_FREE) || > > + ktime_after(ktime_get(), stop)); > > + if (ktime_after(ktime_get(), stop)) { > > + pr_warn_once("TX Timeout !\n"); > > + cnt =3D 10; > > + return; > > + } > > + > > /* Mark channel busy + clear error */ > > iowrite32(0x0, &shmem->channel_status); > > iowrite32(xfer->hdr.poll_completion ? 0 : SCMI_SHMEM_FLAG_INTR_= ENABLED, > > ----8<------------- > > = > > With the above I had in fact a system that could boot even with a > > failing/stuck SCMI transport, but, as expected the SCMI stack > > functionality was totally compromised after the first timeout with no > > possibility to recover. > > = > > Moreover I was thinking at what could happen if later on after boot the > > SCMI server should end in some funny/hogged condition so that it is, > > only temporarily, a bit slower to answer and release the channel: with > > the current implemenation the Kernel agent will spin just a little bit > > more waiting for the channel to be freed and then everything carries > > without much hassle, while with this possible new timing-out solution > > we could end up dropping that transmission and compromising the whole > > transport fucntionality for all the subsequent transmissions. > > = > > So, again, I'm not sure it is worth making such a change even for debug > > purposes, given that in the worst scenario above you end up with a > > system stuck at boot but for which the SCMI stack is anyway compromised > > and where the only solution is fixing the server FW really. > > = > > I'll ask Sudeep is thoughts about the possible hang. > > > = > I am fine with the patch as it provides more info on what is going wrong > in the system. Please post the patch separately with all the info/backgro= und. > = Ok, I'll cleanup and post adding Reported/Suggested-by: YaxiongTian I'm inclined to set the timeout comfortably more than the transport RX timeout. (2xrx_timeout ?) to account for overhead and avoiding to bail out on some transient delays. Thanks, Cristian _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel