From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 5D22AC433FE
	for <linux-arm-kernel@archiver.kernel.org>; Fri, 14 Oct 2022 12:24:25 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References:
	Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=+ioCOMsblM7ECgcsbw3yQfH/ixEsSFP7JYe1bJzCKgA=; b=YinmjmeCKAr/xT
	TZztnFfRFu+Y6Q7YN+aus9G+Lk3ehhC5HyELCRl227B2SbXhbTZDmUgC1uiKlloIh7MhQO7MO7zEg
	dF9vVMF5cezUacR+TVewoZ2lkBmC/d9VGRiSJcwWPu12ZyYufquTRzBFhRWZL9rqfT5QKfqEi/6ZL
	/bi8DmIerXHcpvLWC6F4IPoGXAye9EaorBosQ0BzCdrw3e3/RNCLmrqow8PRkDNhN519I/Z2pTu+a
	KOR10kM6YXJ7ZzEFcatKBtuMQmUUkETJfio6uOsp8PI0CJUTozUZTzdcZmU9T8OVCPaaQ9X3M9WDz
	UktBKt0dECin8obRirjg==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ojJif-00Edlg-IW; Fri, 14 Oct 2022 12:23:17 +0000
Received: from foss.arm.com ([217.140.110.172])
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ojJic-00Edk8-0A
	for linux-arm-kernel@lists.infradead.org; Fri, 14 Oct 2022 12:23:16 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 2C1F913D5;
	Fri, 14 Oct 2022 05:23:17 -0700 (PDT)
Received: from e120937-lin (unknown [172.31.20.19])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C64C13F792;
	Fri, 14 Oct 2022 05:23:09 -0700 (PDT)
Date: Fri, 14 Oct 2022 13:23:03 +0100
From: Cristian Marussi <cristian.marussi@arm.com>
To: Sudeep Holla <sudeep.holla@arm.com>
Cc: YaxiongTian <iambestgod@outlook.com>, iambestgod@qq.com,
	james.quinlan@broadcom.com, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	tianyaxiong@kylinos.cn
Subject: Re: [PATCH -next 1/1] firmware: arm_scmi: Fix possible deadlock in
 shmem_tx_prepare()
Message-ID: <Y0lUpzQwYXANp4Fj@e120937-lin>
References: <Y0V6Q7ZJ3GjBwWub@e120937-lin>
 <KL1PR01MB3510AD021B2258CB789466E3D5259@KL1PR01MB3510.apcprd01.prod.exchangelabs.com>
 <Y0g2d/pw6yCoA3Nc@e120937-lin>
 <20221014115639.waexbqi4vxbu6rxv@bogus>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20221014115639.waexbqi4vxbu6rxv@bogus>
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20221014_052314_180214_DDB9F677 
X-CRM114-Status: GOOD (  49.42  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

On Fri, Oct 14, 2022 at 12:56:39PM +0100, Sudeep Holla wrote:
> On Thu, Oct 13, 2022 at 05:02:15PM +0100, Cristian Marussi wrote:
> > On Thu, Oct 13, 2022 at 03:05:43PM +0800, YaxiongTian wrote:
> > > Hi Cristian
> > > =

> > > =EF=BF=BD=EF=BF=BD There may be a problem with my qq email client, =
=EF=BF=BD I don't see my mail in
> > > the
> > > =

> > > communityI had to switch outlook email.Forgive me if you've received
> > > multiple emails.
> > > =

> > No worries.
> > =

> > > >Problem is anyway, as you said, you'll have to pick this timeout fro=
m the
> > > >related transport scmi_desc (even if as of now the max_rx_timeout for
> > > >all existent shared mem transport is the same..) and this means anyw=
ay
> > > >adding more complexity to the chain of calls to just to print a warn=
 of
> > > >some kind in a rare error-situation from which you cannot recover an=
yway.
> > > =

> > > =EF=BF=BD Yes,it has add more complexity about Monitorring this time.=
For system
> > > stability,the safest thing to do is to abort the transmission.But thi=
s will
> > > lose performance due to more complexity in such unusual situation.
> > > =

> > > >Due to other unrelated discussions, I was starting to think about
> > > >exposing some debug-only (Kconfig dependent) SCMI stats like timeout=
s,
> > > errors,
> > > >unpexpected/OoO/late_replies in order to ease the debug and monitori=
ng
> > > >of the health of a running SCMI stack: maybe this could be a place w=
here
> > > >to flag this FW issues without changing the spinloop above (or
> > > >to add the kind of timeout you mentioned but only when some sort of
> > > >CONFIG_SCMI_DEBUG is enabled...)...still to fully think it through, =
though.
> > > =

> > > =EF=BF=BD I think it should active report warn or err rather than use=
r queries the
> > > information manually.(i.e fs_debug way).Becasue in system startup\S1\=
S3\S4,
> > > user can not queries this flag in Fw,they need get stuck message
> > > immediately.
> > > =

> > =

> > Looking more closely at this, I experimented a bit with an SCMI stack b=
ased on
> > mailbox transport in which I had forcefully set the spin_until_cond() to
> > spin forever.
> > =

> > Even though on a normal SCMI system when the SCMI stack fails at boot
> > the system is supposed to boot anyway (maybe slower), this particular
> > failure in TX path led indeed to a system that does not boot at all and
> > spits out an infinite sequence of:
> > =

> > [ 2924.499486] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [ 2924.505596] rcu:     2-...0: (0 ticks this GP) idle=3D1be4/1/0x40000=
00000000000 softirq=3D50/50 fqs=3D364757
> > [ 2924.514672]  (detected by 4, t=3D730678 jiffies, g=3D-1119, q=3D134 =
ncpus=3D6)
> > [ 2924.521215] Task dump for CPU 2:
> > [ 2924.524445] task:kworker/u12:0   state:R  running task     stack:   =
 0 pid:    9 ppid:     2 flags:0x0000000a
> > [ 2924.534391] Workqueue: events_unbound deferred_probe_work_func
> > [ 2924.540244] Call trace:
> > [ 2924.542691]  __switch_to+0xe4/0x1b8
> > [ 2924.546189]  deferred_probe_work_func+0xa4/0xf8
> > [ 2924.550731]  process_one_work+0x208/0x480
> > [ 2924.554754]  worker_thread+0x230/0x428
> > [ 2924.558514]  kthread+0x114/0x120
> > [ 2924.561752]  ret_from_fork+0x10/0x20
> > =

> > I imagine this is the annoying thing you want to avoid.
> > =

> > So experimenting a bit with a patch similar to yours (ignoring the time=
out
> > config issues and using the static cnt to temporarily stuck and revive =
the SCMI
> > transport)
> > =

> > ------>8-----
> > diff --git a/drivers/firmware/arm_scmi/shmem.c b/drivers/firmware/arm_s=
cmi/shmem.c
> > index 0e3eaea5d852..6dde669abd03 100644
> > --- a/drivers/firmware/arm_scmi/shmem.c
> > +++ b/drivers/firmware/arm_scmi/shmem.c
> > @@ -8,6 +8,7 @@
> >  #include <linux/io.h>
> >  #include <linux/processor.h>
> >  #include <linux/types.h>
> >  =

> >  #include "common.h"
> >  =

> > @@ -29,17 +30,28 @@ struct scmi_shared_mem {
> >         u8 msg_payload[];
> >  };
> >  =

> > +static int cnt =3D 50;
> >  void shmem_tx_prepare(struct scmi_shared_mem __iomem *shmem,
> >                       struct scmi_xfer *xfer)
> >  {
> > +       ktime_t stop;
> > +
> >         /*
> >          * Ideally channel must be free by now unless OS timeout last
> >          * request and platform continued to process the same, wait
> >          * until it releases the shared memory, otherwise we may endup
> >          * overwriting its response with new message payload or vice-ve=
rsa
> >          */
> > -       spin_until_cond(ioread32(&shmem->channel_status) &
> > -                       SCMI_SHMEM_CHAN_STAT_CHANNEL_FREE);
> > +       stop =3D ktime_add_ms(ktime_get(), 35);
> > +       spin_until_cond(((--cnt > 0) && ioread32(&shmem->channel_status=
) &
> > +                       SCMI_SHMEM_CHAN_STAT_CHANNEL_FREE) ||
> > +                       ktime_after(ktime_get(), stop));
> > +       if (ktime_after(ktime_get(), stop)) {
> > +               pr_warn_once("TX Timeout !\n");
> > +               cnt =3D 10;
> > +               return;
> > +       }
> > +
> >         /* Mark channel busy + clear error */
> >         iowrite32(0x0, &shmem->channel_status);
> >         iowrite32(xfer->hdr.poll_completion ? 0 : SCMI_SHMEM_FLAG_INTR_=
ENABLED,
> > ----8<-------------
> > =

> > With the above I had in fact a system that could boot even with a
> > failing/stuck SCMI transport, but, as expected the SCMI stack
> > functionality was totally compromised after the first timeout with no
> > possibility to recover.
> > =

> > Moreover I was thinking at what could happen if later on after boot the
> > SCMI server should end in some funny/hogged condition so that it is,
> > only temporarily, a bit slower to answer and release the channel: with
> > the current implemenation the Kernel agent will spin just a little bit
> > more waiting for the channel to be freed and then everything carries
> > without much hassle, while with this possible new timing-out solution
> > we could end up dropping that transmission and compromising the whole
> > transport fucntionality for all the subsequent transmissions.
> > =

> > So, again, I'm not sure it is worth making such a change even for debug
> > purposes, given that in the worst scenario above you end up with a
> > system stuck at boot but for which the SCMI stack is anyway compromised
> > and where the only solution is fixing the server FW really.
> > =

> > I'll ask Sudeep is thoughts about the possible hang.
> >
> =

> I am fine with the patch as it provides more info on what is going wrong
> in the system. Please post the patch separately with all the info/backgro=
und.
> =


Ok, I'll cleanup and post adding Reported/Suggested-by: YaxiongTian

I'm inclined to set the timeout comfortably more than the transport RX
timeout. (2xrx_timeout ?) to account for overhead and avoiding to bail
out on some transient delays.

Thanks,
Cristian

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel