Linux-SCSI Archive on lore.kernel.org
 help / color / Atom feed
From: Stanley Chu <stanley.chu@mediatek.com>
To: Can Guo <cang@codeaurora.org>
Cc: <linux-scsi@vger.kernel.org>, <martin.petersen@oracle.com>,
	<avri.altman@wdc.com>, <alim.akhtar@samsung.com>,
	<jejb@linux.ibm.com>, <bvanassche@acm.org>, <beanhuo@micron.com>,
	<asutoshd@codeaurora.org>, <matthias.bgg@gmail.com>,
	<linux-mediatek@lists.infradead.org>,
	<linux-arm-kernel@lists.infradead.org>,
	<linux-kernel@vger.kernel.org>, <kuohong.wang@mediatek.com>,
	<peter.wang@mediatek.com>, <chun-hung.wu@mediatek.com>,
	<andy.teng@mediatek.com>, <chaotian.jing@mediatek.com>,
	<cc.chou@mediatek.com>
Subject: Re: [PATCH v2] scsi: ufs: Fix possible infinite loop in ufshcd_hold
Date: Thu, 30 Jul 2020 15:59:21 +0800
Message-ID: <1596095961.17247.51.camel@mtkswgap22> (raw)
In-Reply-To: <4cb7403fae7226b70a133d4a7ecee755@codeaurora.org>

Hi Can,

On Wed, 2020-07-29 at 18:53 +0800, Can Guo wrote:
> Hi Stanley,
> 
> On 2020-07-29 18:26, Stanley Chu wrote:
> > Hi Can,
> > 
> > On Wed, 2020-07-29 at 16:43 +0800, Can Guo wrote:
> >> Hi Stanley,
> >> 
> >> On 2020-07-29 10:40, Stanley Chu wrote:
> >> > In ufshcd_suspend(), after clk-gating is suspended and link is set
> >> > as Hibern8 state, ufshcd_hold() is still possibly invoked before
> >> > ufshcd_suspend() returns. For example, MediaTek's suspend vops may
> >> > issue UIC commands which would call ufshcd_hold() during the command
> >> > issuing flow.
> >> >
> >> > Now if UFSHCD_CAP_HIBERN8_WITH_CLK_GATING capability is enabled,
> >> > then ufshcd_hold() may enter infinite loops because there is no
> >> > clk-ungating work scheduled or pending. In this case, ufshcd_hold()
> >> > shall just bypass, and keep the link as Hibern8 state.
> >> >
> >> 
> >> The infinite loop is expected as ufshcd_hold is called again after
> >> link is put to hibern8 state, so in QCOM's code, we never do this.
> > 
> > Sadly MediaTek have to do this to make our UniPro to enter low-power
> > mode.
> > 
> >> The cap UFSHCD_CAP_HIBERN8_WITH_CLK_GATING means UIC link state
> >> must not be HIBERN8 after ufshcd_hold(async=false) returns.
> > 
> > If driver is not in PM scenarios, e.g., suspended, above statement 
> > shall
> > be always followed. But two obvious violations are existed,
> > 
> > 1. In ufshcd_suspend(), link is set as HIBERN8 behind ufshcd_hold()
> > 2. In ufshcd_resume(), link is set back as Active before
> > ufshcd_release() is invoked
> > 
> > So as my understanding, special conditions are allowed in PM scenarios,
> > and this is why "hba->clk_gating.is_suspended" is introduced. By this
> > thought, I used "hba->clk_gating.is_suspended" in this patch as the
> > mandatory condition to allow ufshcd_hold() usage in vendor suspend and
> > resume callbacks.
> > 
> > 
> >> Instead of bailing out from that loop, which makes the logic of
> >> ufshcd_hold and clk gating even more complex, how about removing
> >> ufshcd_hold/release from ufshcd_send_uic_cmd()? I think they are
> >> redundant and we should never send DME cmds if clocks/powers are
> >> not ready. I mean callers should make sure they are ready to send
> >> DME cmds (and only callers know when), but not leave that job to
> >> ufshcd_send_uic_cmd(). It is convenient to remove ufshcd_hold/
> >> release from ufshcd_send_uic_cmd() as there are not many places
> >> sending DME cmds without holding the clocks, ufs_bsg.c is one.
> >> And I have tested my idea on my setup, it worked well for me.
> >> Another benefit is that it also allows us to use DME cmds
> >> in clk gating/ungating contexts if we need to in the future.
> >> 
> > 
> > Brilliant idea! But this may not solve problems if vendor callbacks 
> > need
> > more than UIC commands in the future.
> > 
> > This simple patch could make all vendor operations on UFSHCI in PM
> > callbacks possible with UFSHCD_CAP_HIBERN8_WITH_CLK_GATING enabled, and
> > again, it allows those operations in PM scenarios only.
> > 
> 
> Other than UIC cmds, I can only think of device manangement cmds (like 
> query).
> If device management cmds come into the way in the future, we fix it as 
> well.
> I mean that is the right thing to do in my opinion - just like we don't 
> call
> pm_runtime_get_sync() in ufshcd_send_uic_cmd().
> 
> I can understand that you want a simple/quick fix to get it work for you 
> once
> for all, but from my point of view, debugging clk gating/ungating really 
> takes
> huge efforts sometime (I've spent a lot of time on it). Some flash 
> vendors also
> use it in their own driver widely which makes some failure scenes even 
> harder to
> undertand/debug. So the first thing comes to my head is that we should 
> avoid
> making it more complex or giving it more exceptions.
> 
>  From functionality point of view, it looks ok to me. It is just that I 
> cannot
> predict it won't cause new problems since the clk gating/ungating 
> sequeces are
> like magic in some use cases sometime.

Thanks for the functionality review.

I totally understand what you mentioned above about the clk-gating
debugging because we also spent lots of time for issue analysis.

I just finished some fault injection for this patch in our platform, the
results are fine.

The active window of this patch is limited: Starting from
ufshcd_link_state_transition() in ufshcd_suspend to ufshcd_vops_resume()
in ufshcd_resume() because the link is back to LINKUP state in MediaTek
resume callback. So I was focus on injecting errors in our callbacks
between this period and most of injected fails triggered host and device
reset flow.

For example,
Suspend: UniPro PowerDownControl timeout
Resume: hba_enable timeout
Resume: UniPro PowerDownControl timeout
Resume: HIBERN8 Exit timeout

Hope these tests can ease your concerns.

Thanks,
Stanley Chu

> 
> Thanks,
> 
> Can Guo.
> 
> >> Please let me know your idea, thanks.
> >> 
> >> Can Guo.
> > 
> > Thanks,
> > Stanley Chu
> > 
> >> 
> >> > Signed-off-by: Stanley Chu <stanley.chu@mediatek.com>
> >> > Signed-off-by: Andy Teng <andy.teng@mediatek.com>
> >> >
> >> > ---
> >> >
> >> > Changes since v1:
> >> > - Fix return value: Use unique bool variable to get the result of
> >> > flush_work(). Thcan prevent incorrect returned value, i.e., rc, if
> >> > flush_work() returns true
> >> > - Fix commit message
> >> >
> >> > ---
> >> >  drivers/scsi/ufs/ufshcd.c | 5 ++++-
> >> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> >> > index 577cc0d7487f..acba2271c5d3 100644
> >> > --- a/drivers/scsi/ufs/ufshcd.c
> >> > +++ b/drivers/scsi/ufs/ufshcd.c
> >> > @@ -1561,6 +1561,7 @@ static void ufshcd_ungate_work(struct work_struct
> >> > *work)
> >> >  int ufshcd_hold(struct ufs_hba *hba, bool async)
> >> >  {
> >> >  	int rc = 0;
> >> > +	bool flush_result;
> >> >  	unsigned long flags;
> >> >
> >> >  	if (!ufshcd_is_clkgating_allowed(hba))
> >> > @@ -1592,7 +1593,9 @@ int ufshcd_hold(struct ufs_hba *hba, bool async)
> >> >  				break;
> >> >  			}
> >> >  			spin_unlock_irqrestore(hba->host->host_lock, flags);
> >> > -			flush_work(&hba->clk_gating.ungate_work);
> >> > +			flush_result = flush_work(&hba->clk_gating.ungate_work);
> >> > +			if (hba->clk_gating.is_suspended && !flush_result)
> >> > +				goto out;
> >> >  			spin_lock_irqsave(hba->host->host_lock, flags);
> >> >  			goto start;
> >> >  		}


  reply index

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-29  2:40 Stanley Chu
2020-07-29  8:43 ` Can Guo
2020-07-29 10:26   ` Stanley Chu
2020-07-29 10:53     ` Can Guo
2020-07-30  7:59       ` Stanley Chu [this message]
     [not found]         ` <1596518850.27829.5.camel@mtkswgap22>
     [not found]           ` <BYAPR04MB4629D947CAC7E07FF94D55F2FC4A0@BYAPR04MB4629.namprd04.prod.outlook.com>
2020-08-07  0:39             ` Stanley Chu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1596095961.17247.51.camel@mtkswgap22 \
    --to=stanley.chu@mediatek.com \
    --cc=alim.akhtar@samsung.com \
    --cc=andy.teng@mediatek.com \
    --cc=asutoshd@codeaurora.org \
    --cc=avri.altman@wdc.com \
    --cc=beanhuo@micron.com \
    --cc=bvanassche@acm.org \
    --cc=cang@codeaurora.org \
    --cc=cc.chou@mediatek.com \
    --cc=chaotian.jing@mediatek.com \
    --cc=chun-hung.wu@mediatek.com \
    --cc=jejb@linux.ibm.com \
    --cc=kuohong.wang@mediatek.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mediatek@lists.infradead.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=matthias.bgg@gmail.com \
    --cc=peter.wang@mediatek.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-SCSI Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-scsi/0 linux-scsi/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-scsi linux-scsi/ https://lore.kernel.org/linux-scsi \
		linux-scsi@vger.kernel.org
	public-inbox-index linux-scsi

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-scsi


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git