RE: [EXT] Re: [PATCH] qedi: Fix cmd_cleanup_cmpl counter mismatch issue.

From: Manish Rangankar <mrangankar@marvell.com>
To: Mike Christie <michael.christie@oracle.com>,
	"martin.petersen@oracle.com" <martin.petersen@oracle.com>,
	"lduncan@suse.com" <lduncan@suse.com>,
	"cleech@redhat.com" <cleech@redhat.com>
Cc: "linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	GR-QLogic-Storage-Upstream
	<GR-QLogic-Storage-Upstream@marvell.com>
Subject: RE: [EXT] Re: [PATCH] qedi: Fix cmd_cleanup_cmpl counter mismatch issue.
Date: Wed, 24 Nov 2021 06:05:28 +0000	[thread overview]
Message-ID: <PH0PR18MB4425F4F08057B89453C2222ED8619@PH0PR18MB4425.namprd18.prod.outlook.com> (raw)
In-Reply-To: <9c21c019-d6ff-a908-80e5-51b9c765d118@oracle.com>

> >
> >  check_cleanup_reqs:
> >  	if (qedi_conn->cmd_cleanup_req > 0) {
> > -		QEDI_INFO(&qedi->dbg_ctx, QEDI_LOG_TID,
> > -			  "Freeing tid=0x%x for cid=0x%x\n",
> > -			  cqe->itid, qedi_conn->iscsi_conn_id);
> > -		qedi_conn->cmd_cleanup_cmpl++;
> > +		++qedi_conn->cmd_cleanup_cmpl;
> > +		QEDI_INFO(&qedi->dbg_ctx, QEDI_LOG_SCSI_TM,
> > +			  "Freeing tid=0x%x for cid=0x%x cleanup count=%d\n",
> > +			  cqe->itid, qedi_conn->iscsi_conn_id,
> > +			  qedi_conn->cmd_cleanup_cmpl);
> 
> Is the issue that cmd_cleanup_cmpl's increment is not seen by
> qedi_cleanup_all_io's wait_event_interruptible_timeout call when it wakes up,
> and your patch fixes this by doing a pre increment?
> 

Yes, cmd_cleanup_cmpl's increment is not seen by qedi_cleanup_all_io's 
wait_event_interruptible_timeout call when it wakes up, even after firmware 
post all the ISCSI_CQE_TYPE_TASK_CLEANUP events for requested cmd_cleanup_req.
Yes, pre increment did addressed this issue. Do you feel otherwise ?

> Does doing a pre increment give you barrier like behavior and is that why this
> works? I thought if wake_up ends up waking up the other thread it does a barrier
> already, so it's not clear to me how changing to a pre-increment helps.
> 
> Is doing a pre-increment a common way to handle this? It looks like we do a
> post increment and wake_up* in other places. However, like in the scsi layer we
> do wake_up_process and memory-barriers.txt says that always does a general
> barrier, so is that why we can do a post increment there?
> 
> Does pre-increment give you barrier like behavior, and is the wake_up call not
> waking up the process so we didn't get a barrier from that, and so that's why this
> works?
> 

Issue happen before calling wake_up. When we gets a ISCSI_CQE_TYPE_TASK_CLEANUP surge on
multiple Rx threads, cmd_cleanup_cmpl tend to miss the increment. The scenario is more similar to
multiple threads access cmd_cleanup_cmpl causing race during postfix increment. This could be because of 
thread reading the same value at a time.

Now that I am explaining it, it felt instead of pre-incrementing cmd_cleanup_cmpl, 
it should be atomic variable. Do see any issue ? 

From logs,
-------------------------------------------------------
[root@rhel82-leo RHEL90_LOGS]# grep -inr "qedi_iscsi_cleanup_task:2160" conn_err.log | wc -l
99

[root@rhel82-leo RHEL90_LOGS]# grep -inr "qedi_cleanup_all_io:1215" conn_err.log | wc -l
99

[root@rhel82-leo RHEL90_LOGS]# grep -inr "qedi_fp_process_cqes:925" conn_err.log | wc -l
99

[root@rhel82-leo RHEL90_LOGS]# grep -inr "qedi_fp_process_cqes:922" conn_err.log | wc -l
99

[Thu Oct 21 22:03:32 2021] [0000:a5:00.5]:[qedi_cleanup_all_io:1246]:18: i/o cmd_cleanup_req=99, not equal to cmd_cleanup_cmpl=97, cid=0x0   <<<
[Thu Oct 21 22:03:38 2021] [0000:a5:00.5]:[qedi_clearsq:1299]:18: fatal error, need hard reset, cid=0x0
-----------------------------------------------------