linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Niklas Cassel <Niklas.Cassel@wdc.com>
To: John Garry <john.garry@huawei.com>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>,
	"jejb@linux.ibm.com" <jejb@linux.ibm.com>,
	"martin.petersen@oracle.com" <martin.petersen@oracle.com>,
	"jinpu.wang@cloud.ionos.com" <jinpu.wang@cloud.ionos.com>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Linuxarm <linuxarm@huawei.com>,
	yangxingui <yangxingui@huawei.com>,
	yanaijie <yanaijie@huawei.com>
Subject: Re: [PATCH v5 0/7] libsas and drivers: NCQ error handling
Date: Thu, 6 Oct 2022 14:45:37 +0000	[thread overview]
Message-ID: <Yz7qD+gpmI1bdw16@x1-carbon> (raw)
In-Reply-To: <f190f19e-34b2-611c-1cf4-f8f34d12fe74@huawei.com>

On Thu, Oct 06, 2022 at 09:33:23AM +0100, John Garry wrote:
> On 05/10/2022 23:42, Damien Le Moal wrote:
> > > Hello Damien,
> > > 
> > > John explained that he got a timeout from EH when reading the log:
> > > [  350.281581] ata1: failed to read log page 10h (errno=-5)
> > > [  350.577181] ata1.00: exception Emask 0x1 SAct 0xffffffff SErr 0x0 action 0x6 frozen
> > > 
> > > ata_eh_read_log_10h() uses ata_read_log_page(), which will first try to read
> > > the log using READ LOG DMA EXT. If that fails, it will retry using READ LOG EXT.
> > > 
> > > Therefore, to see if this is a driver specific bug, I suggested to try to read
> > > the NCQ Command Error log using ATA16 passthrough commands:
> > > 
> > > $ sudo sg_sat_read_gplog -d --log=0x10 /dev/sdc
> > > will read the log using READ LOG DMA EXT.
> > > 
> > > $ sudo sg_sat_read_gplog --log=0x10 /dev/sdc
> > > will read the log using READ LOG EXT.
> 
> Note that I can't get a distro to boot on this system from the HDD for the
> same timeout problem (so no tools easily available).
> 
> > > 
> > > Neither of these two suggested commands are NCQ commands.
> > > (Neither command is encapsulated in a RECEIVE FPDMA QUEUED,
> > > so I'm not sure what you mean.)
> > > 
> > > 
> > > Garry, I now see that:
> > > [  350.577181] ata1.00: exception Emask 0x1 SAct 0xffffffff SErr 0x0 action 0x6 frozen
> > > Your port is frozen.
> > > 
> > > ata_read_log_page() calls ata_exec_internal() which calls ata_exec_internal_sg(),
> > > which will simply return an error without sending down the command to the drive,
> > > if the port is frozen.
> > > 
> > > Not sure why your port is frozen, mine is obviously not.
> 
> I think that it gets frozen when the internal command for read log ext times
> out. More below about that timeout.

ata_read_log_page() will first try to read using READ LOG DMA EXT.
If that fails it will retry with READ LOG EXT.

Your log has this:
[  350.257870] ata1.00: qc timeout (cmd 0x47)

So it is definitely ATA_CMD_READ_LOG_DMA_EXT that times out.

On timeout, ata_exec_internal_sg() will freeze the port:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-core.c?h=v6.0#n1577

When ata_read_log_page() retries with the port frozen,
READ LOG EXT will obviously fail (since the port is frozen).

Not sure why READ LOG DMA EXT would timeout for you...
Perhaps your drive does not implement this command,
and incorrectly reports supporting this command via
ata_id_has_read_log_dma_ext().

Perhaps you could try boot your kernel with libata.force=nodmalog
on the kernel command line, so that ata_read_log_page() will use
READ LOG EXT on the first try.


Damien, it seems that there is no use in retrying if the port
is frozen/we got a timeout, so perhaps:

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index e74ab6c0f1a0..1aa628332c8e 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -2035,7 +2035,8 @@ unsigned int ata_read_log_page(struct ata_device *dev, u8 log,
        if (err_mask) {
                if (dma) {
                        dev->horkage |= ATA_HORKAGE_NO_DMA_LOG;
-                       goto retry;
+                       if (err_mask != AC_ERR_TIMEOUT)
+                               goto retry;
                }

or:

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index e74ab6c0f1a0..2fa03b7573ac 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -2035,7 +2035,8 @@ unsigned int ata_read_log_page(struct ata_device *dev, u8 log,
        if (err_mask) {
                if (dma) {
                        dev->horkage |= ATA_HORKAGE_NO_DMA_LOG;
-                       goto retry;
+                       if (!(dev->link->ap->pflags & ATA_PFLAG_FROZEN))
+                               goto retry;
                }

would be in order, so that we actually print the real error, instead of a bogus
AC_ERR_SYSTEM (returned by ata_exec_internal_sg()) when the port is frozen.

> 
> > > 
> > > ata_do_link_abort() calls ata_eh_set_pending() without activating fast drain:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-eh.c?h=v6.0#n989
> > > 
> > > So I'm not sure why your port is frozen.
> > > (The fast drain timer does freeze the port, but it shouldn't be enabled.)
> > > It might be worthwhile to see who freezes the port in your case.
> > Might come from the command timeout. John has had many problems with the
> > pm80xx HBA in his Arm machine from a while back. Likely not a driver issue
> > but a hw one... No-one seems to be able to recreate the same problem.
> > 
> > We need to try the HBA on our Arm board to see what happens.
> > 
> 
> Yeah, it just looks to be the longstanding issue of using this card on my
> arm64 machine - that is that I get IO timeouts quite regularly. I should
> have mentioned that yesterday. This just seems to be a driver issue.

Out of curiosity, which arm64 SoC is this?

While it is very unlikely that this is your problem, but I've encountered
an issue on an ARM board before, where the PCIe controller was incorrectly
configured in device tree, causing the controller to miss interrrupts,
which presented itself to the user as timeouts in the WiFi driver:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=97131f85c08e024df49480ed499aae8fb754067f


Kind regards,
Niklas

  reply	other threads:[~2022-10-06 14:46 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-27  7:04 [PATCH v5 0/7] libsas and drivers: NCQ error handling John Garry
2022-09-27  7:04 ` [PATCH v5 1/7] scsi: libsas: Add sas_ata_device_link_abort() John Garry
2022-09-27  7:04 ` [PATCH v5 2/7] scsi: hisi_sas: Move slot variable definition in hisi_sas_abort_task() John Garry
2022-09-27  7:04 ` [PATCH v5 3/7] scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw John Garry
2022-09-27  7:04 ` [PATCH v5 4/7] scsi: hisi_sas: Modify v3 HW SATA disk error state completion processing John Garry
2022-09-27  7:04 ` [PATCH v5 5/7] scsi: pm8001: Modify task abort handling for SATA task John Garry
2022-09-27  7:04 ` [PATCH v5 6/7] scsi: pm8001: Use sas_ata_device_link_abort() to handle NCQ errors John Garry
2022-09-27  7:04 ` [PATCH v5 7/7] scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private John Garry
2022-10-04 13:05 ` [PATCH v5 0/7] libsas and drivers: NCQ error handling Niklas Cassel
2022-10-04 14:04   ` John Garry
2022-10-05  8:53     ` John Garry
2022-10-05 21:28       ` Niklas Cassel
2022-10-05 21:36         ` Damien Le Moal
2022-10-05 22:11           ` Niklas Cassel
2022-10-05 22:42             ` Damien Le Moal
2022-10-06  8:33               ` John Garry
2022-10-06 14:45                 ` Niklas Cassel [this message]
2022-10-06 16:41                   ` John Garry
2022-10-24 12:24                     ` Niklas Cassel
2022-10-24 12:44                       ` John Garry
2022-10-24 13:10                         ` Niklas Cassel
2022-10-24 16:20                           ` John Garry
2022-10-06 22:57                   ` Damien Le Moal
2022-10-06  8:37         ` John Garry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yz7qD+gpmI1bdw16@x1-carbon \
    --to=niklas.cassel@wdc.com \
    --cc=damien.lemoal@opensource.wdc.com \
    --cc=jejb@linux.ibm.com \
    --cc=jinpu.wang@cloud.ionos.com \
    --cc=john.garry@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=linuxarm@huawei.com \
    --cc=martin.petersen@oracle.com \
    --cc=yanaijie@huawei.com \
    --cc=yangxingui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).