Re: [RFC 4/4] ahci media error reporting

From: John Snow <jsnow@redhat.com>
To: tasleson@redhat.com
Cc: Kevin Wolf <kwolf@redhat.com>, qemu-devel <qemu-devel@nongnu.org>,
	Qemu-block <qemu-block@nongnu.org>
Subject: Re: [RFC 4/4] ahci media error reporting
Date: Fri, 20 Sep 2019 13:22:48 -0400	[thread overview]
Message-ID: <8b8d198d-91a2-77a9-08a3-c54613789a92@redhat.com> (raw)
In-Reply-To: <afac6895-9a42-1eaa-3068-3e3dfdd1bd23@redhat.com>

On 9/19/19 5:49 PM, Tony Asleson wrote:
> On 9/19/19 3:43 PM, John Snow wrote:
>>
>>
>> On 9/19/19 3:48 PM, Tony Asleson wrote:
>>> Initial attempt at returning a media error for ahci.  This is certainly
>>> wrong and needs serious improvement.
>>>
>>
>> Hi; I have the unfortunate distinction of being the AHCI maintainer.
>> Please CC me on future revisions and discussion if you are interacting
>> with my problem child.
> 
> Will do and thank you for taking a look at this!
> 
>> Also remember to CC qemu-block.
>>
>>> Signed-off-by: Tony Asleson <tasleson@redhat.com>
>>> ---
>>>  hw/ide/ahci.c | 27 +++++++++++++++++++++++++++
>>>  1 file changed, 27 insertions(+)
>>>
>>> diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
>>> index d45393c019..f487764106 100644
>>> --- a/hw/ide/ahci.c
>>> +++ b/hw/ide/ahci.c
>>> @@ -36,6 +36,7 @@
>>>  #include "hw/ide/internal.h"
>>>  #include "hw/ide/pci.h"
>>>  #include "ahci_internal.h"
>>> +#include "block/error_inject.h"
>>>  
>>>  #include "trace.h"
>>>  
>>> @@ -999,6 +1000,22 @@ static void ncq_err(NCQTransferState *ncq_tfs)
>>>      ncq_tfs->used = 0;
>>>  }
>>>  
>>> +/*
>>> + * Figure out correct way to report media error, this is at best a guess
>>> + * and based on the output of linux kernel, not even remotely close.
>>> + */
>>
>> Depends on what kind of error, exactly, you're trying to report, and at
>> what level. (AHCI, NCQ, SATA, ATA?)
> 
> I was trying to return a media error, like a 3/1101 for SCSI device.  I
> think that is at the ATA level?
> 
> 
>> Keep in mind that you've inserted an error path for SATA drives using
>> NCQ, but seemingly haven't touched SATA drives not using NCQ, or ATAPI
>> devices using either PATA/SATA, or ATA drives on PATA.
> 
> Correct, but for trying out a simple read on a SATA drive for Linux I
> end up here first :-)  Well, until the kernel retries a number of times
> and finally gives up and returns an error while also disabling NCQ for
> the device.
> 
> 
>>> +static void ncq_media_err(NCQTransferState *ncq_tfs, uint64_t err_sector)
>>> +{
>>> +    IDEState *ide_state = &ncq_tfs->drive->port.ifs[0];
>>> +
>>> +    ide_state->error = ECC_ERR;
>>> +    ide_state->status = READY_STAT | ERR_STAT;
>>> +    ncq_tfs->drive->port_regs.scr_err |= (1 << ncq_tfs->tag);
>>> +    ncq_tfs->lba = err_sector;
>>> +    qemu_sglist_destroy(&ncq_tfs->sglist);
>>> +    ncq_tfs->used = 0;
>>> +}
>>> +
>>
>> If you are definitely very sure you only want an ide_state error
>> difference, you could just as well call ncq_err and then patch
>> ide_state->error.
>>
>> (I am not sure that's what you want, but... maybe it is?)
> 
> As I mentioned above, return an unrecoverable media error.
> 
> SCSI sense data can report the first sector in error and I thought I
> could do the same for SATA too?
> 
>> I'd have to check -- because I can't say the AHCI emulator was designed
>> so much as congealed -- but you might need calls to ncq_finish.
>>
>> usually, ncq_cb handles the return from any NCQ command and will call
>> ncq_err and ncq_finish as appropriate to tidy up the command.
>>
>> It might be a mistake that execute_ncq_command issues ncq_err in the
>> `default` arm of the switch statement without a call to finish.
>>
>> If we do call ncq_finish from this context I'm not sure if we want
>> block_acct_done here unconditionally. We may not have started a block
>> accounting operation if we never started a backend operation. Everything
>> else looks about right to me.
>>
>>
>>>  static void ncq_finish(NCQTransferState *ncq_tfs)
>>>  {
>>>      /* If we didn't error out, set our finished bit. Errored commands
>>> @@ -1065,6 +1082,8 @@ static void execute_ncq_command(NCQTransferState *ncq_tfs)
>>>  {
>>>      AHCIDevice *ad = ncq_tfs->drive;
>>>      IDEState *ide_state = &ad->port.ifs[0];
>>> +    uint64_t error_sector = 0;
>>> +    char device_id[32];
>>>      int port = ad->port_no;
>>>  
>>>      g_assert(is_ncq(ncq_tfs->cmd));
>>> @@ -1072,6 +1091,14 @@ static void execute_ncq_command(NCQTransferState *ncq_tfs)
>>>  
>>>      switch (ncq_tfs->cmd) {
>>>      case READ_FPDMA_QUEUED:
>>> +        sprintf(device_id, "%lu", ide_state->wwn);
>>
>> This seems suspicious for your design in general, but I'd like to not
>> run sprintf to a buffer in the hotpath for NCQ.
> 
> I totally agree.
> 
> I started out using integers in the call for error_in_read, as that is
> what SCSI uses internally for wwn.  Then I did NVMe and it's using a
> string that doesn't apparently need to be an integer for the wwn? so I
> changed it to being a string to accommodate.
> 
> I also find it interesting that when a SATA device wwid is dumped out
> within the guest it doesn't appear to have any correlation with the wwn
> that was passed in on the command line, eg.
> 
> -device ide-drive,drive=satadisk,bus=ahci.0,wwn=8675309
> 
> $cat /sys/block/sda/device/wwid
> t10.ATA     QEMU HARDDISK                           QM00005
> 
> 
> This does correlate for SCSI
> 
> -device scsi-hd,drive=hd,wwn=12345678
> 
> $ cat /sys/block/sdc/device/wwid
> naa.0000000000bc614e
> 
> 
> as 0xbc614e = 12345678
> 
> 
> For NVMe, the wwn is in the wwid, but it's not immediately obvious.
> 
> Being able to correlate between the command line and what you find in
> the guest would be good.
> 
> 
>> If you need this, I'd do it when wwn is set and just grab it from the
>> state structure.
>>
>>> +
>>> +        if (error_in_read(device_id, ncq_tfs->lba,
>>> +                ncq_tfs->sector_count, &error_sector)) {
>>> +            ncq_media_err(ncq_tfs, error_sector);
>>> +            return;
>>> +        }
>>> +
>>
>> One of the downsides to trying to trigger read error injections
>> per-device instead of per-node is that now you have to goof around with
>> device-specific code everywhere>
>> I suppose from your cover letter you *WANT* device-specific error
>> exercising, which would necessitate a different design from blkdebug,
>> but it means you have to add support for the framework per-device and it
>> might be very tricky to get right.
> 
> Yes, goal was to be able to selectively pick one or more specific block
> devices and then create one or more block errors on each device with
> potentially different error behavior for each block in error.
> 

Yeah -- I imagine you want to see how the kernel will respond to various
error situations posed by certain block devices themselves.

Well... In general, we support rerror=stop and werror=stop in all of our
block emulators that we care about, and you can find the choke points
that implement this by looking for e.g. 'BLOCK_ERROR_ACTION_STOP' --
these error-handling points should already be kind to the various
emulator state machines; the emulators know how to manage the error
actions correctly from this point.

I think whatever solution you end up on should look to expand these
choke points instead of trying to add new ones.

... maybe. maybe it's still too hard to weave together in that way, I'm
not sure. I only care about IDE/ATA devices, so it's just a
recommendation to try to use the existing error handling points as a
basis instead of trying to add new ones.

I think adding new ones will be hard to verify for correctness and hard
to test.

--js

> 
>>>          trace_execute_ncq_command_read(ad->hba, port, ncq_tfs->tag,
>>>                                         ncq_tfs->sector_count, ncq_tfs->lba);
>>>          dma_acct_start(ide_state->blk, &ncq_tfs->acct,
>>>
>