From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:51958) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fVRL8-0002yC-GV for qemu-devel@nongnu.org; Tue, 19 Jun 2018 20:51:17 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fVRL5-0001DP-DI for qemu-devel@nongnu.org; Tue, 19 Jun 2018 20:51:14 -0400 Date: Wed, 20 Jun 2018 06:23:19 +0530 From: Amol Surati Message-ID: <20180620005317.GA30916@arch> References: <20180617183515.3982-1-suratiamol@gmail.com> <20180617183515.3982-2-suratiamol@gmail.com> <20180618180254.GA2441@arch> <20180619040153.GA8868@arch> <20180619085329.GA4292@dhcp-200-186.str.redhat.com> <834a13be-0b5c-1249-ad0a-1a1accb3e058@redhat.com> <20180619143401.GA27130@arch> <20180619212558.GA29939@arch> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [Qemu-devel] [Qemu-block] [RFC 1/1] ide: bug #1777315: io_buffer_size and sg.size can represent partial sector sizes List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: John Snow Cc: Amol Surati , Kevin Wolf , qemu-devel@nongnu.org, "open list:IDE" On Tue, Jun 19, 2018 at 05:43:52PM -0400, John Snow wrote: > > > On 06/19/2018 05:26 PM, Amol Surati wrote: > > On Tue, Jun 19, 2018 at 08:04:03PM +0530, Amol Surati wrote: > >> On Tue, Jun 19, 2018 at 09:45:15AM -0400, John Snow wrote: > >>> > >>> > >>> On 06/19/2018 04:53 AM, Kevin Wolf wrote: > >>>> Am 19.06.2018 um 06:01 hat Amol Surati geschrieben: > >>>>> On Mon, Jun 18, 2018 at 08:14:10PM -0400, John Snow wrote: > >>>>>> > >>>>>> > >>>>>> On 06/18/2018 02:02 PM, Amol Surati wrote: > >>>>>>> On Mon, Jun 18, 2018 at 12:05:15AM +0530, Amol Surati wrote: > >>>>>>>> This patch fixes the assumption that io_buffer_size is always a perfect > >>>>>>>> multiple of the sector size. The assumption is the cause of the firing > >>>>>>>> of 'assert(n * 512 == s->sg.size);'. > >>>>>>>> > >>>>>>>> Signed-off-by: Amol Surati > >>>>>>>> --- > >>>>>>> > >>>>>>> The repository https://github.com/asurati/1777315 contains a module for > >>>>>>> QEMU's 8086:7010 ATA controller, which exercises the code path > >>>>>>> described in [RFC 0/1] of this series. > >>>>>>> > >>>>>> > >>>>>> Thanks, this made it easier to see what was happening. I was able to > >>>>>> write an ide-test test case using this source as a guide, and reproduce > >>>>>> the error. > >>>>>> > >>>>>> static void test_bmdma_partial_sector_short_prdt(void) > >>>>>> { > >>>>>> QPCIDevice *dev; > >>>>>> QPCIBar bmdma_bar, ide_bar; > >>>>>> uint8_t status; > >>>>>> > >>>>>> /* Read 2 sectors but only give 1 sector in PRDT */ > >>>>>> PrdtEntry prdt[] = { > >>>>>> { > >>>>>> .addr = 0, > >>>>>> .size = cpu_to_le32(0x200), > >>>>>> }, > >>>>>> { > >>>>>> .addr = 512, > >>>>>> .size = cpu_to_le32(0x44 | PRDT_EOT), > >>>>>> } > >>>>>> }; > >>>>>> > >>>>>> dev = get_pci_device(&bmdma_bar, &ide_bar); > >>>>>> status = send_dma_request(CMD_READ_DMA, 0, 2, > >>>>>> prdt, ARRAY_SIZE(prdt), NULL); > >>>>>> g_assert_cmphex(status, ==, 0); > >>>>>> assert_bit_clear(qpci_io_readb(dev, ide_bar, reg_status), DF | ERR); > >>>>>> free_pci_device(dev); > >>>>>> } > >>>>>> > >>>>>>> Loading the module reproduces the bug. Tested on the latest master > >>>>>>> branch. > >>>>>>> > >>>>>>> Steps: > >>>>>>> - Install a Linux distribution as a guest, ensuring that the boot disk > >>>>>>> resides on non-IDE controllers (such as virtio) > >>>>>>> - Attach another disk as a master device on the primary > >>>>>>> IDE controller (i.e. attach at -hda.) > >>>>>>> - Blacklist ata_piix, pata_acpi and ata_generic modules, and reboot. > >>>>>>> - Copy the source files into the guest and build the module. > >>>>>>> - Load the module. QEMU process should die with the message: > >>>>>>> qemu-system-x86_64: hw/ide/core.c:871: ide_dma_cb: > >>>>>>> Assertion `n * 512 == s->sg.size' failed. > >>>>>>> > >>>>>>> > >>>>>>> -Amol > >>>>>>> > >>>>>> > >>>>>> I'm less sure of the fix -- certainly the assert is wrong, but just > >>>>>> incrementing 'n' is wrong too -- we didn't copy (n+1) sectors, we copied > >>>>>> (n) and a few extra bytes. > >>>>> > >>>>> That is true. > >>>>> > >>>>> There are (at least) two fields that represent the total size of a DMA > >>>>> transfer - > >>>>> (1) The size, as requested through the NSECTOR field. > >>>>> (2) The size, as calculated through the length fields of the PRD entries. > >>>>> > >>>>> It makes sense to consider the most restrictive of the sizes, as the factor > >>>>> which determines both the end of a successful DMA transfer and the > >>>>> condition to assert. > >>>>> > >>>>>> > >>>>>> The sector-based math here would need to be adjusted to be able to cope > >>>>>> with partial sector reads... or we ought to avoid doing any partial > >>>>>> sector transfers. > >>>>>> > >>>>>> > >>>>>> I'm not sure which is more correct tonight, it depends: > >>>>>> > >>>>>> - If it's OK to transfer partial sectors before reporting overflow, > >>>>>> adjusting the command loop to work with partial sectors is OK. > >>>>>> > >>>>>> - If it's NOT OK to do partial sector transfer, the sglist preparation > >>>>>> phase needs to produce a truncated SGList that's some multiple of 512 > >>>>>> bytes that leaves the excess bytes in a second sglist that we don't > >>>>>> throw away and can use as a basis for building the next sglist. (Or the > >>>>>> DMA helpers need to take a max_bytes parameter and return an sglist > >>>>>> representing unused buffer space if the command underflowed.) > >>>>> > >>>>> Support for partial sector transfers is built into the DMA interface's PRD > >>>>> mechanism itself, because an entry is allowed to transfer in the units of > >>>>> even number of bytes. > >>>>> > >>>>> I think the controller's IO process runs in two parts (probably loops over > >>>>> for a single transfer): > >>>>> > >>>>> (1) The controller's disk interface transfers between its internal buffer > >>>>> and the disk storage. The transfers are likely to be in the > >>>>> multiples of a sector. > >>>>> (2) The controller's DMA interface transfers between its internal buffer > >>>>> and the system memory. The transfers can be sub-sector in size(, and > >>>>> are preserving of the areas, of the internal buffer, not subject to a > >>>>> write.) > >>>> > >>>> The spec isn't clear about this (or at least I can't find anything where > >>>> the exact behaviour is specified), but I agree that that's my mental > >>>> model as well. So I would make IDE send a byte-granularity request with > >>>> the final partial sector to the block layer, so that the data is > >>>> actually transferred up to that point. > >>>> > >>>> In practice it probably doesn't matter much because a too short PRDT > >>>> means that the request doesn't complete successfully (the condition is > >>>> indicated by clear Interrupt, Active and Error flags in the BMDMA > >>>> controller) and I suppose the guest won't actually look at the data > >>>> then. > >>>> > >>>> Providing the data anyway (consistent with our assumption how real > >>>> hardware works) is erring on the safe side because it doesn't hurt a > >>>> reasonable guest that did not expect the data to be transferred in this > >>>> condition. > >>>> > >>>> Kevin > >>>> > >>> > >>> Partial transfers it is, since I didn't see anything in AHCI or BMDMA > >>> that suggested we shouldn't and it seems like the actual easiest fix > >>> because it avoids having to modify the sglists or the sglist preparation > >>> functions. > >>> > >>> Amol, would you like to author a fix, or would you prefer that I do it? > >> > >> Yes, I would like to author it. I assume that the simplification of the > >> calls to prepare_buf is better kept as a change that is separate from > >> this fix. > >> > >>> > >>> If you do, please copy my ide-test case above and check it in as patch > >>> 2/2 to your series as a regression test (tests/ide-test.c). It may need > >>> some further editing after the command submission to pass; I only made > >>> sure it crashes QEMU. > >> > >> Will do. > >> > >> Thanks, > >> -amol > >> > >>> > >>> Thanks, > >>> --js > > > > After ide_dma_cb checks the return value from prepare_buf, it comments > > thus: > > > > "The PRDs were too short. Reset the Active bit, but don't raise an > > interrupt." > > > > So, QEMU already has a policy in place for short PRDs. I apologize, > > that I did not notice it earlier. > > > > Any sane human would be forgiven for not being able to read this function. :D > > > It is enforced by the condition: > > "(s->bus->dma->ops->prepare_buf(s->bus->dma, s->io_buffer_size) < 512)". > > > > The definition of short PRDs it uses is: those which result in a > > "< 512" bytes transfer. > > > > Is there a reason to not define them, as those which result in a > > "< n * 512" bytes transfer, to begin with (or now)? > > > (n = s->nsector). > > ide_dma_cb is written to possibly execute more than once per command. It > tries to gather up what it can from prepare_buf, but as long as it gets > ENOUGH sglist to transfer "at least one sector" it goes ahead with the > read, under the idea that maybe when we return to ide_dma_cb, the next > call to prepare_buf will give it more space to transfer the rest of the > data to. > > This is why it checks for < 512 instead of < (s->nsectors * 512). > "As long as I can transfer a sector's worth, I can make progress on the > command." > > As we now know, this mechanism is faulty because we don't always build > sglists that are multiples of 512. A way to fix this would be to use > s->sg.size as an offset when building the next transfer instead of (n * > 512). Understood. For legacy or other reasons, QEMU supports piece-wise r/w cmds, where each piece is assumed to be a multiple of a sector. That requires maintaining state (such as bm->cur_addr) across multiple calls to ide_dma_cb (and to prepare_buf) both inside QEMU (that is already in place for those reasons) and inside the guest. That also requires providing interrupts to signal state transitions. I am assuming such transitions, since the design too does in some form. > > However, I am pretty sure that for both pci bmdma and ahci, we are > guaranteed to have the entire sglist from the guest before reaching > ide_dma_cb, so we might be able to adjust this loop to understand that > prepare_buf will never yield further results. > > (...I think. I'm actually not confident, since I never wrote the > original DMA loop. It looks right, though.) > > In this case, you could just change the overflow condition to check for > the entire transfer size and declare the overflow sooner instead of later. To consider "< n * 512" as a fix, it is not necessary to look for the guarantee of receiving the entire sglist in one go on the first call to ide_dma_cb. The legacy reasons are expected to behave consistently, whether the guest provides one PRD per state transtion (or per call to prepare_buf), or 2 PRDs per state transition, or all PRDs per state transition. (I know there might not be such a guest, but QEMU does assume its existence in its design.) A guest does not expect differences in outward behaviour regardless of the way it provides the PRDs. Consider the example of a command that transfers 5 consecutive sectors. Each line below represents a call to prepare_buf and its return value, also indicating the number of PRD entries the call consumed and their sizes. E1: --- 512 512 512 128 <--- No interrupt, be inactive, 3 wasted transfers. 512 384 Next, E2: --- 512+512=1024 512+128=640 <--- no 'short PRD', crash, but the same command as above. 512+384=896 Next, E3: --- 512+512+512+128+512+384 <--- success, but the same command as above. The behaviour is indeed guest-visibly inconsistent, but managable. By fixing using s->sg.size as the offset for the next transfer, we bring consistency to all 3 cases E1, E2 and E3, but also begin managing granularity at 2-byte levels, and introduce extensive code changes. By fixing through "< n * 512", - all instances of E2 (of which there has been only one reported) are included in the set E1, while - E3 continues to succeed as before (since "< n * 512" is false for E3), and - E1 continues to fail as before (since anything "< 512" is necessarily "< n * 512",assuming n >= 1 and no integer oveflows.) There were already two types of instances before (E1 and E3); after this fix, they still remain two in number. For me, this line of reasoning is sufficient to go with the "< n * 512" overflow fix. Let me know if there are objections - I am okay writing either/both of the fixes. Apologies for this turn-about - I might have spoken too soon in my earlier emails. > > > > > If the prepare_buf interface implementation and use, that comment, and the > > behaviour (drop and goto eot) which that comment points towards, are all > > consistent with one another, then, the condition "< 512" looks like a bug > > in itself, solving of which also solves this crash (hopefully, without > > making any other change.) > > > > Thanks, > > -amol > >