All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
To: Eric Blake <eblake@redhat.com>, Kevin Wolf <kwolf@redhat.com>,
	qemu-block@nongnu.org
Cc: qemu-devel@nongnu.org, Max Reitz <mreitz@redhat.com>
Subject: Re: [Qemu-devel] [PULL 21/35] block: fix QEMU crash with scsi-hd and drive_del
Date: Wed, 8 Aug 2018 17:32:23 +0300	[thread overview]
Message-ID: <20070601-f5e0-6433-e266-823341d09634@virtuozzo.com> (raw)
In-Reply-To: <bfd7fd9e-11ad-3aa3-68b8-b9b46df36f1a@virtuozzo.com>

08.08.2018 12:33, Vladimir Sementsov-Ogievskiy wrote:
> 07.08.2018 22:57, Eric Blake wrote:
>> On 08/06/2018 05:04 PM, Eric Blake wrote:
>>> On 06/18/2018 11:44 AM, Kevin Wolf wrote:
>>>> From: Greg Kurz <groug@kaod.org>
>>>>
>>>> Removing a drive with drive_del while it is being used to run an I/O
>>>> intensive workload can cause QEMU to crash.
>>>>
>>> ...
>>
>>>
>>> Test 83 sets up a client that intentionally disconnects at critical 
>>> points in the NBD protocol exchange, to ensure that the server 
>>> reacts sanely.
>>
>> Rather, nbd-fault-injector.py is a server that disconnects at 
>> critical points, and the test is of client reaction.
>>
>>>   I suspect that somewhere in the NBD code, the server detects the 
>>> disconnect and was somehow calling into blk_remove_bs() (although I 
>>> could not quickly find the backtrace); and that prior to this patch, 
>>> the 'Connection closed' message resulted from other NBD coroutines 
>>> getting a shot at the (now-closed) connection, while after this 
>>> patch, the additional blk_drain() somehow tweaks things in a way 
>>> that prevents the other NBD coroutines from printing a message.  If 
>>> so, then the change in 83 reference output is probably intentional, 
>>> and we should update it.
>>
>> It seems like this condition is racy, and that the race is more 
>> likely to be lost prior to this patch than after. It's a question of 
>> whether the client has time to start a request to the server prior to 
>> the server hanging up, as the message is generated during 
>> nbd_co_do_receive_one_chunk.  Here's a demonstration of the fact that 
>> things are racy:
>>
>> $ git revert f45280cbf
>> $ make
>> $ cd tests/qemu-iotests
>> $ cat fault.txt
>> [inject-error "a"]
>> event=neg2
>> when=after
>> $ python nbd-fault-injector.py localhost:10809 ./fault.txt &
>> Listening on 127.0.0.1:10809
>> $ ../../qemu-io -f raw nbd://localhost:10809 -c 'r 0 512'
>> Closing connection on rule match inject-error "a"
>> Connection closed
>> read failed: Input/output error
>> $ python nbd-fault-injector.py localhost:10809 ./fault.txt &
>> Listening on 127.0.0.1:10809
>> $ ../../qemu-io -f raw nbd://localhost:10809
>> Closing connection on rule match inject-error "a"
>> qemu-io> r 0 512
>> read failed: Input/output error
>> qemu-io> q
>>
>> So, depending on whether the read command is kicked off quickly (via 
>> -c) or slowly (via typing into qemu-io) determines whether the 
>> message appears.
>>
>> What's more, in commit f140e300, we specifically called out in the 
>> commit message that maybe it was better to trace when we detect 
>> connection closed rather than log it to stdout, and in all cases in 
>> that commit, the additional 'Connection closed' messages do not add 
>> any information to the error message already displayed by the rest of 
>> the code.
>>
>> I don't know how much the proposed NBD reconnect code will change 
>> things in 3.1.  Meanwhile, we've missed any chance for 3.0 to fix 
>> test 83.
>>
>>>
>>> But I'm having a hard time convincing myself that this is the case, 
>>> particularly since I'm not even sure how to easily debug the 
>>> assumptions I made above.
>>>
>>> Since I'm very weak on the whole notion of what blk_drain() vs. 
>>> blk_remove_bs() is really supposed to be doing, and could easily be 
>>> persuaded that the change in output is a regression instead of a fix.
>>
>> At this point, I don't think we have a regression, just merely a bad 
>> iotests reference output. The extra blk_drain() merely adds more time 
>> before the NBD code can send out its first request, and thus makes it 
>> more likely that the fault injector has closed the connection before 
>> the read request is issued rather than after (the message only 
>> appears when read beats the race), but the NBD code shouldn't be 
>> printing the error message in the first place, and 083 needs to be 
>> tweaked to remove the noisy lines added in f140e300 (not just the 
>> three lines that are reliably different due to this patch, but all 
>> other such lines due to strategic server drops at other points in the 
>> NBD protocol).
>>
>
> Ok, agree, I'll do it in reconnect series.
>


hmm, do what?

I was going to change these error messages to be traces, but now I'm not 
sure that it's a good idea. We have generic errp returned from the 
function, and why to drop it from logs? Fixing iotest is not a good 
reason, better is to adjust iotest itself a bit (just commit changed 
output) and forget about it. Is iotest racy itself, did you see 
different output running 83 iotest, not testing by hand?

-- 
Best regards,
Vladimir

  reply	other threads:[~2018-08-08 14:33 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-18 16:44 [Qemu-devel] [PULL 00/35] Block layer patches Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 01/35] test-bdrv-drain: bdrv_drain() works with cross-AioContext events Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 02/35] block: Use bdrv_do_drain_begin/end in bdrv_drain_all() Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 03/35] block: Remove 'recursive' parameter from bdrv_drain_invoke() Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 04/35] block: Don't manually poll in bdrv_drain_all() Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 05/35] tests/test-bdrv-drain: bdrv_drain_all() works in coroutines now Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 06/35] block: Avoid unnecessary aio_poll() in AIO_WAIT_WHILE() Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 07/35] block: Really pause block jobs on drain Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 08/35] block: Remove bdrv_drain_recurse() Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 09/35] test-bdrv-drain: Add test for node deletion Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 10/35] block: Drain recursively with a single BDRV_POLL_WHILE() Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 11/35] test-bdrv-drain: Test node deletion in subtree recursion Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 12/35] block: Don't poll in parent drain callbacks Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 13/35] test-bdrv-drain: Graph change through parent callback Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 14/35] block: Defer .bdrv_drain_begin callback to polling phase Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 15/35] test-bdrv-drain: Test that bdrv_drain_invoke() doesn't poll Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 16/35] block: Allow AIO_WAIT_WHILE with NULL ctx Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 17/35] block: Move bdrv_drain_all_begin() out of coroutine context Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 18/35] block: ignore_bds_parents parameter for drain functions Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 19/35] block: Allow graph changes in bdrv_drain_all_begin/end sections Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 20/35] test-bdrv-drain: Test graph changes in drain_all section Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 21/35] block: fix QEMU crash with scsi-hd and drive_del Kevin Wolf
2018-08-06 22:04   ` Eric Blake
2018-08-07 19:57     ` Eric Blake
2018-08-08  9:33       ` Vladimir Sementsov-Ogievskiy
2018-08-08 14:32         ` Vladimir Sementsov-Ogievskiy [this message]
2018-08-08 14:53           ` Eric Blake
2018-08-08 11:40       ` Vladimir Sementsov-Ogievskiy
2018-08-08 12:53         ` Eric Blake
2018-06-18 16:44 ` [Qemu-devel] [PULL 22/35] block/mirror: Pull out mirror_perform() Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 23/35] block/mirror: Convert to coroutines Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 24/35] block/mirror: Use CoQueue to wait on in-flight ops Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 25/35] block/mirror: Wait for in-flight op conflicts Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 26/35] block/mirror: Use source as a BdrvChild Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 27/35] block: Generalize should_update_child() rule Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 28/35] hbitmap: Add @advance param to hbitmap_iter_next() Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 29/35] test-hbitmap: Add non-advancing iter_next tests Kevin Wolf
2018-06-18 16:44 ` [Qemu-devel] [PULL 30/35] block/dirty-bitmap: Add bdrv_dirty_iter_next_area Kevin Wolf
2018-08-03 15:17   ` Vladimir Sementsov-Ogievskiy
2018-06-18 16:45 ` [Qemu-devel] [PULL 31/35] block/mirror: Add MirrorBDSOpaque Kevin Wolf
2018-06-18 16:45 ` [Qemu-devel] [PULL 32/35] job: Add job_progress_increase_remaining() Kevin Wolf
2018-06-18 16:45 ` [Qemu-devel] [PULL 33/35] block/mirror: Add active mirroring Kevin Wolf
2018-08-03 15:20   ` Vladimir Sementsov-Ogievskiy
2018-06-18 16:45 ` [Qemu-devel] [PULL 34/35] block/mirror: Add copy mode QAPI interface Kevin Wolf
2018-06-18 16:45 ` [Qemu-devel] [PULL 35/35] iotests: Add test for active mirroring Kevin Wolf
2018-06-18 18:50 ` [Qemu-devel] [PULL 00/35] Block layer patches no-reply
2018-06-19 15:57 ` Peter Maydell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070601-f5e0-6433-e266-823341d09634@virtuozzo.com \
    --to=vsementsov@virtuozzo.com \
    --cc=eblake@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.