Re: [PATCH] tests/qemu_iotests: Minimize usage of used ports

From: Eric Blake <eblake@redhat.com>
To: "Max Reitz" <mreitz@redhat.com>,
	"Lukáš Doktor" <ldoktor@redhat.com>,
	qemu-block@nongnu.org
Cc: kwolf@redhat.com, qemu-devel@nongnu.org,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Aarushi Mehta <mehta.aaru20@gmail.com>
Subject: Re: [PATCH] tests/qemu_iotests: Minimize usage of used ports
Date: Thu, 6 Feb 2020 10:48:26 -0600	[thread overview]
Message-ID: <797578d5-bfab-5eb7-8921-0fcf1f3ee40e@redhat.com> (raw)
In-Reply-To: <726ca911-be83-c2d5-ff3f-efa32bc2233e@redhat.com>

On 2/6/20 10:37 AM, Max Reitz wrote:

>>
>> thank you and I am sorry for not digging deep enough. This week my CI failed with:
>>
>> 01:24:06 DEBUG| [stdout] +ERROR: test_inet (__main__.QemuNBD)
>> 01:24:06 DEBUG| [stdout] +----------------------------------------------------------------------
>> 01:24:06 DEBUG| [stdout] +Traceback (most recent call last):
>> 01:24:06 DEBUG| [stdout] +  File "147", line 85, in setUp
>> 01:24:06 DEBUG| [stdout] +    self.vm.launch()
>> 01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py", line 302, in launch
>> 01:24:06 DEBUG| [stdout] +    self._launch()
>> 01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py", line 319, in _launch
>> 01:24:06 DEBUG| [stdout] +    self._pre_launch()
>> 01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/qtest.py", line 106, in _pre_launch
>> 01:24:06 DEBUG| [stdout] +    super(QEMUQtestMachine, self)._pre_launch()
>> 01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py", line 270, in _pre_launch
>> 01:24:06 DEBUG| [stdout] +    self._qmp = qmp.QEMUMonitorProtocol(self._vm_monitor, server=True)
>> 01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/qmp.py", line 60, in __init__
>> 01:24:06 DEBUG| [stdout] +    self.__sock.bind(self.__address)
>> 01:24:06 DEBUG| [stdout] +OSError: [Errno 98] Address already in use

Was this test 147?  If so, see:
https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg01469.html

because that failure matches what I was seeing.

>>
>> I made the mistake of reproducing this on my home system using the qemu revision that I had and assuming it's caused by a used port. So I limited the port range and used nc to occupy the port. It sort-of reproduced but instead of Address already in use it hanged until I kill the nc process. Then it failed with:
>>
>> +Traceback (most recent call last):
>> +  File "147", line 124, in test_inet
>> +    flatten_sock_addr(address))
>> +  File "147", line 59, in client_test
>> +    self.assert_qmp(result, 'return', {})
>> +  File "/home/medic/Work/Projekty/qemu/tests/qemu-iotests/iotests.py", line 821, in assert_qmp
>> +    result = self.dictpath(d, path)
>> +  File "/home/medic/Work/Projekty/qemu/tests/qemu-iotests/iotests.py", line 797, in dictpath
>> +    self.fail('failed path traversal for "%s" in "%s"' % (path, str(d)))
>> +AssertionError: failed path traversal for "return" in "{'error': {'class': 'GenericError', 'desc': 'Failed to read initial magic: Unexpected end-of-file before all bytes were read'}}"
>>

That's a secondary failure, I assume if the initial bug is fixed we are 
less likely to hit the secondary one; but the secondary one may still be 
worth fixing.

>> After a brief study I thought qemu is not doing the job well enough and wanted to add a protection. Anyway after a more thorough overview I came to a different conclusion and that is that we are facing the same issue as with incoming migration about a year ago. What happened is that I started "nc -l localhost 32789" which results in:
>>
>> COMMAND   PID  USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
>> nc      26758 medic    3u  IPv6 9579487      0t0  TCP localhost:32789 (LISTEN)
>>
>> Then we start the server by "_try_server_up" where qemu-nbd detects the port is occupied on IPv6 but available on IPv4, so it claims it:
>> COMMAND   PID  USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
>> nc        26758 medic    3u  IPv6 9579487      0t0  TCP localhost:32789 (LISTEN)
>> qemu-nbd  26927 medic    4u  IPv4 9591857      0t0  TCP localhost:32789 (LISTEN)
>>
>> and reports success. Then we try to connect but the hotplugged VM first attempts to connect on the IPv6 address and hangs for infinity.
>>
>> Now is this an expected behavior? If so then we need the find_free_address (but preferably directly in _try_server_up just before starting the qemu-nbd) to leave as little time-frame for collision as possible. Otherwise the test is alright and qemu-nbd needs a fix to bail out in case some address is already used (IIRC this is what incoming migration does).
> 
> Ah, OK.
> 
> Well, expected behavior...  It’s a shame, that’s what it is.

In libnbd, we recently improved the testsuite by switching over to 
systemd-style fd passing: instead of asking qemu-nbd to open a random 
port (and hoping it is available), we instead pre-open the port (where 
failure is under our control) and then pass in that fd with environment 
variables to qemu-nbd, which in turn guarantees that qemu-nbd won't hit 
failures in trying to use the port.  Maybe we should utilize that more 
in qemu's own testsuite.

Also, I need to revisit my proposed patches for letting qemu-nbd support 
TLS over Unix sockets, as that's another way to avoid TCP contention 
(right now, qemu has an anachronistic prohibition against the 
combination of TLS and Unix sockets).

> 
>> My second mistake was testing this on the old code-base and rebasing it only before sending the patch (without testing after the rebase). If I were to test it first, I would have found out that the real reproducer is simply running the test as the commit 8dff69b9415b4287e900358744b732195e1ab2e2 broke it.
>>
>>
>> So basically there are 2 actions:
>>
>> 1. fix the test as on my system it fails in 100% of cases, bisect says the first bad commit is 8dff69b9415b4287e900358744b732195e1ab2e2. Would anyone have time in digging into this? I already spent way too much time on this and don't really know what that commit is trying to do.
> 
> Yep, I’ve sent a patch:
> 
> https://lists.nongnu.org/archive/html/qemu-block/2020-02/msg00294.html

Ah, so we did notice the same problem.

> 
>> 2. decide on the behavior when IPv4/6 is already in use (bail-out or start).
>> 2a. In case it should bail-out than the test is correct and there is no need for my patch. On the other hand qemu-nbd has to be fixed
> 
> I don’t think it makes much sense to let qemu’s NBD server ensure that
> it claims both IPv4 and IPv6 in case the user specifies a
> non-descriptive hostname.
> 
>> 2b. Otherwise I can send a v2 that will check the port in the _try_server_up just before starting qemu-nbd to minimize the risk of using a utilized port (or should you decide it's not worth checking, I can simply forget about this)
> 
> Hm.  It wouldn’t be fully reliable, but, well...  The risk would be minimal.
> 
> OTOH, would it work if we just did a %s/localhost/127.0.0.1/ in the
> test?  We have specific cases for IPv6, so I think it makes sense to
> force IPv4 in all other cases.

Except then it will fail on machines configured for IPv6-only.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org