Re: [PATCH] tests/qemu_iotests: Minimize usage of used ports

From: "Lukáš Doktor" <ldoktor@redhat.com>
To: Max Reitz <mreitz@redhat.com>, qemu-block@nongnu.org
Cc: kwolf@redhat.com, Stefano Garzarella <sgarzare@redhat.com>,
	qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>,
	Aarushi Mehta <mehta.aaru20@gmail.com>
Subject: Re: [PATCH] tests/qemu_iotests: Minimize usage of used ports
Date: Thu, 6 Feb 2020 17:27:48 +0100	[thread overview]
Message-ID: <595ffedf-4213-f000-96e1-3216e272b1c1@redhat.com> (raw)
In-Reply-To: <fe6a7eb5-352a-3cb5-4e75-bd99baf6d54e@redhat.com>

[-- Attachment #1.1: Type: text/plain, Size: 6681 bytes --]

Dne 06. 02. 20 v 16:03 Max Reitz napsal(a):
> On 03.02.20 08:59, Lukáš Doktor wrote:
>> Using a range of ports from 32768 to 65538 is dangerous as some
>> application might already be listening there and interfere with the
>> testing. There is no way to reserve ports, but let's decrease the chance
>> of interactions by only using ports that were free at the time of
>> importing this module.
>>
>> Without this patch CI occasionally fails with used ports. Additionally I
>> tried listening on the first port to be tried via "nc -l localhost
>> $port" and no matter how many other ports were available it always
>> hanged for infinity.
> 
> I’m afraid I don’t quite understand.  The new functions check whether
> the ports are available for use by creating a server on them (i.e.,
> binding a socket there).  The current code just lets qemu create a
> server there, and see if that works or not.
> 
> So the only difference I can see is that instead of trying out random
> ports during the test and see whether they’re free to use we do this
> check only once when the test is started.
> 
> And the only problem I can imagine from your description is that there
> is some other tool on the system that tries to set up a server but
> cannot because we already have an NBD server there (by accident).
> 
> But I don’t see how checking for free ports once at startup solves that
> problem reliably.
> 
> If what I guessed above is right, the only reliable solution I can
> imagine would be to allow users to specify the port range through
> environment variables, and then you’d have to specify a range that you
> know is free for use.
> 
> Max
> 

Hello Max,

thank you and I am sorry for not digging deep enough. This week my CI failed with:

01:24:06 DEBUG| [stdout] +ERROR: test_inet (__main__.QemuNBD)
01:24:06 DEBUG| [stdout] +----------------------------------------------------------------------
01:24:06 DEBUG| [stdout] +Traceback (most recent call last):
01:24:06 DEBUG| [stdout] +  File "147", line 85, in setUp
01:24:06 DEBUG| [stdout] +    self.vm.launch()
01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py", line 302, in launch
01:24:06 DEBUG| [stdout] +    self._launch()
01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py", line 319, in _launch
01:24:06 DEBUG| [stdout] +    self._pre_launch()
01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/qtest.py", line 106, in _pre_launch
01:24:06 DEBUG| [stdout] +    super(QEMUQtestMachine, self)._pre_launch()
01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py", line 270, in _pre_launch
01:24:06 DEBUG| [stdout] +    self._qmp = qmp.QEMUMonitorProtocol(self._vm_monitor, server=True)
01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/qmp.py", line 60, in __init__
01:24:06 DEBUG| [stdout] +    self.__sock.bind(self.__address)
01:24:06 DEBUG| [stdout] +OSError: [Errno 98] Address already in use

I made the mistake of reproducing this on my home system using the qemu revision that I had and assuming it's caused by a used port. So I limited the port range and used nc to occupy the port. It sort-of reproduced but instead of Address already in use it hanged until I kill the nc process. Then it failed with:

+Traceback (most recent call last):
+  File "147", line 124, in test_inet
+    flatten_sock_addr(address))
+  File "147", line 59, in client_test
+    self.assert_qmp(result, 'return', {})
+  File "/home/medic/Work/Projekty/qemu/tests/qemu-iotests/iotests.py", line 821, in assert_qmp
+    result = self.dictpath(d, path)
+  File "/home/medic/Work/Projekty/qemu/tests/qemu-iotests/iotests.py", line 797, in dictpath
+    self.fail('failed path traversal for "%s" in "%s"' % (path, str(d)))
+AssertionError: failed path traversal for "return" in "{'error': {'class': 'GenericError', 'desc': 'Failed to read initial magic: Unexpected end-of-file before all bytes were read'}}"

After a brief study I thought qemu is not doing the job well enough and wanted to add a protection. Anyway after a more thorough overview I came to a different conclusion and that is that we are facing the same issue as with incoming migration about a year ago. What happened is that I started "nc -l localhost 32789" which results in:

COMMAND   PID  USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
nc      26758 medic    3u  IPv6 9579487      0t0  TCP localhost:32789 (LISTEN)

Then we start the server by "_try_server_up" where qemu-nbd detects the port is occupied on IPv6 but available on IPv4, so it claims it:
COMMAND   PID  USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
nc        26758 medic    3u  IPv6 9579487      0t0  TCP localhost:32789 (LISTEN)
qemu-nbd  26927 medic    4u  IPv4 9591857      0t0  TCP localhost:32789 (LISTEN)

and reports success. Then we try to connect but the hotplugged VM first attempts to connect on the IPv6 address and hangs for infinity.

Now is this an expected behavior? If so then we need the find_free_address (but preferably directly in _try_server_up just before starting the qemu-nbd) to leave as little time-frame for collision as possible. Otherwise the test is alright and qemu-nbd needs a fix to bail out in case some address is already used (IIRC this is what incoming migration does).

My second mistake was testing this on the old code-base and rebasing it only before sending the patch (without testing after the rebase). If I were to test it first, I would have found out that the real reproducer is simply running the test as the commit 8dff69b9415b4287e900358744b732195e1ab2e2 broke it.

So basically there are 2 actions:

1. fix the test as on my system it fails in 100% of cases, bisect says the first bad commit is 8dff69b9415b4287e900358744b732195e1ab2e2. Would anyone have time in digging into this? I already spent way too much time on this and don't really know what that commit is trying to do.
2. decide on the behavior when IPv4/6 is already in use (bail-out or start).
2a. In case it should bail-out than the test is correct and there is no need for my patch. On the other hand qemu-nbd has to be fixed
2b. Otherwise I can send a v2 that will check the port in the _try_server_up just before starting qemu-nbd to minimize the risk of using a utilized port (or should you decide it's not worth checking, I can simply forget about this)

Regards,
Lukáš

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]