All of lore.kernel.org
 help / color / mirror / Atom feed
* Current QA failure challenges (perf, QA http-server, asyncio prserv, bitbake runCommand timeout, unfsd test issue)
@ 2023-01-29 15:41 Richard Purdie
  2023-01-30 20:23 ` [OE-core] " Alexander Kanavin
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Richard Purdie @ 2023-01-29 15:41 UTC (permalink / raw)
  To: openembedded-core; +Cc: Bruce Ashfield, Joshua Watt, Alexandre Belloni

Hi All,

I've going to be a distracted this next week with other project related
things and the autobuilder is showing a number of problems. I've spent
part of my weekend trying to stabilise things so I could not have the
distraction and whilst I've had partial success, things aren't
resolved. I'm sending this out in the hope others may be able to
help/assist. The issues:

Kernel 6.1 reproducibility issue in perf
========================================

We've had two fixes which do seem to have fixed the kernel-devsrc issue
but not the perf one. The diffoscope:

http://autobuilder.yocto.io/pub/repro-fail/oe-reproducible-20230129-8bp_1b_z/packages/diff-html/

The autobuilder failure:

https://autobuilder.yoctoproject.org/typhoon/#/builders/117/builds/2314/steps/13/logs/stdio

Part of the issue is this seems to be host contamination, so you do
need to build on a different host. The first time the test runs, it
does so on the same host and the output is unchanged which is why we've
had green builds but then failures.

I did try removing the sstate for perf, the issue came back, so it
isn't a cache contamination issue afaik.


bitbake timeout on sending commands
===================================

Bitbake runCommand appears to be able to hang in the send since
there is no timeout set there. This can cause things to block. We need
to put a worse case timeout/exit in there.


unfs tests leave nfs server running
===================================

After the unfs NFS test runs, we seem to leave an unfs server hanging
around after the builds are long gone. These build up on the
autobuilder workers.


http-server in dnf tests breaking
=================================

We've started seeing a lot of short packets with the http-server used
by the dnf package manager tests. dnf retries and succeeds leaving a
failure in the logs which parse_logs picks up.

https://autobuilder.yoctoproject.org/typhoon/#/builders/53/builds/6592/steps/14/logs/stdio
Central error: 2023-01-29T11:43:32+0000 INFO Error during transfer: Curl error (18): Transferred a partial file for http://192.168.7.1:36183/cortexa15t2hf_neon/libc6-2.36-r0.cortexa15t2hf_neon.rpm [transfer closed with 15758 bytes remaining to read]
https://autobuilder.yoctoproject.org/typhoon/#/builders/59/builds/6592/steps/14/logs/stdio
Central error: 2023-01-29T11:07:31+0000 INFO Error during transfer: Curl error (18): Transferred a partial file for http://192.168.7.3:42417/core2_32/busybox-1.35.0-r0.core2_32.rpm [transfer closed with 14627 bytes remaining to read]

https://autobuilder.yoctoproject.org/typhoon/#/builders/45/builds/6598/steps/13/logs/stdio
https://autobuilder.yoctoproject.org/typhoon/#/builders/59/builds/6532

and a load more. Why the sudden increase in frequency, not sure. Could
be 6.1 kernel related.


asyncio issues with PR Serv
===========================

I've found at least one bug with what the code was doing and fixing
that should help some of the test failures we've seen. When looking
into that I found more problems though.

If you run "oe-selftest -r prservice -j 3 -K" and then look at the
bitbake-cookerdaemon.log, it shows a load of python asyncio issues. The
first one is:

/usr/lib/python3.10/asyncio/base_events.py:685: ResourceWarning:
unclosed event loop <_UnixSelectorEventLoop running=False closed=False
debug=False>
  _warn(f"unclosed event loop {self!r}", ResourceWarning, source=self)

showing connections to the PR service aren't being cleaned up.

Also:

/home/pokybuild/yocto-worker/oe-selftest-
debian/build/bitbake/lib/bb/codeparser.py:472: ResourceWarning:
unclosed <socket.socket fd=15, family=AddressFamily.AF_INET,
type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 47910),
raddr=('127.0.0.1', 40751)>
  self.process_tokens(more_tokens)

but the codeparser reference isn't accurate, it is just from the
garbage collection point.

The second issue is that in running the selftests, a PR Service process
often seems to be left behind without bitbake running. It suggests some
kind of lifecycle issue somewhere. It might only be happening in
failure cases, not sure.

The third issue seems to be around event loop shutdown. On debian11
with py 3.9, when bitbake is shutting down we're seeing:

412934 07:12:26.081334 Exiting (socket: True)
/usr/lib/python3.9/signal.py:30: ResourceWarning: unclosed
<socket.socket fd=22, family=AddressFamily.AF_INET,
type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 36287), r
addr=('127.0.0.1', 36422)>
  return enum_klass(value)
ResourceWarning: Enable tracemalloc to get the object allocation
traceback
/usr/lib/python3.9/asyncio/selector_events.py:704: ResourceWarning:
unclosed transport <_SelectorSocketTransport fd=22>
  _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
ResourceWarning: Enable tracemalloc to get the object allocation
traceback
Traceback (most recent call last):
  File "/home/pokybuild/yocto-worker/oe-selftest-
debian/build/bitbake/lib/bb/asyncrpc/serv.py", line 71, in
process_requests
    d = await self.read_message()
GeneratorExit

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/pokybuild/yocto-worker/oe-selftest-
debian/build/bitbake/lib/bb/asyncrpc/serv.py", line 201, in
handle_client
    await client.process_requests()
  File "/home/pokybuild/yocto-worker/oe-selftest-
debian/build/bitbake/lib/bb/asyncrpc/serv.py", line 79, in
process_requests
    self.writer.close()
  File "/usr/lib/python3.9/asyncio/streams.py", line 353, in close
    return self._transport.close()
  File "/usr/lib/python3.9/asyncio/selector_events.py", line 700, in
close
    self._loop.call_soon(self._call_connection_lost, None)
  File "/usr/lib/python3.9/asyncio/base_events.py", line 746, in
call_soon
    self._check_closed()
  File "/usr/lib/python3.9/asyncio/base_events.py", line 510, in
_check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Task was destroyed but it is pending!
task: <Task pending name='Task-2540' coro=<AsyncServer.h


If anyone can help with any of these please reply to the email. We
probably need to transfer some into bugzilla, I just wanted to at least
get something written down.

Cheers,

Richard





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OE-core] Current QA failure challenges (perf, QA http-server, asyncio prserv, bitbake runCommand timeout, unfsd test issue)
  2023-01-29 15:41 Current QA failure challenges (perf, QA http-server, asyncio prserv, bitbake runCommand timeout, unfsd test issue) Richard Purdie
@ 2023-01-30 20:23 ` Alexander Kanavin
  2023-01-30 20:23   ` Alexander Kanavin
  2023-01-30 20:30 ` Alexander Kanavin
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Alexander Kanavin @ 2023-01-30 20:23 UTC (permalink / raw)
  To: Richard Purdie
  Cc: openembedded-core, Bruce Ashfield, Joshua Watt, Alexandre Belloni

On Sun, 29 Jan 2023 at 16:41, Richard Purdie
<richard.purdie@linuxfoundation.org> wrote:
> unfs tests leave nfs server running
> ===================================
>
> After the unfs NFS test runs, we seem to leave an unfs server hanging
> around after the builds are long gone. These build up on the
> autobuilder workers.

I'm going slightly mad. The trouble occurs in stop() method in
meta/lib/oeqa/utils/qemurunner.py. The method sends SIGTERM to runqemu
proces, then runs Popen.poll() in a loop until it is not None [1]
(which uses waitpid() .

Runqemu needs to do significant processing in SIGTERM handler
(bringing down the nfs server cleanly), so it takes a few seconds. On
the other hand, poll() returns -15 earlier than SIGTERM hander
completes (!) and SIGTERM handler itself seems
to be interrupted in the middle of it, leaving dangling nfsd processes behind.

The 'slightly mad' part is that Popen.poll() is implemented using waitpid()

[1] https://docs.python.org/3/library/subprocess.html#subprocess.Popen.poll


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OE-core] Current QA failure challenges (perf, QA http-server, asyncio prserv, bitbake runCommand timeout, unfsd test issue)
  2023-01-30 20:23 ` [OE-core] " Alexander Kanavin
@ 2023-01-30 20:23   ` Alexander Kanavin
  0 siblings, 0 replies; 11+ messages in thread
From: Alexander Kanavin @ 2023-01-30 20:23 UTC (permalink / raw)
  To: Richard Purdie
  Cc: openembedded-core, Bruce Ashfield, Joshua Watt, Alexandre Belloni

Grr, incomplete email! Let me resend.

On Mon, 30 Jan 2023 at 21:23, Alexander Kanavin <alex.kanavin@gmail.com> wrote:
>
> On Sun, 29 Jan 2023 at 16:41, Richard Purdie
> <richard.purdie@linuxfoundation.org> wrote:
> > unfs tests leave nfs server running
> > ===================================
> >
> > After the unfs NFS test runs, we seem to leave an unfs server hanging
> > around after the builds are long gone. These build up on the
> > autobuilder workers.
>
> I'm going slightly mad. The trouble occurs in stop() method in
> meta/lib/oeqa/utils/qemurunner.py. The method sends SIGTERM to runqemu
> proces, then runs Popen.poll() in a loop until it is not None [1]
> (which uses waitpid() .
>
> Runqemu needs to do significant processing in SIGTERM handler
> (bringing down the nfs server cleanly), so it takes a few seconds. On
> the other hand, poll() returns -15 earlier than SIGTERM hander
> completes (!) and SIGTERM handler itself seems
> to be interrupted in the middle of it, leaving dangling nfsd processes behind.
>
> The 'slightly mad' part is that Popen.poll() is implemented using waitpid()
>
> [1] https://docs.python.org/3/library/subprocess.html#subprocess.Popen.poll


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OE-core] Current QA failure challenges (perf, QA http-server, asyncio prserv, bitbake runCommand timeout, unfsd test issue)
  2023-01-29 15:41 Current QA failure challenges (perf, QA http-server, asyncio prserv, bitbake runCommand timeout, unfsd test issue) Richard Purdie
  2023-01-30 20:23 ` [OE-core] " Alexander Kanavin
@ 2023-01-30 20:30 ` Alexander Kanavin
       [not found] ` <173F3040B8ABB5B1.13569@lists.openembedded.org>
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Alexander Kanavin @ 2023-01-30 20:30 UTC (permalink / raw)
  To: Richard Purdie
  Cc: openembedded-core, Bruce Ashfield, Joshua Watt, Alexandre Belloni

On Sun, 29 Jan 2023 at 16:41, Richard Purdie
<richard.purdie@linuxfoundation.org> wrote:
> unfs tests leave nfs server running
> ===================================
>
> After the unfs NFS test runs, we seem to leave an unfs server hanging
> around after the builds are long gone. These build up on the
> autobuilder workers.

I'm going slightly mad. The trouble occurs in stop() method in
meta/lib/oeqa/utils/qemurunner.py. The method sends SIGTERM to runqemu
proces, then runs Popen.poll() in a loop until it is not None [1]
(internally this poll() thing seems to be using waitpid()).

Runqemu needs to do significant processing in SIGTERM handler
(bringing down the nfs server cleanly), so it takes a few seconds. On
the other hand, poll() returns -15 earlier than SIGTERM hander
completes (!) and then SIGTERM handler itself seems
to be interrupted in the middle of it, leaving dangling nfsd processes behind.

The 'slightly mad' part is that assuming Popen.poll()/waitpid() do not
wait for the signal handler to complete, how to ensure that the child
process had truly exited before proceeding in the parent? Am I
completely wrong about what's happening and how to use these things
correctly, and sending me to a beginner's Unix class is in order? I'd
like to write a simple replicator showing incorrect behavior, but not
today.

[1] https://docs.python.org/3/library/subprocess.html#subprocess.Popen.poll

Alex


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OE-core] Current QA failure challenges (perf, QA http-server, asyncio prserv, bitbake runCommand timeout, unfsd test issue)
       [not found] ` <173F3040B8ABB5B1.13569@lists.openembedded.org>
@ 2023-01-30 22:14   ` Alexander Kanavin
  0 siblings, 0 replies; 11+ messages in thread
From: Alexander Kanavin @ 2023-01-30 22:14 UTC (permalink / raw)
  To: alex.kanavin
  Cc: Richard Purdie, openembedded-core, Bruce Ashfield, Joshua Watt,
	Alexandre Belloni

On Mon, 30 Jan 2023 at 21:30, Alexander Kanavin via
lists.openembedded.org <alex.kanavin=gmail.com@lists.openembedded.org>
wrote:
> The 'slightly mad' part is that assuming Popen.poll()/waitpid() do not
> wait for the signal handler to complete, how to ensure that the child
> process had truly exited before proceeding in the parent? Am I
> completely wrong about what's happening and how to use these things
> correctly, and sending me to a beginner's Unix class is in order? I'd
> like to write a simple replicator showing incorrect behavior, but not
> today.

I sent a patch which does fix the dangling nfsd processes. I'm still
not sure if this is indeed the semantics of waitpid() or I'm papering
over something completely different, so if Unix gurus can set the
record straight, that would be appreciated.

Alex


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Current QA failure challenges (perf, QA http-server, asyncio prserv, bitbake runCommand timeout, unfsd test issue)
  2023-01-29 15:41 Current QA failure challenges (perf, QA http-server, asyncio prserv, bitbake runCommand timeout, unfsd test issue) Richard Purdie
                   ` (2 preceding siblings ...)
       [not found] ` <173F3040B8ABB5B1.13569@lists.openembedded.org>
@ 2023-01-31 15:06 ` Bruce Ashfield
  2023-01-31 15:14 ` [OE-core] " Alexander Kanavin
  2023-02-01 14:22 ` Joshua Watt
  5 siblings, 0 replies; 11+ messages in thread
From: Bruce Ashfield @ 2023-01-31 15:06 UTC (permalink / raw)
  To: Richard Purdie; +Cc: openembedded-core, Joshua Watt, Alexandre Belloni

On Sun, Jan 29, 2023 at 10:41 AM Richard Purdie
<richard.purdie@linuxfoundation.org> wrote:
>
> Hi All,
>
> I've going to be a distracted this next week with other project related
> things and the autobuilder is showing a number of problems. I've spent
> part of my weekend trying to stabilise things so I could not have the
> distraction and whilst I've had partial success, things aren't
> resolved. I'm sending this out in the hope others may be able to
> help/assist. The issues:
>
> Kernel 6.1 reproducibility issue in perf
> ========================================
>
> We've had two fixes which do seem to have fixed the kernel-devsrc issue
> but not the perf one. The diffoscope:
>
> http://autobuilder.yocto.io/pub/repro-fail/oe-reproducible-20230129-8bp_1b_z/packages/diff-html/
>
> The autobuilder failure:
>
> https://autobuilder.yoctoproject.org/typhoon/#/builders/117/builds/2314/steps/13/logs/stdio
>
> Part of the issue is this seems to be host contamination, so you do
> need to build on a different host. The first time the test runs, it
> does so on the same host and the output is unchanged which is why we've
> had green builds but then failures.
>
> I did try removing the sstate for perf, the issue came back, so it
> isn't a cache contamination issue afaik.


I don't have a great local setup for looking at reproducibility
issues, but once I'm through my current kernel upgrade issues, I'll
start looking at this.

Bruce

>
>
> bitbake timeout on sending commands
> ===================================
>
> Bitbake runCommand appears to be able to hang in the send since
> there is no timeout set there. This can cause things to block. We need
> to put a worse case timeout/exit in there.
>
>
> unfs tests leave nfs server running
> ===================================
>
> After the unfs NFS test runs, we seem to leave an unfs server hanging
> around after the builds are long gone. These build up on the
> autobuilder workers.
>
>
> http-server in dnf tests breaking
> =================================
>
> We've started seeing a lot of short packets with the http-server used
> by the dnf package manager tests. dnf retries and succeeds leaving a
> failure in the logs which parse_logs picks up.
>
> https://autobuilder.yoctoproject.org/typhoon/#/builders/53/builds/6592/steps/14/logs/stdio
> Central error: 2023-01-29T11:43:32+0000 INFO Error during transfer: Curl error (18): Transferred a partial file for http://192.168.7.1:36183/cortexa15t2hf_neon/libc6-2.36-r0.cortexa15t2hf_neon.rpm [transfer closed with 15758 bytes remaining to read]
> https://autobuilder.yoctoproject.org/typhoon/#/builders/59/builds/6592/steps/14/logs/stdio
> Central error: 2023-01-29T11:07:31+0000 INFO Error during transfer: Curl error (18): Transferred a partial file for http://192.168.7.3:42417/core2_32/busybox-1.35.0-r0.core2_32.rpm [transfer closed with 14627 bytes remaining to read]
>
> https://autobuilder.yoctoproject.org/typhoon/#/builders/45/builds/6598/steps/13/logs/stdio
> https://autobuilder.yoctoproject.org/typhoon/#/builders/59/builds/6532
>
> and a load more. Why the sudden increase in frequency, not sure. Could
> be 6.1 kernel related.
>
>
> asyncio issues with PR Serv
> ===========================
>
> I've found at least one bug with what the code was doing and fixing
> that should help some of the test failures we've seen. When looking
> into that I found more problems though.
>
> If you run "oe-selftest -r prservice -j 3 -K" and then look at the
> bitbake-cookerdaemon.log, it shows a load of python asyncio issues. The
> first one is:
>
> /usr/lib/python3.10/asyncio/base_events.py:685: ResourceWarning:
> unclosed event loop <_UnixSelectorEventLoop running=False closed=False
> debug=False>
>   _warn(f"unclosed event loop {self!r}", ResourceWarning, source=self)
>
> showing connections to the PR service aren't being cleaned up.
>
> Also:
>
> /home/pokybuild/yocto-worker/oe-selftest-
> debian/build/bitbake/lib/bb/codeparser.py:472: ResourceWarning:
> unclosed <socket.socket fd=15, family=AddressFamily.AF_INET,
> type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 47910),
> raddr=('127.0.0.1', 40751)>
>   self.process_tokens(more_tokens)
>
> but the codeparser reference isn't accurate, it is just from the
> garbage collection point.
>
> The second issue is that in running the selftests, a PR Service process
> often seems to be left behind without bitbake running. It suggests some
> kind of lifecycle issue somewhere. It might only be happening in
> failure cases, not sure.
>
> The third issue seems to be around event loop shutdown. On debian11
> with py 3.9, when bitbake is shutting down we're seeing:
>
> 412934 07:12:26.081334 Exiting (socket: True)
> /usr/lib/python3.9/signal.py:30: ResourceWarning: unclosed
> <socket.socket fd=22, family=AddressFamily.AF_INET,
> type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 36287), r
> addr=('127.0.0.1', 36422)>
>   return enum_klass(value)
> ResourceWarning: Enable tracemalloc to get the object allocation
> traceback
> /usr/lib/python3.9/asyncio/selector_events.py:704: ResourceWarning:
> unclosed transport <_SelectorSocketTransport fd=22>
>   _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
> ResourceWarning: Enable tracemalloc to get the object allocation
> traceback
> Traceback (most recent call last):
>   File "/home/pokybuild/yocto-worker/oe-selftest-
> debian/build/bitbake/lib/bb/asyncrpc/serv.py", line 71, in
> process_requests
>     d = await self.read_message()
> GeneratorExit
>
> During handling of the above exception, another exception occurred:
>
> Traceback (most recent call last):
>   File "/home/pokybuild/yocto-worker/oe-selftest-
> debian/build/bitbake/lib/bb/asyncrpc/serv.py", line 201, in
> handle_client
>     await client.process_requests()
>   File "/home/pokybuild/yocto-worker/oe-selftest-
> debian/build/bitbake/lib/bb/asyncrpc/serv.py", line 79, in
> process_requests
>     self.writer.close()
>   File "/usr/lib/python3.9/asyncio/streams.py", line 353, in close
>     return self._transport.close()
>   File "/usr/lib/python3.9/asyncio/selector_events.py", line 700, in
> close
>     self._loop.call_soon(self._call_connection_lost, None)
>   File "/usr/lib/python3.9/asyncio/base_events.py", line 746, in
> call_soon
>     self._check_closed()
>   File "/usr/lib/python3.9/asyncio/base_events.py", line 510, in
> _check_closed
>     raise RuntimeError('Event loop is closed')
> RuntimeError: Event loop is closed
> Task was destroyed but it is pending!
> task: <Task pending name='Task-2540' coro=<AsyncServer.h
>
>
> If anyone can help with any of these please reply to the email. We
> probably need to transfer some into bugzilla, I just wanted to at least
> get something written down.
>
> Cheers,
>
> Richard
>
>
>


-- 
- Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end
- "Use the force Harry" - Gandalf, Star Trek II


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OE-core] Current QA failure challenges (perf, QA http-server, asyncio prserv, bitbake runCommand timeout, unfsd test issue)
  2023-01-29 15:41 Current QA failure challenges (perf, QA http-server, asyncio prserv, bitbake runCommand timeout, unfsd test issue) Richard Purdie
                   ` (3 preceding siblings ...)
  2023-01-31 15:06 ` Bruce Ashfield
@ 2023-01-31 15:14 ` Alexander Kanavin
  2023-01-31 16:03   ` Richard Purdie
  2023-02-01 14:22 ` Joshua Watt
  5 siblings, 1 reply; 11+ messages in thread
From: Alexander Kanavin @ 2023-01-31 15:14 UTC (permalink / raw)
  To: Richard Purdie
  Cc: openembedded-core, Bruce Ashfield, Joshua Watt, Alexandre Belloni

On Sun, 29 Jan 2023 at 16:41, Richard Purdie
<richard.purdie@linuxfoundation.org> wrote:
> http-server in dnf tests breaking
> =================================
>
> We've started seeing a lot of short packets with the http-server used
> by the dnf package manager tests. dnf retries and succeeds leaving a
> failure in the logs which parse_logs picks up.
>
> https://autobuilder.yoctoproject.org/typhoon/#/builders/53/builds/6592/steps/14/logs/stdio
> Central error: 2023-01-29T11:43:32+0000 INFO Error during transfer: Curl error (18): Transferred a partial file for http://192.168.7.1:36183/cortexa15t2hf_neon/libc6-2.36-r0.cortexa15t2hf_neon.rpm [transfer closed with 15758 bytes remaining to read]
> https://autobuilder.yoctoproject.org/typhoon/#/builders/59/builds/6592/steps/14/logs/stdio
> Central error: 2023-01-29T11:07:31+0000 INFO Error during transfer: Curl error (18): Transferred a partial file for http://192.168.7.3:42417/core2_32/busybox-1.35.0-r0.core2_32.rpm [transfer closed with 14627 bytes remaining to read]
>
> https://autobuilder.yoctoproject.org/typhoon/#/builders/45/builds/6598/steps/13/logs/stdio
> https://autobuilder.yoctoproject.org/typhoon/#/builders/59/builds/6532
>
> and a load more. Why the sudden increase in frequency, not sure. Could
> be 6.1 kernel related.


I ran this locally three times with master and it wouldn't reproduce.
Probably build host has something to do with it too?

Alex


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OE-core] Current QA failure challenges (perf, QA http-server, asyncio prserv, bitbake runCommand timeout, unfsd test issue)
  2023-01-31 15:14 ` [OE-core] " Alexander Kanavin
@ 2023-01-31 16:03   ` Richard Purdie
  2023-02-01 13:59     ` Ross Burton
  0 siblings, 1 reply; 11+ messages in thread
From: Richard Purdie @ 2023-01-31 16:03 UTC (permalink / raw)
  To: Alexander Kanavin
  Cc: openembedded-core, Bruce Ashfield, Joshua Watt, Alexandre Belloni

On Tue, 2023-01-31 at 16:14 +0100, Alexander Kanavin wrote:
> On Sun, 29 Jan 2023 at 16:41, Richard Purdie
> <richard.purdie@linuxfoundation.org> wrote:
> > http-server in dnf tests breaking
> > =================================
> > 
> > We've started seeing a lot of short packets with the http-server used
> > by the dnf package manager tests. dnf retries and succeeds leaving a
> > failure in the logs which parse_logs picks up.
> > 
> > https://autobuilder.yoctoproject.org/typhoon/#/builders/53/builds/6592/steps/14/logs/stdio
> > Central error: 2023-01-29T11:43:32+0000 INFO Error during transfer: Curl error (18): Transferred a partial file for http://192.168.7.1:36183/cortexa15t2hf_neon/libc6-2.36-r0.cortexa15t2hf_neon.rpm [transfer closed with 15758 bytes remaining to read]
> > https://autobuilder.yoctoproject.org/typhoon/#/builders/59/builds/6592/steps/14/logs/stdio
> > Central error: 2023-01-29T11:07:31+0000 INFO Error during transfer: Curl error (18): Transferred a partial file for http://192.168.7.3:42417/core2_32/busybox-1.35.0-r0.core2_32.rpm [transfer closed with 14627 bytes remaining to read]
> > 
> > https://autobuilder.yoctoproject.org/typhoon/#/builders/45/builds/6598/steps/13/logs/stdio
> > https://autobuilder.yoctoproject.org/typhoon/#/builders/59/builds/6532
> > 
> > and a load more. Why the sudden increase in frequency, not sure. Could
> > be 6.1 kernel related.
> 
> 
> I ran this locally three times with master and it wouldn't reproduce.
> Probably build host has something to do with it too?

I think you need two different host OSes. Which two trigger it and why
I don't know.

The last run on the autobuilder was fedora37 and then alma8 FWIW.

Cheers,

Richard



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OE-core] Current QA failure challenges (perf, QA http-server, asyncio prserv, bitbake runCommand timeout, unfsd test issue)
  2023-01-31 16:03   ` Richard Purdie
@ 2023-02-01 13:59     ` Ross Burton
  0 siblings, 0 replies; 11+ messages in thread
From: Ross Burton @ 2023-02-01 13:59 UTC (permalink / raw)
  To: Alexander Kanavin
  Cc: openembedded-core, Richard Purdie, Bruce Ashfield, Joshua Watt,
	Alexandre Belloni

On 31 Jan 2023, at 16:03, Richard Purdie via lists.openembedded.org <richard.purdie=linuxfoundation.org@lists.openembedded.org> wrote:
>> I ran this locally three times with master and it wouldn't reproduce.
>> Probably build host has something to do with it too?
> 
> I think you need two different host OSes. Which two trigger it and why
> I don't know.
> 
> The last run on the autobuilder was fedora37 and then alma8 FWIW.

I’d put money on it being specifically different host GCCs, so you might be able to reproduce by installing multiple GCCs and changing what the gcc link points at.

Ross

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Current QA failure challenges (perf, QA http-server, asyncio prserv, bitbake runCommand timeout, unfsd test issue)
  2023-01-29 15:41 Current QA failure challenges (perf, QA http-server, asyncio prserv, bitbake runCommand timeout, unfsd test issue) Richard Purdie
                   ` (4 preceding siblings ...)
  2023-01-31 15:14 ` [OE-core] " Alexander Kanavin
@ 2023-02-01 14:22 ` Joshua Watt
  5 siblings, 0 replies; 11+ messages in thread
From: Joshua Watt @ 2023-02-01 14:22 UTC (permalink / raw)
  To: Richard Purdie; +Cc: openembedded-core, Bruce Ashfield, Alexandre Belloni

On Sun, Jan 29, 2023 at 9:41 AM Richard Purdie
<richard.purdie@linuxfoundation.org> wrote:
>
> Hi All,
>
> I've going to be a distracted this next week with other project related
> things and the autobuilder is showing a number of problems. I've spent
> part of my weekend trying to stabilise things so I could not have the
> distraction and whilst I've had partial success, things aren't
> resolved. I'm sending this out in the hope others may be able to
> help/assist. The issues:
>
> Kernel 6.1 reproducibility issue in perf
> ========================================
>
> We've had two fixes which do seem to have fixed the kernel-devsrc issue
> but not the perf one. The diffoscope:
>
> http://autobuilder.yocto.io/pub/repro-fail/oe-reproducible-20230129-8bp_1b_z/packages/diff-html/
>
> The autobuilder failure:
>
> https://autobuilder.yoctoproject.org/typhoon/#/builders/117/builds/2314/steps/13/logs/stdio
>
> Part of the issue is this seems to be host contamination, so you do
> need to build on a different host. The first time the test runs, it
> does so on the same host and the output is unchanged which is why we've
> had green builds but then failures.
>
> I did try removing the sstate for perf, the issue came back, so it
> isn't a cache contamination issue afaik.
>
>
> bitbake timeout on sending commands
> ===================================
>
> Bitbake runCommand appears to be able to hang in the send since
> there is no timeout set there. This can cause things to block. We need
> to put a worse case timeout/exit in there.
>
>
> unfs tests leave nfs server running
> ===================================
>
> After the unfs NFS test runs, we seem to leave an unfs server hanging
> around after the builds are long gone. These build up on the
> autobuilder workers.
>
>
> http-server in dnf tests breaking
> =================================
>
> We've started seeing a lot of short packets with the http-server used
> by the dnf package manager tests. dnf retries and succeeds leaving a
> failure in the logs which parse_logs picks up.
>
> https://autobuilder.yoctoproject.org/typhoon/#/builders/53/builds/6592/steps/14/logs/stdio
> Central error: 2023-01-29T11:43:32+0000 INFO Error during transfer: Curl error (18): Transferred a partial file for http://192.168.7.1:36183/cortexa15t2hf_neon/libc6-2.36-r0.cortexa15t2hf_neon.rpm [transfer closed with 15758 bytes remaining to read]
> https://autobuilder.yoctoproject.org/typhoon/#/builders/59/builds/6592/steps/14/logs/stdio
> Central error: 2023-01-29T11:07:31+0000 INFO Error during transfer: Curl error (18): Transferred a partial file for http://192.168.7.3:42417/core2_32/busybox-1.35.0-r0.core2_32.rpm [transfer closed with 14627 bytes remaining to read]
>
> https://autobuilder.yoctoproject.org/typhoon/#/builders/45/builds/6598/steps/13/logs/stdio
> https://autobuilder.yoctoproject.org/typhoon/#/builders/59/builds/6532
>
> and a load more. Why the sudden increase in frequency, not sure. Could
> be 6.1 kernel related.
>
>
> asyncio issues with PR Serv
> ===========================
>
> I've found at least one bug with what the code was doing and fixing
> that should help some of the test failures we've seen. When looking
> into that I found more problems though.
>
> If you run "oe-selftest -r prservice -j 3 -K" and then look at the
> bitbake-cookerdaemon.log, it shows a load of python asyncio issues. The
> first one is:
>
> /usr/lib/python3.10/asyncio/base_events.py:685: ResourceWarning:
> unclosed event loop <_UnixSelectorEventLoop running=False closed=False
> debug=False>
>   _warn(f"unclosed event loop {self!r}", ResourceWarning, source=self)
>
> showing connections to the PR service aren't being cleaned up.

I can take a look at this if no one else gets there first, but not
until next week


>
> Also:
>
> /home/pokybuild/yocto-worker/oe-selftest-
> debian/build/bitbake/lib/bb/codeparser.py:472: ResourceWarning:
> unclosed <socket.socket fd=15, family=AddressFamily.AF_INET,
> type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 47910),
> raddr=('127.0.0.1', 40751)>
>   self.process_tokens(more_tokens)
>
> but the codeparser reference isn't accurate, it is just from the
> garbage collection point.
>
> The second issue is that in running the selftests, a PR Service process
> often seems to be left behind without bitbake running. It suggests some
> kind of lifecycle issue somewhere. It might only be happening in
> failure cases, not sure.
>
> The third issue seems to be around event loop shutdown. On debian11
> with py 3.9, when bitbake is shutting down we're seeing:
>
> 412934 07:12:26.081334 Exiting (socket: True)
> /usr/lib/python3.9/signal.py:30: ResourceWarning: unclosed
> <socket.socket fd=22, family=AddressFamily.AF_INET,
> type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 36287), r
> addr=('127.0.0.1', 36422)>
>   return enum_klass(value)
> ResourceWarning: Enable tracemalloc to get the object allocation
> traceback
> /usr/lib/python3.9/asyncio/selector_events.py:704: ResourceWarning:
> unclosed transport <_SelectorSocketTransport fd=22>
>   _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
> ResourceWarning: Enable tracemalloc to get the object allocation
> traceback
> Traceback (most recent call last):
>   File "/home/pokybuild/yocto-worker/oe-selftest-
> debian/build/bitbake/lib/bb/asyncrpc/serv.py", line 71, in
> process_requests
>     d = await self.read_message()
> GeneratorExit
>
> During handling of the above exception, another exception occurred:
>
> Traceback (most recent call last):
>   File "/home/pokybuild/yocto-worker/oe-selftest-
> debian/build/bitbake/lib/bb/asyncrpc/serv.py", line 201, in
> handle_client
>     await client.process_requests()
>   File "/home/pokybuild/yocto-worker/oe-selftest-
> debian/build/bitbake/lib/bb/asyncrpc/serv.py", line 79, in
> process_requests
>     self.writer.close()
>   File "/usr/lib/python3.9/asyncio/streams.py", line 353, in close
>     return self._transport.close()
>   File "/usr/lib/python3.9/asyncio/selector_events.py", line 700, in
> close
>     self._loop.call_soon(self._call_connection_lost, None)
>   File "/usr/lib/python3.9/asyncio/base_events.py", line 746, in
> call_soon
>     self._check_closed()
>   File "/usr/lib/python3.9/asyncio/base_events.py", line 510, in
> _check_closed
>     raise RuntimeError('Event loop is closed')
> RuntimeError: Event loop is closed
> Task was destroyed but it is pending!
> task: <Task pending name='Task-2540' coro=<AsyncServer.h
>
>
> If anyone can help with any of these please reply to the email. We
> probably need to transfer some into bugzilla, I just wanted to at least
> get something written down.
>
> Cheers,
>
> Richard
>
>
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OE-core] Current QA failure challenges (perf, QA http-server, asyncio prserv, bitbake runCommand timeout, unfsd test issue)
       [not found] <173ED1DEED132984.27165@lists.openembedded.org>
@ 2023-02-01 14:11 ` Richard Purdie
  0 siblings, 0 replies; 11+ messages in thread
From: Richard Purdie @ 2023-02-01 14:11 UTC (permalink / raw)
  To: openembedded-core; +Cc: Bruce Ashfield, Joshua Watt, Alexandre Belloni

On Sun, 2023-01-29 at 15:41 +0000, Richard Purdie via
lists.openembedded.org wrote:
> http-server in dnf tests breaking
> =================================
> 
> We've started seeing a lot of short packets with the http-server used
> by the dnf package manager tests. dnf retries and succeeds leaving a
> failure in the logs which parse_logs picks up.
> 
> https://autobuilder.yoctoproject.org/typhoon/#/builders/53/builds/6592/steps/14/logs/stdio
> Central error: 2023-01-29T11:43:32+0000 INFO Error during transfer: Curl error (18): Transferred a partial file for http://192.168.7.1:36183/cortexa15t2hf_neon/libc6-2.36-r0.cortexa15t2hf_neon.rpm [transfer closed with 15758 bytes remaining to read]
> https://autobuilder.yoctoproject.org/typhoon/#/builders/59/builds/6592/steps/14/logs/stdio
> Central error: 2023-01-29T11:07:31+0000 INFO Error during transfer: Curl error (18): Transferred a partial file for http://192.168.7.3:42417/core2_32/busybox-1.35.0-r0.core2_32.rpm [transfer closed with 14627 bytes remaining to read]
> 
> https://autobuilder.yoctoproject.org/typhoon/#/builders/45/builds/6598/steps/13/logs/stdio
> https://autobuilder.yoctoproject.org/typhoon/#/builders/59/builds/6532
> 
> and a load more. Why the sudden increase in frequency, not sure. Could
> be 6.1 kernel related.

We've noticed there are virtio networking fixes in recent 6.1 stable
updates so I'm hoping we can try and pull those in, see if they help.

Cheers,

Richard



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-02-01 14:23 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-29 15:41 Current QA failure challenges (perf, QA http-server, asyncio prserv, bitbake runCommand timeout, unfsd test issue) Richard Purdie
2023-01-30 20:23 ` [OE-core] " Alexander Kanavin
2023-01-30 20:23   ` Alexander Kanavin
2023-01-30 20:30 ` Alexander Kanavin
     [not found] ` <173F3040B8ABB5B1.13569@lists.openembedded.org>
2023-01-30 22:14   ` Alexander Kanavin
2023-01-31 15:06 ` Bruce Ashfield
2023-01-31 15:14 ` [OE-core] " Alexander Kanavin
2023-01-31 16:03   ` Richard Purdie
2023-02-01 13:59     ` Ross Burton
2023-02-01 14:22 ` Joshua Watt
     [not found] <173ED1DEED132984.27165@lists.openembedded.org>
2023-02-01 14:11 ` [OE-core] " Richard Purdie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.