All of lore.kernel.org
 help / color / mirror / Atom feed
* [1.46][PATCH 1/4] cooker: Ensure parse_quit thread is closed down
@ 2021-06-03 11:45 Martin Jansa
  2021-06-03 11:45 ` [1.46][PATCH 2/4] cooker: Explictly shut down the sync thread Martin Jansa
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Martin Jansa @ 2021-06-03 11:45 UTC (permalink / raw)
  To: bitbake-devel; +Cc: steve, Richard Purdie

From: Richard Purdie <richard.purdie@linuxfoundation.org>

Each run through the parser would leak a thread from the queue created to
shut the parser down. Close this down correctly and clean up the code flow
slightly whilst in the area, making sure this thread does shut down correctly
(we don't care if it loses data).

Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
---
 lib/bb/cooker.py | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/lib/bb/cooker.py b/lib/bb/cooker.py
index 11cc2b95..4820d268 100644
--- a/lib/bb/cooker.py
+++ b/lib/bb/cooker.py
@@ -2056,12 +2056,11 @@ class CookerParser(object):
                                             self.total)
 
             bb.event.fire(event, self.cfgdata)
-            for process in self.processes:
-                self.parser_quit.put(None)
-        else:
-            self.parser_quit.cancel_join_thread()
-            for process in self.processes:
-                self.parser_quit.put(None)
+
+        # Allow data left in the cancel queue to be discarded
+        self.parser_quit.cancel_join_thread()
+        for process in self.processes:
+            self.parser_quit.put(None)
 
         # Cleanup the queue before call process.join(), otherwise there might be
         # deadlocks.
@@ -2078,6 +2077,9 @@ class CookerParser(object):
             else:
                 process.join()
 
+        self.parser_quit.close()
+        self.parser_quit.join_thread()
+
         sync = threading.Thread(target=self.bb_cache.sync)
         sync.start()
         multiprocessing.util.Finalize(None, sync.join, exitpriority=-100)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [1.46][PATCH 2/4] cooker: Explictly shut down the sync thread
  2021-06-03 11:45 [1.46][PATCH 1/4] cooker: Ensure parse_quit thread is closed down Martin Jansa
@ 2021-06-03 11:45 ` Martin Jansa
  2021-06-03 11:45 ` [1.46][PATCH 3/4] cooker: Ensure parser is cleaned up Martin Jansa
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Martin Jansa @ 2021-06-03 11:45 UTC (permalink / raw)
  To: bitbake-devel; +Cc: steve, Richard Purdie

From: Richard Purdie <richard.purdie@linuxfoundation.org>

Hongxu Jia reported a problem where the bb_cache files were not always being
written out correctly. This was due to the sync thread being terminated
prematurely.

Whilst the preceeding changes mean the exit handler for this thread is now
correctly called since we switch to using sys.exit() instead of os._exit(),
this write can happen after we drop the bitbake lock, leading to potential
races. Avoid that headache by adding in explicit thread join() calls before
we drop the lock (which atexit or Finalize can't do).

Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
---
 lib/bb/cooker.py | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/lib/bb/cooker.py b/lib/bb/cooker.py
index 4820d268..87884574 100644
--- a/lib/bb/cooker.py
+++ b/lib/bb/cooker.py
@@ -1650,6 +1650,7 @@ class BBCooker:
 
         if self.parser:
             self.parser.shutdown(clean=not force, force=force)
+            self.parser.final_cleanup()
 
     def finishcommand(self):
         self.state = state.initial
@@ -2015,6 +2016,7 @@ class CookerParser(object):
 
         self.start()
         self.haveshutdown = False
+        self.syncthread = None
 
     def start(self):
         self.results = self.load_cached()
@@ -2081,8 +2083,8 @@ class CookerParser(object):
         self.parser_quit.join_thread()
 
         sync = threading.Thread(target=self.bb_cache.sync)
+        self.syncthread = sync
         sync.start()
-        multiprocessing.util.Finalize(None, sync.join, exitpriority=-100)
         bb.codeparser.parser_cache_savemerge()
         bb.fetch.fetcher_parse_done()
         if self.cooker.configuration.profile:
@@ -2096,6 +2098,10 @@ class CookerParser(object):
             bb.utils.process_profilelog(profiles, pout = pout)
             print("Processed parsing statistics saved to %s" % (pout))
 
+    def final_cleanup(self):
+        if self.syncthread:
+            self.syncthread.join()
+
     def load_cached(self):
         for filename, appends in self.fromcache:
             cached, infos = self.bb_cache.load(filename, appends)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [1.46][PATCH 3/4] cooker: Ensure parser is cleaned up
  2021-06-03 11:45 [1.46][PATCH 1/4] cooker: Ensure parse_quit thread is closed down Martin Jansa
  2021-06-03 11:45 ` [1.46][PATCH 2/4] cooker: Explictly shut down the sync thread Martin Jansa
@ 2021-06-03 11:45 ` Martin Jansa
  2021-06-03 11:45 ` [1.46][PATCH 4/4] cooker: Avoid parser deadlocks Martin Jansa
  2021-06-03 15:55 ` [1.46][PATCH 1/4] cooker: Ensure parse_quit thread is closed down Steve Sakoman
  3 siblings, 0 replies; 7+ messages in thread
From: Martin Jansa @ 2021-06-03 11:45 UTC (permalink / raw)
  To: bitbake-devel; +Cc: steve, Richard Purdie

From: Richard Purdie <richard.purdie@linuxfoundation.org>

During cooker shutdown, its possible the parser isn't cleaned up. Fix
this (which may partially explain why threads were left hanging around
at exit).

Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
---
 lib/bb/cooker.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/bb/cooker.py b/lib/bb/cooker.py
index 87884574..4d2c62cd 100644
--- a/lib/bb/cooker.py
+++ b/lib/bb/cooker.py
@@ -1636,6 +1636,7 @@ class BBCooker:
         return
 
     def post_serve(self):
+        self.shutdown(force=True)
         prserv.serv.auto_shutdown()
         if self.hashserv:
             self.hashserv.process.terminate()
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [1.46][PATCH 4/4] cooker: Avoid parser deadlocks
  2021-06-03 11:45 [1.46][PATCH 1/4] cooker: Ensure parse_quit thread is closed down Martin Jansa
  2021-06-03 11:45 ` [1.46][PATCH 2/4] cooker: Explictly shut down the sync thread Martin Jansa
  2021-06-03 11:45 ` [1.46][PATCH 3/4] cooker: Ensure parser is cleaned up Martin Jansa
@ 2021-06-03 11:45 ` Martin Jansa
  2021-06-03 15:55 ` [1.46][PATCH 1/4] cooker: Ensure parse_quit thread is closed down Steve Sakoman
  3 siblings, 0 replies; 7+ messages in thread
From: Martin Jansa @ 2021-06-03 11:45 UTC (permalink / raw)
  To: bitbake-devel; +Cc: steve, Richard Purdie

From: Richard Purdie <richard.purdie@linuxfoundation.org>

If you make parsing fail (e.g. add something like:

X := "${@d.getVar('MCMACHINES').split()[1]}"

to meson.bbclass, then run "while true; do bitbake -g bash; done"
it will eventually hang. It appears the cancel_join_thread() call the
parsing failure triggers, breaks the results_queue badly enough that it
sits in read() indefintely (called from self.result_queue.get(timeout=0.25)).
The timeout only applies to lock aquisition, not the read call.

I've tried various other approaches such as using cancel_join_thread()
in other places but the only way things don't lock up is to avoid
cancel_join_thread() entirely for results_queue.

I do have a concern that this may adversely affect Ctrl+C handling
but equally, its broken now already and this appears to improve
things.

[YOCTO #14034]

Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
---
 lib/bb/cooker.py | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/lib/bb/cooker.py b/lib/bb/cooker.py
index 4d2c62cd..730cdc56 100644
--- a/lib/bb/cooker.py
+++ b/lib/bb/cooker.py
@@ -1933,7 +1933,8 @@ class Parser(multiprocessing.Process):
             except queue.Empty:
                 pass
             else:
-                self.results.cancel_join_thread()
+                self.results.close()
+                self.results.join_thread()
                 break
 
             if pending:
@@ -1942,6 +1943,8 @@ class Parser(multiprocessing.Process):
                 try:
                     job = self.jobs.pop()
                 except IndexError:
+                    self.results.close()
+                    self.results.join_thread()
                     break
                 result = self.parse(*job)
                 # Clear the siggen cache after parsing to control memory usage, its huge
@@ -2060,8 +2063,6 @@ class CookerParser(object):
 
             bb.event.fire(event, self.cfgdata)
 
-        # Allow data left in the cancel queue to be discarded
-        self.parser_quit.cancel_join_thread()
         for process in self.processes:
             self.parser_quit.put(None)
 
@@ -2081,7 +2082,8 @@ class CookerParser(object):
                 process.join()
 
         self.parser_quit.close()
-        self.parser_quit.join_thread()
+        # Allow data left in the cancel queue to be discarded
+        self.parser_quit.cancel_join_thread()
 
         sync = threading.Thread(target=self.bb_cache.sync)
         self.syncthread = sync
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [1.46][PATCH 1/4] cooker: Ensure parse_quit thread is closed down
  2021-06-03 11:45 [1.46][PATCH 1/4] cooker: Ensure parse_quit thread is closed down Martin Jansa
                   ` (2 preceding siblings ...)
  2021-06-03 11:45 ` [1.46][PATCH 4/4] cooker: Avoid parser deadlocks Martin Jansa
@ 2021-06-03 15:55 ` Steve Sakoman
  2021-06-03 16:17   ` Martin Jansa
  3 siblings, 1 reply; 7+ messages in thread
From: Steve Sakoman @ 2021-06-03 15:55 UTC (permalink / raw)
  To: Martin Jansa; +Cc: bitbake-devel, Richard Purdie

Hi Martin,

Are you seeing a particular issue that these patches resolve?

I seem to recall considering these patches last summer and
encountering issues on the autobuilder.  Sadly it has been long enough
that I don't recall the details.

I spoke briefly with Richard this morning, and he also has some
concerns about taking these.  Any details you can provide on why you
find these necessary would be appreciated.

Steve

On Thu, Jun 3, 2021 at 1:45 AM Martin Jansa <martin.jansa@gmail.com> wrote:
>
> From: Richard Purdie <richard.purdie@linuxfoundation.org>
>
> Each run through the parser would leak a thread from the queue created to
> shut the parser down. Close this down correctly and clean up the code flow
> slightly whilst in the area, making sure this thread does shut down correctly
> (we don't care if it loses data).
>
> Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
> ---
>  lib/bb/cooker.py | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/lib/bb/cooker.py b/lib/bb/cooker.py
> index 11cc2b95..4820d268 100644
> --- a/lib/bb/cooker.py
> +++ b/lib/bb/cooker.py
> @@ -2056,12 +2056,11 @@ class CookerParser(object):
>                                              self.total)
>
>              bb.event.fire(event, self.cfgdata)
> -            for process in self.processes:
> -                self.parser_quit.put(None)
> -        else:
> -            self.parser_quit.cancel_join_thread()
> -            for process in self.processes:
> -                self.parser_quit.put(None)
> +
> +        # Allow data left in the cancel queue to be discarded
> +        self.parser_quit.cancel_join_thread()
> +        for process in self.processes:
> +            self.parser_quit.put(None)
>
>          # Cleanup the queue before call process.join(), otherwise there might be
>          # deadlocks.
> @@ -2078,6 +2077,9 @@ class CookerParser(object):
>              else:
>                  process.join()
>
> +        self.parser_quit.close()
> +        self.parser_quit.join_thread()
> +
>          sync = threading.Thread(target=self.bb_cache.sync)
>          sync.start()
>          multiprocessing.util.Finalize(None, sync.join, exitpriority=-100)
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [1.46][PATCH 1/4] cooker: Ensure parse_quit thread is closed down
  2021-06-03 15:55 ` [1.46][PATCH 1/4] cooker: Ensure parse_quit thread is closed down Steve Sakoman
@ 2021-06-03 16:17   ` Martin Jansa
  2021-06-03 16:41     ` Steve Sakoman
  0 siblings, 1 reply; 7+ messages in thread
From: Martin Jansa @ 2021-06-03 16:17 UTC (permalink / raw)
  To: Steve Sakoman; +Cc: bitbake-devel, Richard Purdie

[-- Attachment #1: Type: text/plain, Size: 7441 bytes --]

Yes, on big build with a lot of recipes and also multilib enabled I'm
seeing bitbake getting stuck quite often after parsing errors (usually when
developers use wrong version in SRCPV fails to expand). Unfortunately this
means that such jenkins jobs are stuck forever until someone notices that
they are running way too long and aborts them manually (which is quite bad
as the regular builds can take between 1-20 hours, so the stuck build
really stands-out only after it was blocking whole builder for a day or 2.

I've also asked RP about these earlier today:
09:32 < JaMa> RP: any bad memories about top 4 commits in
https://git.openembedded.org/bitbake-contrib/log/?h=jansa/1.46-cooker-backports
? It seems to fix parser deadlock
              I'm seeing quite often with dunfell after parsing exceptions
(maybe only the last commit is needed), I'll do more testing and then would
like to suggest them
              to Steve
09:42 < RP> JaMa: I think it may lead to more autobuilder hangs/timeouts
due to the hanging writer thread we wait on :/
09:42 < RP> JaMa: but I can understand why they may be a good idea

and I've shared more details with JPEW few days ago #yocto/2021-05-28:

00:13 < JaMa> JPEW: I have a question about your last bitbake fix
https://lists.openembedded.org/g/bitbake-devel/message/12343 does this
happen always or only rarely? I'm seeing bitbake hanging after early
parsing errors (typically b.data_smart.ExpansionError: Failure expanding
variable SRCPV, expression was ${@bb.fetch2.get_srcrev(d)} which triggered
exception FetchError: Fetcher failure: Unable to resolve 'foo' in
00:13 < JaMa> upstream git repository in git ls-remote output for ...
00:14 < JaMa> but it's quite rare and I haven't found a reliable way to
trigger this yet
00:15 < JPEW> It seems like it's always
00:16 < JPEW> It came up a lot in the YP summit when the students would
mistype something
00:16 < JaMa> rm -rf tmp-glibc cache/ __pycache__/; bitbake
--runall=write_bom_data lib32-some-big-image; in for loop was able to
eventually reproduce it, but when I try to replicate it on something
smaller with less layers it doesn't happen (or I wasn't patient enough)
00:18 < JPEW> JaMa: Interesting. It could be related... Its caused by the
client not sending an message to the server, perhaps there is something
else that could cause it
00:19 < JPEW> Something more likely with a lot of layers
00:19 < JaMa> it left 3 processes behind, strace -p shows:
00:19 < JaMa> select(12, [10 11], [], [], {tv_sec=30, tv_usec=0}) = 0
(Timeout)
00:19 < JaMa> 2nd one:
00:20 < JaMa> futex(0x12e9930,
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1622142706,
tv_nsec=509127000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
00:20 < JaMa> 3rd:
00:20 < JaMa> read(20,
00:21 < JPEW> JaMa: The defining characteristic was that bitbake-server
didn't exit and you had to kill it...I didn't look for other processes
00:21 < JaMa> ok, possibly different path then, here bitbake-server is
already gone
00:22 < JPEW> Probably then
00:22 < JPEW> Sorry :)
00:22 < JaMa> will continue to debug this, it's also worth noting that I
see it on dunfell, but that's just because most our developers produce
those parsing errors on dunfell based builds :)
00:23 < JaMa> no problem, thanks anyway


In the end I was able to trigger this hang after parsing error (wrong tag
in SRCREV intentionally included in conf/local.conf) in less then 10
iterations of:
for i in `seq -w 1 1000`; do rm -rf tmp-glibc/ bitbake-cookerdaemon.log
 cache/ __pycache__/; echo $i; bitbake --runall=write_bom_data
lib32-big-image >log.parse.$i 2>&1; done

and was surprised that it passed all 1000 iterations with gatesgarth and
newer. Then confirmed that it can also pass 1000 iterations when using 1.48
bitbake with dunfell based build. First I was backporting process.py
related changes as originally I was hoping that already backported
https://git.openembedded.org/bitbake/commit/?h=1.46&id=017a39ed05d065bf28fd38f91bcde8a098300551
will fix these stuck jenkins jobs.

Then I did git bisect on 1.46..1.48 just to discover the same commit in
1.48:
https://git.openembedded.org/bitbake/commit/?h=1.48&id=5d02c98489d3a5836676b9c3fb3bd0157449db2b
to be the "final piece" to fix this, but this change without the cooker
changes didn't fix it in 1.46.

I've also tried to remove some of these commits and backport only the last
4/4 commit, but that only makes the conflicts bigger and diverges from 1.48
behavior, so I've kept all 4. To apply all without conflicts we would need
to backport
https://git.openembedded.org/bitbake/commit/?h=1.48&id=5272f2489586479880ae8d046dfcdbe0963ee5bb
as well, which seems to big change for backport (and also isn't needed to
resolve this parsing error - and I don't use multiconfig at all to properly
test it with dunfell).

Cheers,

On Thu, Jun 3, 2021 at 5:55 PM Steve Sakoman <steve@sakoman.com> wrote:

> Hi Martin,
>
> Are you seeing a particular issue that these patches resolve?
>
> I seem to recall considering these patches last summer and
> encountering issues on the autobuilder.  Sadly it has been long enough
> that I don't recall the details.
>
> I spoke briefly with Richard this morning, and he also has some
> concerns about taking these.  Any details you can provide on why you
> find these necessary would be appreciated.
>
> Steve
>
> On Thu, Jun 3, 2021 at 1:45 AM Martin Jansa <martin.jansa@gmail.com>
> wrote:
> >
> > From: Richard Purdie <richard.purdie@linuxfoundation.org>
> >
> > Each run through the parser would leak a thread from the queue created to
> > shut the parser down. Close this down correctly and clean up the code
> flow
> > slightly whilst in the area, making sure this thread does shut down
> correctly
> > (we don't care if it loses data).
> >
> > Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
> > ---
> >  lib/bb/cooker.py | 14 ++++++++------
> >  1 file changed, 8 insertions(+), 6 deletions(-)
> >
> > diff --git a/lib/bb/cooker.py b/lib/bb/cooker.py
> > index 11cc2b95..4820d268 100644
> > --- a/lib/bb/cooker.py
> > +++ b/lib/bb/cooker.py
> > @@ -2056,12 +2056,11 @@ class CookerParser(object):
> >                                              self.total)
> >
> >              bb.event.fire(event, self.cfgdata)
> > -            for process in self.processes:
> > -                self.parser_quit.put(None)
> > -        else:
> > -            self.parser_quit.cancel_join_thread()
> > -            for process in self.processes:
> > -                self.parser_quit.put(None)
> > +
> > +        # Allow data left in the cancel queue to be discarded
> > +        self.parser_quit.cancel_join_thread()
> > +        for process in self.processes:
> > +            self.parser_quit.put(None)
> >
> >          # Cleanup the queue before call process.join(), otherwise there
> might be
> >          # deadlocks.
> > @@ -2078,6 +2077,9 @@ class CookerParser(object):
> >              else:
> >                  process.join()
> >
> > +        self.parser_quit.close()
> > +        self.parser_quit.join_thread()
> > +
> >          sync = threading.Thread(target=self.bb_cache.sync)
> >          sync.start()
> >          multiprocessing.util.Finalize(None, sync.join,
> exitpriority=-100)
> > --
> > 2.30.2
> >
>

[-- Attachment #2: Type: text/html, Size: 9435 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [1.46][PATCH 1/4] cooker: Ensure parse_quit thread is closed down
  2021-06-03 16:17   ` Martin Jansa
@ 2021-06-03 16:41     ` Steve Sakoman
  0 siblings, 0 replies; 7+ messages in thread
From: Steve Sakoman @ 2021-06-03 16:41 UTC (permalink / raw)
  To: Martin Jansa; +Cc: bitbake-devel, Richard Purdie

On Thu, Jun 3, 2021 at 6:17 AM Martin Jansa <martin.jansa@gmail.com> wrote:
>
> Yes, on big build with a lot of recipes and also multilib enabled I'm seeing bitbake getting stuck quite often after parsing errors (usually when developers use wrong version in SRCPV fails to expand). Unfortunately this means that such jenkins jobs are stuck forever until someone notices that they are running way too long and aborts them manually (which is quite bad as the regular builds can take between 1-20 hours, so the stuck build really stands-out only after it was blocking whole builder for a day or 2.

Thanks for the additional details.  I'll pull these into my test queue
and see if it causes any problems on the autobuilder.

Steve

> I've also asked RP about these earlier today:
> 09:32 < JaMa> RP: any bad memories about top 4 commits in https://git.openembedded.org/bitbake-contrib/log/?h=jansa/1.46-cooker-backports ? It seems to fix parser deadlock
>               I'm seeing quite often with dunfell after parsing exceptions (maybe only the last commit is needed), I'll do more testing and then would like to suggest them
>               to Steve
> 09:42 < RP> JaMa: I think it may lead to more autobuilder hangs/timeouts due to the hanging writer thread we wait on :/
> 09:42 < RP> JaMa: but I can understand why they may be a good idea
>
> and I've shared more details with JPEW few days ago #yocto/2021-05-28:
>
> 00:13 < JaMa> JPEW: I have a question about your last bitbake fix https://lists.openembedded.org/g/bitbake-devel/message/12343 does this happen always or only rarely? I'm seeing bitbake hanging after early parsing errors (typically b.data_smart.ExpansionError: Failure expanding variable SRCPV, expression was ${@bb.fetch2.get_srcrev(d)} which triggered exception FetchError: Fetcher failure: Unable to resolve 'foo' in
> 00:13 < JaMa> upstream git repository in git ls-remote output for ...
> 00:14 < JaMa> but it's quite rare and I haven't found a reliable way to trigger this yet
> 00:15 < JPEW> It seems like it's always
> 00:16 < JPEW> It came up a lot in the YP summit when the students would mistype something
> 00:16 < JaMa> rm -rf tmp-glibc cache/ __pycache__/; bitbake --runall=write_bom_data lib32-some-big-image; in for loop was able to eventually reproduce it, but when I try to replicate it on something smaller with less layers it doesn't happen (or I wasn't patient enough)
> 00:18 < JPEW> JaMa: Interesting. It could be related... Its caused by the client not sending an message to the server, perhaps there is something else that could cause it
> 00:19 < JPEW> Something more likely with a lot of layers
> 00:19 < JaMa> it left 3 processes behind, strace -p shows:
> 00:19 < JaMa> select(12, [10 11], [], [], {tv_sec=30, tv_usec=0}) = 0 (Timeout)
> 00:19 < JaMa> 2nd one:
> 00:20 < JaMa> futex(0x12e9930, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1622142706, tv_nsec=509127000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
> 00:20 < JaMa> 3rd:
> 00:20 < JaMa> read(20,
> 00:21 < JPEW> JaMa: The defining characteristic was that bitbake-server didn't exit and you had to kill it...I didn't look for other processes
> 00:21 < JaMa> ok, possibly different path then, here bitbake-server is already gone
> 00:22 < JPEW> Probably then
> 00:22 < JPEW> Sorry :)
> 00:22 < JaMa> will continue to debug this, it's also worth noting that I see it on dunfell, but that's just because most our developers produce those parsing errors on dunfell based builds :)
> 00:23 < JaMa> no problem, thanks anyway
>
>
> In the end I was able to trigger this hang after parsing error (wrong tag in SRCREV intentionally included in conf/local.conf) in less then 10 iterations of:
> for i in `seq -w 1 1000`; do rm -rf tmp-glibc/ bitbake-cookerdaemon.log  cache/ __pycache__/; echo $i; bitbake --runall=write_bom_data lib32-big-image >log.parse.$i 2>&1; done
>
> and was surprised that it passed all 1000 iterations with gatesgarth and newer. Then confirmed that it can also pass 1000 iterations when using 1.48 bitbake with dunfell based build. First I was backporting process.py related changes as originally I was hoping that already backported
> https://git.openembedded.org/bitbake/commit/?h=1.46&id=017a39ed05d065bf28fd38f91bcde8a098300551
> will fix these stuck jenkins jobs.
>
> Then I did git bisect on 1.46..1.48 just to discover the same commit in 1.48:
> https://git.openembedded.org/bitbake/commit/?h=1.48&id=5d02c98489d3a5836676b9c3fb3bd0157449db2b
> to be the "final piece" to fix this, but this change without the cooker changes didn't fix it in 1.46.
>
> I've also tried to remove some of these commits and backport only the last 4/4 commit, but that only makes the conflicts bigger and diverges from 1.48 behavior, so I've kept all 4. To apply all without conflicts we would need to backport https://git.openembedded.org/bitbake/commit/?h=1.48&id=5272f2489586479880ae8d046dfcdbe0963ee5bb as well, which seems to big change for backport (and also isn't needed to resolve this parsing error - and I don't use multiconfig at all to properly test it with dunfell).
>
> Cheers,
>
> On Thu, Jun 3, 2021 at 5:55 PM Steve Sakoman <steve@sakoman.com> wrote:
>>
>> Hi Martin,
>>
>> Are you seeing a particular issue that these patches resolve?
>>
>> I seem to recall considering these patches last summer and
>> encountering issues on the autobuilder.  Sadly it has been long enough
>> that I don't recall the details.
>>
>> I spoke briefly with Richard this morning, and he also has some
>> concerns about taking these.  Any details you can provide on why you
>> find these necessary would be appreciated.
>>
>> Steve
>>
>> On Thu, Jun 3, 2021 at 1:45 AM Martin Jansa <martin.jansa@gmail.com> wrote:
>> >
>> > From: Richard Purdie <richard.purdie@linuxfoundation.org>
>> >
>> > Each run through the parser would leak a thread from the queue created to
>> > shut the parser down. Close this down correctly and clean up the code flow
>> > slightly whilst in the area, making sure this thread does shut down correctly
>> > (we don't care if it loses data).
>> >
>> > Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
>> > ---
>> >  lib/bb/cooker.py | 14 ++++++++------
>> >  1 file changed, 8 insertions(+), 6 deletions(-)
>> >
>> > diff --git a/lib/bb/cooker.py b/lib/bb/cooker.py
>> > index 11cc2b95..4820d268 100644
>> > --- a/lib/bb/cooker.py
>> > +++ b/lib/bb/cooker.py
>> > @@ -2056,12 +2056,11 @@ class CookerParser(object):
>> >                                              self.total)
>> >
>> >              bb.event.fire(event, self.cfgdata)
>> > -            for process in self.processes:
>> > -                self.parser_quit.put(None)
>> > -        else:
>> > -            self.parser_quit.cancel_join_thread()
>> > -            for process in self.processes:
>> > -                self.parser_quit.put(None)
>> > +
>> > +        # Allow data left in the cancel queue to be discarded
>> > +        self.parser_quit.cancel_join_thread()
>> > +        for process in self.processes:
>> > +            self.parser_quit.put(None)
>> >
>> >          # Cleanup the queue before call process.join(), otherwise there might be
>> >          # deadlocks.
>> > @@ -2078,6 +2077,9 @@ class CookerParser(object):
>> >              else:
>> >                  process.join()
>> >
>> > +        self.parser_quit.close()
>> > +        self.parser_quit.join_thread()
>> > +
>> >          sync = threading.Thread(target=self.bb_cache.sync)
>> >          sync.start()
>> >          multiprocessing.util.Finalize(None, sync.join, exitpriority=-100)
>> > --
>> > 2.30.2
>> >

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-06-03 16:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-03 11:45 [1.46][PATCH 1/4] cooker: Ensure parse_quit thread is closed down Martin Jansa
2021-06-03 11:45 ` [1.46][PATCH 2/4] cooker: Explictly shut down the sync thread Martin Jansa
2021-06-03 11:45 ` [1.46][PATCH 3/4] cooker: Ensure parser is cleaned up Martin Jansa
2021-06-03 11:45 ` [1.46][PATCH 4/4] cooker: Avoid parser deadlocks Martin Jansa
2021-06-03 15:55 ` [1.46][PATCH 1/4] cooker: Ensure parse_quit thread is closed down Steve Sakoman
2021-06-03 16:17   ` Martin Jansa
2021-06-03 16:41     ` Steve Sakoman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.