All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] CI: Add automatic retry for test.py jobs
@ 2023-07-12  2:33 Tom Rini
  2023-07-12 14:00 ` Simon Glass
  2023-07-21  1:30 ` Tom Rini
  0 siblings, 2 replies; 13+ messages in thread
From: Tom Rini @ 2023-07-12  2:33 UTC (permalink / raw)
  To: u-boot

It is not uncommon for some of the QEMU-based jobs to fail not because
of a code issue but rather because of a timing issue or similar problem
that is out of our control. Make use of the keywords that Azure and
GitLab provide so that we will automatically re-run these when they fail
2 times. If they fail that often it is likely we have found a real issue
to investigate.

Signed-off-by: Tom Rini <trini@konsulko.com>
---
 .azure-pipelines.yml | 1 +
 .gitlab-ci.yml       | 1 +
 2 files changed, 2 insertions(+)

diff --git a/.azure-pipelines.yml b/.azure-pipelines.yml
index 06c46b681c3f..76982ec3e52e 100644
--- a/.azure-pipelines.yml
+++ b/.azure-pipelines.yml
@@ -460,6 +460,7 @@ stages:
           fi
           # Some tests using libguestfs-tools need the fuse device to run
           docker run "$@" --device /dev/fuse:/dev/fuse -v $PWD:$(work_dir) $(ci_runner_image) /bin/bash $(work_dir)/test.sh
+        retryCountOnTaskFailure: 2 # QEMU may be too slow, etc.
 
 - stage: world_build
   jobs:
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
index cfd58513c377..f7ffb8f5dfdc 100644
--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml
@@ -20,6 +20,7 @@ stages:
 
 .buildman_and_testpy_template: &buildman_and_testpy_dfn
   stage: test.py
+  retry: 2 # QEMU may be too slow, etc.
   before_script:
     # Clone uboot-test-hooks
     - git config --global --add safe.directory "${CI_PROJECT_DIR}"
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] CI: Add automatic retry for test.py jobs
  2023-07-12  2:33 [PATCH] CI: Add automatic retry for test.py jobs Tom Rini
@ 2023-07-12 14:00 ` Simon Glass
  2023-07-12 17:08   ` Tom Rini
  2023-07-21  1:30 ` Tom Rini
  1 sibling, 1 reply; 13+ messages in thread
From: Simon Glass @ 2023-07-12 14:00 UTC (permalink / raw)
  To: Tom Rini; +Cc: u-boot

Hi Tom,

On Tue, 11 Jul 2023 at 20:33, Tom Rini <trini@konsulko.com> wrote:
>
> It is not uncommon for some of the QEMU-based jobs to fail not because
> of a code issue but rather because of a timing issue or similar problem
> that is out of our control. Make use of the keywords that Azure and
> GitLab provide so that we will automatically re-run these when they fail
> 2 times. If they fail that often it is likely we have found a real issue
> to investigate.
>
> Signed-off-by: Tom Rini <trini@konsulko.com>
> ---
>  .azure-pipelines.yml | 1 +
>  .gitlab-ci.yml       | 1 +
>  2 files changed, 2 insertions(+)
>

This seems like a slippery slope. Do we know why things fail? I wonder
if we should disable the tests / builders instead, until it can be
corrected?

I'll note that we don't have this problem with sandbox tests.

Regards,
Simon

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] CI: Add automatic retry for test.py jobs
  2023-07-12 14:00 ` Simon Glass
@ 2023-07-12 17:08   ` Tom Rini
  2023-07-12 20:32     ` Simon Glass
  0 siblings, 1 reply; 13+ messages in thread
From: Tom Rini @ 2023-07-12 17:08 UTC (permalink / raw)
  To: Simon Glass; +Cc: u-boot

[-- Attachment #1: Type: text/plain, Size: 1276 bytes --]

On Wed, Jul 12, 2023 at 08:00:23AM -0600, Simon Glass wrote:
> Hi Tom,
> 
> On Tue, 11 Jul 2023 at 20:33, Tom Rini <trini@konsulko.com> wrote:
> >
> > It is not uncommon for some of the QEMU-based jobs to fail not because
> > of a code issue but rather because of a timing issue or similar problem
> > that is out of our control. Make use of the keywords that Azure and
> > GitLab provide so that we will automatically re-run these when they fail
> > 2 times. If they fail that often it is likely we have found a real issue
> > to investigate.
> >
> > Signed-off-by: Tom Rini <trini@konsulko.com>
> > ---
> >  .azure-pipelines.yml | 1 +
> >  .gitlab-ci.yml       | 1 +
> >  2 files changed, 2 insertions(+)
> 
> This seems like a slippery slope. Do we know why things fail? I wonder
> if we should disable the tests / builders instead, until it can be
> corrected?

It happens in Azure, so it's not just the broken runner problem we have
in GitLab. And the problem is timing, as I said in the commit.
Sometimes we still get the RTC test failing. Other times we don't get
QEMU + U-Boot spawned in time (most often m68k, but sometimes x86).

> I'll note that we don't have this problem with sandbox tests.

OK, but that's not relevant?

-- 
Tom

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] CI: Add automatic retry for test.py jobs
  2023-07-12 17:08   ` Tom Rini
@ 2023-07-12 20:32     ` Simon Glass
  2023-07-12 20:38       ` Tom Rini
  0 siblings, 1 reply; 13+ messages in thread
From: Simon Glass @ 2023-07-12 20:32 UTC (permalink / raw)
  To: Tom Rini; +Cc: u-boot

Hi Tom,

On Wed, 12 Jul 2023 at 11:09, Tom Rini <trini@konsulko.com> wrote:
>
> On Wed, Jul 12, 2023 at 08:00:23AM -0600, Simon Glass wrote:
> > Hi Tom,
> >
> > On Tue, 11 Jul 2023 at 20:33, Tom Rini <trini@konsulko.com> wrote:
> > >
> > > It is not uncommon for some of the QEMU-based jobs to fail not because
> > > of a code issue but rather because of a timing issue or similar problem
> > > that is out of our control. Make use of the keywords that Azure and
> > > GitLab provide so that we will automatically re-run these when they fail
> > > 2 times. If they fail that often it is likely we have found a real issue
> > > to investigate.
> > >
> > > Signed-off-by: Tom Rini <trini@konsulko.com>
> > > ---
> > >  .azure-pipelines.yml | 1 +
> > >  .gitlab-ci.yml       | 1 +
> > >  2 files changed, 2 insertions(+)
> >
> > This seems like a slippery slope. Do we know why things fail? I wonder
> > if we should disable the tests / builders instead, until it can be
> > corrected?
>
> It happens in Azure, so it's not just the broken runner problem we have
> in GitLab. And the problem is timing, as I said in the commit.
> Sometimes we still get the RTC test failing. Other times we don't get
> QEMU + U-Boot spawned in time (most often m68k, but sometimes x86).

How do we keep this list from growing?

>
> > I'll note that we don't have this problem with sandbox tests.
>
> OK, but that's not relevant?

It is relevant to the discussion about using QEMU instead of sandbox,
e.g. with the TPM. I recall a discussion with Ilias a while back.

Reviewed-by: Simon Glass <sjg@chromium.org>

Regards,
Simon

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] CI: Add automatic retry for test.py jobs
  2023-07-12 20:32     ` Simon Glass
@ 2023-07-12 20:38       ` Tom Rini
  2023-07-13 21:03         ` Simon Glass
  0 siblings, 1 reply; 13+ messages in thread
From: Tom Rini @ 2023-07-12 20:38 UTC (permalink / raw)
  To: Simon Glass; +Cc: u-boot

[-- Attachment #1: Type: text/plain, Size: 2316 bytes --]

On Wed, Jul 12, 2023 at 02:32:18PM -0600, Simon Glass wrote:
> Hi Tom,
> 
> On Wed, 12 Jul 2023 at 11:09, Tom Rini <trini@konsulko.com> wrote:
> >
> > On Wed, Jul 12, 2023 at 08:00:23AM -0600, Simon Glass wrote:
> > > Hi Tom,
> > >
> > > On Tue, 11 Jul 2023 at 20:33, Tom Rini <trini@konsulko.com> wrote:
> > > >
> > > > It is not uncommon for some of the QEMU-based jobs to fail not because
> > > > of a code issue but rather because of a timing issue or similar problem
> > > > that is out of our control. Make use of the keywords that Azure and
> > > > GitLab provide so that we will automatically re-run these when they fail
> > > > 2 times. If they fail that often it is likely we have found a real issue
> > > > to investigate.
> > > >
> > > > Signed-off-by: Tom Rini <trini@konsulko.com>
> > > > ---
> > > >  .azure-pipelines.yml | 1 +
> > > >  .gitlab-ci.yml       | 1 +
> > > >  2 files changed, 2 insertions(+)
> > >
> > > This seems like a slippery slope. Do we know why things fail? I wonder
> > > if we should disable the tests / builders instead, until it can be
> > > corrected?
> >
> > It happens in Azure, so it's not just the broken runner problem we have
> > in GitLab. And the problem is timing, as I said in the commit.
> > Sometimes we still get the RTC test failing. Other times we don't get
> > QEMU + U-Boot spawned in time (most often m68k, but sometimes x86).
> 
> How do we keep this list from growing?

Do we need to? The problem is in essence since we rely on free
resources, sometimes some heavy lifts take longer.  That's what this
flag is for.

> > > I'll note that we don't have this problem with sandbox tests.
> >
> > OK, but that's not relevant?
> 
> It is relevant to the discussion about using QEMU instead of sandbox,
> e.g. with the TPM. I recall a discussion with Ilias a while back.

I'm sure we could make sandbox take too long to start as well, if enough
other things are going on with the system.  And sandbox has its own set
of super frustrating issues instead, so I don't think this is a great
argument to have right here (I have to run it in docker, to get around
some application version requirements and exclude event_dump, bootmgr,
abootimg and gpt tests, which could otherwise run, but fail for me).

-- 
Tom

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] CI: Add automatic retry for test.py jobs
  2023-07-12 20:38       ` Tom Rini
@ 2023-07-13 21:03         ` Simon Glass
  2023-07-13 21:57           ` Tom Rini
  0 siblings, 1 reply; 13+ messages in thread
From: Simon Glass @ 2023-07-13 21:03 UTC (permalink / raw)
  To: Tom Rini; +Cc: u-boot

Hi Tom,

On Wed, 12 Jul 2023 at 14:38, Tom Rini <trini@konsulko.com> wrote:
>
> On Wed, Jul 12, 2023 at 02:32:18PM -0600, Simon Glass wrote:
> > Hi Tom,
> >
> > On Wed, 12 Jul 2023 at 11:09, Tom Rini <trini@konsulko.com> wrote:
> > >
> > > On Wed, Jul 12, 2023 at 08:00:23AM -0600, Simon Glass wrote:
> > > > Hi Tom,
> > > >
> > > > On Tue, 11 Jul 2023 at 20:33, Tom Rini <trini@konsulko.com> wrote:
> > > > >
> > > > > It is not uncommon for some of the QEMU-based jobs to fail not because
> > > > > of a code issue but rather because of a timing issue or similar problem
> > > > > that is out of our control. Make use of the keywords that Azure and
> > > > > GitLab provide so that we will automatically re-run these when they fail
> > > > > 2 times. If they fail that often it is likely we have found a real issue
> > > > > to investigate.
> > > > >
> > > > > Signed-off-by: Tom Rini <trini@konsulko.com>
> > > > > ---
> > > > >  .azure-pipelines.yml | 1 +
> > > > >  .gitlab-ci.yml       | 1 +
> > > > >  2 files changed, 2 insertions(+)
> > > >
> > > > This seems like a slippery slope. Do we know why things fail? I wonder
> > > > if we should disable the tests / builders instead, until it can be
> > > > corrected?
> > >
> > > It happens in Azure, so it's not just the broken runner problem we have
> > > in GitLab. And the problem is timing, as I said in the commit.
> > > Sometimes we still get the RTC test failing. Other times we don't get
> > > QEMU + U-Boot spawned in time (most often m68k, but sometimes x86).
> >
> > How do we keep this list from growing?
>
> Do we need to? The problem is in essence since we rely on free
> resources, sometimes some heavy lifts take longer.  That's what this
> flag is for.

I'm fairly sure the RTC thing could be made deterministic.

The spawning thing...is there a timeout for that? What actually fails?

>
> > > > I'll note that we don't have this problem with sandbox tests.
> > >
> > > OK, but that's not relevant?
> >
> > It is relevant to the discussion about using QEMU instead of sandbox,
> > e.g. with the TPM. I recall a discussion with Ilias a while back.
>
> I'm sure we could make sandbox take too long to start as well, if enough
> other things are going on with the system.  And sandbox has its own set
> of super frustrating issues instead, so I don't think this is a great
> argument to have right here (I have to run it in docker, to get around
> some application version requirements and exclude event_dump, bootmgr,
> abootimg and gpt tests, which could otherwise run, but fail for me).

I haven't heard about this before. Is there anything that could be done?

Regards.

Simon

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] CI: Add automatic retry for test.py jobs
  2023-07-13 21:03         ` Simon Glass
@ 2023-07-13 21:57           ` Tom Rini
  2023-07-15 23:40             ` Simon Glass
  0 siblings, 1 reply; 13+ messages in thread
From: Tom Rini @ 2023-07-13 21:57 UTC (permalink / raw)
  To: Simon Glass; +Cc: u-boot

[-- Attachment #1: Type: text/plain, Size: 3480 bytes --]

On Thu, Jul 13, 2023 at 03:03:57PM -0600, Simon Glass wrote:
> Hi Tom,
> 
> On Wed, 12 Jul 2023 at 14:38, Tom Rini <trini@konsulko.com> wrote:
> >
> > On Wed, Jul 12, 2023 at 02:32:18PM -0600, Simon Glass wrote:
> > > Hi Tom,
> > >
> > > On Wed, 12 Jul 2023 at 11:09, Tom Rini <trini@konsulko.com> wrote:
> > > >
> > > > On Wed, Jul 12, 2023 at 08:00:23AM -0600, Simon Glass wrote:
> > > > > Hi Tom,
> > > > >
> > > > > On Tue, 11 Jul 2023 at 20:33, Tom Rini <trini@konsulko.com> wrote:
> > > > > >
> > > > > > It is not uncommon for some of the QEMU-based jobs to fail not because
> > > > > > of a code issue but rather because of a timing issue or similar problem
> > > > > > that is out of our control. Make use of the keywords that Azure and
> > > > > > GitLab provide so that we will automatically re-run these when they fail
> > > > > > 2 times. If they fail that often it is likely we have found a real issue
> > > > > > to investigate.
> > > > > >
> > > > > > Signed-off-by: Tom Rini <trini@konsulko.com>
> > > > > > ---
> > > > > >  .azure-pipelines.yml | 1 +
> > > > > >  .gitlab-ci.yml       | 1 +
> > > > > >  2 files changed, 2 insertions(+)
> > > > >
> > > > > This seems like a slippery slope. Do we know why things fail? I wonder
> > > > > if we should disable the tests / builders instead, until it can be
> > > > > corrected?
> > > >
> > > > It happens in Azure, so it's not just the broken runner problem we have
> > > > in GitLab. And the problem is timing, as I said in the commit.
> > > > Sometimes we still get the RTC test failing. Other times we don't get
> > > > QEMU + U-Boot spawned in time (most often m68k, but sometimes x86).
> > >
> > > How do we keep this list from growing?
> >
> > Do we need to? The problem is in essence since we rely on free
> > resources, sometimes some heavy lifts take longer.  That's what this
> > flag is for.
> 
> I'm fairly sure the RTC thing could be made deterministic.

We've already tried that once, and it happens a lot less often. If we
make it even looser we risk making the test itself useless.

> The spawning thing...is there a timeout for that? What actually fails?

It doesn't spawn in time for the framework to get to the prompt.  We
could maybe increase the timeout value.  It's always the version test
that fails.

> > > > > I'll note that we don't have this problem with sandbox tests.
> > > >
> > > > OK, but that's not relevant?
> > >
> > > It is relevant to the discussion about using QEMU instead of sandbox,
> > > e.g. with the TPM. I recall a discussion with Ilias a while back.
> >
> > I'm sure we could make sandbox take too long to start as well, if enough
> > other things are going on with the system.  And sandbox has its own set
> > of super frustrating issues instead, so I don't think this is a great
> > argument to have right here (I have to run it in docker, to get around
> > some application version requirements and exclude event_dump, bootmgr,
> > abootimg and gpt tests, which could otherwise run, but fail for me).
> 
> I haven't heard about this before. Is there anything that could be done?

I have no idea what could be done about it since I believe all of them
run fine in CI, including on this very host, when gitlab invokes it
rather than when I invoke it. My point here is that sandbox tests are
just a different kind of picky about things and need their own kind of
"just hit retry".

-- 
Tom

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] CI: Add automatic retry for test.py jobs
  2023-07-13 21:57           ` Tom Rini
@ 2023-07-15 23:40             ` Simon Glass
  2023-07-16 18:17               ` Tom Rini
  0 siblings, 1 reply; 13+ messages in thread
From: Simon Glass @ 2023-07-15 23:40 UTC (permalink / raw)
  To: Tom Rini; +Cc: u-boot

Hi Tom,

On Thu, 13 Jul 2023 at 15:57, Tom Rini <trini@konsulko.com> wrote:
>
> On Thu, Jul 13, 2023 at 03:03:57PM -0600, Simon Glass wrote:
> > Hi Tom,
> >
> > On Wed, 12 Jul 2023 at 14:38, Tom Rini <trini@konsulko.com> wrote:
> > >
> > > On Wed, Jul 12, 2023 at 02:32:18PM -0600, Simon Glass wrote:
> > > > Hi Tom,
> > > >
> > > > On Wed, 12 Jul 2023 at 11:09, Tom Rini <trini@konsulko.com> wrote:
> > > > >
> > > > > On Wed, Jul 12, 2023 at 08:00:23AM -0600, Simon Glass wrote:
> > > > > > Hi Tom,
> > > > > >
> > > > > > On Tue, 11 Jul 2023 at 20:33, Tom Rini <trini@konsulko.com> wrote:
> > > > > > >
> > > > > > > It is not uncommon for some of the QEMU-based jobs to fail not because
> > > > > > > of a code issue but rather because of a timing issue or similar problem
> > > > > > > that is out of our control. Make use of the keywords that Azure and
> > > > > > > GitLab provide so that we will automatically re-run these when they fail
> > > > > > > 2 times. If they fail that often it is likely we have found a real issue
> > > > > > > to investigate.
> > > > > > >
> > > > > > > Signed-off-by: Tom Rini <trini@konsulko.com>
> > > > > > > ---
> > > > > > >  .azure-pipelines.yml | 1 +
> > > > > > >  .gitlab-ci.yml       | 1 +
> > > > > > >  2 files changed, 2 insertions(+)
> > > > > >
> > > > > > This seems like a slippery slope. Do we know why things fail? I wonder
> > > > > > if we should disable the tests / builders instead, until it can be
> > > > > > corrected?
> > > > >
> > > > > It happens in Azure, so it's not just the broken runner problem we have
> > > > > in GitLab. And the problem is timing, as I said in the commit.
> > > > > Sometimes we still get the RTC test failing. Other times we don't get
> > > > > QEMU + U-Boot spawned in time (most often m68k, but sometimes x86).
> > > >
> > > > How do we keep this list from growing?
> > >
> > > Do we need to? The problem is in essence since we rely on free
> > > resources, sometimes some heavy lifts take longer.  That's what this
> > > flag is for.
> >
> > I'm fairly sure the RTC thing could be made deterministic.
>
> We've already tried that once, and it happens a lot less often. If we
> make it even looser we risk making the test itself useless.

For sleep, yes, but for rtc it should be deterministic now...next time
you get a failure could you send me the trace?

>
> > The spawning thing...is there a timeout for that? What actually fails?
>
> It doesn't spawn in time for the framework to get to the prompt.  We
> could maybe increase the timeout value.  It's always the version test
> that fails.

Ah OK, yes increasing the timeout makes sense.

>
> > > > > > I'll note that we don't have this problem with sandbox tests.
> > > > >
> > > > > OK, but that's not relevant?
> > > >
> > > > It is relevant to the discussion about using QEMU instead of sandbox,
> > > > e.g. with the TPM. I recall a discussion with Ilias a while back.
> > >
> > > I'm sure we could make sandbox take too long to start as well, if enough
> > > other things are going on with the system.  And sandbox has its own set
> > > of super frustrating issues instead, so I don't think this is a great
> > > argument to have right here (I have to run it in docker, to get around
> > > some application version requirements and exclude event_dump, bootmgr,
> > > abootimg and gpt tests, which could otherwise run, but fail for me).
> >
> > I haven't heard about this before. Is there anything that could be done?
>
> I have no idea what could be done about it since I believe all of them
> run fine in CI, including on this very host, when gitlab invokes it
> rather than when I invoke it. My point here is that sandbox tests are
> just a different kind of picky about things and need their own kind of
> "just hit retry".

Perhaps this is Python dependencies? I'm not sure, but if you see it
again, please let me know in case we can actually fix this.

Regards,
Simon

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] CI: Add automatic retry for test.py jobs
  2023-07-15 23:40             ` Simon Glass
@ 2023-07-16 18:17               ` Tom Rini
  2023-07-27 19:18                 ` Simon Glass
  0 siblings, 1 reply; 13+ messages in thread
From: Tom Rini @ 2023-07-16 18:17 UTC (permalink / raw)
  To: Simon Glass; +Cc: u-boot

[-- Attachment #1: Type: text/plain, Size: 10552 bytes --]

On Sat, Jul 15, 2023 at 05:40:25PM -0600, Simon Glass wrote:
> Hi Tom,
> 
> On Thu, 13 Jul 2023 at 15:57, Tom Rini <trini@konsulko.com> wrote:
> >
> > On Thu, Jul 13, 2023 at 03:03:57PM -0600, Simon Glass wrote:
> > > Hi Tom,
> > >
> > > On Wed, 12 Jul 2023 at 14:38, Tom Rini <trini@konsulko.com> wrote:
> > > >
> > > > On Wed, Jul 12, 2023 at 02:32:18PM -0600, Simon Glass wrote:
> > > > > Hi Tom,
> > > > >
> > > > > On Wed, 12 Jul 2023 at 11:09, Tom Rini <trini@konsulko.com> wrote:
> > > > > >
> > > > > > On Wed, Jul 12, 2023 at 08:00:23AM -0600, Simon Glass wrote:
> > > > > > > Hi Tom,
> > > > > > >
> > > > > > > On Tue, 11 Jul 2023 at 20:33, Tom Rini <trini@konsulko.com> wrote:
> > > > > > > >
> > > > > > > > It is not uncommon for some of the QEMU-based jobs to fail not because
> > > > > > > > of a code issue but rather because of a timing issue or similar problem
> > > > > > > > that is out of our control. Make use of the keywords that Azure and
> > > > > > > > GitLab provide so that we will automatically re-run these when they fail
> > > > > > > > 2 times. If they fail that often it is likely we have found a real issue
> > > > > > > > to investigate.
> > > > > > > >
> > > > > > > > Signed-off-by: Tom Rini <trini@konsulko.com>
> > > > > > > > ---
> > > > > > > >  .azure-pipelines.yml | 1 +
> > > > > > > >  .gitlab-ci.yml       | 1 +
> > > > > > > >  2 files changed, 2 insertions(+)
> > > > > > >
> > > > > > > This seems like a slippery slope. Do we know why things fail? I wonder
> > > > > > > if we should disable the tests / builders instead, until it can be
> > > > > > > corrected?
> > > > > >
> > > > > > It happens in Azure, so it's not just the broken runner problem we have
> > > > > > in GitLab. And the problem is timing, as I said in the commit.
> > > > > > Sometimes we still get the RTC test failing. Other times we don't get
> > > > > > QEMU + U-Boot spawned in time (most often m68k, but sometimes x86).
> > > > >
> > > > > How do we keep this list from growing?
> > > >
> > > > Do we need to? The problem is in essence since we rely on free
> > > > resources, sometimes some heavy lifts take longer.  That's what this
> > > > flag is for.
> > >
> > > I'm fairly sure the RTC thing could be made deterministic.
> >
> > We've already tried that once, and it happens a lot less often. If we
> > make it even looser we risk making the test itself useless.
> 
> For sleep, yes, but for rtc it should be deterministic now...next time
> you get a failure could you send me the trace?

Found one:
https://dev.azure.com/u-boot/u-boot/_build/results?buildId=6592&view=logs&j=b6c47816-145c-5bfe-20a7-c6a2572e6c41&t=0929c28c-6e32-5635-9624-54eaa917d713&l=599

And note that we have a different set of timeout problems that may or may not
be configurable, which is in the upload of the pytest results. I haven't seen
if there's a knob for this one yet, within Azure (or the python project we're
adding for it).

> > > The spawning thing...is there a timeout for that? What actually fails?
> >
> > It doesn't spawn in time for the framework to get to the prompt.  We
> > could maybe increase the timeout value.  It's always the version test
> > that fails.
> 
> Ah OK, yes increasing the timeout makes sense.
> 
> >
> > > > > > > I'll note that we don't have this problem with sandbox tests.
> > > > > >
> > > > > > OK, but that's not relevant?
> > > > >
> > > > > It is relevant to the discussion about using QEMU instead of sandbox,
> > > > > e.g. with the TPM. I recall a discussion with Ilias a while back.
> > > >
> > > > I'm sure we could make sandbox take too long to start as well, if enough
> > > > other things are going on with the system.  And sandbox has its own set
> > > > of super frustrating issues instead, so I don't think this is a great
> > > > argument to have right here (I have to run it in docker, to get around
> > > > some application version requirements and exclude event_dump, bootmgr,
> > > > abootimg and gpt tests, which could otherwise run, but fail for me).
> > >
> > > I haven't heard about this before. Is there anything that could be done?
> >
> > I have no idea what could be done about it since I believe all of them
> > run fine in CI, including on this very host, when gitlab invokes it
> > rather than when I invoke it. My point here is that sandbox tests are
> > just a different kind of picky about things and need their own kind of
> > "just hit retry".
> 
> Perhaps this is Python dependencies? I'm not sure, but if you see it
> again, please let me know in case we can actually fix this.

Alright. So the first pass I took at running sandbox pytest with as
little hand-holding as possible I hit the known issue of /boot/vmlinu*
being 0400 in Ubuntu. I fixed that and then re-ran and:
test/py/tests/test_cleanup_build.py F

========================================== FAILURES ===========================================
_________________________________________ test_clean __________________________________________
test/py/tests/test_cleanup_build.py:94: in test_clean
    assert not leftovers, f"leftovers: {', '.join(map(str, leftovers))}"
E   AssertionError: leftovers: fdt-out.dtb, sha1-pad/sandbox-u-boot.dtb, sha1-pad/sandbox-kernel.dtb, sha1-basic/sandbox-u-boot.dtb, sha1-basic/sandbox-kernel.dtb, sha384-basic/sandbox-u-boot.dtb, sha384-basic/sandbox-kernel.dtb, algo-arg/sandbox-u-boot.dtb, algo-arg/sandbox-kernel.dtb, sha1-pss/sandbox-u-boot.dtb, sha1-pss/sandbox-kernel.dtb, sha256-pad/sandbox-u-boot.dtb, sha256-pad/sandbox-kernel.dtb, sha256-global-sign/sandbox-binman.dtb, sha256-global-sign/sandbox-u-boot.dtb, sha256-global-sign/sandbox-u-boot-global.dtb, sha256-global-sign/sandbox-kernel.dtb, sha256-global-sign-pss/sandbox-binman-pss.dtb, sha256-global-sign-pss/sandbox-u-boot.dtb, sha256-global-sign-pss/sandbox-kernel.dtb, sha256-global-sign-pss/sandbox-u-boot-global-pss.dtb, auto_fit/dt-1.dtb, auto_fit/dt-2.dtb, sha256-pss/sandbox-u-boot.dtb, sha256-pss/sandbox-kernel.dtb, sha256-pss-pad/sandbox-u-boot.dtb, sha256-pss-pad/sandbox-kernel.dtb, hashes/sandbox-kernel.dtb, sha256-basic/sandbox-u-boot.dtb, sha256-basic/sandbox-kernel.dtb, sha1-pss-pad/sandbox-u-boot.dtb, sha1-pss-pad/sandbox-kernel.dtb, sha384-pad/sandbox-u-boot.dtb, sha384-pad/sandbox-kernel.dtb, sha256-pss-pad-required/sandbox-u-boot.dtb, sha256-pss-pad-required/sandbox-kernel.dtb, ecdsa/sandbox-kernel.dtb, sha256-pss-required/sandbox-u-boot.dtb, sha256-pss-required/sandbox-kernel.dtb
E   assert not [PosixPath('fdt-out.dtb'), PosixPath('sha1-pad/sandbox-u-boot.dtb'), PosixPath('sha1-pad/sandbox-kernel.dtb'), PosixPa...ic/sandbox-u-boot.dtb'), PosixPath('sha1-basic/sandbox-kernel.dtb'), PosixPath('sha384-basic/sandbox-u-boot.dtb'), ...]
------------------------------------ Captured stdout call -------------------------------------
+make O=/tmp/pytest-of-trini/pytest-231/test_clean0 clean
make[1]: Entering directory '/tmp/pytest-of-trini/pytest-231/test_clean0'
  CLEAN   cmd
  CLEAN   dts/../arch/sandbox/dts
  CLEAN   dts
  CLEAN   lib
  CLEAN   tools
  CLEAN   tools/generated
  CLEAN   include/bmp_logo.h include/bmp_logo_data.h include/generated/env.in include/generated/env.txt drivers/video/u_boot_logo.S u-boot u-boot-dtb.bin u-boot-initial-env u-boot-nodtb.bin u-boot.bin u-boot.cfg u-boot.dtb u-boot.dtb.gz u-boot.dtb.out u-boot.dts u-boot.lds u-boot.map u-boot.srec u-boot.sym System.map image.map keep-syms-lto.c lib/efi_loader/helloworld_efi.S
make[1]: Leaving directory '/tmp/pytest-of-trini/pytest-231/test_clean0'
=================================== short test summary info ===================================
FAILED test/py/tests/test_cleanup_build.py::test_clean - AssertionError: leftovers: fdt-out....
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================= 1 failed, 6 passed in 6.42s =================================

Fixing that manually with an rm -rf of /tmp/pytest-of-trini and now it's
stuck.  I've rm -rf'd that and git clean -dfx and just repeat that
failure.  I'm hopeful that when I reboot whatever magic is broken will
be cleaned out.  Moving things in to a docker container again, I get:
=========================================== ERRORS ============================================
_______________________________ ERROR at setup of test_gpt_read _______________________________
/home/trini/work/u-boot/u-boot/test/py/tests/test_gpt.py:74: in state_disk_image
    ???
/home/trini/work/u-boot/u-boot/test/py/tests/test_gpt.py:37: in __init__
    ???
test/py/u_boot_utils.py:279: in __enter__
    self.module_filename = module.__file__
E   AttributeError: 'NoneType' object has no attribute '__file__'
=================================== short test summary info ===================================
ERROR test/py/tests/test_gpt.py::test_gpt_read - AttributeError: 'NoneType' object has no at...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
========================== 41 passed, 45 skipped, 1 error in 19.29s ===========================

And then ignoring that one with "-k not gpt":
test/py/tests/test_android/test_ab.py E

=========================================== ERRORS ============================================
__________________________________ ERROR at setup of test_ab __________________________________
/home/trini/work/u-boot/u-boot/test/py/tests/test_android/test_ab.py:54: in ab_disk_image
    ???
/home/trini/work/u-boot/u-boot/test/py/tests/test_android/test_ab.py:28: in __init__
    ???
test/py/u_boot_utils.py:279: in __enter__
    self.module_filename = module.__file__
E   AttributeError: 'NoneType' object has no attribute '__file__'
=================================== short test summary info ===================================
ERROR test/py/tests/test_android/test_ab.py::test_ab - AttributeError: 'NoneType' object has...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
============= 908 passed, 75 skipped, 10 deselected, 1 error in 159.17s (0:02:39) =============

Now, funny things. If I git clean -dfx, I can then get that test to
pass.  So I guess something else isn't cleaning up / is writing to a
common area? I intentionally build within the source tree, but in a
subdirectory of that, and indeed a lot of tests write to the source
directory itself.

-- 
Tom

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] CI: Add automatic retry for test.py jobs
  2023-07-12  2:33 [PATCH] CI: Add automatic retry for test.py jobs Tom Rini
  2023-07-12 14:00 ` Simon Glass
@ 2023-07-21  1:30 ` Tom Rini
  1 sibling, 0 replies; 13+ messages in thread
From: Tom Rini @ 2023-07-21  1:30 UTC (permalink / raw)
  To: u-boot

[-- Attachment #1: Type: text/plain, Size: 600 bytes --]

On Tue, Jul 11, 2023 at 10:33:03PM -0400, Tom Rini wrote:

> It is not uncommon for some of the QEMU-based jobs to fail not because
> of a code issue but rather because of a timing issue or similar problem
> that is out of our control. Make use of the keywords that Azure and
> GitLab provide so that we will automatically re-run these when they fail
> 2 times. If they fail that often it is likely we have found a real issue
> to investigate.
> 
> Signed-off-by: Tom Rini <trini@konsulko.com>
> Reviewed-by: Simon Glass <sjg@chromium.org>

Applied to u-boot/master, thanks!

-- 
Tom

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] CI: Add automatic retry for test.py jobs
  2023-07-16 18:17               ` Tom Rini
@ 2023-07-27 19:18                 ` Simon Glass
  2023-07-27 20:35                   ` Tom Rini
  0 siblings, 1 reply; 13+ messages in thread
From: Simon Glass @ 2023-07-27 19:18 UTC (permalink / raw)
  To: Tom Rini; +Cc: u-boot

Hi Tom,

On Sun, 16 Jul 2023 at 12:18, Tom Rini <trini@konsulko.com> wrote:
>
> On Sat, Jul 15, 2023 at 05:40:25PM -0600, Simon Glass wrote:
> > Hi Tom,
> >
> > On Thu, 13 Jul 2023 at 15:57, Tom Rini <trini@konsulko.com> wrote:
> > >
> > > On Thu, Jul 13, 2023 at 03:03:57PM -0600, Simon Glass wrote:
> > > > Hi Tom,
> > > >
> > > > On Wed, 12 Jul 2023 at 14:38, Tom Rini <trini@konsulko.com> wrote:
> > > > >
> > > > > On Wed, Jul 12, 2023 at 02:32:18PM -0600, Simon Glass wrote:
> > > > > > Hi Tom,
> > > > > >
> > > > > > On Wed, 12 Jul 2023 at 11:09, Tom Rini <trini@konsulko.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jul 12, 2023 at 08:00:23AM -0600, Simon Glass wrote:
> > > > > > > > Hi Tom,
> > > > > > > >
> > > > > > > > On Tue, 11 Jul 2023 at 20:33, Tom Rini <trini@konsulko.com> wrote:
> > > > > > > > >
> > > > > > > > > It is not uncommon for some of the QEMU-based jobs to fail not because
> > > > > > > > > of a code issue but rather because of a timing issue or similar problem
> > > > > > > > > that is out of our control. Make use of the keywords that Azure and
> > > > > > > > > GitLab provide so that we will automatically re-run these when they fail
> > > > > > > > > 2 times. If they fail that often it is likely we have found a real issue
> > > > > > > > > to investigate.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Tom Rini <trini@konsulko.com>
> > > > > > > > > ---
> > > > > > > > >  .azure-pipelines.yml | 1 +
> > > > > > > > >  .gitlab-ci.yml       | 1 +
> > > > > > > > >  2 files changed, 2 insertions(+)
> > > > > > > >
> > > > > > > > This seems like a slippery slope. Do we know why things fail? I wonder
> > > > > > > > if we should disable the tests / builders instead, until it can be
> > > > > > > > corrected?
> > > > > > >
> > > > > > > It happens in Azure, so it's not just the broken runner problem we have
> > > > > > > in GitLab. And the problem is timing, as I said in the commit.
> > > > > > > Sometimes we still get the RTC test failing. Other times we don't get
> > > > > > > QEMU + U-Boot spawned in time (most often m68k, but sometimes x86).
> > > > > >
> > > > > > How do we keep this list from growing?
> > > > >
> > > > > Do we need to? The problem is in essence since we rely on free
> > > > > resources, sometimes some heavy lifts take longer.  That's what this
> > > > > flag is for.
> > > >
> > > > I'm fairly sure the RTC thing could be made deterministic.
> > >
> > > We've already tried that once, and it happens a lot less often. If we
> > > make it even looser we risk making the test itself useless.
> >
> > For sleep, yes, but for rtc it should be deterministic now...next time
> > you get a failure could you send me the trace?
>
> Found one:
> https://dev.azure.com/u-boot/u-boot/_build/results?buildId=6592&view=logs&j=b6c47816-145c-5bfe-20a7-c6a2572e6c41&t=0929c28c-6e32-5635-9624-54eaa917d713&l=599

I don't seem to have access to that...but it is rtc or sleep?

>
> And note that we have a different set of timeout problems that may or may not
> be configurable, which is in the upload of the pytest results. I haven't seen
> if there's a knob for this one yet, within Azure (or the python project we're
> adding for it).

Oh dear.

>
> > > > The spawning thing...is there a timeout for that? What actually fails?
> > >
> > > It doesn't spawn in time for the framework to get to the prompt.  We
> > > could maybe increase the timeout value.  It's always the version test
> > > that fails.
> >
> > Ah OK, yes increasing the timeout makes sense.
> >
> > >
> > > > > > > > I'll note that we don't have this problem with sandbox tests.
> > > > > > >
> > > > > > > OK, but that's not relevant?
> > > > > >
> > > > > > It is relevant to the discussion about using QEMU instead of sandbox,
> > > > > > e.g. with the TPM. I recall a discussion with Ilias a while back.
> > > > >
> > > > > I'm sure we could make sandbox take too long to start as well, if enough
> > > > > other things are going on with the system.  And sandbox has its own set
> > > > > of super frustrating issues instead, so I don't think this is a great
> > > > > argument to have right here (I have to run it in docker, to get around
> > > > > some application version requirements and exclude event_dump, bootmgr,
> > > > > abootimg and gpt tests, which could otherwise run, but fail for me).
> > > >
> > > > I haven't heard about this before. Is there anything that could be done?
> > >
> > > I have no idea what could be done about it since I believe all of them
> > > run fine in CI, including on this very host, when gitlab invokes it
> > > rather than when I invoke it. My point here is that sandbox tests are
> > > just a different kind of picky about things and need their own kind of
> > > "just hit retry".
> >
> > Perhaps this is Python dependencies? I'm not sure, but if you see it
> > again, please let me know in case we can actually fix this.
>
> Alright. So the first pass I took at running sandbox pytest with as
> little hand-holding as possible I hit the known issue of /boot/vmlinu*
> being 0400 in Ubuntu. I fixed that and then re-ran and:
> test/py/tests/test_cleanup_build.py F
>
> ========================================== FAILURES ===========================================
> _________________________________________ test_clean __________________________________________
> test/py/tests/test_cleanup_build.py:94: in test_clean
>     assert not leftovers, f"leftovers: {', '.join(map(str, leftovers))}"
> E   AssertionError: leftovers: fdt-out.dtb, sha1-pad/sandbox-u-boot.dtb, sha1-pad/sandbox-kernel.dtb, sha1-basic/sandbox-u-boot.dtb, sha1-basic/sandbox-kernel.dtb, sha384-basic/sandbox-u-boot.dtb, sha384-basic/sandbox-kernel.dtb, algo-arg/sandbox-u-boot.dtb, algo-arg/sandbox-kernel.dtb, sha1-pss/sandbox-u-boot.dtb, sha1-pss/sandbox-kernel.dtb, sha256-pad/sandbox-u-boot.dtb, sha256-pad/sandbox-kernel.dtb, sha256-global-sign/sandbox-binman.dtb, sha256-global-sign/sandbox-u-boot.dtb, sha256-global-sign/sandbox-u-boot-global.dtb, sha256-global-sign/sandbox-kernel.dtb, sha256-global-sign-pss/sandbox-binman-pss.dtb, sha256-global-sign-pss/sandbox-u-boot.dtb, sha256-global-sign-pss/sandbox-kernel.dtb, sha256-global-sign-pss/sandbox-u-boot-global-pss.dtb, auto_fit/dt-1.dtb, auto_fit/dt-2.dtb, sha256-pss/sandbox-u-boot.dtb, sha256-pss/sandbox-kernel.dtb, sha256-pss-pad/sandbox-u-boot.dtb, sha256-pss-pad/sandbox-kernel.dtb, hashes/sandbox-kernel.dtb, sha256-basic/sandbox-u-boot.dtb, sha256-basic/sandbox-kernel.dtb, sha1-pss-pad/sandbox-u-boot.dtb, sha1-pss-pad/sandbox-kernel.dtb, sha384-pad/sandbox-u-boot.dtb, sha384-pad/sandbox-kernel.dtb, sha256-pss-pad-required/sandbox-u-boot.dtb, sha256-pss-pad-required/sandbox-kernel.dtb, ecdsa/sandbox-kernel.dtb, sha256-pss-required/sandbox-u-boot.dtb, sha256-pss-required/sandbox-kernel.dtb
> E   assert not [PosixPath('fdt-out.dtb'), PosixPath('sha1-pad/sandbox-u-boot.dtb'), PosixPath('sha1-pad/sandbox-kernel.dtb'), PosixPa...ic/sandbox-u-boot.dtb'), PosixPath('sha1-basic/sandbox-kernel.dtb'), PosixPath('sha384-basic/sandbox-u-boot.dtb'), ...]
> ------------------------------------ Captured stdout call -------------------------------------
> +make O=/tmp/pytest-of-trini/pytest-231/test_clean0 clean
> make[1]: Entering directory '/tmp/pytest-of-trini/pytest-231/test_clean0'
>   CLEAN   cmd
>   CLEAN   dts/../arch/sandbox/dts
>   CLEAN   dts
>   CLEAN   lib
>   CLEAN   tools
>   CLEAN   tools/generated
>   CLEAN   include/bmp_logo.h include/bmp_logo_data.h include/generated/env.in include/generated/env.txt drivers/video/u_boot_logo.S u-boot u-boot-dtb.bin u-boot-initial-env u-boot-nodtb.bin u-boot.bin u-boot.cfg u-boot.dtb u-boot.dtb.gz u-boot.dtb.out u-boot.dts u-boot.lds u-boot.map u-boot.srec u-boot.sym System.map image.map keep-syms-lto.c lib/efi_loader/helloworld_efi.S
> make[1]: Leaving directory '/tmp/pytest-of-trini/pytest-231/test_clean0'
> =================================== short test summary info ===================================
> FAILED test/py/tests/test_cleanup_build.py::test_clean - AssertionError: leftovers: fdt-out....
> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> ================================= 1 failed, 6 passed in 6.42s =================================

That test never passes for me locally, because as you say we add a lot
of files to the build directory and there is no tracking of them such
that 'make clean' could remove them. We could fix that, e.g.:

1. Have binman record all its output filenames in a binman.clean file
2. Have tests always use a 'testfiles' subdir for files they create

>
> Fixing that manually with an rm -rf of /tmp/pytest-of-trini and now it's
> stuck.  I've rm -rf'd that and git clean -dfx and just repeat that
> failure.  I'm hopeful that when I reboot whatever magic is broken will
> be cleaned out.  Moving things in to a docker container again, I get:
> =========================================== ERRORS ============================================
> _______________________________ ERROR at setup of test_gpt_read _______________________________
> /home/trini/work/u-boot/u-boot/test/py/tests/test_gpt.py:74: in state_disk_image
>     ???
> /home/trini/work/u-boot/u-boot/test/py/tests/test_gpt.py:37: in __init__
>     ???
> test/py/u_boot_utils.py:279: in __enter__
>     self.module_filename = module.__file__
> E   AttributeError: 'NoneType' object has no attribute '__file__'
> =================================== short test summary info ===================================
> ERROR test/py/tests/test_gpt.py::test_gpt_read - AttributeError: 'NoneType' object has no at...
> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> ========================== 41 passed, 45 skipped, 1 error in 19.29s ===========================
>
> And then ignoring that one with "-k not gpt":
> test/py/tests/test_android/test_ab.py E
>
> =========================================== ERRORS ============================================
> __________________________________ ERROR at setup of test_ab __________________________________
> /home/trini/work/u-boot/u-boot/test/py/tests/test_android/test_ab.py:54: in ab_disk_image
>     ???
> /home/trini/work/u-boot/u-boot/test/py/tests/test_android/test_ab.py:28: in __init__
>     ???
> test/py/u_boot_utils.py:279: in __enter__
>     self.module_filename = module.__file__
> E   AttributeError: 'NoneType' object has no attribute '__file__'
> =================================== short test summary info ===================================
> ERROR test/py/tests/test_android/test_ab.py::test_ab - AttributeError: 'NoneType' object has...
> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> ============= 908 passed, 75 skipped, 10 deselected, 1 error in 159.17s (0:02:39) =============

These two are the same error. It looks like somehow it is unable to
obtain the module with:

        frame = inspect.stack()[1]
        module = inspect.getmodule(frame[0])

i.e. module is None

+Stephen Warren who may know

What Python version is this?

>
> Now, funny things. If I git clean -dfx, I can then get that test to
> pass.  So I guess something else isn't cleaning up / is writing to a
> common area? I intentionally build within the source tree, but in a
> subdirectory of that, and indeed a lot of tests write to the source
> directory itself.

Wow that really is strange. The logic in that class is pretty clever.
Do you see a message like 'Waiting for generated file timestamp to
increase' at any point?

BTW these problems don't have anything to do with sandbox, which I
think was your original complaint. The more stuff we bring into tests
(Python included) the harder it gets.

Regards,
Simon

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] CI: Add automatic retry for test.py jobs
  2023-07-27 19:18                 ` Simon Glass
@ 2023-07-27 20:35                   ` Tom Rini
  2023-08-17 13:41                     ` Simon Glass
  0 siblings, 1 reply; 13+ messages in thread
From: Tom Rini @ 2023-07-27 20:35 UTC (permalink / raw)
  To: Simon Glass; +Cc: u-boot

[-- Attachment #1: Type: text/plain, Size: 13106 bytes --]

On Thu, Jul 27, 2023 at 01:18:12PM -0600, Simon Glass wrote:
> Hi Tom,
> 
> On Sun, 16 Jul 2023 at 12:18, Tom Rini <trini@konsulko.com> wrote:
> >
> > On Sat, Jul 15, 2023 at 05:40:25PM -0600, Simon Glass wrote:
> > > Hi Tom,
> > >
> > > On Thu, 13 Jul 2023 at 15:57, Tom Rini <trini@konsulko.com> wrote:
> > > >
> > > > On Thu, Jul 13, 2023 at 03:03:57PM -0600, Simon Glass wrote:
> > > > > Hi Tom,
> > > > >
> > > > > On Wed, 12 Jul 2023 at 14:38, Tom Rini <trini@konsulko.com> wrote:
> > > > > >
> > > > > > On Wed, Jul 12, 2023 at 02:32:18PM -0600, Simon Glass wrote:
> > > > > > > Hi Tom,
> > > > > > >
> > > > > > > On Wed, 12 Jul 2023 at 11:09, Tom Rini <trini@konsulko.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Jul 12, 2023 at 08:00:23AM -0600, Simon Glass wrote:
> > > > > > > > > Hi Tom,
> > > > > > > > >
> > > > > > > > > On Tue, 11 Jul 2023 at 20:33, Tom Rini <trini@konsulko.com> wrote:
> > > > > > > > > >
> > > > > > > > > > It is not uncommon for some of the QEMU-based jobs to fail not because
> > > > > > > > > > of a code issue but rather because of a timing issue or similar problem
> > > > > > > > > > that is out of our control. Make use of the keywords that Azure and
> > > > > > > > > > GitLab provide so that we will automatically re-run these when they fail
> > > > > > > > > > 2 times. If they fail that often it is likely we have found a real issue
> > > > > > > > > > to investigate.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Tom Rini <trini@konsulko.com>
> > > > > > > > > > ---
> > > > > > > > > >  .azure-pipelines.yml | 1 +
> > > > > > > > > >  .gitlab-ci.yml       | 1 +
> > > > > > > > > >  2 files changed, 2 insertions(+)
> > > > > > > > >
> > > > > > > > > This seems like a slippery slope. Do we know why things fail? I wonder
> > > > > > > > > if we should disable the tests / builders instead, until it can be
> > > > > > > > > corrected?
> > > > > > > >
> > > > > > > > It happens in Azure, so it's not just the broken runner problem we have
> > > > > > > > in GitLab. And the problem is timing, as I said in the commit.
> > > > > > > > Sometimes we still get the RTC test failing. Other times we don't get
> > > > > > > > QEMU + U-Boot spawned in time (most often m68k, but sometimes x86).
> > > > > > >
> > > > > > > How do we keep this list from growing?
> > > > > >
> > > > > > Do we need to? The problem is in essence since we rely on free
> > > > > > resources, sometimes some heavy lifts take longer.  That's what this
> > > > > > flag is for.
> > > > >
> > > > > I'm fairly sure the RTC thing could be made deterministic.
> > > >
> > > > We've already tried that once, and it happens a lot less often. If we
> > > > make it even looser we risk making the test itself useless.
> > >
> > > For sleep, yes, but for rtc it should be deterministic now...next time
> > > you get a failure could you send me the trace?
> >
> > Found one:
> > https://dev.azure.com/u-boot/u-boot/_build/results?buildId=6592&view=logs&j=b6c47816-145c-5bfe-20a7-c6a2572e6c41&t=0929c28c-6e32-5635-9624-54eaa917d713&l=599
> 
> I don't seem to have access to that...but it is rtc or sleep?

It was the RTC one, and has since rolled off and been deleted.

> > And note that we have a different set of timeout problems that may or may not
> > be configurable, which is in the upload of the pytest results. I haven't seen
> > if there's a knob for this one yet, within Azure (or the python project we're
> > adding for it).
> 
> Oh dear.
> 
> >
> > > > > The spawning thing...is there a timeout for that? What actually fails?
> > > >
> > > > It doesn't spawn in time for the framework to get to the prompt.  We
> > > > could maybe increase the timeout value.  It's always the version test
> > > > that fails.
> > >
> > > Ah OK, yes increasing the timeout makes sense.
> > >
> > > >
> > > > > > > > > I'll note that we don't have this problem with sandbox tests.
> > > > > > > >
> > > > > > > > OK, but that's not relevant?
> > > > > > >
> > > > > > > It is relevant to the discussion about using QEMU instead of sandbox,
> > > > > > > e.g. with the TPM. I recall a discussion with Ilias a while back.
> > > > > >
> > > > > > I'm sure we could make sandbox take too long to start as well, if enough
> > > > > > other things are going on with the system.  And sandbox has its own set
> > > > > > of super frustrating issues instead, so I don't think this is a great
> > > > > > argument to have right here (I have to run it in docker, to get around
> > > > > > some application version requirements and exclude event_dump, bootmgr,
> > > > > > abootimg and gpt tests, which could otherwise run, but fail for me).
> > > > >
> > > > > I haven't heard about this before. Is there anything that could be done?
> > > >
> > > > I have no idea what could be done about it since I believe all of them
> > > > run fine in CI, including on this very host, when gitlab invokes it
> > > > rather than when I invoke it. My point here is that sandbox tests are
> > > > just a different kind of picky about things and need their own kind of
> > > > "just hit retry".
> > >
> > > Perhaps this is Python dependencies? I'm not sure, but if you see it
> > > again, please let me know in case we can actually fix this.
> >
> > Alright. So the first pass I took at running sandbox pytest with as
> > little hand-holding as possible I hit the known issue of /boot/vmlinu*
> > being 0400 in Ubuntu. I fixed that and then re-ran and:
> > test/py/tests/test_cleanup_build.py F
> >
> > ========================================== FAILURES ===========================================
> > _________________________________________ test_clean __________________________________________
> > test/py/tests/test_cleanup_build.py:94: in test_clean
> >     assert not leftovers, f"leftovers: {', '.join(map(str, leftovers))}"
> > E   AssertionError: leftovers: fdt-out.dtb, sha1-pad/sandbox-u-boot.dtb, sha1-pad/sandbox-kernel.dtb, sha1-basic/sandbox-u-boot.dtb, sha1-basic/sandbox-kernel.dtb, sha384-basic/sandbox-u-boot.dtb, sha384-basic/sandbox-kernel.dtb, algo-arg/sandbox-u-boot.dtb, algo-arg/sandbox-kernel.dtb, sha1-pss/sandbox-u-boot.dtb, sha1-pss/sandbox-kernel.dtb, sha256-pad/sandbox-u-boot.dtb, sha256-pad/sandbox-kernel.dtb, sha256-global-sign/sandbox-binman.dtb, sha256-global-sign/sandbox-u-boot.dtb, sha256-global-sign/sandbox-u-boot-global.dtb, sha256-global-sign/sandbox-kernel.dtb, sha256-global-sign-pss/sandbox-binman-pss.dtb, sha256-global-sign-pss/sandbox-u-boot.dtb, sha256-global-sign-pss/sandbox-kernel.dtb, sha256-global-sign-pss/sandbox-u-boot-global-pss.dtb, auto_fit/dt-1.dtb, auto_fit/dt-2.dtb, sha256-pss/sandbox-u-boot.dtb, sha256-pss/sandbox-kernel.dtb, sha256-pss-pad/sandbox-u-boot.dtb, sha256-pss-pad/sandbox-kernel.dtb, hashes/sandbox-kernel.dtb, sha256-basic/sandbox-u-boot.dtb, sha256-basic/sandbox-kernel.dtb, sha1-pss-pad/sandbox-u-boot.dtb, sha1-pss-pad/sandbox-kernel.dtb, sha384-pad/sandbox-u-boot.dtb, sha384-pad/sandbox-kernel.dtb, sha256-pss-pad-required/sandbox-u-boot.dtb, sha256-pss-pad-required/sandbox-kernel.dtb, ecdsa/sandbox-kernel.dtb, sha256-pss-required/sandbox-u-boot.dtb, sha256-pss-required/sandbox-kernel.dtb
> > E   assert not [PosixPath('fdt-out.dtb'), PosixPath('sha1-pad/sandbox-u-boot.dtb'), PosixPath('sha1-pad/sandbox-kernel.dtb'), PosixPa...ic/sandbox-u-boot.dtb'), PosixPath('sha1-basic/sandbox-kernel.dtb'), PosixPath('sha384-basic/sandbox-u-boot.dtb'), ...]
> > ------------------------------------ Captured stdout call -------------------------------------
> > +make O=/tmp/pytest-of-trini/pytest-231/test_clean0 clean
> > make[1]: Entering directory '/tmp/pytest-of-trini/pytest-231/test_clean0'
> >   CLEAN   cmd
> >   CLEAN   dts/../arch/sandbox/dts
> >   CLEAN   dts
> >   CLEAN   lib
> >   CLEAN   tools
> >   CLEAN   tools/generated
> >   CLEAN   include/bmp_logo.h include/bmp_logo_data.h include/generated/env.in include/generated/env.txt drivers/video/u_boot_logo.S u-boot u-boot-dtb.bin u-boot-initial-env u-boot-nodtb.bin u-boot.bin u-boot.cfg u-boot.dtb u-boot.dtb.gz u-boot.dtb.out u-boot.dts u-boot.lds u-boot.map u-boot.srec u-boot.sym System.map image.map keep-syms-lto.c lib/efi_loader/helloworld_efi.S
> > make[1]: Leaving directory '/tmp/pytest-of-trini/pytest-231/test_clean0'
> > =================================== short test summary info ===================================
> > FAILED test/py/tests/test_cleanup_build.py::test_clean - AssertionError: leftovers: fdt-out....
> > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > ================================= 1 failed, 6 passed in 6.42s =================================
> 
> That test never passes for me locally, because as you say we add a lot
> of files to the build directory and there is no tracking of them such
> that 'make clean' could remove them. We could fix that, e.g.:
> 
> 1. Have binman record all its output filenames in a binman.clean file
> 2. Have tests always use a 'testfiles' subdir for files they create

It sounds like this is showing some bugs in how we use binman since
"make clean" should result in a clean tree, and I believe we get a few
patches now and again about removing leftover files.

> > Fixing that manually with an rm -rf of /tmp/pytest-of-trini and now it's
> > stuck.  I've rm -rf'd that and git clean -dfx and just repeat that
> > failure.  I'm hopeful that when I reboot whatever magic is broken will
> > be cleaned out.  Moving things in to a docker container again, I get:
> > =========================================== ERRORS ============================================
> > _______________________________ ERROR at setup of test_gpt_read _______________________________
> > /home/trini/work/u-boot/u-boot/test/py/tests/test_gpt.py:74: in state_disk_image
> >     ???
> > /home/trini/work/u-boot/u-boot/test/py/tests/test_gpt.py:37: in __init__
> >     ???
> > test/py/u_boot_utils.py:279: in __enter__
> >     self.module_filename = module.__file__
> > E   AttributeError: 'NoneType' object has no attribute '__file__'
> > =================================== short test summary info ===================================
> > ERROR test/py/tests/test_gpt.py::test_gpt_read - AttributeError: 'NoneType' object has no at...
> > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > ========================== 41 passed, 45 skipped, 1 error in 19.29s ===========================
> >
> > And then ignoring that one with "-k not gpt":
> > test/py/tests/test_android/test_ab.py E
> >
> > =========================================== ERRORS ============================================
> > __________________________________ ERROR at setup of test_ab __________________________________
> > /home/trini/work/u-boot/u-boot/test/py/tests/test_android/test_ab.py:54: in ab_disk_image
> >     ???
> > /home/trini/work/u-boot/u-boot/test/py/tests/test_android/test_ab.py:28: in __init__
> >     ???
> > test/py/u_boot_utils.py:279: in __enter__
> >     self.module_filename = module.__file__
> > E   AttributeError: 'NoneType' object has no attribute '__file__'
> > =================================== short test summary info ===================================
> > ERROR test/py/tests/test_android/test_ab.py::test_ab - AttributeError: 'NoneType' object has...
> > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > ============= 908 passed, 75 skipped, 10 deselected, 1 error in 159.17s (0:02:39) =============
> 
> These two are the same error. It looks like somehow it is unable to
> obtain the module with:
> 
>         frame = inspect.stack()[1]
>         module = inspect.getmodule(frame[0])
> 
> i.e. module is None
> 
> +Stephen Warren who may know
> 
> What Python version is this?

It's the docker container we use for CI, where these tests pass every
time they're run normally, whatever is in Ubuntu "Jammy".

> > Now, funny things. If I git clean -dfx, I can then get that test to
> > pass.  So I guess something else isn't cleaning up / is writing to a
> > common area? I intentionally build within the source tree, but in a
> > subdirectory of that, and indeed a lot of tests write to the source
> > directory itself.
> 
> Wow that really is strange. The logic in that class is pretty clever.
> Do you see a message like 'Waiting for generated file timestamp to
> increase' at any point?
> 
> BTW these problems don't have anything to do with sandbox, which I
> think was your original complaint. The more stuff we bring into tests
> (Python included) the harder it gets.

The original complaint, as I saw it, was that "sandbox pytests don't
randomly fail".  My point is that sandbox pytests randomly fail all the
time.  QEMU isn't any worse.  I can't say it's better since my local
loop is sandbox for sanity then on to hardware.

-- 
Tom

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] CI: Add automatic retry for test.py jobs
  2023-07-27 20:35                   ` Tom Rini
@ 2023-08-17 13:41                     ` Simon Glass
  0 siblings, 0 replies; 13+ messages in thread
From: Simon Glass @ 2023-08-17 13:41 UTC (permalink / raw)
  To: Tom Rini; +Cc: u-boot

Hi Tom,

On Thu, 27 Jul 2023 at 14:35, Tom Rini <trini@konsulko.com> wrote:
>
> On Thu, Jul 27, 2023 at 01:18:12PM -0600, Simon Glass wrote:
> > Hi Tom,
> >
> > On Sun, 16 Jul 2023 at 12:18, Tom Rini <trini@konsulko.com> wrote:
> > >
> > > On Sat, Jul 15, 2023 at 05:40:25PM -0600, Simon Glass wrote:
> > > > Hi Tom,
> > > >
> > > > On Thu, 13 Jul 2023 at 15:57, Tom Rini <trini@konsulko.com> wrote:
> > > > >
> > > > > On Thu, Jul 13, 2023 at 03:03:57PM -0600, Simon Glass wrote:
> > > > > > Hi Tom,
> > > > > >
> > > > > > On Wed, 12 Jul 2023 at 14:38, Tom Rini <trini@konsulko.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jul 12, 2023 at 02:32:18PM -0600, Simon Glass wrote:
> > > > > > > > Hi Tom,
> > > > > > > >
> > > > > > > > On Wed, 12 Jul 2023 at 11:09, Tom Rini <trini@konsulko.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Jul 12, 2023 at 08:00:23AM -0600, Simon Glass wrote:
> > > > > > > > > > Hi Tom,
> > > > > > > > > >
> > > > > > > > > > On Tue, 11 Jul 2023 at 20:33, Tom Rini <trini@konsulko.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > It is not uncommon for some of the QEMU-based jobs to fail not because
> > > > > > > > > > > of a code issue but rather because of a timing issue or similar problem
> > > > > > > > > > > that is out of our control. Make use of the keywords that Azure and
> > > > > > > > > > > GitLab provide so that we will automatically re-run these when they fail
> > > > > > > > > > > 2 times. If they fail that often it is likely we have found a real issue
> > > > > > > > > > > to investigate.
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Tom Rini <trini@konsulko.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  .azure-pipelines.yml | 1 +
> > > > > > > > > > >  .gitlab-ci.yml       | 1 +
> > > > > > > > > > >  2 files changed, 2 insertions(+)
> > > > > > > > > >
> > > > > > > > > > This seems like a slippery slope. Do we know why things fail? I wonder
> > > > > > > > > > if we should disable the tests / builders instead, until it can be
> > > > > > > > > > corrected?
> > > > > > > > >
> > > > > > > > > It happens in Azure, so it's not just the broken runner problem we have
> > > > > > > > > in GitLab. And the problem is timing, as I said in the commit.
> > > > > > > > > Sometimes we still get the RTC test failing. Other times we don't get
> > > > > > > > > QEMU + U-Boot spawned in time (most often m68k, but sometimes x86).
> > > > > > > >
> > > > > > > > How do we keep this list from growing?
> > > > > > >
> > > > > > > Do we need to? The problem is in essence since we rely on free
> > > > > > > resources, sometimes some heavy lifts take longer.  That's what this
> > > > > > > flag is for.
> > > > > >
> > > > > > I'm fairly sure the RTC thing could be made deterministic.
> > > > >
> > > > > We've already tried that once, and it happens a lot less often. If we
> > > > > make it even looser we risk making the test itself useless.
> > > >
> > > > For sleep, yes, but for rtc it should be deterministic now...next time
> > > > you get a failure could you send me the trace?
> > >
> > > Found one:
> > > https://dev.azure.com/u-boot/u-boot/_build/results?buildId=6592&view=logs&j=b6c47816-145c-5bfe-20a7-c6a2572e6c41&t=0929c28c-6e32-5635-9624-54eaa917d713&l=599
> >
> > I don't seem to have access to that...but it is rtc or sleep?
>
> It was the RTC one, and has since rolled off and been deleted.
>
> > > And note that we have a different set of timeout problems that may or may not
> > > be configurable, which is in the upload of the pytest results. I haven't seen
> > > if there's a knob for this one yet, within Azure (or the python project we're
> > > adding for it).
> >
> > Oh dear.
> >
> > >
> > > > > > The spawning thing...is there a timeout for that? What actually fails?
> > > > >
> > > > > It doesn't spawn in time for the framework to get to the prompt.  We
> > > > > could maybe increase the timeout value.  It's always the version test
> > > > > that fails.
> > > >
> > > > Ah OK, yes increasing the timeout makes sense.
> > > >
> > > > >
> > > > > > > > > > I'll note that we don't have this problem with sandbox tests.
> > > > > > > > >
> > > > > > > > > OK, but that's not relevant?
> > > > > > > >
> > > > > > > > It is relevant to the discussion about using QEMU instead of sandbox,
> > > > > > > > e.g. with the TPM. I recall a discussion with Ilias a while back.
> > > > > > >
> > > > > > > I'm sure we could make sandbox take too long to start as well, if enough
> > > > > > > other things are going on with the system.  And sandbox has its own set
> > > > > > > of super frustrating issues instead, so I don't think this is a great
> > > > > > > argument to have right here (I have to run it in docker, to get around
> > > > > > > some application version requirements and exclude event_dump, bootmgr,
> > > > > > > abootimg and gpt tests, which could otherwise run, but fail for me).
> > > > > >
> > > > > > I haven't heard about this before. Is there anything that could be done?
> > > > >
> > > > > I have no idea what could be done about it since I believe all of them
> > > > > run fine in CI, including on this very host, when gitlab invokes it
> > > > > rather than when I invoke it. My point here is that sandbox tests are
> > > > > just a different kind of picky about things and need their own kind of
> > > > > "just hit retry".
> > > >
> > > > Perhaps this is Python dependencies? I'm not sure, but if you see it
> > > > again, please let me know in case we can actually fix this.
> > >
> > > Alright. So the first pass I took at running sandbox pytest with as
> > > little hand-holding as possible I hit the known issue of /boot/vmlinu*
> > > being 0400 in Ubuntu. I fixed that and then re-ran and:
> > > test/py/tests/test_cleanup_build.py F
> > >
> > > ========================================== FAILURES ===========================================
> > > _________________________________________ test_clean __________________________________________
> > > test/py/tests/test_cleanup_build.py:94: in test_clean
> > >     assert not leftovers, f"leftovers: {', '.join(map(str, leftovers))}"
> > > E   AssertionError: leftovers: fdt-out.dtb, sha1-pad/sandbox-u-boot.dtb, sha1-pad/sandbox-kernel.dtb, sha1-basic/sandbox-u-boot.dtb, sha1-basic/sandbox-kernel.dtb, sha384-basic/sandbox-u-boot.dtb, sha384-basic/sandbox-kernel.dtb, algo-arg/sandbox-u-boot.dtb, algo-arg/sandbox-kernel.dtb, sha1-pss/sandbox-u-boot.dtb, sha1-pss/sandbox-kernel.dtb, sha256-pad/sandbox-u-boot.dtb, sha256-pad/sandbox-kernel.dtb, sha256-global-sign/sandbox-binman.dtb, sha256-global-sign/sandbox-u-boot.dtb, sha256-global-sign/sandbox-u-boot-global.dtb, sha256-global-sign/sandbox-kernel.dtb, sha256-global-sign-pss/sandbox-binman-pss.dtb, sha256-global-sign-pss/sandbox-u-boot.dtb, sha256-global-sign-pss/sandbox-kernel.dtb, sha256-global-sign-pss/sandbox-u-boot-global-pss.dtb, auto_fit/dt-1.dtb, auto_fit/dt-2.dtb, sha256-pss/sandbox-u-boot.dtb, sha256-pss/sandbox-kernel.dtb, sha256-pss-pad/sandbox-u-boot.dtb, sha256-pss-pad/sandbox-kernel.dtb, hashes/sandbox-kernel.dtb, sha256-basic/sandbox-u-boot.dtb, sha256-basic/sandbox-kernel.dtb, sha1-pss-pad/sandbox-u-boot.dtb, sha1-pss-pad/sandbox-kernel.dtb, sha384-pad/sandbox-u-boot.dtb, sha384-pad/sandbox-kernel.dtb, sha256-pss-pad-required/sandbox-u-boot.dtb, sha256-pss-pad-required/sandbox-kernel.dtb, ecdsa/sandbox-kernel.dtb, sha256-pss-required/sandbox-u-boot.dtb, sha256-pss-required/sandbox-kernel.dtb
> > > E   assert not [PosixPath('fdt-out.dtb'), PosixPath('sha1-pad/sandbox-u-boot.dtb'), PosixPath('sha1-pad/sandbox-kernel.dtb'), PosixPa...ic/sandbox-u-boot.dtb'), PosixPath('sha1-basic/sandbox-kernel.dtb'), PosixPath('sha384-basic/sandbox-u-boot.dtb'), ...]
> > > ------------------------------------ Captured stdout call -------------------------------------
> > > +make O=/tmp/pytest-of-trini/pytest-231/test_clean0 clean
> > > make[1]: Entering directory '/tmp/pytest-of-trini/pytest-231/test_clean0'
> > >   CLEAN   cmd
> > >   CLEAN   dts/../arch/sandbox/dts
> > >   CLEAN   dts
> > >   CLEAN   lib
> > >   CLEAN   tools
> > >   CLEAN   tools/generated
> > >   CLEAN   include/bmp_logo.h include/bmp_logo_data.h include/generated/env.in include/generated/env.txt drivers/video/u_boot_logo.S u-boot u-boot-dtb.bin u-boot-initial-env u-boot-nodtb.bin u-boot.bin u-boot.cfg u-boot.dtb u-boot.dtb.gz u-boot.dtb.out u-boot.dts u-boot.lds u-boot.map u-boot.srec u-boot.sym System.map image.map keep-syms-lto.c lib/efi_loader/helloworld_efi.S
> > > make[1]: Leaving directory '/tmp/pytest-of-trini/pytest-231/test_clean0'
> > > =================================== short test summary info ===================================
> > > FAILED test/py/tests/test_cleanup_build.py::test_clean - AssertionError: leftovers: fdt-out....
> > > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > > ================================= 1 failed, 6 passed in 6.42s =================================
> >
> > That test never passes for me locally, because as you say we add a lot
> > of files to the build directory and there is no tracking of them such
> > that 'make clean' could remove them. We could fix that, e.g.:
> >
> > 1. Have binman record all its output filenames in a binman.clean file
> > 2. Have tests always use a 'testfiles' subdir for files they create
>
> It sounds like this is showing some bugs in how we use binman since
> "make clean" should result in a clean tree, and I believe we get a few
> patches now and again about removing leftover files.
>
> > > Fixing that manually with an rm -rf of /tmp/pytest-of-trini and now it's
> > > stuck.  I've rm -rf'd that and git clean -dfx and just repeat that
> > > failure.  I'm hopeful that when I reboot whatever magic is broken will
> > > be cleaned out.  Moving things in to a docker container again, I get:
> > > =========================================== ERRORS ============================================
> > > _______________________________ ERROR at setup of test_gpt_read _______________________________
> > > /home/trini/work/u-boot/u-boot/test/py/tests/test_gpt.py:74: in state_disk_image
> > >     ???
> > > /home/trini/work/u-boot/u-boot/test/py/tests/test_gpt.py:37: in __init__
> > >     ???
> > > test/py/u_boot_utils.py:279: in __enter__
> > >     self.module_filename = module.__file__
> > > E   AttributeError: 'NoneType' object has no attribute '__file__'
> > > =================================== short test summary info ===================================
> > > ERROR test/py/tests/test_gpt.py::test_gpt_read - AttributeError: 'NoneType' object has no at...
> > > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > > ========================== 41 passed, 45 skipped, 1 error in 19.29s ===========================
> > >
> > > And then ignoring that one with "-k not gpt":
> > > test/py/tests/test_android/test_ab.py E
> > >
> > > =========================================== ERRORS ============================================
> > > __________________________________ ERROR at setup of test_ab __________________________________
> > > /home/trini/work/u-boot/u-boot/test/py/tests/test_android/test_ab.py:54: in ab_disk_image
> > >     ???
> > > /home/trini/work/u-boot/u-boot/test/py/tests/test_android/test_ab.py:28: in __init__
> > >     ???
> > > test/py/u_boot_utils.py:279: in __enter__
> > >     self.module_filename = module.__file__
> > > E   AttributeError: 'NoneType' object has no attribute '__file__'
> > > =================================== short test summary info ===================================
> > > ERROR test/py/tests/test_android/test_ab.py::test_ab - AttributeError: 'NoneType' object has...
> > > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > > ============= 908 passed, 75 skipped, 10 deselected, 1 error in 159.17s (0:02:39) =============
> >
> > These two are the same error. It looks like somehow it is unable to
> > obtain the module with:
> >
> >         frame = inspect.stack()[1]
> >         module = inspect.getmodule(frame[0])
> >
> > i.e. module is None
> >
> > +Stephen Warren who may know
> >
> > What Python version is this?
>
> It's the docker container we use for CI, where these tests pass every
> time they're run normally, whatever is in Ubuntu "Jammy".
>
> > > Now, funny things. If I git clean -dfx, I can then get that test to
> > > pass.  So I guess something else isn't cleaning up / is writing to a
> > > common area? I intentionally build within the source tree, but in a
> > > subdirectory of that, and indeed a lot of tests write to the source
> > > directory itself.
> >
> > Wow that really is strange. The logic in that class is pretty clever.
> > Do you see a message like 'Waiting for generated file timestamp to
> > increase' at any point?
> >
> > BTW these problems don't have anything to do with sandbox, which I
> > think was your original complaint. The more stuff we bring into tests
> > (Python included) the harder it gets.
>
> The original complaint, as I saw it, was that "sandbox pytests don't
> randomly fail".  My point is that sandbox pytests randomly fail all the
> time.  QEMU isn't any worse.  I can't say it's better since my local
> loop is sandbox for sanity then on to hardware.

The way I see it, terms of flakiness, speed and ease of debugging,
from best to worse, we have:

- sandbox
- Python wrapper around sandbox
- Python wrapper around QEMU

Regards,
Simon

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-08-17 13:43 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-12  2:33 [PATCH] CI: Add automatic retry for test.py jobs Tom Rini
2023-07-12 14:00 ` Simon Glass
2023-07-12 17:08   ` Tom Rini
2023-07-12 20:32     ` Simon Glass
2023-07-12 20:38       ` Tom Rini
2023-07-13 21:03         ` Simon Glass
2023-07-13 21:57           ` Tom Rini
2023-07-15 23:40             ` Simon Glass
2023-07-16 18:17               ` Tom Rini
2023-07-27 19:18                 ` Simon Glass
2023-07-27 20:35                   ` Tom Rini
2023-08-17 13:41                     ` Simon Glass
2023-07-21  1:30 ` Tom Rini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.