u-boot.lists.denx.de archive mirror
 help / color / mirror / Atom feed
* buildman stops (crashed) on current master
@ 2021-10-07 10:37 Stefano Babic
  2021-10-07 13:43 ` Simon Glass
  0 siblings, 1 reply; 12+ messages in thread
From: Stefano Babic @ 2021-10-07 10:37 UTC (permalink / raw)
  To: U-Boot; +Cc: Simon Glass

Hi all,

CI stops by building aarch64 without notice, for reference:

https://source.denx.de/u-boot/custodians/u-boot-imx/-/jobs/332319

There is no error, just process is killed. It looks like it stops at 
xilinx_zynqmp_virt,

./tools/buildman/buildman -o /tmp -P -E -W aarch64but board can be built 
without issues.

If I build on my host (not in docker, anyway), it generally builds fine 
- but it crashes sometimes, too. On gitlab instance , it crashes.
Issue does not seem that depends on merged patches, and introduces 
boards were already built successfully. Any hint ? I have also no idea 
what I should look as what I see is just

"usr/bin/bash: line 104:    24 Killed 
./tools/buildman/buildman -o /tmp -P -E -W aarch64"

Regards,
Stefano


-- 
=====================================================================
DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: +49-8142-66989-53 Fax: +49-8142-66989-80 Email: sbabic@denx.de
=====================================================================

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: buildman stops (crashed) on current master
  2021-10-07 10:37 buildman stops (crashed) on current master Stefano Babic
@ 2021-10-07 13:43 ` Simon Glass
  2021-10-07 14:10   ` Stefano Babic
  2021-10-19 15:39   ` Stefano Babic
  0 siblings, 2 replies; 12+ messages in thread
From: Simon Glass @ 2021-10-07 13:43 UTC (permalink / raw)
  To: Stefano Babic; +Cc: U-Boot

Hi Stefano,

On Thu, 7 Oct 2021 at 04:37, Stefano Babic <sbabic@denx.de> wrote:
>
> Hi all,
>
> CI stops by building aarch64 without notice, for reference:
>
> https://source.denx.de/u-boot/custodians/u-boot-imx/-/jobs/332319
>
> There is no error, just process is killed. It looks like it stops at
> xilinx_zynqmp_virt,
>
> ./tools/buildman/buildman -o /tmp -P -E -W aarch64but board can be built
> without issues.
>
> If I build on my host (not in docker, anyway), it generally builds fine
> - but it crashes sometimes, too. On gitlab instance , it crashes.
> Issue does not seem that depends on merged patches, and introduces
> boards were already built successfully. Any hint ? I have also no idea
> what I should look as what I see is just
>
> "usr/bin/bash: line 104:    24 Killed
> ./tools/buildman/buildman -o /tmp -P -E -W aarch64"

I cannot see that link. I am not sure what is going on. Does it say
what signal killed it?

Does it sit there for an hour and timeout? If so, then I  did see that
myself once recently, when the Kconfig needed stdin, but I could not
quitetie it down. I think buildman would provide it, but sometimes
not, apparently. So it can happen when there is an existing build
there and your new one which adds Kconfig options that don't have
defaults, or something like that?

If that is it, you can repeat it by clearing out your .bm-work
directory then building just that board for one commit, then the next
(with the Kconfig change).

Buildman is supposed to handle this, of course. I'm not sure what has changed.

Regards,
Simon

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: buildman stops (crashed) on current master
  2021-10-07 13:43 ` Simon Glass
@ 2021-10-07 14:10   ` Stefano Babic
  2021-10-19 15:39   ` Stefano Babic
  1 sibling, 0 replies; 12+ messages in thread
From: Stefano Babic @ 2021-10-07 14:10 UTC (permalink / raw)
  To: Simon Glass, Stefano Babic; +Cc: U-Boot

Hi Simon,

On 07.10.21 15:43, Simon Glass wrote:
> Hi Stefano,
> 
> On Thu, 7 Oct 2021 at 04:37, Stefano Babic <sbabic@denx.de> wrote:
>>
>> Hi all,
>>
>> CI stops by building aarch64 without notice, for reference:
>>
>> https://source.denx.de/u-boot/custodians/u-boot-imx/-/jobs/332319
>>
>> There is no error, just process is killed. It looks like it stops at
>> xilinx_zynqmp_virt,
>>
>> ./tools/buildman/buildman -o /tmp -P -E -W aarch64but board can be built
>> without issues.
>>
>> If I build on my host (not in docker, anyway), it generally builds fine
>> - but it crashes sometimes, too. On gitlab instance , it crashes.
>> Issue does not seem that depends on merged patches, and introduces
>> boards were already built successfully. Any hint ? I have also no idea
>> what I should look as what I see is just
>>
>> "usr/bin/bash: line 104:    24 Killed
>> ./tools/buildman/buildman -o /tmp -P -E -W aarch64"
> 
> I cannot see that link.

Verified with Wolfgang. The CI's results are available only after having 
logged in to source.denxe.de. Without account, they are not shown and 
the server return not found (I was not aware of this).

> I am not sure what is going on. Does it say
> what signal killed it?

Nothing, that makes an investigation difficult. It looks like it is 
bound (resource issue ?) with the runner or docker instance, because if 
I run the same buildman command on my host, it works until end and no 
errors are reported:

    aarch64:  w+   xilinx_zynqmp_virt 

+===================== WARNING ======================
+This board uses CONFIG_SPL_FIT_GENERATOR. Please migrate
+to binman instead, to avoid the proliferation of
+arch-specific scripts with no tests.
+====================================================
+WARNING: BL31 file bl31.bin NOT found, U-Boot will run in EL3
   125  181    0 /306            xilinx_zynqmp_virt

gitlab stops exactly with the last board, as if all work was done, but 
rather after a timeout and pipeline is set to failed (just a feeling).

> 
> Does it sit there for an hour and timeout? If so, then I  did see that
> myself once recently, when the Kconfig needed stdin, but I could not
> quitetie it down. I think buildman would provide it, but sometimes
> not, apparently. So it can happen when there is an existing build
> there and your new one which adds Kconfig options that don't have
> defaults, or something like that?

But anytime there is a new docker instance on gitlab runner, so this 
could happen only on local host.

> 
> If that is it, you can repeat it by clearing out your .bm-work
> directory then building just that board for one commit, then the next
> (with the Kconfig change).
> 
> Buildman is supposed to handle this, of course. I'm not sure what has changed.

Ok, let's see.

Regards,
Stefano


-- 
=====================================================================
DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: +49-8142-66989-53 Fax: +49-8142-66989-80 Email: sbabic@denx.de
=====================================================================

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: buildman stops (crashed) on current master
  2021-10-07 13:43 ` Simon Glass
  2021-10-07 14:10   ` Stefano Babic
@ 2021-10-19 15:39   ` Stefano Babic
  2021-10-19 15:52     ` Simon Glass
  2021-10-19 22:53     ` Tom Rini
  1 sibling, 2 replies; 12+ messages in thread
From: Stefano Babic @ 2021-10-19 15:39 UTC (permalink / raw)
  To: Simon Glass, Stefano Babic; +Cc: U-Boot

Hi Simon,

On 07.10.21 15:43, Simon Glass wrote:
> Hi Stefano,
> 
> On Thu, 7 Oct 2021 at 04:37, Stefano Babic <sbabic@denx.de> wrote:
>>
>> Hi all,
>>
>> CI stops by building aarch64 without notice, for reference:
>>
>> https://source.denx.de/u-boot/custodians/u-boot-imx/-/jobs/332319
>>
>> There is no error, just process is killed. It looks like it stops at
>> xilinx_zynqmp_virt,
>>
>> ./tools/buildman/buildman -o /tmp -P -E -W aarch64but board can be built
>> without issues.
>>
>> If I build on my host (not in docker, anyway), it generally builds fine
>> - but it crashes sometimes, too. On gitlab instance , it crashes.
>> Issue does not seem that depends on merged patches, and introduces
>> boards were already built successfully. Any hint ? I have also no idea
>> what I should look as what I see is just
>>
>> "usr/bin/bash: line 104:    24 Killed
>> ./tools/buildman/buildman -o /tmp -P -E -W aarch64"
> 
> I cannot see that link. I am not sure what is going on. Does it say
> what signal killed it?

Pipelines on our server were not public - I have enbaled now for u-boot-imx.

> 
> Does it sit there for an hour and timeout? If so, then I  did see that
> myself once recently, when the Kconfig needed stdin, but I could not
> quitetie it down. I think buildman would provide it, but sometimes
> not, apparently. So it can happen when there is an existing build
> there and your new one which adds Kconfig options that don't have
> defaults, or something like that?
> 

I have investigated further, and I can reproduce it on my host outside 
the gitlab server. buildman causes a OOM, but I cannot find the cause.

Strange enough, this happens with the "aarch64" target, and I cannot 
reproduce it with Tom's master. So it seems that -master is ok, and 
somethin on u-boot-imx generates the OOM.

However....

The OOM happens always when -2 (two boards remain) appears. I can see 
with htop that buildman starts to allocate memory until it is exhausted 
(64GB RAM + 8 GB swap). Then the kernel decides that it is enough and 
kills buildman - this is what I see on Ci.

You can see now the pipelines:

https://source.denx.de/u-boot/custodians/u-boot-imx/-/pipelines/9520

I have then split aarch64 and I built imx8 separately - same result. The 
pipeline stops with xilinx board, but they have nothing to do. In fact, 
I can build all xilinx board separately. If I run buildman -W aarch64 -x 
xilinx, OOM is shown by another board.

Strange enough, I can build each single board with buildman without 
issues, neither errors nor warnongs. Just when buildman runs all 
together (aarch64, 308 boards), the OOM is generated.

Bisect does not help: I started bisect, and at the end this commit was 
presented:

commit 53a24dee86fb72ae41e7579607bafe13442616f2
Author: Fabio Estevam <festevam@denx.de>
Date:   Mon Aug 23 21:11:09 2021 -0300

     imx8mm-cl-iot-gate: Split the defconfigs


But it is a fake: I can revert it, I get the issue again. And the patch 
has nothing to do.

It looks to me it is something in binman, maybe triggered by some 
changes in tree, but all boards can be built separately without issues. 
I supposed to find the cause in code due to applied patches, but because 
each board can be built and no help from bisect, I am quite puzzled. I 
avoid to send a PR to Tom, else I guess the problem goes into -master, 
but I do not know how to proceed, and I have a lot of patches to be applied.

What can be done ?

> If that is it, you can repeat it by clearing out your .bm-work

On gitlab, the build starts from scratch.

> directory then building just that board for one commit, then the next
> (with the Kconfig change).

I have run buildman for each single board, all of them were successuful. 
With aarch64, I get OOM from buildman.

> 
> Buildman is supposed to handle this, of course. I'm not sure what has changed.
> 

Regards,
Stefano

-- 
=====================================================================
DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: +49-8142-66989-53 Fax: +49-8142-66989-80 Email: sbabic@denx.de
=====================================================================

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: buildman stops (crashed) on current master
  2021-10-19 15:39   ` Stefano Babic
@ 2021-10-19 15:52     ` Simon Glass
  2021-10-19 20:10       ` Stefano Babic
  2021-10-19 22:53     ` Tom Rini
  1 sibling, 1 reply; 12+ messages in thread
From: Simon Glass @ 2021-10-19 15:52 UTC (permalink / raw)
  To: Stefano Babic; +Cc: U-Boot

Hi Stefano,

On Tue, 19 Oct 2021 at 09:39, Stefano Babic <sbabic@denx.de> wrote:
>
> Hi Simon,
>
> On 07.10.21 15:43, Simon Glass wrote:
> > Hi Stefano,
> >
> > On Thu, 7 Oct 2021 at 04:37, Stefano Babic <sbabic@denx.de> wrote:
> >>
> >> Hi all,
> >>
> >> CI stops by building aarch64 without notice, for reference:
> >>
> >> https://source.denx.de/u-boot/custodians/u-boot-imx/-/jobs/332319
> >>
> >> There is no error, just process is killed. It looks like it stops at
> >> xilinx_zynqmp_virt,
> >>
> >> ./tools/buildman/buildman -o /tmp -P -E -W aarch64but board can be built
> >> without issues.
> >>
> >> If I build on my host (not in docker, anyway), it generally builds fine
> >> - but it crashes sometimes, too. On gitlab instance , it crashes.
> >> Issue does not seem that depends on merged patches, and introduces
> >> boards were already built successfully. Any hint ? I have also no idea
> >> what I should look as what I see is just
> >>
> >> "usr/bin/bash: line 104:    24 Killed
> >> ./tools/buildman/buildman -o /tmp -P -E -W aarch64"
> >
> > I cannot see that link. I am not sure what is going on. Does it say
> > what signal killed it?
>
> Pipelines on our server were not public - I have enbaled now for u-boot-imx.
>
> >
> > Does it sit there for an hour and timeout? If so, then I  did see that
> > myself once recently, when the Kconfig needed stdin, but I could not
> > quitetie it down. I think buildman would provide it, but sometimes
> > not, apparently. So it can happen when there is an existing build
> > there and your new one which adds Kconfig options that don't have
> > defaults, or something like that?
> >
>
> I have investigated further, and I can reproduce it on my host outside
> the gitlab server. buildman causes a OOM, but I cannot find the cause.
>
> Strange enough, this happens with the "aarch64" target, and I cannot
> reproduce it with Tom's master. So it seems that -master is ok, and
> somethin on u-boot-imx generates the OOM.
>
> However....
>
> The OOM happens always when -2 (two boards remain) appears. I can see
> with htop that buildman starts to allocate memory until it is exhausted
> (64GB RAM + 8 GB swap). Then the kernel decides that it is enough and
> kills buildman - this is what I see on Ci.
>
> You can see now the pipelines:
>
> https://source.denx.de/u-boot/custodians/u-boot-imx/-/pipelines/9520
>
> I have then split aarch64 and I built imx8 separately - same result. The
> pipeline stops with xilinx board, but they have nothing to do. In fact,
> I can build all xilinx board separately. If I run buildman -W aarch64 -x
> xilinx, OOM is shown by another board.
>
> Strange enough, I can build each single board with buildman without
> issues, neither errors nor warnongs. Just when buildman runs all
> together (aarch64, 308 boards), the OOM is generated.
>
> Bisect does not help: I started bisect, and at the end this commit was
> presented:
>
> commit 53a24dee86fb72ae41e7579607bafe13442616f2
> Author: Fabio Estevam <festevam@denx.de>
> Date:   Mon Aug 23 21:11:09 2021 -0300
>
>      imx8mm-cl-iot-gate: Split the defconfigs
>
>
> But it is a fake: I can revert it, I get the issue again. And the patch
> has nothing to do.
>
> It looks to me it is something in binman, maybe triggered by some
> changes in tree, but all boards can be built separately without issues.
> I supposed to find the cause in code due to applied patches, but because
> each board can be built and no help from bisect, I am quite puzzled. I
> avoid to send a PR to Tom, else I guess the problem goes into -master,
> but I do not know how to proceed, and I have a lot of patches to be applied.
>
> What can be done ?
>
> > If that is it, you can repeat it by clearing out your .bm-work
>
> On gitlab, the build starts from scratch.

Can you check that there is definitely nothing around from the previous build?

>
> > directory then building just that board for one commit, then the next
> > (with the Kconfig change).
>
> I have run buildman for each single board, all of them were successuful.
> With aarch64, I get OOM from buildman.
>
> >
> > Buildman is supposed to handle this, of course. I'm not sure what has changed.
> >

I still believe this is due to the reason I said, but I'm happy to be
proved wrong.

Regards,
Simon

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: buildman stops (crashed) on current master
  2021-10-19 15:52     ` Simon Glass
@ 2021-10-19 20:10       ` Stefano Babic
  0 siblings, 0 replies; 12+ messages in thread
From: Stefano Babic @ 2021-10-19 20:10 UTC (permalink / raw)
  To: Simon Glass, Stefano Babic; +Cc: U-Boot

Hi Simon,

On 19.10.21 17:52, Simon Glass wrote:
> Hi Stefano,
> 
> On Tue, 19 Oct 2021 at 09:39, Stefano Babic <sbabic@denx.de> wrote:
>>
>> Hi Simon,
>>
>> On 07.10.21 15:43, Simon Glass wrote:
>>> Hi Stefano,
>>>
>>> On Thu, 7 Oct 2021 at 04:37, Stefano Babic <sbabic@denx.de> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> CI stops by building aarch64 without notice, for reference:
>>>>
>>>> https://source.denx.de/u-boot/custodians/u-boot-imx/-/jobs/332319
>>>>
>>>> There is no error, just process is killed. It looks like it stops at
>>>> xilinx_zynqmp_virt,
>>>>
>>>> ./tools/buildman/buildman -o /tmp -P -E -W aarch64but board can be built
>>>> without issues.
>>>>
>>>> If I build on my host (not in docker, anyway), it generally builds fine
>>>> - but it crashes sometimes, too. On gitlab instance , it crashes.
>>>> Issue does not seem that depends on merged patches, and introduces
>>>> boards were already built successfully. Any hint ? I have also no idea
>>>> what I should look as what I see is just
>>>>
>>>> "usr/bin/bash: line 104:    24 Killed
>>>> ./tools/buildman/buildman -o /tmp -P -E -W aarch64"
>>>
>>> I cannot see that link. I am not sure what is going on. Does it say
>>> what signal killed it?
>>
>> Pipelines on our server were not public - I have enbaled now for u-boot-imx.
>>
>>>
>>> Does it sit there for an hour and timeout? If so, then I  did see that
>>> myself once recently, when the Kconfig needed stdin, but I could not
>>> quitetie it down. I think buildman would provide it, but sometimes
>>> not, apparently. So it can happen when there is an existing build
>>> there and your new one which adds Kconfig options that don't have
>>> defaults, or something like that?
>>>
>>
>> I have investigated further, and I can reproduce it on my host outside
>> the gitlab server. buildman causes a OOM, but I cannot find the cause.
>>
>> Strange enough, this happens with the "aarch64" target, and I cannot
>> reproduce it with Tom's master. So it seems that -master is ok, and
>> somethin on u-boot-imx generates the OOM.
>>
>> However....
>>
>> The OOM happens always when -2 (two boards remain) appears. I can see
>> with htop that buildman starts to allocate memory until it is exhausted
>> (64GB RAM + 8 GB swap). Then the kernel decides that it is enough and
>> kills buildman - this is what I see on Ci.
>>
>> You can see now the pipelines:
>>
>> https://source.denx.de/u-boot/custodians/u-boot-imx/-/pipelines/9520
>>
>> I have then split aarch64 and I built imx8 separately - same result. The
>> pipeline stops with xilinx board, but they have nothing to do. In fact,
>> I can build all xilinx board separately. If I run buildman -W aarch64 -x
>> xilinx, OOM is shown by another board.
>>
>> Strange enough, I can build each single board with buildman without
>> issues, neither errors nor warnongs. Just when buildman runs all
>> together (aarch64, 308 boards), the OOM is generated.
>>
>> Bisect does not help: I started bisect, and at the end this commit was
>> presented:
>>
>> commit 53a24dee86fb72ae41e7579607bafe13442616f2
>> Author: Fabio Estevam <festevam@denx.de>
>> Date:   Mon Aug 23 21:11:09 2021 -0300
>>
>>       imx8mm-cl-iot-gate: Split the defconfigs
>>
>>
>> But it is a fake: I can revert it, I get the issue again. And the patch
>> has nothing to do.
>>
>> It looks to me it is something in binman, maybe triggered by some
>> changes in tree, but all boards can be built separately without issues.
>> I supposed to find the cause in code due to applied patches, but because
>> each board can be built and no help from bisect, I am quite puzzled. I
>> avoid to send a PR to Tom, else I guess the problem goes into -master,
>> but I do not know how to proceed, and I have a lot of patches to be applied.
>>
>> What can be done ?
>>
>>> If that is it, you can repeat it by clearing out your .bm-work
>>
>> On gitlab, the build starts from scratch.
> 
> Can you check that there is definitely nothing around from the previous build?

On my host, there is definitely something - because I cannot access the 
docker image on the server, I installed a local runner on my PC. So I 
can take a deeper look and jump on the container.

It runs, and I get the same issue. All boards are built, then it stucks 
until OOM happens and kernel kills it.

The job for aarch64 is:

./tools/buildman/buildman -o /tmp -P -E -W aarch64

and it runs in Tom's image. If I jump in the container (without running 
buildman), there is no .bm-work at all (this should be in /tmp, right ?)

uboot@e4a810aa6d8a:/$ ls -la tmp/
total 8
drwxrwxrwt 1 root root 4096 Sep 30 15:54 .
drwxr-xr-x 1 root root 4096 Oct 19 20:01 ..

So it is sure, before buildman runs there is no .bm-work. Then of 
course, after each job (defined in .gitlab-ci.yml), /tmp/.bm-work is 
present. But I guess you do not mean to drop .bm-work after each job, 
right ? Else this should be also put in .gitla-ci.yml.

Where build stucks...I do not know, but it looks like that all boards 
were already built successfully (just as example 
https://source.denx.de/u-boot/custodians/u-boot-imx/-/jobs/337915).

> 
>>
>>> directory then building just that board for one commit, then the next
>>> (with the Kconfig change).
>>
>> I have run buildman for each single board, all of them were successuful.
>> With aarch64, I get OOM from buildman.
>>
>>>
>>> Buildman is supposed to handle this, of course. I'm not sure what has changed.
>>>
> 
> I still believe this is due to the reason I said,

I am sure you're right, but I do not see any .bm-work before running 
buildman. Is there something I can turn on to get more info ?

> but I'm happy to be
> proved wrong.

Thanks,
Stefano

-- 
=====================================================================
DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: +49-8142-66989-53 Fax: +49-8142-66989-80 Email: sbabic@denx.de
=====================================================================

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: buildman stops (crashed) on current master
  2021-10-19 15:39   ` Stefano Babic
  2021-10-19 15:52     ` Simon Glass
@ 2021-10-19 22:53     ` Tom Rini
  2021-10-19 22:59       ` Simon Glass
  1 sibling, 1 reply; 12+ messages in thread
From: Tom Rini @ 2021-10-19 22:53 UTC (permalink / raw)
  To: Stefano Babic, Simon Glass; +Cc: U-Boot

[-- Attachment #1: Type: text/plain, Size: 3670 bytes --]

On Tue, Oct 19, 2021 at 05:39:12PM +0200, Stefano Babic wrote:
> Hi Simon,
> 
> On 07.10.21 15:43, Simon Glass wrote:
> > Hi Stefano,
> > 
> > On Thu, 7 Oct 2021 at 04:37, Stefano Babic <sbabic@denx.de> wrote:
> > > 
> > > Hi all,
> > > 
> > > CI stops by building aarch64 without notice, for reference:
> > > 
> > > https://source.denx.de/u-boot/custodians/u-boot-imx/-/jobs/332319
> > > 
> > > There is no error, just process is killed. It looks like it stops at
> > > xilinx_zynqmp_virt,
> > > 
> > > ./tools/buildman/buildman -o /tmp -P -E -W aarch64but board can be built
> > > without issues.
> > > 
> > > If I build on my host (not in docker, anyway), it generally builds fine
> > > - but it crashes sometimes, too. On gitlab instance , it crashes.
> > > Issue does not seem that depends on merged patches, and introduces
> > > boards were already built successfully. Any hint ? I have also no idea
> > > what I should look as what I see is just
> > > 
> > > "usr/bin/bash: line 104:    24 Killed
> > > ./tools/buildman/buildman -o /tmp -P -E -W aarch64"
> > 
> > I cannot see that link. I am not sure what is going on. Does it say
> > what signal killed it?
> 
> Pipelines on our server were not public - I have enbaled now for u-boot-imx.
> 
> > 
> > Does it sit there for an hour and timeout? If so, then I  did see that
> > myself once recently, when the Kconfig needed stdin, but I could not
> > quitetie it down. I think buildman would provide it, but sometimes
> > not, apparently. So it can happen when there is an existing build
> > there and your new one which adds Kconfig options that don't have
> > defaults, or something like that?
> > 
> 
> I have investigated further, and I can reproduce it on my host outside the
> gitlab server. buildman causes a OOM, but I cannot find the cause.
> 
> Strange enough, this happens with the "aarch64" target, and I cannot
> reproduce it with Tom's master. So it seems that -master is ok, and somethin
> on u-boot-imx generates the OOM.
> 
> However....
> 
> The OOM happens always when -2 (two boards remain) appears. I can see with
> htop that buildman starts to allocate memory until it is exhausted (64GB RAM
> + 8 GB swap). Then the kernel decides that it is enough and kills buildman -
> this is what I see on Ci.
> 
> You can see now the pipelines:
> 
> https://source.denx.de/u-boot/custodians/u-boot-imx/-/pipelines/9520
> 
> I have then split aarch64 and I built imx8 separately - same result. The
> pipeline stops with xilinx board, but they have nothing to do. In fact, I
> can build all xilinx board separately. If I run buildman -W aarch64 -x
> xilinx, OOM is shown by another board.
> 
> Strange enough, I can build each single board with buildman without issues,
> neither errors nor warnongs. Just when buildman runs all together (aarch64,
> 308 boards), the OOM is generated.
> 
> Bisect does not help: I started bisect, and at the end this commit was
> presented:
> 
> commit 53a24dee86fb72ae41e7579607bafe13442616f2
> Author: Fabio Estevam <festevam@denx.de>
> Date:   Mon Aug 23 21:11:09 2021 -0300
> 
>     imx8mm-cl-iot-gate: Split the defconfigs

I strongly suspect what's going on here is that these new defconfigs are
out of sync with changes now in Kconfig.  The build itself will just sit
there, waiting for the "oldconfig" prompt to be answered.

I want to say the problem here is that stdin is open, rather than
pointing to something closed and would lead to the build failing
immediately, rather than once a timeout is hit, or OOM kicks in due to
kconfig chewing up all the memory.

-- 
Tom

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: buildman stops (crashed) on current master
  2021-10-19 22:53     ` Tom Rini
@ 2021-10-19 22:59       ` Simon Glass
  2021-10-19 23:01         ` Tom Rini
  0 siblings, 1 reply; 12+ messages in thread
From: Simon Glass @ 2021-10-19 22:59 UTC (permalink / raw)
  To: Tom Rini; +Cc: Stefano Babic, U-Boot

Hi Tom,

On Tue, 19 Oct 2021 at 16:53, Tom Rini <trini@konsulko.com> wrote:
>
> On Tue, Oct 19, 2021 at 05:39:12PM +0200, Stefano Babic wrote:
> > Hi Simon,
> >
> > On 07.10.21 15:43, Simon Glass wrote:
> > > Hi Stefano,
> > >
> > > On Thu, 7 Oct 2021 at 04:37, Stefano Babic <sbabic@denx.de> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > CI stops by building aarch64 without notice, for reference:
> > > >
> > > > https://source.denx.de/u-boot/custodians/u-boot-imx/-/jobs/332319
> > > >
> > > > There is no error, just process is killed. It looks like it stops at
> > > > xilinx_zynqmp_virt,
> > > >
> > > > ./tools/buildman/buildman -o /tmp -P -E -W aarch64but board can be built
> > > > without issues.
> > > >
> > > > If I build on my host (not in docker, anyway), it generally builds fine
> > > > - but it crashes sometimes, too. On gitlab instance , it crashes.
> > > > Issue does not seem that depends on merged patches, and introduces
> > > > boards were already built successfully. Any hint ? I have also no idea
> > > > what I should look as what I see is just
> > > >
> > > > "usr/bin/bash: line 104:    24 Killed
> > > > ./tools/buildman/buildman -o /tmp -P -E -W aarch64"
> > >
> > > I cannot see that link. I am not sure what is going on. Does it say
> > > what signal killed it?
> >
> > Pipelines on our server were not public - I have enbaled now for u-boot-imx.
> >
> > >
> > > Does it sit there for an hour and timeout? If so, then I  did see that
> > > myself once recently, when the Kconfig needed stdin, but I could not
> > > quitetie it down. I think buildman would provide it, but sometimes
> > > not, apparently. So it can happen when there is an existing build
> > > there and your new one which adds Kconfig options that don't have
> > > defaults, or something like that?
> > >
> >
> > I have investigated further, and I can reproduce it on my host outside the
> > gitlab server. buildman causes a OOM, but I cannot find the cause.
> >
> > Strange enough, this happens with the "aarch64" target, and I cannot
> > reproduce it with Tom's master. So it seems that -master is ok, and somethin
> > on u-boot-imx generates the OOM.
> >
> > However....
> >
> > The OOM happens always when -2 (two boards remain) appears. I can see with
> > htop that buildman starts to allocate memory until it is exhausted (64GB RAM
> > + 8 GB swap). Then the kernel decides that it is enough and kills buildman -
> > this is what I see on Ci.
> >
> > You can see now the pipelines:
> >
> > https://source.denx.de/u-boot/custodians/u-boot-imx/-/pipelines/9520
> >
> > I have then split aarch64 and I built imx8 separately - same result. The
> > pipeline stops with xilinx board, but they have nothing to do. In fact, I
> > can build all xilinx board separately. If I run buildman -W aarch64 -x
> > xilinx, OOM is shown by another board.
> >
> > Strange enough, I can build each single board with buildman without issues,
> > neither errors nor warnongs. Just when buildman runs all together (aarch64,
> > 308 boards), the OOM is generated.
> >
> > Bisect does not help: I started bisect, and at the end this commit was
> > presented:
> >
> > commit 53a24dee86fb72ae41e7579607bafe13442616f2
> > Author: Fabio Estevam <festevam@denx.de>
> > Date:   Mon Aug 23 21:11:09 2021 -0300
> >
> >     imx8mm-cl-iot-gate: Split the defconfigs
>
> I strongly suspect what's going on here is that these new defconfigs are
> out of sync with changes now in Kconfig.  The build itself will just sit
> there, waiting for the "oldconfig" prompt to be answered.
>
> I want to say the problem here is that stdin is open, rather than
> pointing to something closed and would lead to the build failing
> immediately, rather than once a timeout is hit, or OOM kicks in due to
> kconfig chewing up all the memory.

Yes that's exactly what I saw...

In fact, see this commit:

e62a24ce27a buildman: Avoid hanging when the config changes

But that was 3 years ago.

Regards,
Simon

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: buildman stops (crashed) on current master
  2021-10-19 22:59       ` Simon Glass
@ 2021-10-19 23:01         ` Tom Rini
  2021-10-20  3:42           ` Simon Glass
  0 siblings, 1 reply; 12+ messages in thread
From: Tom Rini @ 2021-10-19 23:01 UTC (permalink / raw)
  To: Simon Glass; +Cc: Stefano Babic, U-Boot

[-- Attachment #1: Type: text/plain, Size: 4435 bytes --]

On Tue, Oct 19, 2021 at 04:59:15PM -0600, Simon Glass wrote:
> Hi Tom,
> 
> On Tue, 19 Oct 2021 at 16:53, Tom Rini <trini@konsulko.com> wrote:
> >
> > On Tue, Oct 19, 2021 at 05:39:12PM +0200, Stefano Babic wrote:
> > > Hi Simon,
> > >
> > > On 07.10.21 15:43, Simon Glass wrote:
> > > > Hi Stefano,
> > > >
> > > > On Thu, 7 Oct 2021 at 04:37, Stefano Babic <sbabic@denx.de> wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > CI stops by building aarch64 without notice, for reference:
> > > > >
> > > > > https://source.denx.de/u-boot/custodians/u-boot-imx/-/jobs/332319
> > > > >
> > > > > There is no error, just process is killed. It looks like it stops at
> > > > > xilinx_zynqmp_virt,
> > > > >
> > > > > ./tools/buildman/buildman -o /tmp -P -E -W aarch64but board can be built
> > > > > without issues.
> > > > >
> > > > > If I build on my host (not in docker, anyway), it generally builds fine
> > > > > - but it crashes sometimes, too. On gitlab instance , it crashes.
> > > > > Issue does not seem that depends on merged patches, and introduces
> > > > > boards were already built successfully. Any hint ? I have also no idea
> > > > > what I should look as what I see is just
> > > > >
> > > > > "usr/bin/bash: line 104:    24 Killed
> > > > > ./tools/buildman/buildman -o /tmp -P -E -W aarch64"
> > > >
> > > > I cannot see that link. I am not sure what is going on. Does it say
> > > > what signal killed it?
> > >
> > > Pipelines on our server were not public - I have enbaled now for u-boot-imx.
> > >
> > > >
> > > > Does it sit there for an hour and timeout? If so, then I  did see that
> > > > myself once recently, when the Kconfig needed stdin, but I could not
> > > > quitetie it down. I think buildman would provide it, but sometimes
> > > > not, apparently. So it can happen when there is an existing build
> > > > there and your new one which adds Kconfig options that don't have
> > > > defaults, or something like that?
> > > >
> > >
> > > I have investigated further, and I can reproduce it on my host outside the
> > > gitlab server. buildman causes a OOM, but I cannot find the cause.
> > >
> > > Strange enough, this happens with the "aarch64" target, and I cannot
> > > reproduce it with Tom's master. So it seems that -master is ok, and somethin
> > > on u-boot-imx generates the OOM.
> > >
> > > However....
> > >
> > > The OOM happens always when -2 (two boards remain) appears. I can see with
> > > htop that buildman starts to allocate memory until it is exhausted (64GB RAM
> > > + 8 GB swap). Then the kernel decides that it is enough and kills buildman -
> > > this is what I see on Ci.
> > >
> > > You can see now the pipelines:
> > >
> > > https://source.denx.de/u-boot/custodians/u-boot-imx/-/pipelines/9520
> > >
> > > I have then split aarch64 and I built imx8 separately - same result. The
> > > pipeline stops with xilinx board, but they have nothing to do. In fact, I
> > > can build all xilinx board separately. If I run buildman -W aarch64 -x
> > > xilinx, OOM is shown by another board.
> > >
> > > Strange enough, I can build each single board with buildman without issues,
> > > neither errors nor warnongs. Just when buildman runs all together (aarch64,
> > > 308 boards), the OOM is generated.
> > >
> > > Bisect does not help: I started bisect, and at the end this commit was
> > > presented:
> > >
> > > commit 53a24dee86fb72ae41e7579607bafe13442616f2
> > > Author: Fabio Estevam <festevam@denx.de>
> > > Date:   Mon Aug 23 21:11:09 2021 -0300
> > >
> > >     imx8mm-cl-iot-gate: Split the defconfigs
> >
> > I strongly suspect what's going on here is that these new defconfigs are
> > out of sync with changes now in Kconfig.  The build itself will just sit
> > there, waiting for the "oldconfig" prompt to be answered.
> >
> > I want to say the problem here is that stdin is open, rather than
> > pointing to something closed and would lead to the build failing
> > immediately, rather than once a timeout is hit, or OOM kicks in due to
> > kconfig chewing up all the memory.
> 
> Yes that's exactly what I saw...
> 
> In fact, see this commit:
> 
> e62a24ce27a buildman: Avoid hanging when the config changes
> 
> But that was 3 years ago.

Looks like something else needs to be changed then, I've bisected down
similar failures here before very recently.

-- 
Tom

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: buildman stops (crashed) on current master
  2021-10-19 23:01         ` Tom Rini
@ 2021-10-20  3:42           ` Simon Glass
  2021-10-20  9:54             ` Stefano Babic
  0 siblings, 1 reply; 12+ messages in thread
From: Simon Glass @ 2021-10-20  3:42 UTC (permalink / raw)
  To: Tom Rini; +Cc: Stefano Babic, U-Boot

Hi,

On Tue, 19 Oct 2021 at 17:01, Tom Rini <trini@konsulko.com> wrote:
>
> On Tue, Oct 19, 2021 at 04:59:15PM -0600, Simon Glass wrote:
> > Hi Tom,
> >
> > On Tue, 19 Oct 2021 at 16:53, Tom Rini <trini@konsulko.com> wrote:
> > >
> > > On Tue, Oct 19, 2021 at 05:39:12PM +0200, Stefano Babic wrote:
> > > > Hi Simon,
> > > >
> > > > On 07.10.21 15:43, Simon Glass wrote:
> > > > > Hi Stefano,
> > > > >
> > > > > On Thu, 7 Oct 2021 at 04:37, Stefano Babic <sbabic@denx.de> wrote:
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > CI stops by building aarch64 without notice, for reference:
> > > > > >
> > > > > > https://source.denx.de/u-boot/custodians/u-boot-imx/-/jobs/332319
> > > > > >
> > > > > > There is no error, just process is killed. It looks like it stops at
> > > > > > xilinx_zynqmp_virt,
> > > > > >
> > > > > > ./tools/buildman/buildman -o /tmp -P -E -W aarch64but board can be built
> > > > > > without issues.
> > > > > >
> > > > > > If I build on my host (not in docker, anyway), it generally builds fine
> > > > > > - but it crashes sometimes, too. On gitlab instance , it crashes.
> > > > > > Issue does not seem that depends on merged patches, and introduces
> > > > > > boards were already built successfully. Any hint ? I have also no idea
> > > > > > what I should look as what I see is just
> > > > > >
> > > > > > "usr/bin/bash: line 104:    24 Killed
> > > > > > ./tools/buildman/buildman -o /tmp -P -E -W aarch64"
> > > > >
> > > > > I cannot see that link. I am not sure what is going on. Does it say
> > > > > what signal killed it?
> > > >
> > > > Pipelines on our server were not public - I have enbaled now for u-boot-imx.
> > > >
> > > > >
> > > > > Does it sit there for an hour and timeout? If so, then I  did see that
> > > > > myself once recently, when the Kconfig needed stdin, but I could not
> > > > > quitetie it down. I think buildman would provide it, but sometimes
> > > > > not, apparently. So it can happen when there is an existing build
> > > > > there and your new one which adds Kconfig options that don't have
> > > > > defaults, or something like that?
> > > > >
> > > >
> > > > I have investigated further, and I can reproduce it on my host outside the
> > > > gitlab server. buildman causes a OOM, but I cannot find the cause.
> > > >
> > > > Strange enough, this happens with the "aarch64" target, and I cannot
> > > > reproduce it with Tom's master. So it seems that -master is ok, and somethin
> > > > on u-boot-imx generates the OOM.
> > > >
> > > > However....
> > > >
> > > > The OOM happens always when -2 (two boards remain) appears. I can see with
> > > > htop that buildman starts to allocate memory until it is exhausted (64GB RAM
> > > > + 8 GB swap). Then the kernel decides that it is enough and kills buildman -
> > > > this is what I see on Ci.
> > > >
> > > > You can see now the pipelines:
> > > >
> > > > https://source.denx.de/u-boot/custodians/u-boot-imx/-/pipelines/9520
> > > >
> > > > I have then split aarch64 and I built imx8 separately - same result. The
> > > > pipeline stops with xilinx board, but they have nothing to do. In fact, I
> > > > can build all xilinx board separately. If I run buildman -W aarch64 -x
> > > > xilinx, OOM is shown by another board.
> > > >
> > > > Strange enough, I can build each single board with buildman without issues,
> > > > neither errors nor warnongs. Just when buildman runs all together (aarch64,
> > > > 308 boards), the OOM is generated.
> > > >
> > > > Bisect does not help: I started bisect, and at the end this commit was
> > > > presented:
> > > >
> > > > commit 53a24dee86fb72ae41e7579607bafe13442616f2
> > > > Author: Fabio Estevam <festevam@denx.de>
> > > > Date:   Mon Aug 23 21:11:09 2021 -0300
> > > >
> > > >     imx8mm-cl-iot-gate: Split the defconfigs
> > >
> > > I strongly suspect what's going on here is that these new defconfigs are
> > > out of sync with changes now in Kconfig.  The build itself will just sit
> > > there, waiting for the "oldconfig" prompt to be answered.
> > >
> > > I want to say the problem here is that stdin is open, rather than
> > > pointing to something closed and would lead to the build failing
> > > immediately, rather than once a timeout is hit, or OOM kicks in due to
> > > kconfig chewing up all the memory.
> >
> > Yes that's exactly what I saw...
> >
> > In fact, see this commit:
> >
> > e62a24ce27a buildman: Avoid hanging when the config changes
> >
> > But that was 3 years ago.
>
> Looks like something else needs to be changed then, I've bisected down
> similar failures here before very recently.

I dug into this a bit and I think buildman can detect this situation.
I'll send a little series.

Regards,
Simon

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: buildman stops (crashed) on current master
  2021-10-20  3:42           ` Simon Glass
@ 2021-10-20  9:54             ` Stefano Babic
  2021-10-20 13:39               ` Simon Glass
  0 siblings, 1 reply; 12+ messages in thread
From: Stefano Babic @ 2021-10-20  9:54 UTC (permalink / raw)
  To: Simon Glass, Tom Rini; +Cc: Stefano Babic, U-Boot

On 20.10.21 05:42, Simon Glass wrote:
> Hi,
> 
> On Tue, 19 Oct 2021 at 17:01, Tom Rini <trini@konsulko.com> wrote:
>>
>> On Tue, Oct 19, 2021 at 04:59:15PM -0600, Simon Glass wrote:
>>> Hi Tom,
>>>
>>> On Tue, 19 Oct 2021 at 16:53, Tom Rini <trini@konsulko.com> wrote:
>>>>
>>>> On Tue, Oct 19, 2021 at 05:39:12PM +0200, Stefano Babic wrote:
>>>>> Hi Simon,
>>>>>
>>>>> On 07.10.21 15:43, Simon Glass wrote:
>>>>>> Hi Stefano,
>>>>>>
>>>>>> On Thu, 7 Oct 2021 at 04:37, Stefano Babic <sbabic@denx.de> wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> CI stops by building aarch64 without notice, for reference:
>>>>>>>
>>>>>>> https://source.denx.de/u-boot/custodians/u-boot-imx/-/jobs/332319
>>>>>>>
>>>>>>> There is no error, just process is killed. It looks like it stops at
>>>>>>> xilinx_zynqmp_virt,
>>>>>>>
>>>>>>> ./tools/buildman/buildman -o /tmp -P -E -W aarch64but board can be built
>>>>>>> without issues.
>>>>>>>
>>>>>>> If I build on my host (not in docker, anyway), it generally builds fine
>>>>>>> - but it crashes sometimes, too. On gitlab instance , it crashes.
>>>>>>> Issue does not seem that depends on merged patches, and introduces
>>>>>>> boards were already built successfully. Any hint ? I have also no idea
>>>>>>> what I should look as what I see is just
>>>>>>>
>>>>>>> "usr/bin/bash: line 104:    24 Killed
>>>>>>> ./tools/buildman/buildman -o /tmp -P -E -W aarch64"
>>>>>>
>>>>>> I cannot see that link. I am not sure what is going on. Does it say
>>>>>> what signal killed it?
>>>>>
>>>>> Pipelines on our server were not public - I have enbaled now for u-boot-imx.
>>>>>
>>>>>>
>>>>>> Does it sit there for an hour and timeout? If so, then I  did see that
>>>>>> myself once recently, when the Kconfig needed stdin, but I could not
>>>>>> quitetie it down. I think buildman would provide it, but sometimes
>>>>>> not, apparently. So it can happen when there is an existing build
>>>>>> there and your new one which adds Kconfig options that don't have
>>>>>> defaults, or something like that?
>>>>>>
>>>>>
>>>>> I have investigated further, and I can reproduce it on my host outside the
>>>>> gitlab server. buildman causes a OOM, but I cannot find the cause.
>>>>>
>>>>> Strange enough, this happens with the "aarch64" target, and I cannot
>>>>> reproduce it with Tom's master. So it seems that -master is ok, and somethin
>>>>> on u-boot-imx generates the OOM.
>>>>>
>>>>> However....
>>>>>
>>>>> The OOM happens always when -2 (two boards remain) appears. I can see with
>>>>> htop that buildman starts to allocate memory until it is exhausted (64GB RAM
>>>>> + 8 GB swap). Then the kernel decides that it is enough and kills buildman -
>>>>> this is what I see on Ci.
>>>>>
>>>>> You can see now the pipelines:
>>>>>
>>>>> https://source.denx.de/u-boot/custodians/u-boot-imx/-/pipelines/9520
>>>>>
>>>>> I have then split aarch64 and I built imx8 separately - same result. The
>>>>> pipeline stops with xilinx board, but they have nothing to do. In fact, I
>>>>> can build all xilinx board separately. If I run buildman -W aarch64 -x
>>>>> xilinx, OOM is shown by another board.
>>>>>
>>>>> Strange enough, I can build each single board with buildman without issues,
>>>>> neither errors nor warnongs. Just when buildman runs all together (aarch64,
>>>>> 308 boards), the OOM is generated.
>>>>>
>>>>> Bisect does not help: I started bisect, and at the end this commit was
>>>>> presented:
>>>>>
>>>>> commit 53a24dee86fb72ae41e7579607bafe13442616f2
>>>>> Author: Fabio Estevam <festevam@denx.de>
>>>>> Date:   Mon Aug 23 21:11:09 2021 -0300
>>>>>
>>>>>      imx8mm-cl-iot-gate: Split the defconfigs
>>>>
>>>> I strongly suspect what's going on here is that these new defconfigs are
>>>> out of sync with changes now in Kconfig.  The build itself will just sit
>>>> there, waiting for the "oldconfig" prompt to be answered.
>>>>
>>>> I want to say the problem here is that stdin is open, rather than
>>>> pointing to something closed and would lead to the build failing
>>>> immediately, rather than once a timeout is hit, or OOM kicks in due to
>>>> kconfig chewing up all the memory.
>>>
>>> Yes that's exactly what I saw...
>>>
>>> In fact, see this commit:
>>>
>>> e62a24ce27a buildman: Avoid hanging when the config changes
>>>
>>> But that was 3 years ago.
>>
>> Looks like something else needs to be changed then, I've bisected down
>> similar failures here before very recently.
> 
> I dug into this a bit and I think buildman can detect this situation.
> I'll send a little series.
> 

Patch definetly help ;-)

It breaks build (on CI when build-tools runs), but I get much more 
details when I build locally single boards. I can find for 
kontron-sl-mx8mm several errors due to:

- CONFIG_SYS_LOAD_ADDR not defined in configs, but in header
- CONFIG_SYS_EXTRA_OPTIONS instead of CONFIG_IMX_CONFIG
- CONFIG_SYS_MALLOC_LEN not defined in config, but in header

Your patch are a valueable tool (CI driove me crazy), I can now folow 
what happens. I send a patch for kontron, and I go on with the rest (I 
guess kontron is not the only board causing this deadlock). Many thanks !

Tom, I apply Simon's patches on my tree, I cannot work without them...

Regards,
Stefano

-- 
=====================================================================
DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: +49-8142-66989-53 Fax: +49-8142-66989-80 Email: sbabic@denx.de
=====================================================================

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: buildman stops (crashed) on current master
  2021-10-20  9:54             ` Stefano Babic
@ 2021-10-20 13:39               ` Simon Glass
  0 siblings, 0 replies; 12+ messages in thread
From: Simon Glass @ 2021-10-20 13:39 UTC (permalink / raw)
  To: Stefano Babic; +Cc: Tom Rini, U-Boot

Hi Stefano,

On Wed, 20 Oct 2021 at 03:55, Stefano Babic <sbabic@denx.de> wrote:
>
> On 20.10.21 05:42, Simon Glass wrote:
> > Hi,
> >
> > On Tue, 19 Oct 2021 at 17:01, Tom Rini <trini@konsulko.com> wrote:
> >>
> >> On Tue, Oct 19, 2021 at 04:59:15PM -0600, Simon Glass wrote:
> >>> Hi Tom,
> >>>
> >>> On Tue, 19 Oct 2021 at 16:53, Tom Rini <trini@konsulko.com> wrote:
> >>>>
> >>>> On Tue, Oct 19, 2021 at 05:39:12PM +0200, Stefano Babic wrote:
> >>>>> Hi Simon,
> >>>>>
> >>>>> On 07.10.21 15:43, Simon Glass wrote:
> >>>>>> Hi Stefano,
> >>>>>>
> >>>>>> On Thu, 7 Oct 2021 at 04:37, Stefano Babic <sbabic@denx.de> wrote:
> >>>>>>>
> >>>>>>> Hi all,
> >>>>>>>
> >>>>>>> CI stops by building aarch64 without notice, for reference:
> >>>>>>>
> >>>>>>> https://source.denx.de/u-boot/custodians/u-boot-imx/-/jobs/332319
> >>>>>>>
> >>>>>>> There is no error, just process is killed. It looks like it stops at
> >>>>>>> xilinx_zynqmp_virt,
> >>>>>>>
> >>>>>>> ./tools/buildman/buildman -o /tmp -P -E -W aarch64but board can be built
> >>>>>>> without issues.
> >>>>>>>
> >>>>>>> If I build on my host (not in docker, anyway), it generally builds fine
> >>>>>>> - but it crashes sometimes, too. On gitlab instance , it crashes.
> >>>>>>> Issue does not seem that depends on merged patches, and introduces
> >>>>>>> boards were already built successfully. Any hint ? I have also no idea
> >>>>>>> what I should look as what I see is just
> >>>>>>>
> >>>>>>> "usr/bin/bash: line 104:    24 Killed
> >>>>>>> ./tools/buildman/buildman -o /tmp -P -E -W aarch64"
> >>>>>>
> >>>>>> I cannot see that link. I am not sure what is going on. Does it say
> >>>>>> what signal killed it?
> >>>>>
> >>>>> Pipelines on our server were not public - I have enbaled now for u-boot-imx.
> >>>>>
> >>>>>>
> >>>>>> Does it sit there for an hour and timeout? If so, then I  did see that
> >>>>>> myself once recently, when the Kconfig needed stdin, but I could not
> >>>>>> quitetie it down. I think buildman would provide it, but sometimes
> >>>>>> not, apparently. So it can happen when there is an existing build
> >>>>>> there and your new one which adds Kconfig options that don't have
> >>>>>> defaults, or something like that?
> >>>>>>
> >>>>>
> >>>>> I have investigated further, and I can reproduce it on my host outside the
> >>>>> gitlab server. buildman causes a OOM, but I cannot find the cause.
> >>>>>
> >>>>> Strange enough, this happens with the "aarch64" target, and I cannot
> >>>>> reproduce it with Tom's master. So it seems that -master is ok, and somethin
> >>>>> on u-boot-imx generates the OOM.
> >>>>>
> >>>>> However....
> >>>>>
> >>>>> The OOM happens always when -2 (two boards remain) appears. I can see with
> >>>>> htop that buildman starts to allocate memory until it is exhausted (64GB RAM
> >>>>> + 8 GB swap). Then the kernel decides that it is enough and kills buildman -
> >>>>> this is what I see on Ci.
> >>>>>
> >>>>> You can see now the pipelines:
> >>>>>
> >>>>> https://source.denx.de/u-boot/custodians/u-boot-imx/-/pipelines/9520
> >>>>>
> >>>>> I have then split aarch64 and I built imx8 separately - same result. The
> >>>>> pipeline stops with xilinx board, but they have nothing to do. In fact, I
> >>>>> can build all xilinx board separately. If I run buildman -W aarch64 -x
> >>>>> xilinx, OOM is shown by another board.
> >>>>>
> >>>>> Strange enough, I can build each single board with buildman without issues,
> >>>>> neither errors nor warnongs. Just when buildman runs all together (aarch64,
> >>>>> 308 boards), the OOM is generated.
> >>>>>
> >>>>> Bisect does not help: I started bisect, and at the end this commit was
> >>>>> presented:
> >>>>>
> >>>>> commit 53a24dee86fb72ae41e7579607bafe13442616f2
> >>>>> Author: Fabio Estevam <festevam@denx.de>
> >>>>> Date:   Mon Aug 23 21:11:09 2021 -0300
> >>>>>
> >>>>>      imx8mm-cl-iot-gate: Split the defconfigs
> >>>>
> >>>> I strongly suspect what's going on here is that these new defconfigs are
> >>>> out of sync with changes now in Kconfig.  The build itself will just sit
> >>>> there, waiting for the "oldconfig" prompt to be answered.
> >>>>
> >>>> I want to say the problem here is that stdin is open, rather than
> >>>> pointing to something closed and would lead to the build failing
> >>>> immediately, rather than once a timeout is hit, or OOM kicks in due to
> >>>> kconfig chewing up all the memory.
> >>>
> >>> Yes that's exactly what I saw...
> >>>
> >>> In fact, see this commit:
> >>>
> >>> e62a24ce27a buildman: Avoid hanging when the config changes
> >>>
> >>> But that was 3 years ago.
> >>
> >> Looks like something else needs to be changed then, I've bisected down
> >> similar failures here before very recently.
> >
> > I dug into this a bit and I think buildman can detect this situation.
> > I'll send a little series.
> >
>
> Patch definetly help ;-)
>
> It breaks build (on CI when build-tools runs), but I get much more
> details when I build locally single boards. I can find for
> kontron-sl-mx8mm several errors due to:
>
> - CONFIG_SYS_LOAD_ADDR not defined in configs, but in header
> - CONFIG_SYS_EXTRA_OPTIONS instead of CONFIG_IMX_CONFIG
> - CONFIG_SYS_MALLOC_LEN not defined in config, but in header
>
> Your patch are a valueable tool (CI driove me crazy), I can now folow
> what happens. I send a patch for kontron, and I go on with the rest (I
> guess kontron is not the only board causing this deadlock). Many thanks !
>
> Tom, I apply Simon's patches on my tree, I cannot work without them...

OK good. Unfortunately this is likely to throw up problems that are
only fatal 1% of the time.

Regards,
Simon

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-10-20 13:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-07 10:37 buildman stops (crashed) on current master Stefano Babic
2021-10-07 13:43 ` Simon Glass
2021-10-07 14:10   ` Stefano Babic
2021-10-19 15:39   ` Stefano Babic
2021-10-19 15:52     ` Simon Glass
2021-10-19 20:10       ` Stefano Babic
2021-10-19 22:53     ` Tom Rini
2021-10-19 22:59       ` Simon Glass
2021-10-19 23:01         ` Tom Rini
2021-10-20  3:42           ` Simon Glass
2021-10-20  9:54             ` Stefano Babic
2021-10-20 13:39               ` Simon Glass

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).