All of lore.kernel.org
 help / color / mirror / Atom feed
* I disabled more compiler warnings..
@ 2020-05-10 19:33 Linus Torvalds
  2020-05-11  7:43 ` David Laight
  2020-05-11  9:03 ` Arnd Bergmann
  0 siblings, 2 replies; 17+ messages in thread
From: Linus Torvalds @ 2020-05-10 19:33 UTC (permalink / raw)
  To: Arnd Bergmann, Masahiro Yamada; +Cc: Linux Kernel Mailing List

Arnd, Masahiro, (anybody else?),
  I don't know how much you care - I thought at least Arnd might - but
I've upgraded my main desktop to gcc-10. I did that partly because a
bug report that initially seemed to blame the compiler, but eventually
turned out to be unrelated.

And as part of that, there were a lot of new warnings in the kernel build.

I let them go for a while, in the belief that I could deal with it,
but then yesterday I did a pull and didn't initially even notice that
the end result didn't compile for me, because the build error was
hidden by the hundreds of lines of warnings (not hundreds of warnings,
but gcc has gotten to be pretty verbose about the warnings, so each
warning was often 5+ lines of "this is where it happened, this is
where it was included from, and this is where the offending thing was
declared")

A lot of them were good warnings where gcc warns about things it
really should warn about - if you have modern source code and actually
use flexible arrays etc. Which we're moving towards, but we're not
there yet, and clearly won't be for 5.7.

So I don't think those parts are at all controversial. We'll get the
zero-sized arrays converted, and then we'll have another slew of
one-sized array declarations that will also have to be taken care of,
but I hope we can re-enable those array limit warnings in the not too
distant future. They should be fairly unambiguously a good thing once
we've sorted out the legacy code, which people are working on anyway.

Same to some degree goes for the 'restrict' warning, which we don't
need right now and triggers for what is a somewhat odd - but not
entirely unusual - case for the kernel. That warning right now has no
upside for us, but it's a potentially interesting warning for the
future. And the special 'free()' warning is gcc doing things the wrong
way, but that one was easy enough to just avoid in the first place,
without making the source code any worse.

So most of the warning removals are fairly innocuous:

   gcc-10: disable 'restrict' warning for now
   gcc-10: disable 'stringop-overflow' warning for now
   gcc-10: disable 'array-bounds' warning for now
   gcc-10: disable 'zero-length-bounds' warning for now

and two were fixes/cleanups to the kernel to avoid new warnings:

   gcc-10: mark more functions __init to avoid section mismatch warnings
   gcc-10: avoid shadowing standard library 'free()' in crypto

but the one commit I wanted to point out - because I think Arnd may
care - is me giving up on the "maybe initialized" warning once again.

This is the usual "gcc optimizations changed, now it can't follow the
flow of code it used to handle fine, and it warns in several new
places that are almost certainly false positives":

   Stop the ad-hoc games with -Wno-maybe-initialized

so now -Wno-maybe-uninitialized is the default behavior, and if you
want to see those (very unreliable) warnings, you need to use "W=2"
which has explicitly enabled it for a longish time.

I simply refuse to be in the situation where I might miss an
_important_ warning (or, like happened yesterday, an actual failure),
because the stupid warning noise is hiding things that matter. Yes, I
caught the build error despite the noise, but that was partly luck (I
did another pull before I pushed out, and caught the error on the
second build). And yes, I've made my workflow now hopefully make sure
that the actual build error will stand out more, but even hiding just
other - more real - warnings is a problem, so I do not want to see the
pointless noise.

But I wish there was a better model for handling those "maybe
uninitialzed" issues. Some of them _have_ been valid, even if a huge
amount are just the compiler not following the use of the variables
well enough.

So I thought I'd at least mention this thing, and see if people have
any sane solutions.

             Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: I disabled more compiler warnings..
  2020-05-10 19:33 I disabled more compiler warnings Linus Torvalds
@ 2020-05-11  7:43 ` David Laight
  2020-05-11 17:41   ` Linus Torvalds
  2020-05-11  9:03 ` Arnd Bergmann
  1 sibling, 1 reply; 17+ messages in thread
From: David Laight @ 2020-05-11  7:43 UTC (permalink / raw)
  To: 'Linus Torvalds', Arnd Bergmann, Masahiro Yamada
  Cc: Linux Kernel Mailing List

From: Linus Torvalds
> Sent: 10 May 2020 20:33
...
> And as part of that, there were a lot of new warnings in the kernel build.
> 
> I let them go for a while, in the belief that I could deal with it,
> but then yesterday I did a pull and didn't initially even notice that
> the end result didn't compile for me, because the build error was
> hidden by the hundreds of lines of warnings ...

One problem is that gmake is very bad at stopping parallel
makes when one command fails.
So the kernel build carries on firing off new compilations
even after one has failed.

I've not looked inside gmake, but I fixed nmake so that it
properly used a single job token pipe for the entire (NetBSD)
build and then flushed and refilled it with 'abort' tokens
when any command failed.
That made the build stop almost immediately.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: I disabled more compiler warnings..
  2020-05-10 19:33 I disabled more compiler warnings Linus Torvalds
  2020-05-11  7:43 ` David Laight
@ 2020-05-11  9:03 ` Arnd Bergmann
  2020-05-11 11:11   ` Arnd Bergmann
  1 sibling, 1 reply; 17+ messages in thread
From: Arnd Bergmann @ 2020-05-11  9:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Masahiro Yamada, Linux Kernel Mailing List, Jason A. Donenfeld

On Sun, May 10, 2020 at 9:33 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I don't think those parts are at all controversial. We'll get the
> zero-sized arrays converted, and then we'll have another slew of
> one-sized array declarations that will also have to be taken care of,
> but I hope we can re-enable those array limit warnings in the not too
> distant future. They should be fairly unambiguously a good thing once
> we've sorted out the legacy code, which people are working on anyway.

Agreed, I have sent patches for all the zero-length arrays. Some
of them still need more discussion or better ideas when they are not
trivially changed to flexible arrays, but there are not actually so many
of them that cause warnings right now.

> Same to some degree goes for the 'restrict' warning, which we don't
> need right now and triggers for what is a somewhat odd - but not
> entirely unusual - case for the kernel.

I think in this case it's not the warning that is new, but (same as
the free free() case) the compiler now knows that snprintf() comes with
a restrict pointer in standard C even when we don't mention it in our
declaration.

> and two were fixes/cleanups to the kernel to avoid new warnings:
>
>    gcc-10: mark more functions __init to avoid section mismatch warnings
>    gcc-10: avoid shadowing standard library 'free()' in crypto

I think I had sent the same patches and they got merged into linux-next,
but the maintainers did not consider them to be v5.7 material as
this was all old code.

> but the one commit I wanted to point out - because I think Arnd may
> care - is me giving up on the "maybe initialized" warning once again.
>
> This is the usual "gcc optimizations changed, now it can't follow the
> flow of code it used to handle fine, and it warns in several new
> places that are almost certainly false positives":
>
>    Stop the ad-hoc games with -Wno-maybe-initialized
>
> so now -Wno-maybe-uninitialized is the default behavior, and if you
> want to see those (very unreliable) warnings, you need to use "W=2"
> which has explicitly enabled it for a longish time.
>
> I simply refuse to be in the situation where I might miss an
> _important_ warning (or, like happened yesterday, an actual failure),
> because the stupid warning noise is hiding things that matter. Yes, I
> caught the build error despite the noise, but that was partly luck (I
> did another pull before I pushed out, and caught the error on the
> second build). And yes, I've made my workflow now hopefully make sure
> that the actual build error will stand out more, but even hiding just
> other - more real - warnings is a problem, so I do not want to see the
> pointless noise.
>
> But I wish there was a better model for handling those "maybe
> uninitialzed" issues. Some of them _have_ been valid, even if a huge
> amount are just the compiler not following the use of the variables
> well enough.
>
> So I thought I'd at least mention this thing, and see if people have
> any sane solutions.

Thanks for looping me in. I will try to investigate what is going on,
as I did not see a huge increase in these warnings after I moved
to gcc-10, and clearly something is going very wrong. Some more
notes on the topic:

- I absolutely agree that on a release kernel, doing 'make -s
  allmodconfig; make -skj30' should produce zero output,
  and anything else is a problem that needs to be addressed
  in some form, so turning them off for the moment is sensible.

- I also think that the warning about possibly uninitialized variables
  is an important tool in general, as long as it works well enough,
  as Rusty explained many years ago in
  https://rusty.ozlabs.org/?p=232
  It is super-annoying to see false-positivies pop up in old code, but
  when writing new code, that warning has saved me from running
  some stupidly broken code many times.

- Jason Donenfeld noticed that the gcc-10 inliner is much less
  aggressive in -O2 and suggested using -O3 by default to get
  the original behavior back.
  https://lore.kernel.org/lkml/20200507224530.2993316-1-Jason@zx2c4.com/
  I'm not sure we want all the things that -O3 does, but whatever
  changed in the inliner is probably the root cause for the new
  warnings, and for the same reason we previously turned off the
  warning in "-Os".

- There is a big difference between the handling of this warning
  between gcc and clang: while both have false positives and
  false negatives (see: halting problem), the behavior of clang is
  more reliable as the set of warnings it produces does not
  depend on inlining decisions.

- There is a good chance that some trivial difference between
  mainline and linux-next, or between linux-next and my test
  branch (with a couple hundred fixup patches) is causing you
  to see many warnings that I don't get. I'll try to reproduce and
  bisect what I find.

      Arnd

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: I disabled more compiler warnings..
  2020-05-11  9:03 ` Arnd Bergmann
@ 2020-05-11 11:11   ` Arnd Bergmann
  0 siblings, 0 replies; 17+ messages in thread
From: Arnd Bergmann @ 2020-05-11 11:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Masahiro Yamada, Linux Kernel Mailing List, Jason A. Donenfeld

On Mon, May 11, 2020 at 11:03 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> - There is a good chance that some trivial difference between
>   mainline and linux-next, or between linux-next and my test
>   branch (with a couple hundred fixup patches) is causing you
>   to see many warnings that I don't get. I'll try to reproduce and
>   bisect what I find.

Update: I found a bug in my own patches, one of them tried to disable
the warning when CONFIG_GCOV_PROFILE_ALL is turned on, as that
caused some false positives. Unfortunately I ended up turning it off
unconditionally by testing for CONFIG_ARCH_HAS_GCOV_PROFILE_ALL
instead.

I do see 36 such errors in my tree now with gcc-10 x86 allmodconfig,
and there are probably a couple more in your tree that I had already
fixed locally, e.g. when clang found the same thing. See full (abbreviated)
list below.

Looking at some of those warnings, I find a roughly 50:50 split between
code that gcc should have warned about all along when there is a
path in which a variable is actually used without initialization (which
may be impossible for reasons the compiler cannot know), and code
that looks correct to me, but I can see exactly how gcc gets confused,
especially now that it warns about the first class.
It should be possible to avoid these 36 warnings while making the
code more readable in most cases, or adding fake initializations
for false positives without a better workaround.

For -O3, there are many more warnings (225 total in my tree), with
additional valid bugs but also a higher ratio of false positives
in the samples I looked at. gcc-9 -O3 had additional false positives and
also a few extra somewhat sensible warnings, 273 warnings in total,
though these had always been disabled already.

For the default -O2, gcc-9 prints only the three warnings from
kernel/trace/ftrace.c, all of which are false positives that the compiler
should have not warned about IMO.

        Arnd

8<---
/* x86 allmodconfig gcc-10 -O2 warnings */
arch/x86/kvm/../../../virt/kvm/kvm_main.c:2441:42: error:
'nr_pages_avail' may be used uninitialized
drivers/video/fbdev/i740fb.c:332:9: error: 'wm' may be used uninitialized
lib/zstd/decompress.c:303:6: error: 'fParams.windowSize' may be used
uninitialized
lib/zstd/decompress.c:342:8: error: 'fParams.frameContentSize' may be
used uninitialized
kernel/trace/ftrace.c:7156:8: error: 'seq_ops' may be used uninitialized
kernel/trace/ftrace.c:7234:8: error: 'filtered_pids' may be used uninitialized
kernel/trace/ftrace.c:7251:22: error: 'other_pids' may be used uninitialized
drivers/i3c/device.c:227:41: error: 'devinfo.dcr' may be used uninitialized
include/linux/i3c/device.h:80:49: error: 'devinfo.pid' may be used uninitialized
fs/ext4/ext4_extents.h:226:8: error: 'zero_ex1.ee_start_lo' may be
used uninitialized
fs/ext4/ext4_extents.h:227:12: error: 'zero_ex1.ee_start_hi' may be
used uninitialized
fs/ext4/extents.c:3083:9: error: 'zero_ex1.ee_block' may be used uninitialized
drivers/input/touchscreen/hideep.c:373:14: error: 'unmask_code' may be
used uninitialized
drivers/media/i2c/tvp7002.c:691:11: error: 'val' may be used uninitialized
include/linux/spinlock.h:289:3: error: 'flags' may be used uninitialized
drivers/gpu/drm/amd/amdgpu/df_v3_6.c:571:2: error: 'hi_base_addr' may
be used uninitialized
drivers/gpu/drm/amd/amdgpu/df_v3_6.c:571:2: error: 'lo_base_addr' may
be used uninitialized
drivers/gpu/drm/i915/gt/selftest_lrc.c:2959:28: error: 'rq' may be
used uninitialized
drivers/gpu/drm/i915/gt/selftest_workarounds.c:492:17: error: 'rsvd'
may be used uninitialized
drivers/firmware/arm_scmi/driver.c:502:12: error: 't' may be used uninitialized
drivers/usb/serial/cp210x.c:948:8: error: 'bits' may be used uninitialized
drivers/usb/serial/cp210x.c:1194:8: error: 'bits' may be used uninitialized
drivers/platform/x86/acer-wmi.c:1440:9: error: 'value' may be used uninitialized
drivers/net/wireless/rsi/rsi_91x_sdio.c:1340:7: error: 'data' may be
used uninitialized
drivers/net/wireless/rsi/rsi_91x_sdio.c:238:12: error: 'resp' may be
used uninitialized
drivers/gpu/drm/nouveau/nvkm/subdev/i2c/auxg94.c:160:45: error: 'stat'
may be used uninitialized
drivers/gpu/drm/nouveau/nvkm/subdev/i2c/auxgm200.c:160:45: error:
'stat' may be used uninitialized
drivers/ata/ahci_imx.c:369:30: error: 'dac_ctl_reg' may be used uninitialized
drivers/ata/ahci_imx.c:377:33: error: 'rtune_ctl_reg' may be used uninitialized
drivers/ata/ahci_imx.c:375:42: error: 'mpll_test_reg' may be used uninitialized
include/linux/printk.h:360:2: error: 'abs_ppfid' may be used uninitialized
drivers/media/dvb-frontends/drxk_hard.c:1026:5: error: 'wait_cmd' may
be used uninitialized
drivers/iio/imu/bmi160/bmi160_core.c:577:47: error:
'int_out_ctrl_shift' may be used uninitialized
include/linux/dev_printk.h:104:2: error: 'pin_name' may be used uninitialized
drivers/iio/imu/bmi160/bmi160_core.c:606:8: error: 'int_map_mask' may
be used uninitialized
drivers/iio/imu/bmi160/bmi160_core.c:599:8: error: 'int_latch_mask'
may be used uninitialized

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: I disabled more compiler warnings..
  2020-05-11  7:43 ` David Laight
@ 2020-05-11 17:41   ` Linus Torvalds
  2020-05-11 17:58     ` Paul Smith
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2020-05-11 17:41 UTC (permalink / raw)
  To: David Laight, Paul Smith
  Cc: Arnd Bergmann, Masahiro Yamada, Linux Kernel Mailing List

On Mon, May 11, 2020 at 12:43 AM David Laight <David.Laight@aculab.com> wrote:
>
> I've not looked inside gmake, but I fixed nmake so that it
> properly used a single job token pipe for the entire (NetBSD)
> build and then flushed and refilled it with 'abort' tokens
> when any command failed.
> That made the build stop almost immediately.

The GNU jobserver doesn't have anything like that, afaik.

I think it always writes a '+' character as a token, so I guess it
could be extended to write something else for the "abort now"
situation (presumably a '-' character).

But at least for external jobserver clients (of which I am not aware
of any, I think we only depend on the internal GNU make behavior), the
documentation states that you should just write back the same token
you read.

I've looked at the low-level jobserver code because I was chasing a
bug there not that long ago, but I've never looked at the interaction
with actually running commands.

Adding Paul Smith to the cc to see if he has any comments..

Paul - the issue is that most of us build the kernel with a "make
-j<bignum>" (in my case "-j32") and if an error happens during the
make, it can take a _looong_ time for make to react. And if there are
warnings in the build, they can hide the actual error fairly easily).

So what David is talking about is to make fatal errors abort much more
quickly by having "jobserver_acquire()" perhaps return an error, and
aborting any jobs when they see that.

I can trivially see how to do it in the jobserver code itself (just
see if the token we get was '-', and if it was, write it back for the
next user and return error), but it's the downstream make code I'm
entirely unfamiliar with.

Any ideas?

                  Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: I disabled more compiler warnings..
  2020-05-11 17:41   ` Linus Torvalds
@ 2020-05-11 17:58     ` Paul Smith
  2020-05-11 19:33       ` Linus Torvalds
  2020-05-11 21:09       ` David Laight
  0 siblings, 2 replies; 17+ messages in thread
From: Paul Smith @ 2020-05-11 17:58 UTC (permalink / raw)
  To: Linus Torvalds, David Laight
  Cc: Arnd Bergmann, Masahiro Yamada, Linux Kernel Mailing List

On Mon, 2020-05-11 at 10:41 -0700, Linus Torvalds wrote:
> On Mon, May 11, 2020 at 12:43 AM David Laight <
> David.Laight@aculab.com> wrote:
> > 
> > I've not looked inside gmake, but I fixed nmake so that it
> > properly used a single job token pipe for the entire (NetBSD)
> > build and then flushed and refilled it with 'abort' tokens
> > when any command failed.
> > That made the build stop almost immediately.
> 
> The GNU jobserver doesn't have anything like that, afaik.
> 
> I think it always writes a '+' character as a token, so I guess it
> could be extended to write something else for the "abort now"
> situation (presumably a '-' character).

That was exactly my plan.

> But at least for external jobserver clients (of which I am not aware
> of any, I think we only depend on the internal GNU make behavior),
> the documentation states that you should just write back the same
> token you read.

I wrote this text precisely because I intended to support using other
tokens to specify other situations, such as failure :).

As a note, the GCC project has a GSoC project approved for this summer,
for GCC and/or binutils to participate in the jobserver protocol when
they do multithreading in the compiler/linker.  I think they are
planning on creating a generic "jobserver library" but I'm not
mentoring (I don't have the bandwidth for GSoC mentoring).  I do hope
to stay abreast of their work and perhaps toss in suggestions however.

> Paul - the issue is that most of us build the kernel with a "make
> -j<bignum>" (in my case "-j32") and if an error happens during the
> make, it can take a _looong_ time for make to react. And if there are
> warnings in the build, they can hide the actual error fairly easily).

Yes, I believe I have an enhancement in Savannah about this.

> I can trivially see how to do it in the jobserver code itself (just
> see if the token we get was '-', and if it was, write it back for the
> next user and return error), but it's the downstream make code I'm
> entirely unfamiliar with.

That's necessary, but my thinking is that more could be done.  What I
was going to do was if we write back an error token then we would also
go into a loop trying to read all the tokens off the jobserver and
write back error tokens, until we read an error token, then we'd exit
(after writing it back obviously).

If you don't do that you'll have to go through an entire set of builds
after the failure before everyone notices the failure.  If every
instance of make does this then you can propogate the error to everyone
more quickly.

My current work on GNU make is fixing the atrocious mess it has with
signal handling: don't even look and if you do, remember I inherited
this code (yeah, yeah, a long time ago but still... :)).  Right now
it's not so hard (especially with large -j) to have make instances
hanging when ^C is used due to race conditions in the signal handling.

As with all single-threaded applications, though, the problem is the
difficulty (in a portable way) of handling both signals and wait*(2)
reliably...

Anyway, this shouldn't be too difficult.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: I disabled more compiler warnings..
  2020-05-11 17:58     ` Paul Smith
@ 2020-05-11 19:33       ` Linus Torvalds
  2020-05-11 20:25         ` Paul Smith
  2020-05-11 21:09       ` David Laight
  1 sibling, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2020-05-11 19:33 UTC (permalink / raw)
  To: Paul Smith
  Cc: David Laight, Arnd Bergmann, Masahiro Yamada, Linux Kernel Mailing List

On Mon, May 11, 2020 at 10:59 AM Paul Smith <psmith@gnu.org> wrote:
>
> As with all single-threaded applications, though, the problem is the
> difficulty (in a portable way) of handling both signals and wait*(2)
> reliably...

I do wonder if GNU make isn't making it worse for itself by blocking SIGCHLD.

I wonder if you could just have three different file descriptors:

 - the "current token file descriptor"

 - a /dev/null file descriptor

 - the jobserver pipe file descriptor. This is left blocking.

and then make the rule be that

 - the SIGCHLD handler just does

        dup2(nullfd, jobserverfd);

   This guarantees that any read() that was interrupted by a SIGCHLD
will now reliably return immediately. But it also guarantees that if
we raced, and the SIGCHLD happened just *before* we did the read(),
the read() will also exit immediately.

 - the waiting code does

        check_child_status();
        for (;;) {
            ret = read(jobserverfd, buffer, 1);
            if (ret == 1) {
                .. we got the token successfully, do whatever ..
                return;
            }
            /* The read might fail because of SIGCHLD - either we got
interrupted, or the fd was replaced with /dev/null */
            /* First, re-instate the pipe binding */
            dup2(pipefd, jobserverfd);
            /* Then do the child status stuff again */
            check_child_status();
            /* Ok, we can restart, there's no races with SIGCHLD */
         }

which fundamentally doesn't have any races. Look, ma, no need for
nonblocking reads, or pselect().

I don't know. Maybe I missed something.

               Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: I disabled more compiler warnings..
  2020-05-11 19:33       ` Linus Torvalds
@ 2020-05-11 20:25         ` Paul Smith
  0 siblings, 0 replies; 17+ messages in thread
From: Paul Smith @ 2020-05-11 20:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Laight, Arnd Bergmann, Masahiro Yamada, Linux Kernel Mailing List

On Mon, 2020-05-11 at 12:33 -0700, Linus Torvalds wrote:
> I wonder if you could just have three different file descriptors:
> 
>  - the "current token file descriptor"
>  - a /dev/null file descriptor
>  - the jobserver pipe file descriptor. This is left blocking.

If I'm understanding your suggestion correctly, this is pretty much how
it worked originally.  Except I didn't use a /dev/null file descriptor:
I dup the jobserver FD, read() the dup, and then in the signal handler
I _closed_ the dup.

So, I can tell what happened by whether the read() returns EINTR vs.
EBADF.

This is still how things work on systems without pselect() support.

There's a blog page I wrote about this many (er... MANY) years ago:

http://make.mad-scientist.net/papers/jobserver-implementation/

You can skip the blather at the top: search down to "Then it gets
ugly..." and check the algorithm there to see if that's what you had in
mind.


The problem is that SA_RESTART works well in Linux but is not always
reliable on other OS's, and it's not always possible to catch EINTR
everywhere (for example, in third party libraries like libintl) so you
can't always automatically restart from a SIGCHLD.

See this bug reported on OS X for example:
  https://savannah.gnu.org/bugs/?func=detailitem&item_id=46261


But, the issue I have now is not related to SIGCHLD, it's related to
SIGINT/SIGTERM etc. handling.  Today, GNU make calls its die() function
directly from the SIGINT signal handler, and die() does A LOT of very
signal-unsafe stuff.  Gross, but it's worked like that for 30 years.

My idea was that since GNU make already uses an EINTRLOOP() macro to
check EINTR returns from system calls and restart, I would have the
SIGINT handler just set a variable, then install into the EINTRLOOP
macro a check of that variable, which would let me transition from
signal handler to user-space without dropping checks all over the code
(at least, not visibly :)).  Maybe I'd check directly in a few places,
where we may go for some time without a system call.

That worked but I do have to contend with the same issue as the
jobserver: any time I need to wait for something I have a race between
the last time I checked for an interrupt and the wait operation.

I have had to put this aside for a bit, so I haven't decided how to
address it but I'm sure it's doable.  The issue with GNU make is it has
to be as portable as possible: make is a foundational tool for any sort
of system bootstrap.  Of course, we could just not support parallel
builds on insufficient systems but...

If people feel this isn't an appropriate topic for lkml, I invite
anyone interested parties to post to bug-make@gnu.org :).


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: I disabled more compiler warnings..
  2020-05-11 17:58     ` Paul Smith
  2020-05-11 19:33       ` Linus Torvalds
@ 2020-05-11 21:09       ` David Laight
  2020-05-12  7:55         ` David Laight
  1 sibling, 1 reply; 17+ messages in thread
From: David Laight @ 2020-05-11 21:09 UTC (permalink / raw)
  To: 'psmith@gnu.org', Linus Torvalds
  Cc: Arnd Bergmann, Masahiro Yamada, Linux Kernel Mailing List

From: Paul Smith
> Sent: 11 May 2020 18:59
> On Mon, 2020-05-11 at 10:41 -0700, Linus Torvalds wrote:
> > On Mon, May 11, 2020 at 12:43 AM David Laight <
> > David.Laight@aculab.com> wrote:
> > >
> > > I've not looked inside gmake, but I fixed nmake so that it
> > > properly used a single job token pipe for the entire (NetBSD)
> > > build and then flushed and refilled it with 'abort' tokens
> > > when any command failed.
> > > That made the build stop almost immediately.
> >
> > The GNU jobserver doesn't have anything like that, afaik.
> >
> > I think it always writes a '+' character as a token, so I guess it
> > could be extended to write something else for the "abort now"
> > situation (presumably a '-' character).
> 
> That was exactly my plan.

ISTR using '*' :-) Was a long time ago.

One problem is ensuring that all the recursive makes actually
use the same token queue.
The Linux kernel build acts as though the sub-makes have their
own queue - I certainly had to fix that as well.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: I disabled more compiler warnings..
  2020-05-11 21:09       ` David Laight
@ 2020-05-12  7:55         ` David Laight
  2020-05-12 14:35           ` Paul Smith
  0 siblings, 1 reply; 17+ messages in thread
From: David Laight @ 2020-05-12  7:55 UTC (permalink / raw)
  To: 'psmith@gnu.org', 'Linus Torvalds'
  Cc: 'Arnd Bergmann', 'Masahiro Yamada',
	'Linux Kernel Mailing List'

From: David Laight
> Sent: 11 May 2020 22:09
> From: Paul Smith
> > Sent: 11 May 2020 18:59
> > On Mon, 2020-05-11 at 10:41 -0700, Linus Torvalds wrote:
> > > On Mon, May 11, 2020 at 12:43 AM David Laight <
> > > David.Laight@aculab.com> wrote:
> > > >
> > > > I've not looked inside gmake, but I fixed nmake so that it
> > > > properly used a single job token pipe for the entire (NetBSD)
> > > > build and then flushed and refilled it with 'abort' tokens
> > > > when any command failed.
> > > > That made the build stop almost immediately.
> > >
> > > The GNU jobserver doesn't have anything like that, afaik.
> > >
> > > I think it always writes a '+' character as a token, so I guess it
> > > could be extended to write something else for the "abort now"
> > > situation (presumably a '-' character).
> >
> > That was exactly my plan.
> 
> ISTR using '*' :-) Was a long time ago.
> 
> One problem is ensuring that all the recursive makes actually
> use the same token queue.
> The Linux kernel build acts as though the sub-makes have their
> own queue - I certainly had to fix that as well.

I think I've remembered the obvious thing that made it work better.

When a job ends it is important to get a new token from the jobserver
rather than reusing the one to hand.
Otherwise you don't seen the 'abort' marker for ages.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: I disabled more compiler warnings..
  2020-05-12  7:55         ` David Laight
@ 2020-05-12 14:35           ` Paul Smith
  2020-05-12 15:04             ` David Laight
  0 siblings, 1 reply; 17+ messages in thread
From: Paul Smith @ 2020-05-12 14:35 UTC (permalink / raw)
  To: David Laight, 'Linus Torvalds'
  Cc: 'Arnd Bergmann', 'Masahiro Yamada',
	'Linux Kernel Mailing List'

On Tue, 2020-05-12 at 07:55 +0000, David Laight wrote:
> > One problem is ensuring that all the recursive makes actually
> > use the same token queue.
> > The Linux kernel build acts as though the sub-makes have their
> > own queue - I certainly had to fix that as well.

I don't understand this... I guess I'm not familiar enough with the
kernel build system.

> I think I've remembered the obvious thing that made it work better.
> 
> When a job ends it is important to get a new token from the jobserver
> rather than reusing the one to hand.
> Otherwise you don't seen the 'abort' marker for ages.

If GNU make retrieved a token then it will always put that token back
into the jobserver pipe when the job ends, and get another one when the
next job is to start.  To do otherwise would mean that some makes could
hoard tokens.

However, the jobserver is implemented such that make itself is not
considered a job, even a sub-make.  The way it works is that when you
invoke a recursive make the parent make will obtain a jobserver token
for that recursive invocation (like it does for every job), then that
sub-make can "pass on" that token: in other words, the sub-make has a
free token that it can always use without querying the jobserver.

This way every invocation of recursive make can always make progress,
at least serially.

I can see that in the "fast fail" model this could be problematic, but
it should only ever be an issue in situations where a sub-make was
running serially for some reason: either the structure of the
prerequisites means it's naturally serial, or else someone added
.NOTPARALLEL to the makefile or something.  As soon as make wants to
run a second job in parallel it will go to the jobserver and discover
the "failure" token.

Changing this will require thought.  We can't just skip the free token
otherwise you can get into a state where all your tokens are used by
recursive makes and no make can get a new token to run a job.

I can see two possible solutions:

First, when a sub-make starts it could put back one token into the
jobserver, representing the token the parent make obtained for it, then
proceed to always get a token before every job (no free token).  This
means that sometimes a sub-make won't be able to run any jobs at all:
it can get locked out waiting for a token.  Maybe that's not a problem.

The other idea is to keep the free token but make it a last resort
rather than a first resort.  This has the nice properties that (a)
we'll see failures fast and (b) we still have a free token, but the
code is more complex: basically we'd need to perform a non-blocking
read on the jobserver FD and if we didn't get anything back, we'd use
our free token if it's still available: if not we'd do a blocking read
on the jobserver FD to wait for a new token.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: I disabled more compiler warnings..
  2020-05-12 14:35           ` Paul Smith
@ 2020-05-12 15:04             ` David Laight
  2020-05-12 16:55               ` Paul Smith
  0 siblings, 1 reply; 17+ messages in thread
From: David Laight @ 2020-05-12 15:04 UTC (permalink / raw)
  To: 'psmith@gnu.org', 'Linus Torvalds'
  Cc: 'Arnd Bergmann', 'Masahiro Yamada',
	'Linux Kernel Mailing List'

From: Paul Smith
> Sent: 12 May 2020 15:36
> On Tue, 2020-05-12 at 07:55 +0000, David Laight wrote:
> > > One problem is ensuring that all the recursive makes actually
> > > use the same token queue.
> > > The Linux kernel build acts as though the sub-makes have their
> > > own queue - I certainly had to fix that as well.
> 
> I don't understand this... I guess I'm not familiar enough with the
> kernel build system.

Don't worry. I'm rather guessing how gmake and the kernel makefile
interact based on changes I made to NetBSD's make and makefiles
almost 20 years ago.

I think there were some sub-makes that were started with make
instead of $(MAKE) so ended up creating a new job pipe.
(The pipe fds were added to argv[] by $(MAKE))

> > I think I've remembered the obvious thing that made it work better.
> >
> > When a job ends it is important to get a new token from the jobserver
> > rather than reusing the one to hand.
> > Otherwise you don't seen the 'abort' marker for ages.
> 
> If GNU make retrieved a token then it will always put that token back
> into the jobserver pipe when the job ends, and get another one when the
> next job is to start.  To do otherwise would mean that some makes could
> hoard tokens.
> 
> However, the jobserver is implemented such that make itself is not
> considered a job, even a sub-make.  The way it works is that when you
> invoke a recursive make the parent make will obtain a jobserver token
> for that recursive invocation (like it does for every job), then that
> sub-make can "pass on" that token: in other words, the sub-make has a
> free token that it can always use without querying the jobserver.
> 
> This way every invocation of recursive make can always make progress,
> at least serially.
> 
> I can see that in the "fast fail" model this could be problematic, but
> it should only ever be an issue in situations where a sub-make was
> running serially for some reason: either the structure of the
> prerequisites means it's naturally serial, or else someone added
> .NOTPARALLEL to the makefile or something.  As soon as make wants to
> run a second job in parallel it will go to the jobserver and discover
> the "failure" token.
> 
> Changing this will require thought.  We can't just skip the free token
> otherwise you can get into a state where all your tokens are used by
> recursive makes and no make can get a new token to run a job.
> 
> I can see two possible solutions:
> 
> First, when a sub-make starts it could put back one token into the
> jobserver, representing the token the parent make obtained for it, then
> proceed to always get a token before every job (no free token).  This
> means that sometimes a sub-make won't be able to run any jobs at all:
> it can get locked out waiting for a token.  Maybe that's not a problem.
> 
> The other idea is to keep the free token but make it a last resort
> rather than a first resort.  This has the nice properties that (a)
> we'll see failures fast and (b) we still have a free token, but the
> code is more complex: basically we'd need to perform a non-blocking
> read on the jobserver FD and if we didn't get anything back, we'd use
> our free token if it's still available: if not we'd do a blocking read
> on the jobserver FD to wait for a new token.

Doesn't it do blocking reads with SIGCHLD enabled?
(or hopefully ppoll() to avoid the race)

Another option is for the 'parent' make to return (or not acquire)
a job token for $(MAKE) commands.
Then the sub-make have to acquire a token for every command.
make has to know about $(MAKE) because they are special in all sorts
of ways.
But that won't work well if the old and new versions ever interact.

Or, require the sub-make acquire a token in order to exit.
Then it can free the token when every job terminates.

I can't remember what I did to netbsd's make.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: I disabled more compiler warnings..
  2020-05-12 15:04             ` David Laight
@ 2020-05-12 16:55               ` Paul Smith
  2020-05-13  8:21                 ` David Laight
  0 siblings, 1 reply; 17+ messages in thread
From: Paul Smith @ 2020-05-12 16:55 UTC (permalink / raw)
  To: David Laight, 'Linus Torvalds'
  Cc: 'Arnd Bergmann', 'Masahiro Yamada',
	'Linux Kernel Mailing List'

On Tue, 2020-05-12 at 15:04 +0000, David Laight wrote:
> I think there were some sub-makes that were started with make
> instead of $(MAKE) so ended up creating a new job pipe.

Oh, yes, that will do it.

> Doesn't it do blocking reads with SIGCHLD enabled?

No, because it's racy (by itself).

> (or hopefully ppoll() to avoid the race)

GNU make uses pselect(), on systems that support it.  On systems that
don't support pselect() it uses a trick I described in another email:
we dup() the FD, read() on the dup, then in the SIGCHLD handler we
close() the dup.

> Another option is for the 'parent' make to return (or not acquire)
> a job token for $(MAKE) commands.

It just feels cleaner to me to have the parent simply always take the
token, and leave it up to the child to put it back if appropriate,
rather than the parent putting it back.

Having the parent not acquire a token at all won't work; without
limiting sub-makes it means you might have 100's of them running at the
same time, even with -j2 or whatever.

> Or, require the sub-make acquire a token in order to exit.
> Then it can free the token when every job terminates.

I'm not sure I follow that?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: I disabled more compiler warnings..
  2020-05-12 16:55               ` Paul Smith
@ 2020-05-13  8:21                 ` David Laight
  2020-05-13 15:32                   ` Paul Smith
  0 siblings, 1 reply; 17+ messages in thread
From: David Laight @ 2020-05-13  8:21 UTC (permalink / raw)
  To: 'psmith@gnu.org', 'Linus Torvalds'
  Cc: 'Arnd Bergmann', 'Masahiro Yamada',
	'Linux Kernel Mailing List'

From: Paul Smith
> Sent: 12 May 2020 17:55
> On Tue, 2020-05-12 at 15:04 +0000, David Laight wrote:
> > I think there were some sub-makes that were started with make
> > instead of $(MAKE) so ended up creating a new job pipe.
> 
> Oh, yes, that will do it.
> 
> > Doesn't it do blocking reads with SIGCHLD enabled?
> 
> No, because it's racy (by itself).
> 
> > (or hopefully ppoll() to avoid the race)
> 
> GNU make uses pselect(), on systems that support it.  On systems that
> don't support pselect() it uses a trick I described in another email:
> we dup() the FD, read() on the dup, then in the SIGCHLD handler we
> close() the dup.

Does that even work - seems like it requires close() to abort poll().
Better is to just have the SIGCHLD handler write a byte into a pipe.

> > Another option is for the 'parent' make to return (or not acquire)
> > a job token for $(MAKE) commands.
> 
> It just feels cleaner to me to have the parent simply always take the
> token, and leave it up to the child to put it back if appropriate,
> rather than the parent putting it back.
> 
> Having the parent not acquire a token at all won't work; without
> limiting sub-makes it means you might have 100's of them running at the
> same time, even with -j2 or whatever.

Hmmm... 
That means the sub-make must always hold one token.
Otherwise the parent-make could use it to create a new sub-make.

Actually the token pipe can be opened NON_BLOCK because poll()
can/will be used to wait for a token.

So you always try to read a token - even when you have one 'in your hand'
(either entry or because a job just finished).
If it isn't the 'abort' one, put it back.
A bit of faffing on the token pipe isn't going to affect the performance
when it is about to do fork+exec.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: I disabled more compiler warnings..
  2020-05-13  8:21                 ` David Laight
@ 2020-05-13 15:32                   ` Paul Smith
  2020-05-13 15:53                     ` David Laight
  0 siblings, 1 reply; 17+ messages in thread
From: Paul Smith @ 2020-05-13 15:32 UTC (permalink / raw)
  To: David Laight, 'Linus Torvalds'
  Cc: 'Arnd Bergmann', 'Masahiro Yamada',
	'Linux Kernel Mailing List'

On Wed, 2020-05-13 at 08:21 +0000, David Laight wrote:
> > GNU make uses pselect(), on systems that support it.  On systems
> > that don't support pselect() it uses a trick I described in another
> > email: we dup() the FD, read() on the dup, then in the SIGCHLD
> > handler we close() the dup.
> 
> Does that even work - seems like it requires close() to abort poll().
> Better is to just have the SIGCHLD handler write a byte into a pipe.

Sorry I guess I wasn't clear.

We use _either_ pselect(), _or_ the close()-the-dup trick.  Not both.

If we have pselect() we use that: no dup'ing or close'ing needed.

If we don't have pselect() we use the close() in the signal handler. 
In that case we're just waiting in the read(), we're not using select()
or poll() or whatever.  It's definitely the case that if we're waiting
in read() and someone closes the FD, we'll wake up! :)

> > Having the parent not acquire a token at all won't work; without
> > limiting sub-makes it means you might have 100's of them running at
> > the same time, even with -j2 or whatever.
> 
> Hmmm... 
> That means the sub-make must always hold one token.
> Otherwise the parent-make could use it to create a new sub-make.

Right, my first idea has this same problem so it won't work.

> Actually the token pipe can be opened NON_BLOCK because poll()
> can/will be used to wait for a token.

Again, that only works on systems where pselect() is available.

> So you always try to read a token - even when you have one 'in your
> hand' (either entry or because a job just finished).  If it isn't the
> 'abort' one, put it back.

Something like that would be needed, yes.

Note this is only needed in a rare situation where you're running with
parallelism enabled BUT you have a makefile which never actually tries
to run two or more jobs at the same time for some reason.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: I disabled more compiler warnings..
  2020-05-13 15:32                   ` Paul Smith
@ 2020-05-13 15:53                     ` David Laight
  2020-05-13 16:06                       ` Paul Smith
  0 siblings, 1 reply; 17+ messages in thread
From: David Laight @ 2020-05-13 15:53 UTC (permalink / raw)
  To: 'psmith@gnu.org', 'Linus Torvalds'
  Cc: 'Arnd Bergmann', 'Masahiro Yamada',
	'Linux Kernel Mailing List'

From: Paul Smith
> Sent: 13 May 2020 16:33
> On Wed, 2020-05-13 at 08:21 +0000, David Laight wrote:
...
> If we don't have pselect() we use the close() in the signal handler.
> In that case we're just waiting in the read(), we're not using select()
> or poll() or whatever.  It's definitely the case that if we're waiting
> in read() and someone closes the FD, we'll wake up! :)

Ugg, that is relying on getting either EINTR or EBADFD.
I can't remember if Posix allows SIGCHLD to be delivered
in a different thread.
Windows definitely likes delivering signals that way :-)

> > > Having the parent not acquire a token at all won't work; without
> > > limiting sub-makes it means you might have 100's of them running at
> > > the same time, even with -j2 or whatever.
> >
> > Hmmm...
> > That means the sub-make must always hold one token.
> > Otherwise the parent-make could use it to create a new sub-make.
> 
> Right, my first idea has this same problem so it won't work.
> 
> > Actually the token pipe can be opened NON_BLOCK because poll()
> > can/will be used to wait for a token.
> 
> Again, that only works on systems where pselect() is available.
> 
> > So you always try to read a token - even when you have one 'in your
> > hand' (either entry or because a job just finished).  If it isn't the
> > 'abort' one, put it back.
> 
> Something like that would be needed, yes.
> 
> Note this is only needed in a rare situation where you're running with
> parallelism enabled BUT you have a makefile which never actually tries
> to run two or more jobs at the same time for some reason.

I did have to recirculate the tokens.
Can't exactly remember why.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: I disabled more compiler warnings..
  2020-05-13 15:53                     ` David Laight
@ 2020-05-13 16:06                       ` Paul Smith
  0 siblings, 0 replies; 17+ messages in thread
From: Paul Smith @ 2020-05-13 16:06 UTC (permalink / raw)
  To: David Laight, 'Linus Torvalds'
  Cc: 'Arnd Bergmann', 'Masahiro Yamada',
	'Linux Kernel Mailing List'

On Wed, 2020-05-13 at 15:53 +0000, David Laight wrote:
> > If we don't have pselect() we use the close() in the signal
> > handler. In that case we're just waiting in the read(), we're not
> > using select() or poll() or whatever.  It's definitely the case
> > that if we're waiting in read() and someone closes the FD, we'll
> > wake up! :)
> 
> Ugg, that is relying on getting either EINTR or EBADFD.

Yes, exactly.

> I can't remember if Posix allows SIGCHLD to be delivered
> in a different thread.

It does: all you have to do is ensure that all threads other than the
one you want block the signal.

However, GNU make is not multithreaded so this is moot.

> Windows definitely likes delivering signals that way :-)

In Windows IIRC GNU make doesn't use this method at all; Windows has
some kind of process shared semaphore that is used instead.  I didn't
write that code so I can't really describe it in detail :)


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-05-13 16:06 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-10 19:33 I disabled more compiler warnings Linus Torvalds
2020-05-11  7:43 ` David Laight
2020-05-11 17:41   ` Linus Torvalds
2020-05-11 17:58     ` Paul Smith
2020-05-11 19:33       ` Linus Torvalds
2020-05-11 20:25         ` Paul Smith
2020-05-11 21:09       ` David Laight
2020-05-12  7:55         ` David Laight
2020-05-12 14:35           ` Paul Smith
2020-05-12 15:04             ` David Laight
2020-05-12 16:55               ` Paul Smith
2020-05-13  8:21                 ` David Laight
2020-05-13 15:32                   ` Paul Smith
2020-05-13 15:53                     ` David Laight
2020-05-13 16:06                       ` Paul Smith
2020-05-11  9:03 ` Arnd Bergmann
2020-05-11 11:11   ` Arnd Bergmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.