git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] ci sanitizer cleanups and performance improvements
@ 2023-06-01 18:02 Jeff King
  2023-06-01 18:03 ` [PATCH 1/3] ci: use clang for ASan/UBSan checks Jeff King
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Jeff King @ 2023-06-01 18:02 UTC (permalink / raw)
  To: git; +Cc: Teng Long, Junio C Hamano

Here are a few patches to tweak our CI sanitizer setup. The first one
hopefully increases coverage. The other two just aim to reduce the
amount of CPU we use.

  [1/3]: ci: use clang for ASan/UBSan checks
  [2/3]: ci: run ASan/UBSan in a single job
  [3/3]: ci: drop linux-clang job

 .github/workflows/main.yml | 10 ++--------
 ci/lib.sh                  |  7 ++-----
 2 files changed, 4 insertions(+), 13 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/3] ci: use clang for ASan/UBSan checks
  2023-06-01 18:02 [PATCH 0/3] ci sanitizer cleanups and performance improvements Jeff King
@ 2023-06-01 18:03 ` Jeff King
  2023-06-01 18:09 ` [PATCH 2/3] ci: run ASan/UBSan in a single job Jeff King
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Jeff King @ 2023-06-01 18:03 UTC (permalink / raw)
  To: git; +Cc: Teng Long, Junio C Hamano

Both gcc and clang support the "address" and "undefined" sanitizers.
However, they may produce different results. We've seen at least two
real world cases where gcc missed a UBSan problem but clang found it:

  1. Clang's UBSan (using clang 14.0.6) found a string index that was
     subtracted to "-1", causing an out-of-bounds read (curiously this
     didn't trigger ASan, but that may be because the string was in the
     argv memory, not stack or heap). Using gcc (version 12.2.0) didn't
     find the same problem.

     Original thread:
     https://lore.kernel.org/git/20230519005447.GA2955320@coredump.intra.peff.net/

  2. Clang's UBSan (using clang 4.0.1) complained about pointer
     arithmetic with NULL, but gcc at the time did not. This was in
     2017, and modern gcc does seem to find the issue, though.

     Original thread:
     https://lore.kernel.org/git/32a8b949-638a-1784-7fba-948ae32206fc@web.de/

Since we don't otherwise have a particular preference for one compiler
over the other for this test, let's switch to the one that we think may
be more thorough.

Note that it's entirely possible that the two are simply _different_,
and we are trading off problems that gcc would find that clang wouldn't.
However, my subjective and anecdotal experience has been that clang's
sanitizer support is a bit more mature (e.g., I recall other oddities
around leak-checking where clang performed more sensibly).

Obviously running both and cross-checking the results would give us the
best coverage, but that's very expensive to run (and these are already
some of our most expensive CI jobs). So let's use clang as our best
guess, and we can re-evaluate if we get more data points.

Signed-off-by: Jeff King <peff@peff.net>
---
You can see that this catches the recent problem when merged with
'next':

  https://github.com/peff/git/actions/runs/5102807385/jobs/9172605638

 .github/workflows/main.yml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index 30492eacdd..487ad31e66 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -274,10 +274,10 @@ jobs:
             cc: gcc
             pool: ubuntu-latest
           - jobname: linux-asan
-            cc: gcc
+            cc: clang
             pool: ubuntu-latest
           - jobname: linux-ubsan
-            cc: gcc
+            cc: clang
             pool: ubuntu-latest
     env:
       CC: ${{matrix.vector.cc}}
-- 
2.41.0.346.g8d12207a4f


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/3] ci: run ASan/UBSan in a single job
  2023-06-01 18:02 [PATCH 0/3] ci sanitizer cleanups and performance improvements Jeff King
  2023-06-01 18:03 ` [PATCH 1/3] ci: use clang for ASan/UBSan checks Jeff King
@ 2023-06-01 18:09 ` Jeff King
  2023-06-01 18:10 ` [PATCH 3/3] ci: drop linux-clang job Jeff King
  2023-06-03  1:35 ` [PATCH 0/3] ci sanitizer cleanups and performance improvements Junio C Hamano
  3 siblings, 0 replies; 5+ messages in thread
From: Jeff King @ 2023-06-01 18:09 UTC (permalink / raw)
  To: git; +Cc: Teng Long, Junio C Hamano

When we started running sanitizers in CI via 1c0962c0c4 (ci: add address
and undefined sanitizer tasks, 2022-10-20), we ran them as two separate
CI jobs, since as that commit notes, the combination "seems to take
forever".

And indeed, it does with gcc. However, since the previous commit
switched to using clang, the situation is different, and we can save
some CPU by using a single job for both. Comparing before/after CI runs,
this saved about 14 minutes (the single combined job took 54m, versus
44m plus 24m for ASan and UBSan jobs, respectively). That's wall-clock
and not CPU, but since our jobs are mostly CPU-bound, the two should be
closely proportional.

This does increase the end-to-end time of a CI run, though, since before
this patch the two jobs could run in parallel, and the sanitizer job is
our longest single job. It also means that we won't get a separate
result for "this passed with UBSan but not with ASan" or vice versa).
But as 1c0962c0c4 noted, that is not a very useful signal in practice.

Below are some more detailed timings of gcc vs clang that I measured by
running the test suite on my local workstation. Each measurement counts
only the time to run the test suite with each compiler (not the compile
time itself). We'll focus on the wall-clock times for simplicity, though
the CPU times follow roughly similar trends.

Here's a run with CC=gcc as a baseline:

  real	1m12.931s
  user	9m30.566s
  sys	8m9.538s

Running with SANITIZE=address increases the time by a factor of ~4.7x:

  real	5m40.352s
  user	49m37.044s
  sys	36m42.950s

Running with SANITIZE=undefined increases the time by a factor of ~1.7x:

  real	2m5.956s
  user	12m42.847s
  sys	19m27.067s

So let's call that 6.4 time units to run them separately (where a unit
is the time it takes to run the test suite with no sanitizers). As a
simplistic model, we might imagine that running them together would take
5.4 units (we save 1 unit because we are no longer running the test
suite twice, but just paying the sanitizer overhead on top of a single
run).

But that's not what happens. Running with SANITIZE=address,undefined
results in a factor of 9.3x:

  real	11m9.817s
  user	77m31.284s
  sys	96m40.454s

So not only did we not get faster when doing them together, we actually
spent 1.5x as much CPU as doing them separately! And while those
wall-clock numbers might not look too terrible, keep in mind that this
is on an unloaded 8-core machine. In the CI environment, wall-clock
times will be much closer to CPU times. So not only are we wasting CPU,
but we risk hitting timeouts.

Now let's try the same thing with clang. Here's our no-sanitizer
baseline run, which is almost identical to the gcc one (which is quite
convenient, because we can keep using the same "time units" to get an
apples-to-apples comparison):

  real	1m11.844s
  user	9m28.313s
  sys	8m8.240s

And now again with SANITIZE=address, we get a 5x factor (so slightly
worse than gcc's 4.7x, though I wouldn't read too much into it; there is
a fair bit of run-to-run noise):

  real	6m7.662s
  user	49m24.330s
  sys	44m13.846s

And with SANITIZE=undefined, we are at 1.5x, slightly outperforming gcc
(though again, that's probably mostly noise):

  real	1m50.028s
  user	11m0.973s
  sys	16m42.731s

So running them separately, our total cost is 6.5x. But if we combine
them in a single run (SANITIZE=address,undefined), we get:

  real	6m51.804s
  user	52m32.049s
  sys	51m46.711s

which is a factor of 5.7x. That's along the lines we'd hoped for!
Running them together saves us almost a whole time unit. And that's not
counting any time spent outside the test suite itself (starting the job,
setting up the environment, compiling) that we're no longer duplicating
by having two jobs.

So clang behaves like we'd hope: the overhead to run the sanitizers is
additive as you add more sanitizers. Whereas gcc's numbers seem very
close to multiplicative, almost as if the sanitizers were enforcing
their overheads on each other (though that is purely a guess on what is
going on; ultimately what matters to us is the amount of time it takes).

And that roughly matches the CI improvement I saw. A "time unit" there
is more like 12 minutes, and the observed time savings was 14 minutes
(with the extra presumably coming from avoiding duplicated setup, etc).

Signed-off-by: Jeff King <peff@peff.net>
---
I'm not excited about increasing the total run-time. However, if we want
to deal with that, I suspect a better solution is to split the tests up
into several parallel jobs the way the Windows ones do. That's a lot
more work to set up, but it would let us bring the runtime down to match
other tests (which seem to take 15-20 minutes, versus 44 minutes even
before this patch).

 .github/workflows/main.yml | 5 +----
 ci/lib.sh                  | 7 ++-----
 2 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index 487ad31e66..2114303b7d 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -273,10 +273,7 @@ jobs:
           - jobname: linux-leaks
             cc: gcc
             pool: ubuntu-latest
-          - jobname: linux-asan
-            cc: clang
-            pool: ubuntu-latest
-          - jobname: linux-ubsan
+          - jobname: linux-asan-ubsan
             cc: clang
             pool: ubuntu-latest
     env:
diff --git a/ci/lib.sh b/ci/lib.sh
index db7105e8a8..369d462f13 100755
--- a/ci/lib.sh
+++ b/ci/lib.sh
@@ -278,11 +278,8 @@ linux-leaks)
 	export GIT_TEST_PASSING_SANITIZE_LEAK=true
 	export GIT_TEST_SANITIZE_LEAK_LOG=true
 	;;
-linux-asan)
-	export SANITIZE=address
-	;;
-linux-ubsan)
-	export SANITIZE=undefined
+linux-asan-ubsan)
+	export SANITIZE=address,undefined
 	;;
 esac
 
-- 
2.41.0.346.g8d12207a4f


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 3/3] ci: drop linux-clang job
  2023-06-01 18:02 [PATCH 0/3] ci sanitizer cleanups and performance improvements Jeff King
  2023-06-01 18:03 ` [PATCH 1/3] ci: use clang for ASan/UBSan checks Jeff King
  2023-06-01 18:09 ` [PATCH 2/3] ci: run ASan/UBSan in a single job Jeff King
@ 2023-06-01 18:10 ` Jeff King
  2023-06-03  1:35 ` [PATCH 0/3] ci sanitizer cleanups and performance improvements Junio C Hamano
  3 siblings, 0 replies; 5+ messages in thread
From: Jeff King @ 2023-06-01 18:10 UTC (permalink / raw)
  To: git; +Cc: Teng Long, Junio C Hamano

Since the linux-asan-ubsan job runs using clang under Linux, there is
not much point in running a separate clang job. Any errors that a normal
clang compile-and-test cycle would find are likely to be a subset of
what the sanitizer job will find. Since this job takes ~14 minutes to
run in CI, this shaves off some of our CPU load (though it does not
affect end-to-end runtime, since it's typically run in parallel and is
not the longest job).

Technically this provides us with slightly less signal for a given run,
since you won't immediately know if a failure in the sanitizer job is
from using clang or from the sanitizers themselves. But it's generally
obvious from the logs, and anyway your next step would be to fix the
probvlem and re-run CI, since we expect all of these jobs to pass
normally.

Signed-off-by: Jeff King <peff@peff.net>
---
Obviously not necessary, but just seemed like another easy way to trim
our resource usage. Really, the same logic could apply before my series
to the linux-gcc job, which is similarly redundant.

 .github/workflows/main.yml | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index 2114303b7d..079645b776 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -246,9 +246,6 @@ jobs:
       fail-fast: false
       matrix:
         vector:
-          - jobname: linux-clang
-            cc: clang
-            pool: ubuntu-latest
           - jobname: linux-sha256
             cc: clang
             pool: ubuntu-latest
-- 
2.41.0.346.g8d12207a4f

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/3] ci sanitizer cleanups and performance improvements
  2023-06-01 18:02 [PATCH 0/3] ci sanitizer cleanups and performance improvements Jeff King
                   ` (2 preceding siblings ...)
  2023-06-01 18:10 ` [PATCH 3/3] ci: drop linux-clang job Jeff King
@ 2023-06-03  1:35 ` Junio C Hamano
  3 siblings, 0 replies; 5+ messages in thread
From: Junio C Hamano @ 2023-06-03  1:35 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Teng Long

Jeff King <peff@peff.net> writes:

> Here are a few patches to tweak our CI sanitizer setup. The first one
> hopefully increases coverage. The other two just aim to reduce the
> amount of CPU we use.
>
>   [1/3]: ci: use clang for ASan/UBSan checks
>   [2/3]: ci: run ASan/UBSan in a single job
>   [3/3]: ci: drop linux-clang job
>
>  .github/workflows/main.yml | 10 ++--------
>  ci/lib.sh                  |  7 ++-----
>  2 files changed, 4 insertions(+), 13 deletions(-)

Makes sense.  Will queue.  Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-06-03  1:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-01 18:02 [PATCH 0/3] ci sanitizer cleanups and performance improvements Jeff King
2023-06-01 18:03 ` [PATCH 1/3] ci: use clang for ASan/UBSan checks Jeff King
2023-06-01 18:09 ` [PATCH 2/3] ci: run ASan/UBSan in a single job Jeff King
2023-06-01 18:10 ` [PATCH 3/3] ci: drop linux-clang job Jeff King
2023-06-03  1:35 ` [PATCH 0/3] ci sanitizer cleanups and performance improvements Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).