All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] improve audit syscall-exit latency
@ 2022-09-27 22:59 ` Ankur Arora
  0 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-09-27 22:59 UTC (permalink / raw)
  To: linux-audit; +Cc: paul, eparis, linux-kernel, boris.ostrovsky, konrad.wilk

This series attempts to reduce syscall-exit latency in the audit path,
especially for cases where there are a lot of audit exit rules.

From profiling, audit_filter_syscall() takes the largest chunk of time,
specifically in this list-walk while processing the AUDIT_FILTER_EXIT
list:

      list = &audit_filter_list[AUDIT_FILTER_EXIT];

      list_for_each_entry_rcu(e, list, list) {
          if (audit_in_mask(&e->rule, ctx->major) &&
              audit_filter_rules(tsk, &e->rule, ctx, NULL,
                                 &state, false, x)) {
                  rcu_read_unlock();
                  ctx->current_state = state;
                  return state;
          }
      }

(Note that almost identical logic exists in audit_filter_uring(),
audit_filter_inode_name().)

Comparing baseline performance with audit=0/audit=1 with a user-space
getpid() loop (executes 10^7 times.) For audit=1, this uses an
audit-rule set where the audit_filter_syscall() loop iterates over
42 AUDIT_FILTER_EXIT rules which, for getpid(), calls audit_filter_rules()
for 5 of them (we use this set of rules in production.)

Test system:
 Server: ORACLE SERVER X8-2L
 CPU: Intel Skylakex (6:85:6)
 Microcode: 0x400320a

 # v6.0.0-rc5.baseline, audit=0
 Performance counter stats for 'bin/getpid' (3 runs):

            734.10 msec task-clock                       #    0.999 CPUs utilized            ( +-  0.03% )
                 1      context-switches                 #    1.361 /sec                     ( +- 66.67% )
                 0      cpu-migrations                   #    0.000 /sec
                53      page-faults                      #   72.152 /sec                     ( +-  0.63% )
     2,838,869,156      cycles                           #    3.865 GHz                      ( +-  0.01% )  (62.40%)
     4,174,224,305      instructions                     #    1.47  insn per cycle           ( +-  0.01% )  (74.93%)
       890,798,133      branches                         #    1.213 G/sec                    ( +-  0.01% )  (74.93%)
         5,015,118      branch-misses                    #    0.56% of all branches          ( +-  0.05% )  (74.93%)
     1,231,150,558      L1-dcache-loads                  #    1.676 G/sec                    ( +-  0.01% )  (74.94%)
           418,297      L1-dcache-load-misses            #    0.03% of all L1-dcache accesses  ( +-  0.68% )  (75.07%)
             3,937      LLC-loads                        #    5.360 K/sec                    ( +-  3.76% )  (50.13%)
               510      LLC-load-misses                  #   13.39% of all LL-cache accesses  ( +- 79.89% )  (50.00%)

          0.735018 +- 0.000275 seconds time elapsed  ( +-  0.04% )

 # v6.0.0-rc5.baseline, audit=1
 Performance counter stats for 'bin/getpid' (3 runs):

          2,158.81 msec task-clock                       #    0.998 CPUs utilized            ( +-  0.13% )
                 2      context-switches                 #    0.925 /sec                     ( +- 28.87% )
                 0      cpu-migrations                   #    0.000 /sec
                52      page-faults                      #   24.056 /sec                     ( +-  0.64% )
     8,364,119,898      cycles                           #    3.869 GHz                      ( +-  0.11% )  (62.48%)
    19,996,521,678      instructions                     #    2.39  insn per cycle           ( +-  0.01% )  (74.98%)
     4,302,068,252      branches                         #    1.990 G/sec                    ( +-  0.00% )  (74.98%)
        15,135,694      branch-misses                    #    0.35% of all branches          ( +-  0.16% )  (74.99%)
     4,714,694,841      L1-dcache-loads                  #    2.181 G/sec                    ( +-  0.01% )  (74.99%)
        66,407,857      L1-dcache-load-misses            #    1.41% of all L1-dcache accesses  ( +-  1.50% )  (75.01%)
             6,785      LLC-loads                        #    3.139 K/sec                    ( +- 12.49% )  (50.03%)
             3,235      LLC-load-misses                  #   41.29% of all LL-cache accesses  ( +-  6.08% )  (50.01%)

           2.16213 +- 0.00288 seconds time elapsed  ( +-  0.13% )

perf stat numbers for each getpid() iteration:

		  baseline      baseline 
                   audit=0       audit=1 

  cycles               283           836 
  instructions         417          1999 
  IPC                 1.47          2.39 
  branches              89           430 
  branch-misses       0.50          1.51 
  L1-loads             123           471 
  L1-load-misses        ~0          ~6.6*

  * the L1-load-misses are largely stable for runs across a single
    boot, but vary across boots due to vagaries of the SLAB allocator.

baseline audit=1 spends a significant amount of time executing in audit
code and incurs a new branch-miss and a few new L1-load-misses. Also
note that computed audit-only IPC is 2.86 so the baseline is not
ill-performant code.

Patches
==

Patch 1 "audit: cache ctx->major in audit_filter_syscall()", caches
ctx->major in a local variable. This gets rid of a persistent entry
from L1-dcache  (audit_context::major) that had similar alignment
constraints as a potentially busy cache-set (audit_entry::list) and
allows some of the error checking in audit_in_mask() to be hoisted out
of the loop.

Patch 2: "audit: annotate branch direction for audit_in_mask()", so
the compiler can pessimize the call to audit_filter_rules().

Patch 3, "audit: unify audit_filter_{uring(),inode_name(),syscall()}"
centralizes this logic in a common function.

with these changes:

 Performance counter stats for 'bin/getpid' (3 runs):

          1,750.21 msec task-clock                       #    0.994 CPUs utilized            ( +-  0.45% )
                 3      context-switches                 #    1.705 /sec                     ( +- 11.11% )
                 0      cpu-migrations                   #    0.000 /sec
                52      page-faults                      #   29.548 /sec                     ( +-  0.64% )
     6,770,976,590      cycles                           #    3.848 GHz                      ( +-  0.40% )  (27.74%)
    16,588,372,024      instructions                     #    2.44  insn per cycle           ( +-  0.03% )  (33.34%)
     4,322,555,829      branches                         #    2.456 G/sec                    ( +-  0.02% )  (33.40%)
         2,803,286      branch-misses                    #    0.06% of all branches          ( +- 26.45% )  (33.46%)
     4,449,426,336      L1-dcache-loads                  #    2.528 G/sec                    ( +-  0.01% )  (27.71%)
        63,612,108      L1-dcache-load-misses            #    1.43% of all L1-dcache accesses  ( +-  0.50% )  (27.71%)
             6,123      LLC-loads                        #    3.479 K/sec                    ( +-  8.68% )  (27.71%)
             1,890      LLC-load-misses                  #   26.69% of all LL-cache accesses  ( +- 46.99% )  (27.71%)

           1.76033 +- 0.00791 seconds time elapsed  ( +-  0.45% )

And, overall getpid() latency numbers (aggregated over 12 boots for each):
                Min     Mean    Median    Max         pstdev
                (ns)    (ns)      (ns)    (ns)

 baseline     201.30   216.14   216.22  228.46      (+- 1.45%)
 patch1       196.63   207.86   206.60  230.98      (+- 3.92%)
 patch1-2     173.11   182.51   179.65  202.09      (+- 4.34%)
 patch1-3     162.11   175.26   173.71  190.56      (+- 4.33%)

This gives a reasonable speedup. My testing was on Intel Skylake, but I
suspect this should translate to other archs as well (especially on less
wide architectures.)

Please review.

Thanks
Ankur

Ankur Arora (3):
  audit: cache ctx->major in audit_filter_syscall()
  audit: annotate branch direction for audit_in_mask()
  audit: unify audit_filter_{uring(),inode_name(),syscall()}

 kernel/auditsc.c | 86 +++++++++++++++++++++++++-----------------------
 1 file changed, 45 insertions(+), 41 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 0/3] improve audit syscall-exit latency
@ 2022-09-27 22:59 ` Ankur Arora
  0 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-09-27 22:59 UTC (permalink / raw)
  To: linux-audit; +Cc: boris.ostrovsky, konrad.wilk, eparis, linux-kernel

This series attempts to reduce syscall-exit latency in the audit path,
especially for cases where there are a lot of audit exit rules.

From profiling, audit_filter_syscall() takes the largest chunk of time,
specifically in this list-walk while processing the AUDIT_FILTER_EXIT
list:

      list = &audit_filter_list[AUDIT_FILTER_EXIT];

      list_for_each_entry_rcu(e, list, list) {
          if (audit_in_mask(&e->rule, ctx->major) &&
              audit_filter_rules(tsk, &e->rule, ctx, NULL,
                                 &state, false, x)) {
                  rcu_read_unlock();
                  ctx->current_state = state;
                  return state;
          }
      }

(Note that almost identical logic exists in audit_filter_uring(),
audit_filter_inode_name().)

Comparing baseline performance with audit=0/audit=1 with a user-space
getpid() loop (executes 10^7 times.) For audit=1, this uses an
audit-rule set where the audit_filter_syscall() loop iterates over
42 AUDIT_FILTER_EXIT rules which, for getpid(), calls audit_filter_rules()
for 5 of them (we use this set of rules in production.)

Test system:
 Server: ORACLE SERVER X8-2L
 CPU: Intel Skylakex (6:85:6)
 Microcode: 0x400320a

 # v6.0.0-rc5.baseline, audit=0
 Performance counter stats for 'bin/getpid' (3 runs):

            734.10 msec task-clock                       #    0.999 CPUs utilized            ( +-  0.03% )
                 1      context-switches                 #    1.361 /sec                     ( +- 66.67% )
                 0      cpu-migrations                   #    0.000 /sec
                53      page-faults                      #   72.152 /sec                     ( +-  0.63% )
     2,838,869,156      cycles                           #    3.865 GHz                      ( +-  0.01% )  (62.40%)
     4,174,224,305      instructions                     #    1.47  insn per cycle           ( +-  0.01% )  (74.93%)
       890,798,133      branches                         #    1.213 G/sec                    ( +-  0.01% )  (74.93%)
         5,015,118      branch-misses                    #    0.56% of all branches          ( +-  0.05% )  (74.93%)
     1,231,150,558      L1-dcache-loads                  #    1.676 G/sec                    ( +-  0.01% )  (74.94%)
           418,297      L1-dcache-load-misses            #    0.03% of all L1-dcache accesses  ( +-  0.68% )  (75.07%)
             3,937      LLC-loads                        #    5.360 K/sec                    ( +-  3.76% )  (50.13%)
               510      LLC-load-misses                  #   13.39% of all LL-cache accesses  ( +- 79.89% )  (50.00%)

          0.735018 +- 0.000275 seconds time elapsed  ( +-  0.04% )

 # v6.0.0-rc5.baseline, audit=1
 Performance counter stats for 'bin/getpid' (3 runs):

          2,158.81 msec task-clock                       #    0.998 CPUs utilized            ( +-  0.13% )
                 2      context-switches                 #    0.925 /sec                     ( +- 28.87% )
                 0      cpu-migrations                   #    0.000 /sec
                52      page-faults                      #   24.056 /sec                     ( +-  0.64% )
     8,364,119,898      cycles                           #    3.869 GHz                      ( +-  0.11% )  (62.48%)
    19,996,521,678      instructions                     #    2.39  insn per cycle           ( +-  0.01% )  (74.98%)
     4,302,068,252      branches                         #    1.990 G/sec                    ( +-  0.00% )  (74.98%)
        15,135,694      branch-misses                    #    0.35% of all branches          ( +-  0.16% )  (74.99%)
     4,714,694,841      L1-dcache-loads                  #    2.181 G/sec                    ( +-  0.01% )  (74.99%)
        66,407,857      L1-dcache-load-misses            #    1.41% of all L1-dcache accesses  ( +-  1.50% )  (75.01%)
             6,785      LLC-loads                        #    3.139 K/sec                    ( +- 12.49% )  (50.03%)
             3,235      LLC-load-misses                  #   41.29% of all LL-cache accesses  ( +-  6.08% )  (50.01%)

           2.16213 +- 0.00288 seconds time elapsed  ( +-  0.13% )

perf stat numbers for each getpid() iteration:

		  baseline      baseline 
                   audit=0       audit=1 

  cycles               283           836 
  instructions         417          1999 
  IPC                 1.47          2.39 
  branches              89           430 
  branch-misses       0.50          1.51 
  L1-loads             123           471 
  L1-load-misses        ~0          ~6.6*

  * the L1-load-misses are largely stable for runs across a single
    boot, but vary across boots due to vagaries of the SLAB allocator.

baseline audit=1 spends a significant amount of time executing in audit
code and incurs a new branch-miss and a few new L1-load-misses. Also
note that computed audit-only IPC is 2.86 so the baseline is not
ill-performant code.

Patches
==

Patch 1 "audit: cache ctx->major in audit_filter_syscall()", caches
ctx->major in a local variable. This gets rid of a persistent entry
from L1-dcache  (audit_context::major) that had similar alignment
constraints as a potentially busy cache-set (audit_entry::list) and
allows some of the error checking in audit_in_mask() to be hoisted out
of the loop.

Patch 2: "audit: annotate branch direction for audit_in_mask()", so
the compiler can pessimize the call to audit_filter_rules().

Patch 3, "audit: unify audit_filter_{uring(),inode_name(),syscall()}"
centralizes this logic in a common function.

with these changes:

 Performance counter stats for 'bin/getpid' (3 runs):

          1,750.21 msec task-clock                       #    0.994 CPUs utilized            ( +-  0.45% )
                 3      context-switches                 #    1.705 /sec                     ( +- 11.11% )
                 0      cpu-migrations                   #    0.000 /sec
                52      page-faults                      #   29.548 /sec                     ( +-  0.64% )
     6,770,976,590      cycles                           #    3.848 GHz                      ( +-  0.40% )  (27.74%)
    16,588,372,024      instructions                     #    2.44  insn per cycle           ( +-  0.03% )  (33.34%)
     4,322,555,829      branches                         #    2.456 G/sec                    ( +-  0.02% )  (33.40%)
         2,803,286      branch-misses                    #    0.06% of all branches          ( +- 26.45% )  (33.46%)
     4,449,426,336      L1-dcache-loads                  #    2.528 G/sec                    ( +-  0.01% )  (27.71%)
        63,612,108      L1-dcache-load-misses            #    1.43% of all L1-dcache accesses  ( +-  0.50% )  (27.71%)
             6,123      LLC-loads                        #    3.479 K/sec                    ( +-  8.68% )  (27.71%)
             1,890      LLC-load-misses                  #   26.69% of all LL-cache accesses  ( +- 46.99% )  (27.71%)

           1.76033 +- 0.00791 seconds time elapsed  ( +-  0.45% )

And, overall getpid() latency numbers (aggregated over 12 boots for each):
                Min     Mean    Median    Max         pstdev
                (ns)    (ns)      (ns)    (ns)

 baseline     201.30   216.14   216.22  228.46      (+- 1.45%)
 patch1       196.63   207.86   206.60  230.98      (+- 3.92%)
 patch1-2     173.11   182.51   179.65  202.09      (+- 4.34%)
 patch1-3     162.11   175.26   173.71  190.56      (+- 4.33%)

This gives a reasonable speedup. My testing was on Intel Skylake, but I
suspect this should translate to other archs as well (especially on less
wide architectures.)

Please review.

Thanks
Ankur

Ankur Arora (3):
  audit: cache ctx->major in audit_filter_syscall()
  audit: annotate branch direction for audit_in_mask()
  audit: unify audit_filter_{uring(),inode_name(),syscall()}

 kernel/auditsc.c | 86 +++++++++++++++++++++++++-----------------------
 1 file changed, 45 insertions(+), 41 deletions(-)

-- 
2.31.1

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()
  2022-09-27 22:59 ` Ankur Arora
@ 2022-09-27 22:59   ` Ankur Arora
  -1 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-09-27 22:59 UTC (permalink / raw)
  To: linux-audit; +Cc: paul, eparis, linux-kernel, boris.ostrovsky, konrad.wilk

ctx->major contains the current syscall number. This is, of course, a
constant for the duration of the syscall. Unfortunately, GCC's alias
analysis cannot prove that it is not modified via a pointer in the
audit_filter_syscall() loop, and so always loads it from memory.

In and of itself the load isn't very expensive (ops dependent on the
ctx->major load are only used to determine the direction of control flow
and have short dependence chains and, in any case the related branches
get predicted perfectly in the fastpath) but still cache ctx->major
in a local for two reasons:

* ctx->major is in the first cacheline of struct audit_context and has
  similar alignment as audit_entry::list audit_entry. For cases
  with a lot of audit rules, doing this reduces one source of contention
  from a potentially busy cache-set.

* audit_in_mask() (called in the hot loop in audit_filter_syscall())
  does cast manipulation and error checking on ctx->major:

     audit_in_mask(const struct audit_krule *rule, unsigned long val):
             if (val > 0xffffffff)
                     return false;

             word = AUDIT_WORD(val);
             if (word >= AUDIT_BITMASK_SIZE)
                     return false;

             bit = AUDIT_BIT(val);

             return rule->mask[word] & bit;

  The clauses related to the rule need to be evaluated in the loop, but
  the rest is unnecessarily re-evaluated for every loop iteration.
  (Note, however, that most of these are cheap ALU ops and the branches
   are perfectly predicted. However, see discussion on cycles
   improvement below for more on why it is still worth hoisting.)

On a Skylakex system change in getpid() latency (aggregated over
12 boot cycles):

             Min     Mean  Median     Max       pstdev
            (ns)     (ns)    (ns)    (ns)

 -        201.30   216.14  216.22  228.46      (+- 1.45%)
 +        196.63   207.86  206.60  230.98      (+- 3.92%)

Performance counter stats for 'bin/getpid' (3 runs) go from:
    cycles               836.89  (  +-   .80% )
    instructions        2000.19  (  +-   .03% )
    IPC                    2.39  (  +-   .83% )
    branches             430.14  (  +-   .03% )
    branch-misses          1.48  (  +-  3.37% )
    L1-dcache-loads      471.11  (  +-   .05% )
    L1-dcache-load-misses  7.62  (  +- 46.98% )

 to:
    cycles               805.58  (  +-  4.11% )
    instructions        1654.11  (  +-   .05% )
    IPC                    2.06  (  +-  3.39% )
    branches             430.02  (  +-   .05% )
    branch-misses          1.55  (  +-  7.09% )
    L1-dcache-loads      440.01  (  +-   .09% )
    L1-dcache-load-misses  9.05  (  +- 74.03% )

(Both aggregated over 12 boot cycles.)

instructions: we reduce around 8 instructions/iteration because some of
the computation is now hoisted out of the loop (branch count does not
change because GCC, for reasons unclear, only hoists the computations
while keeping the basic-blocks.)

cycles: improve by about 5% (in aggregate and looking at individual run
numbers.) This is likely because we now waste fewer pipeline resources
on unnecessary instructions which allows the control flow to
speculatively execute further ahead shortening the execution of the loop
a little. The final gating factor on the performance of this loop
remains the long dependence chain due to the linked-list load.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/auditsc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 79a5da1bc5bb..533b087c3c02 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct *tsk,
 {
 	struct audit_entry *e;
 	enum audit_state state;
+	unsigned long major = ctx->major;
 
 	if (auditd_test_task(tsk))
 		return;
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
-		if (audit_in_mask(&e->rule, ctx->major) &&
+		if (audit_in_mask(&e->rule, major) &&
 		    audit_filter_rules(tsk, &e->rule, ctx, NULL,
 				       &state, false)) {
 			rcu_read_unlock();
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()
@ 2022-09-27 22:59   ` Ankur Arora
  0 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-09-27 22:59 UTC (permalink / raw)
  To: linux-audit; +Cc: boris.ostrovsky, konrad.wilk, eparis, linux-kernel

ctx->major contains the current syscall number. This is, of course, a
constant for the duration of the syscall. Unfortunately, GCC's alias
analysis cannot prove that it is not modified via a pointer in the
audit_filter_syscall() loop, and so always loads it from memory.

In and of itself the load isn't very expensive (ops dependent on the
ctx->major load are only used to determine the direction of control flow
and have short dependence chains and, in any case the related branches
get predicted perfectly in the fastpath) but still cache ctx->major
in a local for two reasons:

* ctx->major is in the first cacheline of struct audit_context and has
  similar alignment as audit_entry::list audit_entry. For cases
  with a lot of audit rules, doing this reduces one source of contention
  from a potentially busy cache-set.

* audit_in_mask() (called in the hot loop in audit_filter_syscall())
  does cast manipulation and error checking on ctx->major:

     audit_in_mask(const struct audit_krule *rule, unsigned long val):
             if (val > 0xffffffff)
                     return false;

             word = AUDIT_WORD(val);
             if (word >= AUDIT_BITMASK_SIZE)
                     return false;

             bit = AUDIT_BIT(val);

             return rule->mask[word] & bit;

  The clauses related to the rule need to be evaluated in the loop, but
  the rest is unnecessarily re-evaluated for every loop iteration.
  (Note, however, that most of these are cheap ALU ops and the branches
   are perfectly predicted. However, see discussion on cycles
   improvement below for more on why it is still worth hoisting.)

On a Skylakex system change in getpid() latency (aggregated over
12 boot cycles):

             Min     Mean  Median     Max       pstdev
            (ns)     (ns)    (ns)    (ns)

 -        201.30   216.14  216.22  228.46      (+- 1.45%)
 +        196.63   207.86  206.60  230.98      (+- 3.92%)

Performance counter stats for 'bin/getpid' (3 runs) go from:
    cycles               836.89  (  +-   .80% )
    instructions        2000.19  (  +-   .03% )
    IPC                    2.39  (  +-   .83% )
    branches             430.14  (  +-   .03% )
    branch-misses          1.48  (  +-  3.37% )
    L1-dcache-loads      471.11  (  +-   .05% )
    L1-dcache-load-misses  7.62  (  +- 46.98% )

 to:
    cycles               805.58  (  +-  4.11% )
    instructions        1654.11  (  +-   .05% )
    IPC                    2.06  (  +-  3.39% )
    branches             430.02  (  +-   .05% )
    branch-misses          1.55  (  +-  7.09% )
    L1-dcache-loads      440.01  (  +-   .09% )
    L1-dcache-load-misses  9.05  (  +- 74.03% )

(Both aggregated over 12 boot cycles.)

instructions: we reduce around 8 instructions/iteration because some of
the computation is now hoisted out of the loop (branch count does not
change because GCC, for reasons unclear, only hoists the computations
while keeping the basic-blocks.)

cycles: improve by about 5% (in aggregate and looking at individual run
numbers.) This is likely because we now waste fewer pipeline resources
on unnecessary instructions which allows the control flow to
speculatively execute further ahead shortening the execution of the loop
a little. The final gating factor on the performance of this loop
remains the long dependence chain due to the linked-list load.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/auditsc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 79a5da1bc5bb..533b087c3c02 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct *tsk,
 {
 	struct audit_entry *e;
 	enum audit_state state;
+	unsigned long major = ctx->major;
 
 	if (auditd_test_task(tsk))
 		return;
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
-		if (audit_in_mask(&e->rule, ctx->major) &&
+		if (audit_in_mask(&e->rule, major) &&
 		    audit_filter_rules(tsk, &e->rule, ctx, NULL,
 				       &state, false)) {
 			rcu_read_unlock();
-- 
2.31.1

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/3] audit: annotate branch direction for audit_in_mask()
  2022-09-27 22:59 ` Ankur Arora
@ 2022-09-27 22:59   ` Ankur Arora
  -1 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-09-27 22:59 UTC (permalink / raw)
  To: linux-audit; +Cc: paul, eparis, linux-kernel, boris.ostrovsky, konrad.wilk

With sane audit rules, audit logging would only be triggered
infrequently. Keeping this in mind, annotate audit_in_mask() as
unlikely() to allow the compiler to pessimize the call to
audit_filter_rules().

This allows GCC to invert the branch direction for the audit_filter_rules()
basic block in this loop:

 	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
 		if (audit_in_mask(&e->rule, major) &&
 		    audit_filter_rules(tsk, &e->rule, ctx, NULL,
 				       &state, false)) {
			...

such that it executes the common case in a straight line fashion.

On a Skylakex system change in getpid() latency (all results
aggregated across 12 boot cycles):

         Min     Mean    Median   Max      pstdev
         (ns)    (ns)    (ns)     (ns)

 -    196.63   207.86  206.60  230.98      (+- 3.92%)
 +    173.11   182.51  179.65  202.09      (+- 4.34%)

Performance counter stats for 'bin/getpid' (3 runs) go from:
    cycles               805.58  (  +-  4.11% )
    instructions        1654.11  (  +-   .05% )
    IPC                    2.06  (  +-  3.39% )
    branches             430.02  (  +-   .05% )
    branch-misses          1.55  (  +-  7.09% )
    L1-dcache-loads      440.01  (  +-   .09% )
    L1-dcache-load-misses  9.05  (  +- 74.03% )

 to:
    cycles               706.13  (  +-  4.13% )
    instructions        1654.70  (  +-   .06% )
    IPC                    2.35  (  +-  4.25% )
    branches             430.99  (  +-   .06% )
    branch-misses          0.50  (  +-  2.00% )
    L1-dcache-loads      440.02  (  +-   .07% )
    L1-dcache-load-misses  5.22  (  +- 82.75% )

(Both aggregated over 12 boot cycles.)

cycles: performance improves on average by ~100 cycles/call. IPC
improves commensurately. Two reasons for this improvement:

  * one fewer branch mispred: no obvious reason for this
    branch-miss reduction. There is no significant change in
    basic-block structure (apart from the branch inversion.)

  * the direction of the branch for the call is now inverted, so it
    chooses the not-taken direction more often. The issue-latency
    for not-taken branches is often cheaper.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/auditsc.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 533b087c3c02..bf26f47b5226 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -789,7 +789,7 @@ static enum audit_state audit_filter_task(struct task_struct *tsk, char **key)
 	return AUDIT_STATE_BUILD;
 }
 
-static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
+static bool audit_in_mask(const struct audit_krule *rule, unsigned long val)
 {
 	int word, bit;
 
@@ -850,12 +850,13 @@ static void audit_filter_syscall(struct task_struct *tsk,
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
-		if (audit_in_mask(&e->rule, major) &&
-		    audit_filter_rules(tsk, &e->rule, ctx, NULL,
-				       &state, false)) {
-			rcu_read_unlock();
-			ctx->current_state = state;
-			return;
+		if (unlikely(audit_in_mask(&e->rule, major))) {
+			if (audit_filter_rules(tsk, &e->rule, ctx, NULL,
+					       &state, false)) {
+				rcu_read_unlock();
+				ctx->current_state = state;
+				return;
+			}
 		}
 	}
 	rcu_read_unlock();
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/3] audit: annotate branch direction for audit_in_mask()
@ 2022-09-27 22:59   ` Ankur Arora
  0 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-09-27 22:59 UTC (permalink / raw)
  To: linux-audit; +Cc: boris.ostrovsky, konrad.wilk, eparis, linux-kernel

With sane audit rules, audit logging would only be triggered
infrequently. Keeping this in mind, annotate audit_in_mask() as
unlikely() to allow the compiler to pessimize the call to
audit_filter_rules().

This allows GCC to invert the branch direction for the audit_filter_rules()
basic block in this loop:

 	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
 		if (audit_in_mask(&e->rule, major) &&
 		    audit_filter_rules(tsk, &e->rule, ctx, NULL,
 				       &state, false)) {
			...

such that it executes the common case in a straight line fashion.

On a Skylakex system change in getpid() latency (all results
aggregated across 12 boot cycles):

         Min     Mean    Median   Max      pstdev
         (ns)    (ns)    (ns)     (ns)

 -    196.63   207.86  206.60  230.98      (+- 3.92%)
 +    173.11   182.51  179.65  202.09      (+- 4.34%)

Performance counter stats for 'bin/getpid' (3 runs) go from:
    cycles               805.58  (  +-  4.11% )
    instructions        1654.11  (  +-   .05% )
    IPC                    2.06  (  +-  3.39% )
    branches             430.02  (  +-   .05% )
    branch-misses          1.55  (  +-  7.09% )
    L1-dcache-loads      440.01  (  +-   .09% )
    L1-dcache-load-misses  9.05  (  +- 74.03% )

 to:
    cycles               706.13  (  +-  4.13% )
    instructions        1654.70  (  +-   .06% )
    IPC                    2.35  (  +-  4.25% )
    branches             430.99  (  +-   .06% )
    branch-misses          0.50  (  +-  2.00% )
    L1-dcache-loads      440.02  (  +-   .07% )
    L1-dcache-load-misses  5.22  (  +- 82.75% )

(Both aggregated over 12 boot cycles.)

cycles: performance improves on average by ~100 cycles/call. IPC
improves commensurately. Two reasons for this improvement:

  * one fewer branch mispred: no obvious reason for this
    branch-miss reduction. There is no significant change in
    basic-block structure (apart from the branch inversion.)

  * the direction of the branch for the call is now inverted, so it
    chooses the not-taken direction more often. The issue-latency
    for not-taken branches is often cheaper.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/auditsc.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 533b087c3c02..bf26f47b5226 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -789,7 +789,7 @@ static enum audit_state audit_filter_task(struct task_struct *tsk, char **key)
 	return AUDIT_STATE_BUILD;
 }
 
-static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
+static bool audit_in_mask(const struct audit_krule *rule, unsigned long val)
 {
 	int word, bit;
 
@@ -850,12 +850,13 @@ static void audit_filter_syscall(struct task_struct *tsk,
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
-		if (audit_in_mask(&e->rule, major) &&
-		    audit_filter_rules(tsk, &e->rule, ctx, NULL,
-				       &state, false)) {
-			rcu_read_unlock();
-			ctx->current_state = state;
-			return;
+		if (unlikely(audit_in_mask(&e->rule, major))) {
+			if (audit_filter_rules(tsk, &e->rule, ctx, NULL,
+					       &state, false)) {
+				rcu_read_unlock();
+				ctx->current_state = state;
+				return;
+			}
 		}
 	}
 	rcu_read_unlock();
-- 
2.31.1

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/3] audit: unify audit_filter_{uring(),inode_name(),syscall()}
  2022-09-27 22:59 ` Ankur Arora
@ 2022-09-27 22:59   ` Ankur Arora
  -1 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-09-27 22:59 UTC (permalink / raw)
  To: linux-audit; +Cc: paul, eparis, linux-kernel, boris.ostrovsky, konrad.wilk

audit_filter_uring(), audit_filter_inode_name() are substantially similar
to audit_filter_syscall(). Move the core logic to __audit_filter() which
can be parametrized for all three.

On a Skylakex system, getpid() latency (all results aggregated
across 12 boot cycles):

         Min     Mean    Median   Max      pstdev
         (ns)    (ns)    (ns)     (ns)

 -    173.11   182.51  179.65  202.09     (+- 4.34%)
 +    162.11   175.26  173.71  190.56     (+- 4.33%)

Performance counter stats for 'bin/getpid' (3 runs) go from:
    cycles               706.13  (  +-  4.13% )
    instructions        1654.70  (  +-   .06% )
    IPC                    2.35  (  +-  4.25% )
    branches             430.99  (  +-   .06% )
    branch-misses          0.50  (  +-  2.00% )
    L1-dcache-loads      440.02  (  +-   .07% )
    L1-dcache-load-misses  5.22  (  +- 82.75% )

 to:
    cycles               678.79  (  +-  4.22% )
    instructions        1657.79  (  +-   .05% )
    IPC                    2.45  (  +-  4.08% )
    branches             432.00  (  +-   .05% )
    branch-misses          0.38  (  +- 23.68% )
    L1-dcache-loads      444.96  (  +-   .03% )
    L1-dcache-load-misses  5.13  (  +- 73.09% )

(Both aggregated over 12 boot cycles.)

Unclear if the improvement is just run-to-run variation or because of
a slightly denser loop (the list parameter in the list_for_each_entry_rcu()
exit check now comes from a register rather than a constant as before.)

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/auditsc.c | 86 +++++++++++++++++++++++++-----------------------
 1 file changed, 44 insertions(+), 42 deletions(-)

diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index bf26f47b5226..dd89a52988b0 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -805,6 +805,41 @@ static bool audit_in_mask(const struct audit_krule *rule, unsigned long val)
 	return rule->mask[word] & bit;
 }
 
+/**
+ * __audit_filter - common filter
+ *
+ * @tsk: associated task
+ * @ctx: audit context
+ * @list: audit filter list
+ * @op: current syscall/uring_op
+ * @name: name to be filtered (used by audit_filter_inode_name)
+ *
+ * return: 1 if we hit a filter, 0 if we don't
+ */
+static int __audit_filter(struct task_struct *tsk,
+			   struct audit_context *ctx,
+			   struct list_head *list,
+			   unsigned long op,
+			   struct audit_names *name)
+{
+	struct audit_entry *e;
+	enum audit_state state;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(e, list, list) {
+		if (unlikely(audit_in_mask(&e->rule, op))) {
+			if (audit_filter_rules(tsk, &e->rule, ctx, name,
+					       &state, false)) {
+				rcu_read_unlock();
+				ctx->current_state = state;
+				return 1;
+			}
+		}
+	}
+	rcu_read_unlock();
+	return 0;
+}
+
 /**
  * audit_filter_uring - apply filters to an io_uring operation
  * @tsk: associated task
@@ -813,24 +848,11 @@ static bool audit_in_mask(const struct audit_krule *rule, unsigned long val)
 static void audit_filter_uring(struct task_struct *tsk,
 			       struct audit_context *ctx)
 {
-	struct audit_entry *e;
-	enum audit_state state;
-
 	if (auditd_test_task(tsk))
 		return;
 
-	rcu_read_lock();
-	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_URING_EXIT],
-				list) {
-		if (audit_in_mask(&e->rule, ctx->uring_op) &&
-		    audit_filter_rules(tsk, &e->rule, ctx, NULL, &state,
-				       false)) {
-			rcu_read_unlock();
-			ctx->current_state = state;
-			return;
-		}
-	}
-	rcu_read_unlock();
+	__audit_filter(tsk, ctx, &audit_filter_list[AUDIT_FILTER_URING_EXIT],
+			ctx->uring_op, NULL);
 }
 
 /* At syscall exit time, this filter is called if the audit_state is
@@ -841,26 +863,11 @@ static void audit_filter_uring(struct task_struct *tsk,
 static void audit_filter_syscall(struct task_struct *tsk,
 				 struct audit_context *ctx)
 {
-	struct audit_entry *e;
-	enum audit_state state;
-	unsigned long major = ctx->major;
-
 	if (auditd_test_task(tsk))
 		return;
 
-	rcu_read_lock();
-	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
-		if (unlikely(audit_in_mask(&e->rule, major))) {
-			if (audit_filter_rules(tsk, &e->rule, ctx, NULL,
-					       &state, false)) {
-				rcu_read_unlock();
-				ctx->current_state = state;
-				return;
-			}
-		}
-	}
-	rcu_read_unlock();
-	return;
+	__audit_filter(tsk, ctx, &audit_filter_list[AUDIT_FILTER_EXIT],
+			ctx->major, NULL);
 }
 
 /*
@@ -872,17 +879,12 @@ static int audit_filter_inode_name(struct task_struct *tsk,
 				   struct audit_context *ctx) {
 	int h = audit_hash_ino((u32)n->ino);
 	struct list_head *list = &audit_inode_hash[h];
-	struct audit_entry *e;
-	enum audit_state state;
 
-	list_for_each_entry_rcu(e, list, list) {
-		if (audit_in_mask(&e->rule, ctx->major) &&
-		    audit_filter_rules(tsk, &e->rule, ctx, n, &state, false)) {
-			ctx->current_state = state;
-			return 1;
-		}
-	}
-	return 0;
+	/*
+	 * We are called holding an rcu read lock. __audit_filter() will take
+	 * one as well.
+	 */
+	return __audit_filter(tsk, ctx, list, ctx->major, n);
 }
 
 /* At syscall exit time, this filter is called if any audit_names have been
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/3] audit: unify audit_filter_{uring(), inode_name(), syscall()}
@ 2022-09-27 22:59   ` Ankur Arora
  0 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-09-27 22:59 UTC (permalink / raw)
  To: linux-audit; +Cc: boris.ostrovsky, konrad.wilk, eparis, linux-kernel

audit_filter_uring(), audit_filter_inode_name() are substantially similar
to audit_filter_syscall(). Move the core logic to __audit_filter() which
can be parametrized for all three.

On a Skylakex system, getpid() latency (all results aggregated
across 12 boot cycles):

         Min     Mean    Median   Max      pstdev
         (ns)    (ns)    (ns)     (ns)

 -    173.11   182.51  179.65  202.09     (+- 4.34%)
 +    162.11   175.26  173.71  190.56     (+- 4.33%)

Performance counter stats for 'bin/getpid' (3 runs) go from:
    cycles               706.13  (  +-  4.13% )
    instructions        1654.70  (  +-   .06% )
    IPC                    2.35  (  +-  4.25% )
    branches             430.99  (  +-   .06% )
    branch-misses          0.50  (  +-  2.00% )
    L1-dcache-loads      440.02  (  +-   .07% )
    L1-dcache-load-misses  5.22  (  +- 82.75% )

 to:
    cycles               678.79  (  +-  4.22% )
    instructions        1657.79  (  +-   .05% )
    IPC                    2.45  (  +-  4.08% )
    branches             432.00  (  +-   .05% )
    branch-misses          0.38  (  +- 23.68% )
    L1-dcache-loads      444.96  (  +-   .03% )
    L1-dcache-load-misses  5.13  (  +- 73.09% )

(Both aggregated over 12 boot cycles.)

Unclear if the improvement is just run-to-run variation or because of
a slightly denser loop (the list parameter in the list_for_each_entry_rcu()
exit check now comes from a register rather than a constant as before.)

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/auditsc.c | 86 +++++++++++++++++++++++++-----------------------
 1 file changed, 44 insertions(+), 42 deletions(-)

diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index bf26f47b5226..dd89a52988b0 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -805,6 +805,41 @@ static bool audit_in_mask(const struct audit_krule *rule, unsigned long val)
 	return rule->mask[word] & bit;
 }
 
+/**
+ * __audit_filter - common filter
+ *
+ * @tsk: associated task
+ * @ctx: audit context
+ * @list: audit filter list
+ * @op: current syscall/uring_op
+ * @name: name to be filtered (used by audit_filter_inode_name)
+ *
+ * return: 1 if we hit a filter, 0 if we don't
+ */
+static int __audit_filter(struct task_struct *tsk,
+			   struct audit_context *ctx,
+			   struct list_head *list,
+			   unsigned long op,
+			   struct audit_names *name)
+{
+	struct audit_entry *e;
+	enum audit_state state;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(e, list, list) {
+		if (unlikely(audit_in_mask(&e->rule, op))) {
+			if (audit_filter_rules(tsk, &e->rule, ctx, name,
+					       &state, false)) {
+				rcu_read_unlock();
+				ctx->current_state = state;
+				return 1;
+			}
+		}
+	}
+	rcu_read_unlock();
+	return 0;
+}
+
 /**
  * audit_filter_uring - apply filters to an io_uring operation
  * @tsk: associated task
@@ -813,24 +848,11 @@ static bool audit_in_mask(const struct audit_krule *rule, unsigned long val)
 static void audit_filter_uring(struct task_struct *tsk,
 			       struct audit_context *ctx)
 {
-	struct audit_entry *e;
-	enum audit_state state;
-
 	if (auditd_test_task(tsk))
 		return;
 
-	rcu_read_lock();
-	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_URING_EXIT],
-				list) {
-		if (audit_in_mask(&e->rule, ctx->uring_op) &&
-		    audit_filter_rules(tsk, &e->rule, ctx, NULL, &state,
-				       false)) {
-			rcu_read_unlock();
-			ctx->current_state = state;
-			return;
-		}
-	}
-	rcu_read_unlock();
+	__audit_filter(tsk, ctx, &audit_filter_list[AUDIT_FILTER_URING_EXIT],
+			ctx->uring_op, NULL);
 }
 
 /* At syscall exit time, this filter is called if the audit_state is
@@ -841,26 +863,11 @@ static void audit_filter_uring(struct task_struct *tsk,
 static void audit_filter_syscall(struct task_struct *tsk,
 				 struct audit_context *ctx)
 {
-	struct audit_entry *e;
-	enum audit_state state;
-	unsigned long major = ctx->major;
-
 	if (auditd_test_task(tsk))
 		return;
 
-	rcu_read_lock();
-	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
-		if (unlikely(audit_in_mask(&e->rule, major))) {
-			if (audit_filter_rules(tsk, &e->rule, ctx, NULL,
-					       &state, false)) {
-				rcu_read_unlock();
-				ctx->current_state = state;
-				return;
-			}
-		}
-	}
-	rcu_read_unlock();
-	return;
+	__audit_filter(tsk, ctx, &audit_filter_list[AUDIT_FILTER_EXIT],
+			ctx->major, NULL);
 }
 
 /*
@@ -872,17 +879,12 @@ static int audit_filter_inode_name(struct task_struct *tsk,
 				   struct audit_context *ctx) {
 	int h = audit_hash_ino((u32)n->ino);
 	struct list_head *list = &audit_inode_hash[h];
-	struct audit_entry *e;
-	enum audit_state state;
 
-	list_for_each_entry_rcu(e, list, list) {
-		if (audit_in_mask(&e->rule, ctx->major) &&
-		    audit_filter_rules(tsk, &e->rule, ctx, n, &state, false)) {
-			ctx->current_state = state;
-			return 1;
-		}
-	}
-	return 0;
+	/*
+	 * We are called holding an rcu read lock. __audit_filter() will take
+	 * one as well.
+	 */
+	return __audit_filter(tsk, ctx, list, ctx->major, n);
 }
 
 /* At syscall exit time, this filter is called if any audit_names have been
-- 
2.31.1

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()
  2022-09-27 22:59   ` Ankur Arora
@ 2022-09-28 22:03     ` Paul Moore
  -1 siblings, 0 replies; 40+ messages in thread
From: Paul Moore @ 2022-09-28 22:03 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-audit, eparis, linux-kernel, boris.ostrovsky, konrad.wilk

On Tue, Sep 27, 2022 at 6:59 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> ctx->major contains the current syscall number. This is, of course, a
> constant for the duration of the syscall. Unfortunately, GCC's alias
> analysis cannot prove that it is not modified via a pointer in the
> audit_filter_syscall() loop, and so always loads it from memory.
>
> ...
>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/auditsc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

This looks pretty trivial to me, but it's too late in the current -rc
cycle for this to be merged, I'll queue it up for after the upcoming
merge window closes.  Thanks.

> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index 79a5da1bc5bb..533b087c3c02 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct *tsk,
>  {
>         struct audit_entry *e;
>         enum audit_state state;
> +       unsigned long major = ctx->major;
>
>         if (auditd_test_task(tsk))
>                 return;
>
>         rcu_read_lock();
>         list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
> -               if (audit_in_mask(&e->rule, ctx->major) &&
> +               if (audit_in_mask(&e->rule, major) &&
>                     audit_filter_rules(tsk, &e->rule, ctx, NULL,
>                                        &state, false)) {
>                         rcu_read_unlock();
> --
> 2.31.1

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()
@ 2022-09-28 22:03     ` Paul Moore
  0 siblings, 0 replies; 40+ messages in thread
From: Paul Moore @ 2022-09-28 22:03 UTC (permalink / raw)
  To: Ankur Arora
  Cc: boris.ostrovsky, linux-audit, linux-kernel, eparis, konrad.wilk

On Tue, Sep 27, 2022 at 6:59 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> ctx->major contains the current syscall number. This is, of course, a
> constant for the duration of the syscall. Unfortunately, GCC's alias
> analysis cannot prove that it is not modified via a pointer in the
> audit_filter_syscall() loop, and so always loads it from memory.
>
> ...
>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/auditsc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

This looks pretty trivial to me, but it's too late in the current -rc
cycle for this to be merged, I'll queue it up for after the upcoming
merge window closes.  Thanks.

> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index 79a5da1bc5bb..533b087c3c02 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct *tsk,
>  {
>         struct audit_entry *e;
>         enum audit_state state;
> +       unsigned long major = ctx->major;
>
>         if (auditd_test_task(tsk))
>                 return;
>
>         rcu_read_lock();
>         list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
> -               if (audit_in_mask(&e->rule, ctx->major) &&
> +               if (audit_in_mask(&e->rule, major) &&
>                     audit_filter_rules(tsk, &e->rule, ctx, NULL,
>                                        &state, false)) {
>                         rcu_read_unlock();
> --
> 2.31.1

-- 
paul-moore.com

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] audit: annotate branch direction for audit_in_mask()
  2022-09-27 22:59   ` Ankur Arora
@ 2022-09-28 22:26     ` Paul Moore
  -1 siblings, 0 replies; 40+ messages in thread
From: Paul Moore @ 2022-09-28 22:26 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-audit, eparis, linux-kernel, boris.ostrovsky, konrad.wilk

On Tue, Sep 27, 2022 at 7:00 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> With sane audit rules, audit logging would only be triggered
> infrequently. Keeping this in mind, annotate audit_in_mask() as
> unlikely() to allow the compiler to pessimize the call to
> audit_filter_rules().
>
> This allows GCC to invert the branch direction for the audit_filter_rules()
> basic block in this loop:
>
>         list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
>                 if (audit_in_mask(&e->rule, major) &&
>                     audit_filter_rules(tsk, &e->rule, ctx, NULL,
>                                        &state, false)) {
>                         ...
>
> such that it executes the common case in a straight line fashion.
>
> On a Skylakex system change in getpid() latency (all results
> aggregated across 12 boot cycles):
>
>          Min     Mean    Median   Max      pstdev
>          (ns)    (ns)    (ns)     (ns)
>
>  -    196.63   207.86  206.60  230.98      (+- 3.92%)
>  +    173.11   182.51  179.65  202.09      (+- 4.34%)
>
> Performance counter stats for 'bin/getpid' (3 runs) go from:
>     cycles               805.58  (  +-  4.11% )
>     instructions        1654.11  (  +-   .05% )
>     IPC                    2.06  (  +-  3.39% )
>     branches             430.02  (  +-   .05% )
>     branch-misses          1.55  (  +-  7.09% )
>     L1-dcache-loads      440.01  (  +-   .09% )
>     L1-dcache-load-misses  9.05  (  +- 74.03% )
>
>  to:
>     cycles               706.13  (  +-  4.13% )
>     instructions        1654.70  (  +-   .06% )
>     IPC                    2.35  (  +-  4.25% )
>     branches             430.99  (  +-   .06% )
>     branch-misses          0.50  (  +-  2.00% )
>     L1-dcache-loads      440.02  (  +-   .07% )
>     L1-dcache-load-misses  5.22  (  +- 82.75% )
>
> (Both aggregated over 12 boot cycles.)
>
> cycles: performance improves on average by ~100 cycles/call. IPC
> improves commensurately. Two reasons for this improvement:
>
>   * one fewer branch mispred: no obvious reason for this
>     branch-miss reduction. There is no significant change in
>     basic-block structure (apart from the branch inversion.)
>
>   * the direction of the branch for the call is now inverted, so it
>     chooses the not-taken direction more often. The issue-latency
>     for not-taken branches is often cheaper.
>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/auditsc.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)

I generally dislike merging likely()/unlikely() additions to code
paths that can have varying levels of performance depending on runtime
configuration.  While I appreciate the work you are doing to improve
audit performance, I don't think this is something I want to merge,
I'm sorry.

> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index 533b087c3c02..bf26f47b5226 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -789,7 +789,7 @@ static enum audit_state audit_filter_task(struct task_struct *tsk, char **key)
>         return AUDIT_STATE_BUILD;
>  }
>
> -static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
> +static bool audit_in_mask(const struct audit_krule *rule, unsigned long val)
>  {
>         int word, bit;
>
> @@ -850,12 +850,13 @@ static void audit_filter_syscall(struct task_struct *tsk,
>
>         rcu_read_lock();
>         list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
> -               if (audit_in_mask(&e->rule, major) &&
> -                   audit_filter_rules(tsk, &e->rule, ctx, NULL,
> -                                      &state, false)) {
> -                       rcu_read_unlock();
> -                       ctx->current_state = state;
> -                       return;
> +               if (unlikely(audit_in_mask(&e->rule, major))) {
> +                       if (audit_filter_rules(tsk, &e->rule, ctx, NULL,
> +                                              &state, false)) {
> +                               rcu_read_unlock();
> +                               ctx->current_state = state;
> +                               return;
> +                       }
>                 }
>         }
>         rcu_read_unlock();
> --
> 2.31.1

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] audit: annotate branch direction for audit_in_mask()
@ 2022-09-28 22:26     ` Paul Moore
  0 siblings, 0 replies; 40+ messages in thread
From: Paul Moore @ 2022-09-28 22:26 UTC (permalink / raw)
  To: Ankur Arora
  Cc: boris.ostrovsky, linux-audit, linux-kernel, eparis, konrad.wilk

On Tue, Sep 27, 2022 at 7:00 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> With sane audit rules, audit logging would only be triggered
> infrequently. Keeping this in mind, annotate audit_in_mask() as
> unlikely() to allow the compiler to pessimize the call to
> audit_filter_rules().
>
> This allows GCC to invert the branch direction for the audit_filter_rules()
> basic block in this loop:
>
>         list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
>                 if (audit_in_mask(&e->rule, major) &&
>                     audit_filter_rules(tsk, &e->rule, ctx, NULL,
>                                        &state, false)) {
>                         ...
>
> such that it executes the common case in a straight line fashion.
>
> On a Skylakex system change in getpid() latency (all results
> aggregated across 12 boot cycles):
>
>          Min     Mean    Median   Max      pstdev
>          (ns)    (ns)    (ns)     (ns)
>
>  -    196.63   207.86  206.60  230.98      (+- 3.92%)
>  +    173.11   182.51  179.65  202.09      (+- 4.34%)
>
> Performance counter stats for 'bin/getpid' (3 runs) go from:
>     cycles               805.58  (  +-  4.11% )
>     instructions        1654.11  (  +-   .05% )
>     IPC                    2.06  (  +-  3.39% )
>     branches             430.02  (  +-   .05% )
>     branch-misses          1.55  (  +-  7.09% )
>     L1-dcache-loads      440.01  (  +-   .09% )
>     L1-dcache-load-misses  9.05  (  +- 74.03% )
>
>  to:
>     cycles               706.13  (  +-  4.13% )
>     instructions        1654.70  (  +-   .06% )
>     IPC                    2.35  (  +-  4.25% )
>     branches             430.99  (  +-   .06% )
>     branch-misses          0.50  (  +-  2.00% )
>     L1-dcache-loads      440.02  (  +-   .07% )
>     L1-dcache-load-misses  5.22  (  +- 82.75% )
>
> (Both aggregated over 12 boot cycles.)
>
> cycles: performance improves on average by ~100 cycles/call. IPC
> improves commensurately. Two reasons for this improvement:
>
>   * one fewer branch mispred: no obvious reason for this
>     branch-miss reduction. There is no significant change in
>     basic-block structure (apart from the branch inversion.)
>
>   * the direction of the branch for the call is now inverted, so it
>     chooses the not-taken direction more often. The issue-latency
>     for not-taken branches is often cheaper.
>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/auditsc.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)

I generally dislike merging likely()/unlikely() additions to code
paths that can have varying levels of performance depending on runtime
configuration.  While I appreciate the work you are doing to improve
audit performance, I don't think this is something I want to merge,
I'm sorry.

> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index 533b087c3c02..bf26f47b5226 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -789,7 +789,7 @@ static enum audit_state audit_filter_task(struct task_struct *tsk, char **key)
>         return AUDIT_STATE_BUILD;
>  }
>
> -static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
> +static bool audit_in_mask(const struct audit_krule *rule, unsigned long val)
>  {
>         int word, bit;
>
> @@ -850,12 +850,13 @@ static void audit_filter_syscall(struct task_struct *tsk,
>
>         rcu_read_lock();
>         list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
> -               if (audit_in_mask(&e->rule, major) &&
> -                   audit_filter_rules(tsk, &e->rule, ctx, NULL,
> -                                      &state, false)) {
> -                       rcu_read_unlock();
> -                       ctx->current_state = state;
> -                       return;
> +               if (unlikely(audit_in_mask(&e->rule, major))) {
> +                       if (audit_filter_rules(tsk, &e->rule, ctx, NULL,
> +                                              &state, false)) {
> +                               rcu_read_unlock();
> +                               ctx->current_state = state;
> +                               return;
> +                       }
>                 }
>         }
>         rcu_read_unlock();
> --
> 2.31.1

-- 
paul-moore.com

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/3] audit: unify audit_filter_{uring(),inode_name(),syscall()}
  2022-09-27 22:59   ` [PATCH 3/3] audit: unify audit_filter_{uring(), inode_name(), syscall()} Ankur Arora
@ 2022-09-28 22:54     ` Paul Moore
  -1 siblings, 0 replies; 40+ messages in thread
From: Paul Moore @ 2022-09-28 22:54 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-audit, eparis, linux-kernel, boris.ostrovsky, konrad.wilk

On Tue, Sep 27, 2022 at 7:00 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> audit_filter_uring(), audit_filter_inode_name() are substantially similar
> to audit_filter_syscall(). Move the core logic to __audit_filter() which
> can be parametrized for all three.
>
> On a Skylakex system, getpid() latency (all results aggregated
> across 12 boot cycles):
>
>          Min     Mean    Median   Max      pstdev
>          (ns)    (ns)    (ns)     (ns)
>
>  -    173.11   182.51  179.65  202.09     (+- 4.34%)
>  +    162.11   175.26  173.71  190.56     (+- 4.33%)
>
> Performance counter stats for 'bin/getpid' (3 runs) go from:
>     cycles               706.13  (  +-  4.13% )
>     instructions        1654.70  (  +-   .06% )
>     IPC                    2.35  (  +-  4.25% )
>     branches             430.99  (  +-   .06% )
>     branch-misses          0.50  (  +-  2.00% )
>     L1-dcache-loads      440.02  (  +-   .07% )
>     L1-dcache-load-misses  5.22  (  +- 82.75% )
>
>  to:
>     cycles               678.79  (  +-  4.22% )
>     instructions        1657.79  (  +-   .05% )
>     IPC                    2.45  (  +-  4.08% )
>     branches             432.00  (  +-   .05% )
>     branch-misses          0.38  (  +- 23.68% )
>     L1-dcache-loads      444.96  (  +-   .03% )
>     L1-dcache-load-misses  5.13  (  +- 73.09% )
>
> (Both aggregated over 12 boot cycles.)
>
> Unclear if the improvement is just run-to-run variation or because of
> a slightly denser loop (the list parameter in the list_for_each_entry_rcu()
> exit check now comes from a register rather than a constant as before.)
>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/auditsc.c | 86 +++++++++++++++++++++++++-----------------------
>  1 file changed, 44 insertions(+), 42 deletions(-)
>
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index bf26f47b5226..dd89a52988b0 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -805,6 +805,41 @@ static bool audit_in_mask(const struct audit_krule *rule, unsigned long val)
>         return rule->mask[word] & bit;
>  }
>
> +/**
> + * __audit_filter - common filter
> + *

Please remove the vertical whitespace between the function description
line and the parameter descriptions.

I'd also suggest renaming this function to "__audit_filter_op(...)"
since we have a few audit filtering functions and a generic
"__audit_filter()" function with no qualification in the name seems
just a bit too generic to not be misleading ... at least I think so.

I also might try to be just a bit more descriptive in the text above,
for example:

"__audit_filter_op - filter helper function for operations (syscall/uring/etc.)"

> + * @tsk: associated task
> + * @ctx: audit context
> + * @list: audit filter list
> + * @op: current syscall/uring_op
> + * @name: name to be filtered (used by audit_filter_inode_name)

I would change this to "@name: audit_name to use in filtering (can be
NULL)" and leave it at that.

> + *
> + * return: 1 if we hit a filter, 0 if we don't

The function header block comment description should be in regular
English sentences.  Perhaps something like this:

"Run the audit filters specified in @list against @tsk using @ctx,
@op, and @name as necessary; the caller is responsible for ensuring
that the call is made while the RCU read lock is held.  The @name
parameter can be NULL, but all others must be specified.  Returns
1/true if the filter finds a match, 0/false if none are found."

> + */
> +static int __audit_filter(struct task_struct *tsk,
> +                          struct audit_context *ctx,
> +                          struct list_head *list,
> +                          unsigned long op,
> +                          struct audit_names *name)
> +{
> +       struct audit_entry *e;
> +       enum audit_state state;
> +
> +       rcu_read_lock();

I would move the RCU locking to the callers since not every caller requires it.

> +       list_for_each_entry_rcu(e, list, list) {
> +               if (unlikely(audit_in_mask(&e->rule, op))) {

As discussed in patch 2/3, please remove the unlikely() call.

> +                       if (audit_filter_rules(tsk, &e->rule, ctx, name,
> +                                              &state, false)) {
> +                               rcu_read_unlock();
> +                               ctx->current_state = state;
> +                               return 1;
>  +                       }
> +               }
> +       }
> +       rcu_read_unlock();
> +       return 0;
> +}
> +

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/3] audit: unify audit_filter_{uring(), inode_name(), syscall()}
@ 2022-09-28 22:54     ` Paul Moore
  0 siblings, 0 replies; 40+ messages in thread
From: Paul Moore @ 2022-09-28 22:54 UTC (permalink / raw)
  To: Ankur Arora
  Cc: boris.ostrovsky, linux-audit, linux-kernel, eparis, konrad.wilk

On Tue, Sep 27, 2022 at 7:00 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> audit_filter_uring(), audit_filter_inode_name() are substantially similar
> to audit_filter_syscall(). Move the core logic to __audit_filter() which
> can be parametrized for all three.
>
> On a Skylakex system, getpid() latency (all results aggregated
> across 12 boot cycles):
>
>          Min     Mean    Median   Max      pstdev
>          (ns)    (ns)    (ns)     (ns)
>
>  -    173.11   182.51  179.65  202.09     (+- 4.34%)
>  +    162.11   175.26  173.71  190.56     (+- 4.33%)
>
> Performance counter stats for 'bin/getpid' (3 runs) go from:
>     cycles               706.13  (  +-  4.13% )
>     instructions        1654.70  (  +-   .06% )
>     IPC                    2.35  (  +-  4.25% )
>     branches             430.99  (  +-   .06% )
>     branch-misses          0.50  (  +-  2.00% )
>     L1-dcache-loads      440.02  (  +-   .07% )
>     L1-dcache-load-misses  5.22  (  +- 82.75% )
>
>  to:
>     cycles               678.79  (  +-  4.22% )
>     instructions        1657.79  (  +-   .05% )
>     IPC                    2.45  (  +-  4.08% )
>     branches             432.00  (  +-   .05% )
>     branch-misses          0.38  (  +- 23.68% )
>     L1-dcache-loads      444.96  (  +-   .03% )
>     L1-dcache-load-misses  5.13  (  +- 73.09% )
>
> (Both aggregated over 12 boot cycles.)
>
> Unclear if the improvement is just run-to-run variation or because of
> a slightly denser loop (the list parameter in the list_for_each_entry_rcu()
> exit check now comes from a register rather than a constant as before.)
>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/auditsc.c | 86 +++++++++++++++++++++++++-----------------------
>  1 file changed, 44 insertions(+), 42 deletions(-)
>
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index bf26f47b5226..dd89a52988b0 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -805,6 +805,41 @@ static bool audit_in_mask(const struct audit_krule *rule, unsigned long val)
>         return rule->mask[word] & bit;
>  }
>
> +/**
> + * __audit_filter - common filter
> + *

Please remove the vertical whitespace between the function description
line and the parameter descriptions.

I'd also suggest renaming this function to "__audit_filter_op(...)"
since we have a few audit filtering functions and a generic
"__audit_filter()" function with no qualification in the name seems
just a bit too generic to not be misleading ... at least I think so.

I also might try to be just a bit more descriptive in the text above,
for example:

"__audit_filter_op - filter helper function for operations (syscall/uring/etc.)"

> + * @tsk: associated task
> + * @ctx: audit context
> + * @list: audit filter list
> + * @op: current syscall/uring_op
> + * @name: name to be filtered (used by audit_filter_inode_name)

I would change this to "@name: audit_name to use in filtering (can be
NULL)" and leave it at that.

> + *
> + * return: 1 if we hit a filter, 0 if we don't

The function header block comment description should be in regular
English sentences.  Perhaps something like this:

"Run the audit filters specified in @list against @tsk using @ctx,
@op, and @name as necessary; the caller is responsible for ensuring
that the call is made while the RCU read lock is held.  The @name
parameter can be NULL, but all others must be specified.  Returns
1/true if the filter finds a match, 0/false if none are found."

> + */
> +static int __audit_filter(struct task_struct *tsk,
> +                          struct audit_context *ctx,
> +                          struct list_head *list,
> +                          unsigned long op,
> +                          struct audit_names *name)
> +{
> +       struct audit_entry *e;
> +       enum audit_state state;
> +
> +       rcu_read_lock();

I would move the RCU locking to the callers since not every caller requires it.

> +       list_for_each_entry_rcu(e, list, list) {
> +               if (unlikely(audit_in_mask(&e->rule, op))) {

As discussed in patch 2/3, please remove the unlikely() call.

> +                       if (audit_filter_rules(tsk, &e->rule, ctx, name,
> +                                              &state, false)) {
> +                               rcu_read_unlock();
> +                               ctx->current_state = state;
> +                               return 1;
>  +                       }
> +               }
> +       }
> +       rcu_read_unlock();
> +       return 0;
> +}
> +

-- 
paul-moore.com

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] audit: annotate branch direction for audit_in_mask()
  2022-09-28 22:26     ` Paul Moore
@ 2022-09-29 20:19       ` Ankur Arora
  -1 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-09-29 20:19 UTC (permalink / raw)
  To: Paul Moore
  Cc: Ankur Arora, linux-audit, eparis, linux-kernel, boris.ostrovsky,
	konrad.wilk


Paul Moore <paul@paul-moore.com> writes:

> On Tue, Sep 27, 2022 at 7:00 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> With sane audit rules, audit logging would only be triggered
>> infrequently. Keeping this in mind, annotate audit_in_mask() as
>> unlikely() to allow the compiler to pessimize the call to
>> audit_filter_rules().
>>
>> This allows GCC to invert the branch direction for the audit_filter_rules()
>> basic block in this loop:
>>
>>         list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
>>                 if (audit_in_mask(&e->rule, major) &&
>>                     audit_filter_rules(tsk, &e->rule, ctx, NULL,
>>                                        &state, false)) {
>>                         ...
>>
>> such that it executes the common case in a straight line fashion.
>>
>> On a Skylakex system change in getpid() latency (all results
>> aggregated across 12 boot cycles):
>>
>>          Min     Mean    Median   Max      pstdev
>>          (ns)    (ns)    (ns)     (ns)
>>
>>  -    196.63   207.86  206.60  230.98      (+- 3.92%)
>>  +    173.11   182.51  179.65  202.09      (+- 4.34%)
>>
>> Performance counter stats for 'bin/getpid' (3 runs) go from:
>>     cycles               805.58  (  +-  4.11% )
>>     instructions        1654.11  (  +-   .05% )
>>     IPC                    2.06  (  +-  3.39% )
>>     branches             430.02  (  +-   .05% )
>>     branch-misses          1.55  (  +-  7.09% )
>>     L1-dcache-loads      440.01  (  +-   .09% )
>>     L1-dcache-load-misses  9.05  (  +- 74.03% )
>>
>>  to:
>>     cycles               706.13  (  +-  4.13% )
>>     instructions        1654.70  (  +-   .06% )
>>     IPC                    2.35  (  +-  4.25% )
>>     branches             430.99  (  +-   .06% )
>>     branch-misses          0.50  (  +-  2.00% )
>>     L1-dcache-loads      440.02  (  +-   .07% )
>>     L1-dcache-load-misses  5.22  (  +- 82.75% )
>>
>> (Both aggregated over 12 boot cycles.)
>>
>> cycles: performance improves on average by ~100 cycles/call. IPC
>> improves commensurately. Two reasons for this improvement:
>>
>>   * one fewer branch mispred: no obvious reason for this
>>     branch-miss reduction. There is no significant change in
>>     basic-block structure (apart from the branch inversion.)
>>
>>   * the direction of the branch for the call is now inverted, so it
>>     chooses the not-taken direction more often. The issue-latency
>>     for not-taken branches is often cheaper.
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/auditsc.c | 15 ++++++++-------
>>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> I generally dislike merging likely()/unlikely() additions to code
> paths that can have varying levels of performance depending on runtime
> configuration.

I think that's fair, and in this particular case the benchmark is quite
contrived.

But, just to elaborate a bit more on why that unlikely() clause made
sense to me: it seems to me that audit typically would be triggered for
control syscalls and the ratio between control and non-control ones
would be fairly lopsided.

Let me see if I can rewrite the conditional in a different way to get a
similar effect but I suspect that might be even more compiler dependent.

Also, let me run the audit-testsuite this time. Is there a good test
there that you would recommend that might serve as a more representative
workload?


Thanks
Ankur

> While I appreciate the work you are doing to improve
> audit performance, I don't think this is something I want to merge,
> I'm sorry.



>
>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> index 533b087c3c02..bf26f47b5226 100644
>> --- a/kernel/auditsc.c
>> +++ b/kernel/auditsc.c
>> @@ -789,7 +789,7 @@ static enum audit_state audit_filter_task(struct task_struct *tsk, char **key)
>>         return AUDIT_STATE_BUILD;
>>  }
>>
>> -static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
>> +static bool audit_in_mask(const struct audit_krule *rule, unsigned long val)
>>  {
>>         int word, bit;
>>
>> @@ -850,12 +850,13 @@ static void audit_filter_syscall(struct task_struct *tsk,
>>
>>         rcu_read_lock();
>>         list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
>> -               if (audit_in_mask(&e->rule, major) &&
>> -                   audit_filter_rules(tsk, &e->rule, ctx, NULL,
>> -                                      &state, false)) {
>> -                       rcu_read_unlock();
>> -                       ctx->current_state = state;
>> -                       return;
>> +               if (unlikely(audit_in_mask(&e->rule, major))) {
>> +                       if (audit_filter_rules(tsk, &e->rule, ctx, NULL,
>> +                                              &state, false)) {
>> +                               rcu_read_unlock();
>> +                               ctx->current_state = state;
>> +                               return;
>> +                       }
>>                 }
>>         }
>>         rcu_read_unlock();
>> --
>> 2.31.1


--
ankur

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] audit: annotate branch direction for audit_in_mask()
@ 2022-09-29 20:19       ` Ankur Arora
  0 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-09-29 20:19 UTC (permalink / raw)
  To: Paul Moore
  Cc: konrad.wilk, linux-kernel, eparis, linux-audit, Ankur Arora,
	boris.ostrovsky


Paul Moore <paul@paul-moore.com> writes:

> On Tue, Sep 27, 2022 at 7:00 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> With sane audit rules, audit logging would only be triggered
>> infrequently. Keeping this in mind, annotate audit_in_mask() as
>> unlikely() to allow the compiler to pessimize the call to
>> audit_filter_rules().
>>
>> This allows GCC to invert the branch direction for the audit_filter_rules()
>> basic block in this loop:
>>
>>         list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
>>                 if (audit_in_mask(&e->rule, major) &&
>>                     audit_filter_rules(tsk, &e->rule, ctx, NULL,
>>                                        &state, false)) {
>>                         ...
>>
>> such that it executes the common case in a straight line fashion.
>>
>> On a Skylakex system change in getpid() latency (all results
>> aggregated across 12 boot cycles):
>>
>>          Min     Mean    Median   Max      pstdev
>>          (ns)    (ns)    (ns)     (ns)
>>
>>  -    196.63   207.86  206.60  230.98      (+- 3.92%)
>>  +    173.11   182.51  179.65  202.09      (+- 4.34%)
>>
>> Performance counter stats for 'bin/getpid' (3 runs) go from:
>>     cycles               805.58  (  +-  4.11% )
>>     instructions        1654.11  (  +-   .05% )
>>     IPC                    2.06  (  +-  3.39% )
>>     branches             430.02  (  +-   .05% )
>>     branch-misses          1.55  (  +-  7.09% )
>>     L1-dcache-loads      440.01  (  +-   .09% )
>>     L1-dcache-load-misses  9.05  (  +- 74.03% )
>>
>>  to:
>>     cycles               706.13  (  +-  4.13% )
>>     instructions        1654.70  (  +-   .06% )
>>     IPC                    2.35  (  +-  4.25% )
>>     branches             430.99  (  +-   .06% )
>>     branch-misses          0.50  (  +-  2.00% )
>>     L1-dcache-loads      440.02  (  +-   .07% )
>>     L1-dcache-load-misses  5.22  (  +- 82.75% )
>>
>> (Both aggregated over 12 boot cycles.)
>>
>> cycles: performance improves on average by ~100 cycles/call. IPC
>> improves commensurately. Two reasons for this improvement:
>>
>>   * one fewer branch mispred: no obvious reason for this
>>     branch-miss reduction. There is no significant change in
>>     basic-block structure (apart from the branch inversion.)
>>
>>   * the direction of the branch for the call is now inverted, so it
>>     chooses the not-taken direction more often. The issue-latency
>>     for not-taken branches is often cheaper.
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/auditsc.c | 15 ++++++++-------
>>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> I generally dislike merging likely()/unlikely() additions to code
> paths that can have varying levels of performance depending on runtime
> configuration.

I think that's fair, and in this particular case the benchmark is quite
contrived.

But, just to elaborate a bit more on why that unlikely() clause made
sense to me: it seems to me that audit typically would be triggered for
control syscalls and the ratio between control and non-control ones
would be fairly lopsided.

Let me see if I can rewrite the conditional in a different way to get a
similar effect but I suspect that might be even more compiler dependent.

Also, let me run the audit-testsuite this time. Is there a good test
there that you would recommend that might serve as a more representative
workload?


Thanks
Ankur

> While I appreciate the work you are doing to improve
> audit performance, I don't think this is something I want to merge,
> I'm sorry.



>
>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> index 533b087c3c02..bf26f47b5226 100644
>> --- a/kernel/auditsc.c
>> +++ b/kernel/auditsc.c
>> @@ -789,7 +789,7 @@ static enum audit_state audit_filter_task(struct task_struct *tsk, char **key)
>>         return AUDIT_STATE_BUILD;
>>  }
>>
>> -static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
>> +static bool audit_in_mask(const struct audit_krule *rule, unsigned long val)
>>  {
>>         int word, bit;
>>
>> @@ -850,12 +850,13 @@ static void audit_filter_syscall(struct task_struct *tsk,
>>
>>         rcu_read_lock();
>>         list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
>> -               if (audit_in_mask(&e->rule, major) &&
>> -                   audit_filter_rules(tsk, &e->rule, ctx, NULL,
>> -                                      &state, false)) {
>> -                       rcu_read_unlock();
>> -                       ctx->current_state = state;
>> -                       return;
>> +               if (unlikely(audit_in_mask(&e->rule, major))) {
>> +                       if (audit_filter_rules(tsk, &e->rule, ctx, NULL,
>> +                                              &state, false)) {
>> +                               rcu_read_unlock();
>> +                               ctx->current_state = state;
>> +                               return;
>> +                       }
>>                 }
>>         }
>>         rcu_read_unlock();
>> --
>> 2.31.1


--
ankur

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()
  2022-09-28 22:03     ` Paul Moore
@ 2022-09-29 20:20       ` Ankur Arora
  -1 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-09-29 20:20 UTC (permalink / raw)
  To: Paul Moore
  Cc: Ankur Arora, linux-audit, eparis, linux-kernel, boris.ostrovsky,
	konrad.wilk


Paul Moore <paul@paul-moore.com> writes:

> On Tue, Sep 27, 2022 at 6:59 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> ctx->major contains the current syscall number. This is, of course, a
>> constant for the duration of the syscall. Unfortunately, GCC's alias
>> analysis cannot prove that it is not modified via a pointer in the
>> audit_filter_syscall() loop, and so always loads it from memory.
>>
>> ...
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/auditsc.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> This looks pretty trivial to me, but it's too late in the current -rc
> cycle for this to be merged, I'll queue it up for after the upcoming
> merge window closes.  Thanks.

Thanks.

Ankur

>
>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> index 79a5da1bc5bb..533b087c3c02 100644
>> --- a/kernel/auditsc.c
>> +++ b/kernel/auditsc.c
>> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct *tsk,
>>  {
>>         struct audit_entry *e;
>>         enum audit_state state;
>> +       unsigned long major = ctx->major;
>>
>>         if (auditd_test_task(tsk))
>>                 return;
>>
>>         rcu_read_lock();
>>         list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
>> -               if (audit_in_mask(&e->rule, ctx->major) &&
>> +               if (audit_in_mask(&e->rule, major) &&
>>                     audit_filter_rules(tsk, &e->rule, ctx, NULL,
>>                                        &state, false)) {
>>                         rcu_read_unlock();
>> --
>> 2.31.1


--
ankur

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()
@ 2022-09-29 20:20       ` Ankur Arora
  0 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-09-29 20:20 UTC (permalink / raw)
  To: Paul Moore
  Cc: konrad.wilk, linux-kernel, eparis, linux-audit, Ankur Arora,
	boris.ostrovsky


Paul Moore <paul@paul-moore.com> writes:

> On Tue, Sep 27, 2022 at 6:59 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> ctx->major contains the current syscall number. This is, of course, a
>> constant for the duration of the syscall. Unfortunately, GCC's alias
>> analysis cannot prove that it is not modified via a pointer in the
>> audit_filter_syscall() loop, and so always loads it from memory.
>>
>> ...
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/auditsc.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> This looks pretty trivial to me, but it's too late in the current -rc
> cycle for this to be merged, I'll queue it up for after the upcoming
> merge window closes.  Thanks.

Thanks.

Ankur

>
>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> index 79a5da1bc5bb..533b087c3c02 100644
>> --- a/kernel/auditsc.c
>> +++ b/kernel/auditsc.c
>> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct *tsk,
>>  {
>>         struct audit_entry *e;
>>         enum audit_state state;
>> +       unsigned long major = ctx->major;
>>
>>         if (auditd_test_task(tsk))
>>                 return;
>>
>>         rcu_read_lock();
>>         list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
>> -               if (audit_in_mask(&e->rule, ctx->major) &&
>> +               if (audit_in_mask(&e->rule, major) &&
>>                     audit_filter_rules(tsk, &e->rule, ctx, NULL,
>>                                        &state, false)) {
>>                         rcu_read_unlock();
>> --
>> 2.31.1


--
ankur

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/3] audit: unify audit_filter_{uring(),inode_name(),syscall()}
  2022-09-28 22:54     ` [PATCH 3/3] audit: unify audit_filter_{uring(), inode_name(), syscall()} Paul Moore
@ 2022-09-29 20:23       ` Ankur Arora
  -1 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-09-29 20:23 UTC (permalink / raw)
  To: Paul Moore
  Cc: Ankur Arora, linux-audit, eparis, linux-kernel, boris.ostrovsky,
	konrad.wilk


Paul Moore <paul@paul-moore.com> writes:

> On Tue, Sep 27, 2022 at 7:00 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> audit_filter_uring(), audit_filter_inode_name() are substantially similar
>> to audit_filter_syscall(). Move the core logic to __audit_filter() which
>> can be parametrized for all three.
>>
>> On a Skylakex system, getpid() latency (all results aggregated
>> across 12 boot cycles):
>>
>>          Min     Mean    Median   Max      pstdev
>>          (ns)    (ns)    (ns)     (ns)
>>
>>  -    173.11   182.51  179.65  202.09     (+- 4.34%)
>>  +    162.11   175.26  173.71  190.56     (+- 4.33%)
>>
>> Performance counter stats for 'bin/getpid' (3 runs) go from:
>>     cycles               706.13  (  +-  4.13% )
>>     instructions        1654.70  (  +-   .06% )
>>     IPC                    2.35  (  +-  4.25% )
>>     branches             430.99  (  +-   .06% )
>>     branch-misses          0.50  (  +-  2.00% )
>>     L1-dcache-loads      440.02  (  +-   .07% )
>>     L1-dcache-load-misses  5.22  (  +- 82.75% )
>>
>>  to:
>>     cycles               678.79  (  +-  4.22% )
>>     instructions        1657.79  (  +-   .05% )
>>     IPC                    2.45  (  +-  4.08% )
>>     branches             432.00  (  +-   .05% )
>>     branch-misses          0.38  (  +- 23.68% )
>>     L1-dcache-loads      444.96  (  +-   .03% )
>>     L1-dcache-load-misses  5.13  (  +- 73.09% )
>>
>> (Both aggregated over 12 boot cycles.)
>>
>> Unclear if the improvement is just run-to-run variation or because of
>> a slightly denser loop (the list parameter in the list_for_each_entry_rcu()
>> exit check now comes from a register rather than a constant as before.)
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/auditsc.c | 86 +++++++++++++++++++++++++-----------------------
>>  1 file changed, 44 insertions(+), 42 deletions(-)
>>
>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> index bf26f47b5226..dd89a52988b0 100644
>> --- a/kernel/auditsc.c
>> +++ b/kernel/auditsc.c
>> @@ -805,6 +805,41 @@ static bool audit_in_mask(const struct audit_krule *rule, unsigned long val)
>>         return rule->mask[word] & bit;
>>  }
>>
>> +/**
>> + * __audit_filter - common filter
>> + *
>
> Please remove the vertical whitespace between the function description
> line and the parameter descriptions.
>
> I'd also suggest renaming this function to "__audit_filter_op(...)"
> since we have a few audit filtering functions and a generic
> "__audit_filter()" function with no qualification in the name seems
> just a bit too generic to not be misleading ... at least I think so.
>
> I also might try to be just a bit more descriptive in the text above,
> for example:
>
> "__audit_filter_op - filter helper function for operations (syscall/uring/etc.)"
>
>> + * @tsk: associated task
>> + * @ctx: audit context
>> + * @list: audit filter list
>> + * @op: current syscall/uring_op
>> + * @name: name to be filtered (used by audit_filter_inode_name)
>
> I would change this to "@name: audit_name to use in filtering (can be
> NULL)" and leave it at that.
>
>> + *
>> + * return: 1 if we hit a filter, 0 if we don't
>
> The function header block comment description should be in regular
> English sentences.  Perhaps something like this:
>
> "Run the audit filters specified in @list against @tsk using @ctx,
> @op, and @name as necessary; the caller is responsible for ensuring
> that the call is made while the RCU read lock is held.  The @name
> parameter can be NULL, but all others must be specified.  Returns
> 1/true if the filter finds a match, 0/false if none are found."
>
>> + */
>> +static int __audit_filter(struct task_struct *tsk,
>> +                          struct audit_context *ctx,
>> +                          struct list_head *list,
>> +                          unsigned long op,
>> +                          struct audit_names *name)
>> +{
>> +       struct audit_entry *e;
>> +       enum audit_state state;
>> +
>> +       rcu_read_lock();
>
> I would move the RCU locking to the callers since not every caller requires it.
>
>> +       list_for_each_entry_rcu(e, list, list) {
>> +               if (unlikely(audit_in_mask(&e->rule, op))) {
>
> As discussed in patch 2/3, please remove the unlikely() call.

I'll pull it out of this patch.

And thanks for the comments. Will address and send out a v2.


Ankur

>
>> +                       if (audit_filter_rules(tsk, &e->rule, ctx, name,
>> +                                              &state, false)) {
>> +                               rcu_read_unlock();
>> +                               ctx->current_state = state;
>> +                               return 1;
>>  +                       }
>> +               }
>> +       }
>> +       rcu_read_unlock();
>> +       return 0;
>> +}
>> +

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/3] audit: unify audit_filter_{uring(),inode_name(),syscall()}
@ 2022-09-29 20:23       ` Ankur Arora
  0 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-09-29 20:23 UTC (permalink / raw)
  To: Paul Moore
  Cc: konrad.wilk, linux-kernel, eparis, linux-audit, Ankur Arora,
	boris.ostrovsky


Paul Moore <paul@paul-moore.com> writes:

> On Tue, Sep 27, 2022 at 7:00 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> audit_filter_uring(), audit_filter_inode_name() are substantially similar
>> to audit_filter_syscall(). Move the core logic to __audit_filter() which
>> can be parametrized for all three.
>>
>> On a Skylakex system, getpid() latency (all results aggregated
>> across 12 boot cycles):
>>
>>          Min     Mean    Median   Max      pstdev
>>          (ns)    (ns)    (ns)     (ns)
>>
>>  -    173.11   182.51  179.65  202.09     (+- 4.34%)
>>  +    162.11   175.26  173.71  190.56     (+- 4.33%)
>>
>> Performance counter stats for 'bin/getpid' (3 runs) go from:
>>     cycles               706.13  (  +-  4.13% )
>>     instructions        1654.70  (  +-   .06% )
>>     IPC                    2.35  (  +-  4.25% )
>>     branches             430.99  (  +-   .06% )
>>     branch-misses          0.50  (  +-  2.00% )
>>     L1-dcache-loads      440.02  (  +-   .07% )
>>     L1-dcache-load-misses  5.22  (  +- 82.75% )
>>
>>  to:
>>     cycles               678.79  (  +-  4.22% )
>>     instructions        1657.79  (  +-   .05% )
>>     IPC                    2.45  (  +-  4.08% )
>>     branches             432.00  (  +-   .05% )
>>     branch-misses          0.38  (  +- 23.68% )
>>     L1-dcache-loads      444.96  (  +-   .03% )
>>     L1-dcache-load-misses  5.13  (  +- 73.09% )
>>
>> (Both aggregated over 12 boot cycles.)
>>
>> Unclear if the improvement is just run-to-run variation or because of
>> a slightly denser loop (the list parameter in the list_for_each_entry_rcu()
>> exit check now comes from a register rather than a constant as before.)
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/auditsc.c | 86 +++++++++++++++++++++++++-----------------------
>>  1 file changed, 44 insertions(+), 42 deletions(-)
>>
>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> index bf26f47b5226..dd89a52988b0 100644
>> --- a/kernel/auditsc.c
>> +++ b/kernel/auditsc.c
>> @@ -805,6 +805,41 @@ static bool audit_in_mask(const struct audit_krule *rule, unsigned long val)
>>         return rule->mask[word] & bit;
>>  }
>>
>> +/**
>> + * __audit_filter - common filter
>> + *
>
> Please remove the vertical whitespace between the function description
> line and the parameter descriptions.
>
> I'd also suggest renaming this function to "__audit_filter_op(...)"
> since we have a few audit filtering functions and a generic
> "__audit_filter()" function with no qualification in the name seems
> just a bit too generic to not be misleading ... at least I think so.
>
> I also might try to be just a bit more descriptive in the text above,
> for example:
>
> "__audit_filter_op - filter helper function for operations (syscall/uring/etc.)"
>
>> + * @tsk: associated task
>> + * @ctx: audit context
>> + * @list: audit filter list
>> + * @op: current syscall/uring_op
>> + * @name: name to be filtered (used by audit_filter_inode_name)
>
> I would change this to "@name: audit_name to use in filtering (can be
> NULL)" and leave it at that.
>
>> + *
>> + * return: 1 if we hit a filter, 0 if we don't
>
> The function header block comment description should be in regular
> English sentences.  Perhaps something like this:
>
> "Run the audit filters specified in @list against @tsk using @ctx,
> @op, and @name as necessary; the caller is responsible for ensuring
> that the call is made while the RCU read lock is held.  The @name
> parameter can be NULL, but all others must be specified.  Returns
> 1/true if the filter finds a match, 0/false if none are found."
>
>> + */
>> +static int __audit_filter(struct task_struct *tsk,
>> +                          struct audit_context *ctx,
>> +                          struct list_head *list,
>> +                          unsigned long op,
>> +                          struct audit_names *name)
>> +{
>> +       struct audit_entry *e;
>> +       enum audit_state state;
>> +
>> +       rcu_read_lock();
>
> I would move the RCU locking to the callers since not every caller requires it.
>
>> +       list_for_each_entry_rcu(e, list, list) {
>> +               if (unlikely(audit_in_mask(&e->rule, op))) {
>
> As discussed in patch 2/3, please remove the unlikely() call.

I'll pull it out of this patch.

And thanks for the comments. Will address and send out a v2.


Ankur

>
>> +                       if (audit_filter_rules(tsk, &e->rule, ctx, name,
>> +                                              &state, false)) {
>> +                               rcu_read_unlock();
>> +                               ctx->current_state = state;
>> +                               return 1;
>>  +                       }
>> +               }
>> +       }
>> +       rcu_read_unlock();
>> +       return 0;
>> +}
>> +

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()
  2022-09-27 22:59   ` Ankur Arora
@ 2022-09-30 17:45     ` Steve Grubb
  -1 siblings, 0 replies; 40+ messages in thread
From: Steve Grubb @ 2022-09-30 17:45 UTC (permalink / raw)
  To: linux-audit
  Cc: boris.ostrovsky, konrad.wilk, eparis, linux-kernel, Ankur Arora

Hello,

Thanks for the detailed notes on this investigation. It really is  a lot of 
good information backing this up. However, there will come a day when someone 
sees this "major = ctx->major" and they will send a patch to "fix" this 
unnecessary assignment. If you are sending a V2 of this set, I would suggest 
adding some comment in the code that this is for a performance improvement 
and to see the commit message for additional info.

Thanks,
-Steve

On Tuesday, September 27, 2022 6:59:42 PM EDT Ankur Arora wrote:
> ctx->major contains the current syscall number. This is, of course, a
> constant for the duration of the syscall. Unfortunately, GCC's alias
> analysis cannot prove that it is not modified via a pointer in the
> audit_filter_syscall() loop, and so always loads it from memory.
> 
> In and of itself the load isn't very expensive (ops dependent on the
> ctx->major load are only used to determine the direction of control flow
> and have short dependence chains and, in any case the related branches
> get predicted perfectly in the fastpath) but still cache ctx->major
> in a local for two reasons:
> 
> * ctx->major is in the first cacheline of struct audit_context and has
>   similar alignment as audit_entry::list audit_entry. For cases
>   with a lot of audit rules, doing this reduces one source of contention
>   from a potentially busy cache-set.
> 
> * audit_in_mask() (called in the hot loop in audit_filter_syscall())
>   does cast manipulation and error checking on ctx->major:
> 
>      audit_in_mask(const struct audit_krule *rule, unsigned long val):
>              if (val > 0xffffffff)
>                      return false;
> 
>              word = AUDIT_WORD(val);
>              if (word >= AUDIT_BITMASK_SIZE)
>                      return false;
> 
>              bit = AUDIT_BIT(val);
> 
>              return rule->mask[word] & bit;
> 
>   The clauses related to the rule need to be evaluated in the loop, but
>   the rest is unnecessarily re-evaluated for every loop iteration.
>   (Note, however, that most of these are cheap ALU ops and the branches
>    are perfectly predicted. However, see discussion on cycles
>    improvement below for more on why it is still worth hoisting.)
> 
> On a Skylakex system change in getpid() latency (aggregated over
> 12 boot cycles):
> 
>              Min     Mean  Median     Max       pstdev
>             (ns)     (ns)    (ns)    (ns)
> 
>  -        201.30   216.14  216.22  228.46      (+- 1.45%)
>  +        196.63   207.86  206.60  230.98      (+- 3.92%)
> 
> Performance counter stats for 'bin/getpid' (3 runs) go from:
>     cycles               836.89  (  +-   .80% )
>     instructions        2000.19  (  +-   .03% )
>     IPC                    2.39  (  +-   .83% )
>     branches             430.14  (  +-   .03% )
>     branch-misses          1.48  (  +-  3.37% )
>     L1-dcache-loads      471.11  (  +-   .05% )
>     L1-dcache-load-misses  7.62  (  +- 46.98% )
> 
>  to:
>     cycles               805.58  (  +-  4.11% )
>     instructions        1654.11  (  +-   .05% )
>     IPC                    2.06  (  +-  3.39% )
>     branches             430.02  (  +-   .05% )
>     branch-misses          1.55  (  +-  7.09% )
>     L1-dcache-loads      440.01  (  +-   .09% )
>     L1-dcache-load-misses  9.05  (  +- 74.03% )
> 
> (Both aggregated over 12 boot cycles.)
> 
> instructions: we reduce around 8 instructions/iteration because some of
> the computation is now hoisted out of the loop (branch count does not
> change because GCC, for reasons unclear, only hoists the computations
> while keeping the basic-blocks.)
> 
> cycles: improve by about 5% (in aggregate and looking at individual run
> numbers.) This is likely because we now waste fewer pipeline resources
> on unnecessary instructions which allows the control flow to
> speculatively execute further ahead shortening the execution of the loop
> a little. The final gating factor on the performance of this loop
> remains the long dependence chain due to the linked-list load.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/auditsc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index 79a5da1bc5bb..533b087c3c02 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct
> *tsk, {
>  	struct audit_entry *e;
>  	enum audit_state state;
> +	unsigned long major = ctx->major;
> 
>  	if (auditd_test_task(tsk))
>  		return;
> 
>  	rcu_read_lock();
>  	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], 
list) {
> -		if (audit_in_mask(&e->rule, ctx->major) &&
> +		if (audit_in_mask(&e->rule, major) &&
>  		    audit_filter_rules(tsk, &e->rule, ctx, NULL,
>  				       &state, false)) {
>  			rcu_read_unlock();





^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()
@ 2022-09-30 17:45     ` Steve Grubb
  0 siblings, 0 replies; 40+ messages in thread
From: Steve Grubb @ 2022-09-30 17:45 UTC (permalink / raw)
  To: linux-audit
  Cc: boris.ostrovsky, Ankur Arora, linux-kernel, eparis, konrad.wilk

Hello,

Thanks for the detailed notes on this investigation. It really is  a lot of 
good information backing this up. However, there will come a day when someone 
sees this "major = ctx->major" and they will send a patch to "fix" this 
unnecessary assignment. If you are sending a V2 of this set, I would suggest 
adding some comment in the code that this is for a performance improvement 
and to see the commit message for additional info.

Thanks,
-Steve

On Tuesday, September 27, 2022 6:59:42 PM EDT Ankur Arora wrote:
> ctx->major contains the current syscall number. This is, of course, a
> constant for the duration of the syscall. Unfortunately, GCC's alias
> analysis cannot prove that it is not modified via a pointer in the
> audit_filter_syscall() loop, and so always loads it from memory.
> 
> In and of itself the load isn't very expensive (ops dependent on the
> ctx->major load are only used to determine the direction of control flow
> and have short dependence chains and, in any case the related branches
> get predicted perfectly in the fastpath) but still cache ctx->major
> in a local for two reasons:
> 
> * ctx->major is in the first cacheline of struct audit_context and has
>   similar alignment as audit_entry::list audit_entry. For cases
>   with a lot of audit rules, doing this reduces one source of contention
>   from a potentially busy cache-set.
> 
> * audit_in_mask() (called in the hot loop in audit_filter_syscall())
>   does cast manipulation and error checking on ctx->major:
> 
>      audit_in_mask(const struct audit_krule *rule, unsigned long val):
>              if (val > 0xffffffff)
>                      return false;
> 
>              word = AUDIT_WORD(val);
>              if (word >= AUDIT_BITMASK_SIZE)
>                      return false;
> 
>              bit = AUDIT_BIT(val);
> 
>              return rule->mask[word] & bit;
> 
>   The clauses related to the rule need to be evaluated in the loop, but
>   the rest is unnecessarily re-evaluated for every loop iteration.
>   (Note, however, that most of these are cheap ALU ops and the branches
>    are perfectly predicted. However, see discussion on cycles
>    improvement below for more on why it is still worth hoisting.)
> 
> On a Skylakex system change in getpid() latency (aggregated over
> 12 boot cycles):
> 
>              Min     Mean  Median     Max       pstdev
>             (ns)     (ns)    (ns)    (ns)
> 
>  -        201.30   216.14  216.22  228.46      (+- 1.45%)
>  +        196.63   207.86  206.60  230.98      (+- 3.92%)
> 
> Performance counter stats for 'bin/getpid' (3 runs) go from:
>     cycles               836.89  (  +-   .80% )
>     instructions        2000.19  (  +-   .03% )
>     IPC                    2.39  (  +-   .83% )
>     branches             430.14  (  +-   .03% )
>     branch-misses          1.48  (  +-  3.37% )
>     L1-dcache-loads      471.11  (  +-   .05% )
>     L1-dcache-load-misses  7.62  (  +- 46.98% )
> 
>  to:
>     cycles               805.58  (  +-  4.11% )
>     instructions        1654.11  (  +-   .05% )
>     IPC                    2.06  (  +-  3.39% )
>     branches             430.02  (  +-   .05% )
>     branch-misses          1.55  (  +-  7.09% )
>     L1-dcache-loads      440.01  (  +-   .09% )
>     L1-dcache-load-misses  9.05  (  +- 74.03% )
> 
> (Both aggregated over 12 boot cycles.)
> 
> instructions: we reduce around 8 instructions/iteration because some of
> the computation is now hoisted out of the loop (branch count does not
> change because GCC, for reasons unclear, only hoists the computations
> while keeping the basic-blocks.)
> 
> cycles: improve by about 5% (in aggregate and looking at individual run
> numbers.) This is likely because we now waste fewer pipeline resources
> on unnecessary instructions which allows the control flow to
> speculatively execute further ahead shortening the execution of the loop
> a little. The final gating factor on the performance of this loop
> remains the long dependence chain due to the linked-list load.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/auditsc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index 79a5da1bc5bb..533b087c3c02 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct
> *tsk, {
>  	struct audit_entry *e;
>  	enum audit_state state;
> +	unsigned long major = ctx->major;
> 
>  	if (auditd_test_task(tsk))
>  		return;
> 
>  	rcu_read_lock();
>  	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], 
list) {
> -		if (audit_in_mask(&e->rule, ctx->major) &&
> +		if (audit_in_mask(&e->rule, major) &&
>  		    audit_filter_rules(tsk, &e->rule, ctx, NULL,
>  				       &state, false)) {
>  			rcu_read_unlock();




--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()
  2022-09-30 17:45     ` Steve Grubb
@ 2022-09-30 18:22       ` Paul Moore
  -1 siblings, 0 replies; 40+ messages in thread
From: Paul Moore @ 2022-09-30 18:22 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-audit, boris.ostrovsky, linux-kernel, eparis, konrad.wilk,
	Steve Grubb

On Fri, Sep 30, 2022 at 1:45 PM Steve Grubb <sgrubb@redhat.com> wrote:
> Hello,
>
> Thanks for the detailed notes on this investigation. It really is  a lot of
> good information backing this up. However, there will come a day when someone
> sees this "major = ctx->major" and they will send a patch to "fix" this
> unnecessary assignment. If you are sending a V2 of this set, I would suggest
> adding some comment in the code that this is for a performance improvement
> and to see the commit message for additional info.

Please do not send a v2 with the changes Steve is suggesting, I think
this patch is fine as-is.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()
@ 2022-09-30 18:22       ` Paul Moore
  0 siblings, 0 replies; 40+ messages in thread
From: Paul Moore @ 2022-09-30 18:22 UTC (permalink / raw)
  To: Ankur Arora
  Cc: konrad.wilk, linux-kernel, eparis, linux-audit, boris.ostrovsky

On Fri, Sep 30, 2022 at 1:45 PM Steve Grubb <sgrubb@redhat.com> wrote:
> Hello,
>
> Thanks for the detailed notes on this investigation. It really is  a lot of
> good information backing this up. However, there will come a day when someone
> sees this "major = ctx->major" and they will send a patch to "fix" this
> unnecessary assignment. If you are sending a V2 of this set, I would suggest
> adding some comment in the code that this is for a performance improvement
> and to see the commit message for additional info.

Please do not send a v2 with the changes Steve is suggesting, I think
this patch is fine as-is.

-- 
paul-moore.com

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] audit: annotate branch direction for audit_in_mask()
  2022-09-29 20:19       ` Ankur Arora
@ 2022-09-30 18:48         ` Paul Moore
  -1 siblings, 0 replies; 40+ messages in thread
From: Paul Moore @ 2022-09-30 18:48 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-audit, eparis, linux-kernel, boris.ostrovsky, konrad.wilk

On Thu, Sep 29, 2022 at 4:20 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> Paul Moore <paul@paul-moore.com> writes:
> > I generally dislike merging likely()/unlikely() additions to code
> > paths that can have varying levels of performance depending on runtime
> > configuration.
>
> I think that's fair, and in this particular case the benchmark is quite
> contrived.
>
> But, just to elaborate a bit more on why that unlikely() clause made
> sense to me: it seems to me that audit typically would be triggered for
> control syscalls and the ratio between control and non-control ones
> would be fairly lopsided.

I understand, and there is definitely some precedence in the audit
code for using likely()/unlikely() in a manner similar as you
described, but I'll refer to my previous comments - it's not something
I like.  As a general rule, aside from the unlikely() calls in the
audit wrappers present in include/linux/audit.h I would suggest not
adding any new likely()/unlikely() calls.

> Let me see if I can rewrite the conditional in a different way to get a
> similar effect but I suspect that might be even more compiler dependent.

I am okay with ordering conditionals to make the common case the
short-circuit case.

> Also, let me run the audit-testsuite this time. Is there a good test
> there that you would recommend that might serve as a more representative
> workload?

Probably not.  The audit-testsuite is intended to be a quick, easy to
run regression test that can be used by developers to help reduce
audit breakage.  It is not representative of any particular workload,
and is definitely not comprehensive (it is woefully lacking in several
areas unfortunately).

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] audit: annotate branch direction for audit_in_mask()
@ 2022-09-30 18:48         ` Paul Moore
  0 siblings, 0 replies; 40+ messages in thread
From: Paul Moore @ 2022-09-30 18:48 UTC (permalink / raw)
  To: Ankur Arora
  Cc: boris.ostrovsky, linux-audit, linux-kernel, eparis, konrad.wilk

On Thu, Sep 29, 2022 at 4:20 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> Paul Moore <paul@paul-moore.com> writes:
> > I generally dislike merging likely()/unlikely() additions to code
> > paths that can have varying levels of performance depending on runtime
> > configuration.
>
> I think that's fair, and in this particular case the benchmark is quite
> contrived.
>
> But, just to elaborate a bit more on why that unlikely() clause made
> sense to me: it seems to me that audit typically would be triggered for
> control syscalls and the ratio between control and non-control ones
> would be fairly lopsided.

I understand, and there is definitely some precedence in the audit
code for using likely()/unlikely() in a manner similar as you
described, but I'll refer to my previous comments - it's not something
I like.  As a general rule, aside from the unlikely() calls in the
audit wrappers present in include/linux/audit.h I would suggest not
adding any new likely()/unlikely() calls.

> Let me see if I can rewrite the conditional in a different way to get a
> similar effect but I suspect that might be even more compiler dependent.

I am okay with ordering conditionals to make the common case the
short-circuit case.

> Also, let me run the audit-testsuite this time. Is there a good test
> there that you would recommend that might serve as a more representative
> workload?

Probably not.  The audit-testsuite is intended to be a quick, easy to
run regression test that can be used by developers to help reduce
audit breakage.  It is not representative of any particular workload,
and is definitely not comprehensive (it is woefully lacking in several
areas unfortunately).

-- 
paul-moore.com

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2] audit: unify audit_filter_{uring(),inode_name(),syscall()}
  2022-09-27 22:59   ` [PATCH 3/3] audit: unify audit_filter_{uring(), inode_name(), syscall()} Ankur Arora
@ 2022-10-07  0:49     ` Ankur Arora
  -1 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-10-07  0:49 UTC (permalink / raw)
  To: linux-audit
  Cc: paul, eparis, sgrubb, linux-kernel, boris.ostrovsky, konrad.wilk

audit_filter_uring(), audit_filter_inode_name() are substantially
similar to audit_filter_syscall(). Move the core logic to
__audit_filter_op() which can be parametrized for all three.

On a Skylakex system, getpid() latency (all results aggregated
across 12 boot cycles):

         Min     Mean    Median   Max      pstdev
         (ns)    (ns)    (ns)     (ns)

 -    196.63   207.86  206.60  230.98      (+- 3.92%)
 +    183.73   196.95  192.31  232.49	   (+- 6.04%)

Performance counter stats for 'bin/getpid' (3 runs) go from:
    cycles               805.58  (  +-  4.11% )
    instructions        1654.11  (  +-   .05% )
    IPC                    2.06  (  +-  3.39% )
    branches             430.02  (  +-   .05% )
    branch-misses          1.55  (  +-  7.09% )
    L1-dcache-loads      440.01  (  +-   .09% )
    L1-dcache-load-misses  9.05  (  +- 74.03% )
to:
    cycles		 765.37  (  +-  6.66% )
    instructions        1677.07  (  +-  0.04% )
    IPC		           2.20  (  +-  5.90% )
    branches	         431.10  (  +-  0.04% )
    branch-misses	   1.60  (  +- 11.25% )
    L1-dcache-loads	 521.04  (  +-  0.05% )
    L1-dcache-load-misses  6.92  (  +- 77.60% )

(Both aggregated over 12 boot cycles.)

The increased L1-dcache-loads are due to some intermediate values now
coming from the stack.

The improvement in cycles is due to a slightly denser loop (the list
parameter in the list_for_each_entry_rcu() exit check now comes from
a register rather than a constant as before.)

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/auditsc.c | 76 +++++++++++++++++++++++++-----------------------
 1 file changed, 39 insertions(+), 37 deletions(-)

diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 2206cdf1ba2c..4991348e300a 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -807,6 +807,40 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
 	return rule->mask[word] & bit;
 }
 
+/**
+ * __audit_filter_op - common filter helper for operations (syscall/uring/etc)
+ * @tsk: associated task
+ * @ctx: audit context
+ * @list: audit filter list
+ * @name: audit_name (can be NULL)
+ * @op: current syscall/uring_op
+ *
+ * Run the udit filters specified in @list against @tsk using @ctx,
+ * @name, and @op, as necessary; the caller is responsible for ensuring
+ * that the call is made while the RCU read lock is held. The @name
+ * parameter can be NULL, but all others must be specified.
+ * Returns 1/true if the filter finds a match, 0/false if none are found.
+ */
+static int __audit_filter_op(struct task_struct *tsk,
+			   struct audit_context *ctx,
+			   struct list_head *list,
+			   struct audit_names *name,
+			   unsigned long op)
+{
+	struct audit_entry *e;
+	enum audit_state state;
+
+	list_for_each_entry_rcu(e, list, list) {
+		if (audit_in_mask(&e->rule, op) &&
+		    audit_filter_rules(tsk, &e->rule, ctx, name,
+				       &state, false)) {
+			ctx->current_state = state;
+			return 1;
+		}
+	}
+	return 0;
+}
+
 /**
  * audit_filter_uring - apply filters to an io_uring operation
  * @tsk: associated task
@@ -815,23 +849,12 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
 static void audit_filter_uring(struct task_struct *tsk,
 			       struct audit_context *ctx)
 {
-	struct audit_entry *e;
-	enum audit_state state;
-
 	if (auditd_test_task(tsk))
 		return;
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_URING_EXIT],
-				list) {
-		if (audit_in_mask(&e->rule, ctx->uring_op) &&
-		    audit_filter_rules(tsk, &e->rule, ctx, NULL, &state,
-				       false)) {
-			rcu_read_unlock();
-			ctx->current_state = state;
-			return;
-		}
-	}
+	__audit_filter_op(tsk, ctx, &audit_filter_list[AUDIT_FILTER_URING_EXIT],
+			NULL, ctx->uring_op);
 	rcu_read_unlock();
 }
 
@@ -843,25 +866,13 @@ static void audit_filter_uring(struct task_struct *tsk,
 static void audit_filter_syscall(struct task_struct *tsk,
 				 struct audit_context *ctx)
 {
-	struct audit_entry *e;
-	enum audit_state state;
-	unsigned long major = ctx->major;
-
 	if (auditd_test_task(tsk))
 		return;
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
-		if (audit_in_mask(&e->rule, major) &&
-		    audit_filter_rules(tsk, &e->rule, ctx, NULL,
-				       &state, false)) {
-			rcu_read_unlock();
-			ctx->current_state = state;
-			return;
-		}
-	}
+	__audit_filter_op(tsk, ctx, &audit_filter_list[AUDIT_FILTER_EXIT],
+			NULL, ctx->major);
 	rcu_read_unlock();
-	return;
 }
 
 /*
@@ -873,17 +884,8 @@ static int audit_filter_inode_name(struct task_struct *tsk,
 				   struct audit_context *ctx) {
 	int h = audit_hash_ino((u32)n->ino);
 	struct list_head *list = &audit_inode_hash[h];
-	struct audit_entry *e;
-	enum audit_state state;
 
-	list_for_each_entry_rcu(e, list, list) {
-		if (audit_in_mask(&e->rule, ctx->major) &&
-		    audit_filter_rules(tsk, &e->rule, ctx, n, &state, false)) {
-			ctx->current_state = state;
-			return 1;
-		}
-	}
-	return 0;
+	return __audit_filter_op(tsk, ctx, list, n, ctx->major);
 }
 
 /* At syscall exit time, this filter is called if any audit_names have been
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2] audit: unify audit_filter_{uring(), inode_name(), syscall()}
@ 2022-10-07  0:49     ` Ankur Arora
  0 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-10-07  0:49 UTC (permalink / raw)
  To: linux-audit; +Cc: konrad.wilk, linux-kernel, eparis, boris.ostrovsky

audit_filter_uring(), audit_filter_inode_name() are substantially
similar to audit_filter_syscall(). Move the core logic to
__audit_filter_op() which can be parametrized for all three.

On a Skylakex system, getpid() latency (all results aggregated
across 12 boot cycles):

         Min     Mean    Median   Max      pstdev
         (ns)    (ns)    (ns)     (ns)

 -    196.63   207.86  206.60  230.98      (+- 3.92%)
 +    183.73   196.95  192.31  232.49	   (+- 6.04%)

Performance counter stats for 'bin/getpid' (3 runs) go from:
    cycles               805.58  (  +-  4.11% )
    instructions        1654.11  (  +-   .05% )
    IPC                    2.06  (  +-  3.39% )
    branches             430.02  (  +-   .05% )
    branch-misses          1.55  (  +-  7.09% )
    L1-dcache-loads      440.01  (  +-   .09% )
    L1-dcache-load-misses  9.05  (  +- 74.03% )
to:
    cycles		 765.37  (  +-  6.66% )
    instructions        1677.07  (  +-  0.04% )
    IPC		           2.20  (  +-  5.90% )
    branches	         431.10  (  +-  0.04% )
    branch-misses	   1.60  (  +- 11.25% )
    L1-dcache-loads	 521.04  (  +-  0.05% )
    L1-dcache-load-misses  6.92  (  +- 77.60% )

(Both aggregated over 12 boot cycles.)

The increased L1-dcache-loads are due to some intermediate values now
coming from the stack.

The improvement in cycles is due to a slightly denser loop (the list
parameter in the list_for_each_entry_rcu() exit check now comes from
a register rather than a constant as before.)

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/auditsc.c | 76 +++++++++++++++++++++++++-----------------------
 1 file changed, 39 insertions(+), 37 deletions(-)

diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 2206cdf1ba2c..4991348e300a 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -807,6 +807,40 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
 	return rule->mask[word] & bit;
 }
 
+/**
+ * __audit_filter_op - common filter helper for operations (syscall/uring/etc)
+ * @tsk: associated task
+ * @ctx: audit context
+ * @list: audit filter list
+ * @name: audit_name (can be NULL)
+ * @op: current syscall/uring_op
+ *
+ * Run the udit filters specified in @list against @tsk using @ctx,
+ * @name, and @op, as necessary; the caller is responsible for ensuring
+ * that the call is made while the RCU read lock is held. The @name
+ * parameter can be NULL, but all others must be specified.
+ * Returns 1/true if the filter finds a match, 0/false if none are found.
+ */
+static int __audit_filter_op(struct task_struct *tsk,
+			   struct audit_context *ctx,
+			   struct list_head *list,
+			   struct audit_names *name,
+			   unsigned long op)
+{
+	struct audit_entry *e;
+	enum audit_state state;
+
+	list_for_each_entry_rcu(e, list, list) {
+		if (audit_in_mask(&e->rule, op) &&
+		    audit_filter_rules(tsk, &e->rule, ctx, name,
+				       &state, false)) {
+			ctx->current_state = state;
+			return 1;
+		}
+	}
+	return 0;
+}
+
 /**
  * audit_filter_uring - apply filters to an io_uring operation
  * @tsk: associated task
@@ -815,23 +849,12 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
 static void audit_filter_uring(struct task_struct *tsk,
 			       struct audit_context *ctx)
 {
-	struct audit_entry *e;
-	enum audit_state state;
-
 	if (auditd_test_task(tsk))
 		return;
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_URING_EXIT],
-				list) {
-		if (audit_in_mask(&e->rule, ctx->uring_op) &&
-		    audit_filter_rules(tsk, &e->rule, ctx, NULL, &state,
-				       false)) {
-			rcu_read_unlock();
-			ctx->current_state = state;
-			return;
-		}
-	}
+	__audit_filter_op(tsk, ctx, &audit_filter_list[AUDIT_FILTER_URING_EXIT],
+			NULL, ctx->uring_op);
 	rcu_read_unlock();
 }
 
@@ -843,25 +866,13 @@ static void audit_filter_uring(struct task_struct *tsk,
 static void audit_filter_syscall(struct task_struct *tsk,
 				 struct audit_context *ctx)
 {
-	struct audit_entry *e;
-	enum audit_state state;
-	unsigned long major = ctx->major;
-
 	if (auditd_test_task(tsk))
 		return;
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
-		if (audit_in_mask(&e->rule, major) &&
-		    audit_filter_rules(tsk, &e->rule, ctx, NULL,
-				       &state, false)) {
-			rcu_read_unlock();
-			ctx->current_state = state;
-			return;
-		}
-	}
+	__audit_filter_op(tsk, ctx, &audit_filter_list[AUDIT_FILTER_EXIT],
+			NULL, ctx->major);
 	rcu_read_unlock();
-	return;
 }
 
 /*
@@ -873,17 +884,8 @@ static int audit_filter_inode_name(struct task_struct *tsk,
 				   struct audit_context *ctx) {
 	int h = audit_hash_ino((u32)n->ino);
 	struct list_head *list = &audit_inode_hash[h];
-	struct audit_entry *e;
-	enum audit_state state;
 
-	list_for_each_entry_rcu(e, list, list) {
-		if (audit_in_mask(&e->rule, ctx->major) &&
-		    audit_filter_rules(tsk, &e->rule, ctx, n, &state, false)) {
-			ctx->current_state = state;
-			return 1;
-		}
-	}
-	return 0;
+	return __audit_filter_op(tsk, ctx, list, n, ctx->major);
 }
 
 /* At syscall exit time, this filter is called if any audit_names have been
-- 
2.31.1

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()
  2022-09-30 17:45     ` Steve Grubb
@ 2022-10-07  0:55       ` Ankur Arora
  -1 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-10-07  0:55 UTC (permalink / raw)
  To: Steve Grubb
  Cc: linux-audit, boris.ostrovsky, konrad.wilk, eparis, linux-kernel,
	Ankur Arora


Steve Grubb <sgrubb@redhat.com> writes:

> Hello,
>
> Thanks for the detailed notes on this investigation. It really is  a lot of
> good information backing this up. However, there will come a day when someone
> sees this "major = ctx->major" and they will send a patch to "fix" this
> unnecessary assignment. If you are sending a V2 of this set, I would suggest
> adding some comment in the code that this is for a performance improvement
> and to see the commit message for additional info.

Thanks for the comment. Just sent out v2 of the last patch which
addresses this tangentially -- it adds a separate function parameter
for ctx->major/uring_op, so this shouldn't be an issue anymore.

Thanks
Ankur

>
> Thanks,
> -Steve
>
> On Tuesday, September 27, 2022 6:59:42 PM EDT Ankur Arora wrote:
>> ctx->major contains the current syscall number. This is, of course, a
>> constant for the duration of the syscall. Unfortunately, GCC's alias
>> analysis cannot prove that it is not modified via a pointer in the
>> audit_filter_syscall() loop, and so always loads it from memory.
>>
>> In and of itself the load isn't very expensive (ops dependent on the
>> ctx->major load are only used to determine the direction of control flow
>> and have short dependence chains and, in any case the related branches
>> get predicted perfectly in the fastpath) but still cache ctx->major
>> in a local for two reasons:
>>
>> * ctx->major is in the first cacheline of struct audit_context and has
>>   similar alignment as audit_entry::list audit_entry. For cases
>>   with a lot of audit rules, doing this reduces one source of contention
>>   from a potentially busy cache-set.
>>
>> * audit_in_mask() (called in the hot loop in audit_filter_syscall())
>>   does cast manipulation and error checking on ctx->major:
>>
>>      audit_in_mask(const struct audit_krule *rule, unsigned long val):
>>              if (val > 0xffffffff)
>>                      return false;
>>
>>              word = AUDIT_WORD(val);
>>              if (word >= AUDIT_BITMASK_SIZE)
>>                      return false;
>>
>>              bit = AUDIT_BIT(val);
>>
>>              return rule->mask[word] & bit;
>>
>>   The clauses related to the rule need to be evaluated in the loop, but
>>   the rest is unnecessarily re-evaluated for every loop iteration.
>>   (Note, however, that most of these are cheap ALU ops and the branches
>>    are perfectly predicted. However, see discussion on cycles
>>    improvement below for more on why it is still worth hoisting.)
>>
>> On a Skylakex system change in getpid() latency (aggregated over
>> 12 boot cycles):
>>
>>              Min     Mean  Median     Max       pstdev
>>             (ns)     (ns)    (ns)    (ns)
>>
>>  -        201.30   216.14  216.22  228.46      (+- 1.45%)
>>  +        196.63   207.86  206.60  230.98      (+- 3.92%)
>>
>> Performance counter stats for 'bin/getpid' (3 runs) go from:
>>     cycles               836.89  (  +-   .80% )
>>     instructions        2000.19  (  +-   .03% )
>>     IPC                    2.39  (  +-   .83% )
>>     branches             430.14  (  +-   .03% )
>>     branch-misses          1.48  (  +-  3.37% )
>>     L1-dcache-loads      471.11  (  +-   .05% )
>>     L1-dcache-load-misses  7.62  (  +- 46.98% )
>>
>>  to:
>>     cycles               805.58  (  +-  4.11% )
>>     instructions        1654.11  (  +-   .05% )
>>     IPC                    2.06  (  +-  3.39% )
>>     branches             430.02  (  +-   .05% )
>>     branch-misses          1.55  (  +-  7.09% )
>>     L1-dcache-loads      440.01  (  +-   .09% )
>>     L1-dcache-load-misses  9.05  (  +- 74.03% )
>>
>> (Both aggregated over 12 boot cycles.)
>>
>> instructions: we reduce around 8 instructions/iteration because some of
>> the computation is now hoisted out of the loop (branch count does not
>> change because GCC, for reasons unclear, only hoists the computations
>> while keeping the basic-blocks.)
>>
>> cycles: improve by about 5% (in aggregate and looking at individual run
>> numbers.) This is likely because we now waste fewer pipeline resources
>> on unnecessary instructions which allows the control flow to
>> speculatively execute further ahead shortening the execution of the loop
>> a little. The final gating factor on the performance of this loop
>> remains the long dependence chain due to the linked-list load.
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/auditsc.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> index 79a5da1bc5bb..533b087c3c02 100644
>> --- a/kernel/auditsc.c
>> +++ b/kernel/auditsc.c
>> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct
>> *tsk, {
>>  	struct audit_entry *e;
>>  	enum audit_state state;
>> +	unsigned long major = ctx->major;
>>
>>  	if (auditd_test_task(tsk))
>>  		return;
>>
>>  	rcu_read_lock();
>>  	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT],
> list) {
>> -		if (audit_in_mask(&e->rule, ctx->major) &&
>> +		if (audit_in_mask(&e->rule, major) &&
>>  		    audit_filter_rules(tsk, &e->rule, ctx, NULL,
>>  				       &state, false)) {
>>  			rcu_read_unlock();


--
ankur

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()
@ 2022-10-07  0:55       ` Ankur Arora
  0 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-10-07  0:55 UTC (permalink / raw)
  To: Steve Grubb
  Cc: konrad.wilk, linux-kernel, linux-audit, eparis, boris.ostrovsky


Steve Grubb <sgrubb@redhat.com> writes:

> Hello,
>
> Thanks for the detailed notes on this investigation. It really is  a lot of
> good information backing this up. However, there will come a day when someone
> sees this "major = ctx->major" and they will send a patch to "fix" this
> unnecessary assignment. If you are sending a V2 of this set, I would suggest
> adding some comment in the code that this is for a performance improvement
> and to see the commit message for additional info.

Thanks for the comment. Just sent out v2 of the last patch which
addresses this tangentially -- it adds a separate function parameter
for ctx->major/uring_op, so this shouldn't be an issue anymore.

Thanks
Ankur

>
> Thanks,
> -Steve
>
> On Tuesday, September 27, 2022 6:59:42 PM EDT Ankur Arora wrote:
>> ctx->major contains the current syscall number. This is, of course, a
>> constant for the duration of the syscall. Unfortunately, GCC's alias
>> analysis cannot prove that it is not modified via a pointer in the
>> audit_filter_syscall() loop, and so always loads it from memory.
>>
>> In and of itself the load isn't very expensive (ops dependent on the
>> ctx->major load are only used to determine the direction of control flow
>> and have short dependence chains and, in any case the related branches
>> get predicted perfectly in the fastpath) but still cache ctx->major
>> in a local for two reasons:
>>
>> * ctx->major is in the first cacheline of struct audit_context and has
>>   similar alignment as audit_entry::list audit_entry. For cases
>>   with a lot of audit rules, doing this reduces one source of contention
>>   from a potentially busy cache-set.
>>
>> * audit_in_mask() (called in the hot loop in audit_filter_syscall())
>>   does cast manipulation and error checking on ctx->major:
>>
>>      audit_in_mask(const struct audit_krule *rule, unsigned long val):
>>              if (val > 0xffffffff)
>>                      return false;
>>
>>              word = AUDIT_WORD(val);
>>              if (word >= AUDIT_BITMASK_SIZE)
>>                      return false;
>>
>>              bit = AUDIT_BIT(val);
>>
>>              return rule->mask[word] & bit;
>>
>>   The clauses related to the rule need to be evaluated in the loop, but
>>   the rest is unnecessarily re-evaluated for every loop iteration.
>>   (Note, however, that most of these are cheap ALU ops and the branches
>>    are perfectly predicted. However, see discussion on cycles
>>    improvement below for more on why it is still worth hoisting.)
>>
>> On a Skylakex system change in getpid() latency (aggregated over
>> 12 boot cycles):
>>
>>              Min     Mean  Median     Max       pstdev
>>             (ns)     (ns)    (ns)    (ns)
>>
>>  -        201.30   216.14  216.22  228.46      (+- 1.45%)
>>  +        196.63   207.86  206.60  230.98      (+- 3.92%)
>>
>> Performance counter stats for 'bin/getpid' (3 runs) go from:
>>     cycles               836.89  (  +-   .80% )
>>     instructions        2000.19  (  +-   .03% )
>>     IPC                    2.39  (  +-   .83% )
>>     branches             430.14  (  +-   .03% )
>>     branch-misses          1.48  (  +-  3.37% )
>>     L1-dcache-loads      471.11  (  +-   .05% )
>>     L1-dcache-load-misses  7.62  (  +- 46.98% )
>>
>>  to:
>>     cycles               805.58  (  +-  4.11% )
>>     instructions        1654.11  (  +-   .05% )
>>     IPC                    2.06  (  +-  3.39% )
>>     branches             430.02  (  +-   .05% )
>>     branch-misses          1.55  (  +-  7.09% )
>>     L1-dcache-loads      440.01  (  +-   .09% )
>>     L1-dcache-load-misses  9.05  (  +- 74.03% )
>>
>> (Both aggregated over 12 boot cycles.)
>>
>> instructions: we reduce around 8 instructions/iteration because some of
>> the computation is now hoisted out of the loop (branch count does not
>> change because GCC, for reasons unclear, only hoists the computations
>> while keeping the basic-blocks.)
>>
>> cycles: improve by about 5% (in aggregate and looking at individual run
>> numbers.) This is likely because we now waste fewer pipeline resources
>> on unnecessary instructions which allows the control flow to
>> speculatively execute further ahead shortening the execution of the loop
>> a little. The final gating factor on the performance of this loop
>> remains the long dependence chain due to the linked-list load.
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/auditsc.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> index 79a5da1bc5bb..533b087c3c02 100644
>> --- a/kernel/auditsc.c
>> +++ b/kernel/auditsc.c
>> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct
>> *tsk, {
>>  	struct audit_entry *e;
>>  	enum audit_state state;
>> +	unsigned long major = ctx->major;
>>
>>  	if (auditd_test_task(tsk))
>>  		return;
>>
>>  	rcu_read_lock();
>>  	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT],
> list) {
>> -		if (audit_in_mask(&e->rule, ctx->major) &&
>> +		if (audit_in_mask(&e->rule, major) &&
>>  		    audit_filter_rules(tsk, &e->rule, ctx, NULL,
>>  				       &state, false)) {
>>  			rcu_read_unlock();


--
ankur

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] audit: annotate branch direction for audit_in_mask()
  2022-09-30 18:48         ` Paul Moore
@ 2022-10-07  0:57           ` Ankur Arora
  -1 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-10-07  0:57 UTC (permalink / raw)
  To: Paul Moore
  Cc: Ankur Arora, linux-audit, eparis, linux-kernel, boris.ostrovsky,
	konrad.wilk


Paul Moore <paul@paul-moore.com> writes:

> On Thu, Sep 29, 2022 at 4:20 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> Paul Moore <paul@paul-moore.com> writes:
>> > I generally dislike merging likely()/unlikely() additions to code
>> > paths that can have varying levels of performance depending on runtime
>> > configuration.
>>
>> I think that's fair, and in this particular case the benchmark is quite
>> contrived.
>>
>> But, just to elaborate a bit more on why that unlikely() clause made
>> sense to me: it seems to me that audit typically would be triggered for
>> control syscalls and the ratio between control and non-control ones
>> would be fairly lopsided.
>
> I understand, and there is definitely some precedence in the audit
> code for using likely()/unlikely() in a manner similar as you
> described, but I'll refer to my previous comments - it's not something
> I like.  As a general rule, aside from the unlikely() calls in the
> audit wrappers present in include/linux/audit.h I would suggest not
> adding any new likely()/unlikely() calls.
>
>> Let me see if I can rewrite the conditional in a different way to get a
>> similar effect but I suspect that might be even more compiler dependent.
>
> I am okay with ordering conditionals to make the common case the
> short-circuit case.

So I played around with a bunch of different combinations of the
conditionals but nothing really improved the code all that much.

Just sent out v2 dropping the unlikely() clause.


Thanks
Ankur

>
>> Also, let me run the audit-testsuite this time. Is there a good test
>> there that you would recommend that might serve as a more representative
>> workload?
>
> Probably not.  The audit-testsuite is intended to be a quick, easy to
> run regression test that can be used by developers to help reduce
> audit breakage.  It is not representative of any particular workload,
> and is definitely not comprehensive (it is woefully lacking in several
> areas unfortunately).


--
ankur

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] audit: annotate branch direction for audit_in_mask()
@ 2022-10-07  0:57           ` Ankur Arora
  0 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-10-07  0:57 UTC (permalink / raw)
  To: Paul Moore
  Cc: konrad.wilk, linux-kernel, eparis, linux-audit, boris.ostrovsky


Paul Moore <paul@paul-moore.com> writes:

> On Thu, Sep 29, 2022 at 4:20 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> Paul Moore <paul@paul-moore.com> writes:
>> > I generally dislike merging likely()/unlikely() additions to code
>> > paths that can have varying levels of performance depending on runtime
>> > configuration.
>>
>> I think that's fair, and in this particular case the benchmark is quite
>> contrived.
>>
>> But, just to elaborate a bit more on why that unlikely() clause made
>> sense to me: it seems to me that audit typically would be triggered for
>> control syscalls and the ratio between control and non-control ones
>> would be fairly lopsided.
>
> I understand, and there is definitely some precedence in the audit
> code for using likely()/unlikely() in a manner similar as you
> described, but I'll refer to my previous comments - it's not something
> I like.  As a general rule, aside from the unlikely() calls in the
> audit wrappers present in include/linux/audit.h I would suggest not
> adding any new likely()/unlikely() calls.
>
>> Let me see if I can rewrite the conditional in a different way to get a
>> similar effect but I suspect that might be even more compiler dependent.
>
> I am okay with ordering conditionals to make the common case the
> short-circuit case.

So I played around with a bunch of different combinations of the
conditionals but nothing really improved the code all that much.

Just sent out v2 dropping the unlikely() clause.


Thanks
Ankur

>
>> Also, let me run the audit-testsuite this time. Is there a good test
>> there that you would recommend that might serve as a more representative
>> workload?
>
> Probably not.  The audit-testsuite is intended to be a quick, easy to
> run regression test that can be used by developers to help reduce
> audit breakage.  It is not representative of any particular workload,
> and is definitely not comprehensive (it is woefully lacking in several
> areas unfortunately).


--
ankur

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2] audit: unify audit_filter_{uring(),inode_name(),syscall()}
  2022-10-07  0:49     ` [PATCH v2] audit: unify audit_filter_{uring(), inode_name(), syscall()} Ankur Arora
@ 2022-10-13 23:11       ` Paul Moore
  -1 siblings, 0 replies; 40+ messages in thread
From: Paul Moore @ 2022-10-13 23:11 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-audit, eparis, sgrubb, linux-kernel, boris.ostrovsky, konrad.wilk

On Thu, Oct 6, 2022 at 8:49 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> audit_filter_uring(), audit_filter_inode_name() are substantially
> similar to audit_filter_syscall(). Move the core logic to
> __audit_filter_op() which can be parametrized for all three.
>
> On a Skylakex system, getpid() latency (all results aggregated
> across 12 boot cycles):
>
>          Min     Mean    Median   Max      pstdev
>          (ns)    (ns)    (ns)     (ns)
>
>  -    196.63   207.86  206.60  230.98      (+- 3.92%)
>  +    183.73   196.95  192.31  232.49      (+- 6.04%)
>
> Performance counter stats for 'bin/getpid' (3 runs) go from:
>     cycles               805.58  (  +-  4.11% )
>     instructions        1654.11  (  +-   .05% )
>     IPC                    2.06  (  +-  3.39% )
>     branches             430.02  (  +-   .05% )
>     branch-misses          1.55  (  +-  7.09% )
>     L1-dcache-loads      440.01  (  +-   .09% )
>     L1-dcache-load-misses  9.05  (  +- 74.03% )
> to:
>     cycles               765.37  (  +-  6.66% )
>     instructions        1677.07  (  +-  0.04% )
>     IPC                    2.20  (  +-  5.90% )
>     branches             431.10  (  +-  0.04% )
>     branch-misses          1.60  (  +- 11.25% )
>     L1-dcache-loads      521.04  (  +-  0.05% )
>     L1-dcache-load-misses  6.92  (  +- 77.60% )
>
> (Both aggregated over 12 boot cycles.)
>
> The increased L1-dcache-loads are due to some intermediate values now
> coming from the stack.
>
> The improvement in cycles is due to a slightly denser loop (the list
> parameter in the list_for_each_entry_rcu() exit check now comes from
> a register rather than a constant as before.)
>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/auditsc.c | 76 +++++++++++++++++++++++++-----------------------
>  1 file changed, 39 insertions(+), 37 deletions(-)

Thanks, this looks good to me.  I'll queue this up for when the merge
window closes.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2] audit: unify audit_filter_{uring(), inode_name(), syscall()}
@ 2022-10-13 23:11       ` Paul Moore
  0 siblings, 0 replies; 40+ messages in thread
From: Paul Moore @ 2022-10-13 23:11 UTC (permalink / raw)
  To: Ankur Arora
  Cc: konrad.wilk, linux-kernel, eparis, linux-audit, boris.ostrovsky

On Thu, Oct 6, 2022 at 8:49 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> audit_filter_uring(), audit_filter_inode_name() are substantially
> similar to audit_filter_syscall(). Move the core logic to
> __audit_filter_op() which can be parametrized for all three.
>
> On a Skylakex system, getpid() latency (all results aggregated
> across 12 boot cycles):
>
>          Min     Mean    Median   Max      pstdev
>          (ns)    (ns)    (ns)     (ns)
>
>  -    196.63   207.86  206.60  230.98      (+- 3.92%)
>  +    183.73   196.95  192.31  232.49      (+- 6.04%)
>
> Performance counter stats for 'bin/getpid' (3 runs) go from:
>     cycles               805.58  (  +-  4.11% )
>     instructions        1654.11  (  +-   .05% )
>     IPC                    2.06  (  +-  3.39% )
>     branches             430.02  (  +-   .05% )
>     branch-misses          1.55  (  +-  7.09% )
>     L1-dcache-loads      440.01  (  +-   .09% )
>     L1-dcache-load-misses  9.05  (  +- 74.03% )
> to:
>     cycles               765.37  (  +-  6.66% )
>     instructions        1677.07  (  +-  0.04% )
>     IPC                    2.20  (  +-  5.90% )
>     branches             431.10  (  +-  0.04% )
>     branch-misses          1.60  (  +- 11.25% )
>     L1-dcache-loads      521.04  (  +-  0.05% )
>     L1-dcache-load-misses  6.92  (  +- 77.60% )
>
> (Both aggregated over 12 boot cycles.)
>
> The increased L1-dcache-loads are due to some intermediate values now
> coming from the stack.
>
> The improvement in cycles is due to a slightly denser loop (the list
> parameter in the list_for_each_entry_rcu() exit check now comes from
> a register rather than a constant as before.)
>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/auditsc.c | 76 +++++++++++++++++++++++++-----------------------
>  1 file changed, 39 insertions(+), 37 deletions(-)

Thanks, this looks good to me.  I'll queue this up for when the merge
window closes.

-- 
paul-moore.com

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2] audit: unify audit_filter_{uring(),inode_name(),syscall()}
  2022-10-13 23:11       ` [PATCH v2] audit: unify audit_filter_{uring(), inode_name(), syscall()} Paul Moore
@ 2022-10-14 16:53         ` Ankur Arora
  -1 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-10-14 16:53 UTC (permalink / raw)
  To: Paul Moore
  Cc: Ankur Arora, linux-audit, eparis, sgrubb, linux-kernel,
	boris.ostrovsky, konrad.wilk


Paul Moore <paul@paul-moore.com> writes:

> On Thu, Oct 6, 2022 at 8:49 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> audit_filter_uring(), audit_filter_inode_name() are substantially
>> similar to audit_filter_syscall(). Move the core logic to
>> __audit_filter_op() which can be parametrized for all three.
>>
>> On a Skylakex system, getpid() latency (all results aggregated
>> across 12 boot cycles):
>>
>>          Min     Mean    Median   Max      pstdev
>>          (ns)    (ns)    (ns)     (ns)
>>
>>  -    196.63   207.86  206.60  230.98      (+- 3.92%)
>>  +    183.73   196.95  192.31  232.49      (+- 6.04%)
>>
>> Performance counter stats for 'bin/getpid' (3 runs) go from:
>>     cycles               805.58  (  +-  4.11% )
>>     instructions        1654.11  (  +-   .05% )
>>     IPC                    2.06  (  +-  3.39% )
>>     branches             430.02  (  +-   .05% )
>>     branch-misses          1.55  (  +-  7.09% )
>>     L1-dcache-loads      440.01  (  +-   .09% )
>>     L1-dcache-load-misses  9.05  (  +- 74.03% )
>> to:
>>     cycles               765.37  (  +-  6.66% )
>>     instructions        1677.07  (  +-  0.04% )
>>     IPC                    2.20  (  +-  5.90% )
>>     branches             431.10  (  +-  0.04% )
>>     branch-misses          1.60  (  +- 11.25% )
>>     L1-dcache-loads      521.04  (  +-  0.05% )
>>     L1-dcache-load-misses  6.92  (  +- 77.60% )
>>
>> (Both aggregated over 12 boot cycles.)
>>
>> The increased L1-dcache-loads are due to some intermediate values now
>> coming from the stack.
>>
>> The improvement in cycles is due to a slightly denser loop (the list
>> parameter in the list_for_each_entry_rcu() exit check now comes from
>> a register rather than a constant as before.)
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/auditsc.c | 76 +++++++++++++++++++++++++-----------------------
>>  1 file changed, 39 insertions(+), 37 deletions(-)
>
> Thanks, this looks good to me.  I'll queue this up for when the merge
> window closes.

Great. Thanks Paul.

--
ankur

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2] audit: unify audit_filter_{uring(),inode_name(),syscall()}
@ 2022-10-14 16:53         ` Ankur Arora
  0 siblings, 0 replies; 40+ messages in thread
From: Ankur Arora @ 2022-10-14 16:53 UTC (permalink / raw)
  To: Paul Moore
  Cc: konrad.wilk, linux-kernel, eparis, linux-audit, boris.ostrovsky


Paul Moore <paul@paul-moore.com> writes:

> On Thu, Oct 6, 2022 at 8:49 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> audit_filter_uring(), audit_filter_inode_name() are substantially
>> similar to audit_filter_syscall(). Move the core logic to
>> __audit_filter_op() which can be parametrized for all three.
>>
>> On a Skylakex system, getpid() latency (all results aggregated
>> across 12 boot cycles):
>>
>>          Min     Mean    Median   Max      pstdev
>>          (ns)    (ns)    (ns)     (ns)
>>
>>  -    196.63   207.86  206.60  230.98      (+- 3.92%)
>>  +    183.73   196.95  192.31  232.49      (+- 6.04%)
>>
>> Performance counter stats for 'bin/getpid' (3 runs) go from:
>>     cycles               805.58  (  +-  4.11% )
>>     instructions        1654.11  (  +-   .05% )
>>     IPC                    2.06  (  +-  3.39% )
>>     branches             430.02  (  +-   .05% )
>>     branch-misses          1.55  (  +-  7.09% )
>>     L1-dcache-loads      440.01  (  +-   .09% )
>>     L1-dcache-load-misses  9.05  (  +- 74.03% )
>> to:
>>     cycles               765.37  (  +-  6.66% )
>>     instructions        1677.07  (  +-  0.04% )
>>     IPC                    2.20  (  +-  5.90% )
>>     branches             431.10  (  +-  0.04% )
>>     branch-misses          1.60  (  +- 11.25% )
>>     L1-dcache-loads      521.04  (  +-  0.05% )
>>     L1-dcache-load-misses  6.92  (  +- 77.60% )
>>
>> (Both aggregated over 12 boot cycles.)
>>
>> The increased L1-dcache-loads are due to some intermediate values now
>> coming from the stack.
>>
>> The improvement in cycles is due to a slightly denser loop (the list
>> parameter in the list_for_each_entry_rcu() exit check now comes from
>> a register rather than a constant as before.)
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/auditsc.c | 76 +++++++++++++++++++++++++-----------------------
>>  1 file changed, 39 insertions(+), 37 deletions(-)
>
> Thanks, this looks good to me.  I'll queue this up for when the merge
> window closes.

Great. Thanks Paul.

--
ankur

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()
  2022-09-28 22:03     ` Paul Moore
@ 2022-10-17 18:23       ` Paul Moore
  -1 siblings, 0 replies; 40+ messages in thread
From: Paul Moore @ 2022-10-17 18:23 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-audit, eparis, linux-kernel, boris.ostrovsky, konrad.wilk

On Wed, Sep 28, 2022 at 6:03 PM Paul Moore <paul@paul-moore.com> wrote:
> On Tue, Sep 27, 2022 at 6:59 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >
> > ctx->major contains the current syscall number. This is, of course, a
> > constant for the duration of the syscall. Unfortunately, GCC's alias
> > analysis cannot prove that it is not modified via a pointer in the
> > audit_filter_syscall() loop, and so always loads it from memory.
> >
> > ...
> >
> > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > ---
> >  kernel/auditsc.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
>
> This looks pretty trivial to me, but it's too late in the current -rc
> cycle for this to be merged, I'll queue it up for after the upcoming
> merge window closes.  Thanks.

I just merged this into audit/next, thanks again!

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()
@ 2022-10-17 18:23       ` Paul Moore
  0 siblings, 0 replies; 40+ messages in thread
From: Paul Moore @ 2022-10-17 18:23 UTC (permalink / raw)
  To: Ankur Arora
  Cc: boris.ostrovsky, linux-audit, linux-kernel, eparis, konrad.wilk

On Wed, Sep 28, 2022 at 6:03 PM Paul Moore <paul@paul-moore.com> wrote:
> On Tue, Sep 27, 2022 at 6:59 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >
> > ctx->major contains the current syscall number. This is, of course, a
> > constant for the duration of the syscall. Unfortunately, GCC's alias
> > analysis cannot prove that it is not modified via a pointer in the
> > audit_filter_syscall() loop, and so always loads it from memory.
> >
> > ...
> >
> > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > ---
> >  kernel/auditsc.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
>
> This looks pretty trivial to me, but it's too late in the current -rc
> cycle for this to be merged, I'll queue it up for after the upcoming
> merge window closes.  Thanks.

I just merged this into audit/next, thanks again!

-- 
paul-moore.com

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2] audit: unify audit_filter_{uring(),inode_name(),syscall()}
  2022-10-13 23:11       ` [PATCH v2] audit: unify audit_filter_{uring(), inode_name(), syscall()} Paul Moore
@ 2022-10-17 18:26         ` Paul Moore
  -1 siblings, 0 replies; 40+ messages in thread
From: Paul Moore @ 2022-10-17 18:26 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-audit, eparis, sgrubb, linux-kernel, boris.ostrovsky, konrad.wilk

On Thu, Oct 13, 2022 at 7:11 PM Paul Moore <paul@paul-moore.com> wrote:
> On Thu, Oct 6, 2022 at 8:49 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >
> > audit_filter_uring(), audit_filter_inode_name() are substantially
> > similar to audit_filter_syscall(). Move the core logic to
> > __audit_filter_op() which can be parametrized for all three.
> >
> > On a Skylakex system, getpid() latency (all results aggregated
> > across 12 boot cycles):
> >
> >          Min     Mean    Median   Max      pstdev
> >          (ns)    (ns)    (ns)     (ns)
> >
> >  -    196.63   207.86  206.60  230.98      (+- 3.92%)
> >  +    183.73   196.95  192.31  232.49      (+- 6.04%)
> >
> > Performance counter stats for 'bin/getpid' (3 runs) go from:
> >     cycles               805.58  (  +-  4.11% )
> >     instructions        1654.11  (  +-   .05% )
> >     IPC                    2.06  (  +-  3.39% )
> >     branches             430.02  (  +-   .05% )
> >     branch-misses          1.55  (  +-  7.09% )
> >     L1-dcache-loads      440.01  (  +-   .09% )
> >     L1-dcache-load-misses  9.05  (  +- 74.03% )
> > to:
> >     cycles               765.37  (  +-  6.66% )
> >     instructions        1677.07  (  +-  0.04% )
> >     IPC                    2.20  (  +-  5.90% )
> >     branches             431.10  (  +-  0.04% )
> >     branch-misses          1.60  (  +- 11.25% )
> >     L1-dcache-loads      521.04  (  +-  0.05% )
> >     L1-dcache-load-misses  6.92  (  +- 77.60% )
> >
> > (Both aggregated over 12 boot cycles.)
> >
> > The increased L1-dcache-loads are due to some intermediate values now
> > coming from the stack.
> >
> > The improvement in cycles is due to a slightly denser loop (the list
> > parameter in the list_for_each_entry_rcu() exit check now comes from
> > a register rather than a constant as before.)
> >
> > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > ---
> >  kernel/auditsc.c | 76 +++++++++++++++++++++++++-----------------------
> >  1 file changed, 39 insertions(+), 37 deletions(-)
>
> Thanks, this looks good to me.  I'll queue this up for when the merge
> window closes.

This is also merged into audit/next, thanks!

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2] audit: unify audit_filter_{uring(), inode_name(), syscall()}
@ 2022-10-17 18:26         ` Paul Moore
  0 siblings, 0 replies; 40+ messages in thread
From: Paul Moore @ 2022-10-17 18:26 UTC (permalink / raw)
  To: Ankur Arora
  Cc: konrad.wilk, linux-kernel, eparis, linux-audit, boris.ostrovsky

On Thu, Oct 13, 2022 at 7:11 PM Paul Moore <paul@paul-moore.com> wrote:
> On Thu, Oct 6, 2022 at 8:49 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >
> > audit_filter_uring(), audit_filter_inode_name() are substantially
> > similar to audit_filter_syscall(). Move the core logic to
> > __audit_filter_op() which can be parametrized for all three.
> >
> > On a Skylakex system, getpid() latency (all results aggregated
> > across 12 boot cycles):
> >
> >          Min     Mean    Median   Max      pstdev
> >          (ns)    (ns)    (ns)     (ns)
> >
> >  -    196.63   207.86  206.60  230.98      (+- 3.92%)
> >  +    183.73   196.95  192.31  232.49      (+- 6.04%)
> >
> > Performance counter stats for 'bin/getpid' (3 runs) go from:
> >     cycles               805.58  (  +-  4.11% )
> >     instructions        1654.11  (  +-   .05% )
> >     IPC                    2.06  (  +-  3.39% )
> >     branches             430.02  (  +-   .05% )
> >     branch-misses          1.55  (  +-  7.09% )
> >     L1-dcache-loads      440.01  (  +-   .09% )
> >     L1-dcache-load-misses  9.05  (  +- 74.03% )
> > to:
> >     cycles               765.37  (  +-  6.66% )
> >     instructions        1677.07  (  +-  0.04% )
> >     IPC                    2.20  (  +-  5.90% )
> >     branches             431.10  (  +-  0.04% )
> >     branch-misses          1.60  (  +- 11.25% )
> >     L1-dcache-loads      521.04  (  +-  0.05% )
> >     L1-dcache-load-misses  6.92  (  +- 77.60% )
> >
> > (Both aggregated over 12 boot cycles.)
> >
> > The increased L1-dcache-loads are due to some intermediate values now
> > coming from the stack.
> >
> > The improvement in cycles is due to a slightly denser loop (the list
> > parameter in the list_for_each_entry_rcu() exit check now comes from
> > a register rather than a constant as before.)
> >
> > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > ---
> >  kernel/auditsc.c | 76 +++++++++++++++++++++++++-----------------------
> >  1 file changed, 39 insertions(+), 37 deletions(-)
>
> Thanks, this looks good to me.  I'll queue this up for when the merge
> window closes.

This is also merged into audit/next, thanks!

-- 
paul-moore.com

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2022-10-17 18:26 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-27 22:59 [PATCH 0/3] improve audit syscall-exit latency Ankur Arora
2022-09-27 22:59 ` Ankur Arora
2022-09-27 22:59 ` [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall() Ankur Arora
2022-09-27 22:59   ` Ankur Arora
2022-09-28 22:03   ` Paul Moore
2022-09-28 22:03     ` Paul Moore
2022-09-29 20:20     ` Ankur Arora
2022-09-29 20:20       ` Ankur Arora
2022-10-17 18:23     ` Paul Moore
2022-10-17 18:23       ` Paul Moore
2022-09-30 17:45   ` Steve Grubb
2022-09-30 17:45     ` Steve Grubb
2022-09-30 18:22     ` Paul Moore
2022-09-30 18:22       ` Paul Moore
2022-10-07  0:55     ` Ankur Arora
2022-10-07  0:55       ` Ankur Arora
2022-09-27 22:59 ` [PATCH 2/3] audit: annotate branch direction for audit_in_mask() Ankur Arora
2022-09-27 22:59   ` Ankur Arora
2022-09-28 22:26   ` Paul Moore
2022-09-28 22:26     ` Paul Moore
2022-09-29 20:19     ` Ankur Arora
2022-09-29 20:19       ` Ankur Arora
2022-09-30 18:48       ` Paul Moore
2022-09-30 18:48         ` Paul Moore
2022-10-07  0:57         ` Ankur Arora
2022-10-07  0:57           ` Ankur Arora
2022-09-27 22:59 ` [PATCH 3/3] audit: unify audit_filter_{uring(),inode_name(),syscall()} Ankur Arora
2022-09-27 22:59   ` [PATCH 3/3] audit: unify audit_filter_{uring(), inode_name(), syscall()} Ankur Arora
2022-09-28 22:54   ` [PATCH 3/3] audit: unify audit_filter_{uring(),inode_name(),syscall()} Paul Moore
2022-09-28 22:54     ` [PATCH 3/3] audit: unify audit_filter_{uring(), inode_name(), syscall()} Paul Moore
2022-09-29 20:23     ` [PATCH 3/3] audit: unify audit_filter_{uring(),inode_name(),syscall()} Ankur Arora
2022-09-29 20:23       ` Ankur Arora
2022-10-07  0:49   ` [PATCH v2] " Ankur Arora
2022-10-07  0:49     ` [PATCH v2] audit: unify audit_filter_{uring(), inode_name(), syscall()} Ankur Arora
2022-10-13 23:11     ` [PATCH v2] audit: unify audit_filter_{uring(),inode_name(),syscall()} Paul Moore
2022-10-13 23:11       ` [PATCH v2] audit: unify audit_filter_{uring(), inode_name(), syscall()} Paul Moore
2022-10-14 16:53       ` [PATCH v2] audit: unify audit_filter_{uring(),inode_name(),syscall()} Ankur Arora
2022-10-14 16:53         ` Ankur Arora
2022-10-17 18:26       ` Paul Moore
2022-10-17 18:26         ` [PATCH v2] audit: unify audit_filter_{uring(), inode_name(), syscall()} Paul Moore

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.