Severe performance regression in "net: macsec: preserve ingress frame ordering"

* Severe performance regression in "net: macsec: preserve ingress frame ordering"
@ 2020-08-06 21:11 Ryan Cox
  2020-08-07  3:48 ` Scott Dial
  0 siblings, 1 reply; 12+ messages in thread
From: Ryan Cox @ 2020-08-06 21:11 UTC (permalink / raw)
  To: netdev, davem, sd, scott; +Cc: Antoine Tenart

Hello,

I have found two performance issues with MACsec on 10 Gb/s links (tested 
on Intel and Broadcom NICs):
1)  MACsec with encryption is much faster than MACsec without encryption 
(9.8 vs 7.4 Gb/s) until 5.7, where both have poor performance
2)  5.7 introduced a severe performance impact for MACsec with and 
without encryption at commit ab046a5d4be4c90a3952a0eae75617b49c0cb01b

I haven't been able to look at issue #1 yet (and I don't know where to 
start) since I got sidetracked looking at issue #2.

This email is about issue #2, which results in the following in my test 
setup:
* MACsec with encryption drops from 9.81 Gb/s to 1.00 Gb/s or sometimes 
worse
* MACsec without encryption drops from 7.40 Gb/s to 1.80 Gb/s

I have tested a number of configurations.  These tests were performed on 
the following hardware:
* dual Intel Xeon E5-2680 v4 @ 2.40GHz, 14 cores each
* Intel 82599ES 10 GbE NIC
* ixgbe driver, version 5.1.0-k

I also tested on the following hardware in a more limited fashion, but 
the results were consistent:
* dual Intel Xeon E5-2670 v3 @ 2.30GHz, 12 cores each
* Broadcom BCM57810 10 GbE NIC
* bnx2x driver, 1.713.36-0

Only one 10 Gb/s link was populated (i.e. no port-channel).  The MTU for 
the network is 9000, with a resulting MACsec MTU of 8968.  All tests 
were performed with only one switch in between the servers.

I tested three scenarios.  The tests were run on servers that are booted 
with NFS root from an identical image.  The only difference was the 
kernel.  A script was run to create the three scenarios and run the 
benchmarks, so the setups are identical across tests.

The scenarios all involved iperf3 tests of these conditions:
1) no MACsec
2) MACsec without encryption
3) MACsec with encryption

The MACsec setup was done as follows:
ip link add link em1 ms1 type macsec sci 1234 encrypt on  #or omitting 
the "encrypt on" for specific tests
ip macsec add ms1 tx sa 0 pn 1234 on key 01 $(printf %032d 1234)
ip macsec add ms1 rx sci 1234
ip macsec add ms1 rx sci 1234 sa 0 pn 1234 on key 01 $(printf %032d 1234)

That results in `ip macsec show` like this:
6: ms1: protect on validate strict sc off sa off encrypt on send_sci on 
end_station off scb off replay off
     cipher suite: GCM-AES-128, using ICV length 16
     TXSC: 0000000000001234 on SA 0
         0: PN 599345, state on, key 01000000000000000000000000000000
     RXSC: 0000000000001234, state on
         0: PN 5076769, state on, key 01000000000000000000000000000000

I tested a number of kernels (all 64 bit) including:
* 4.18.0-193.13.2.el8_2 (RHEL 8)
* 5.6.7-1.el8.elrepo (ELRepo)
* 5.7.11-1.el8.elrepo (ELRepo)
* 5.7 at tag v5.7.11 (I compiled)
* 5.7 at tag v5.7.11 with ab046a5d4be4c90a3952a0eae75617b49c0cb01b 
reverted (I compiled)

I did test 4.18 <-> 5.7 (bi-directional) and both directions resulted in 
poor performance.  Other than that, each test was between two servers of 
the same kernel version.

CONFIG_CRYPTO_AES_NI_INTEL=y is set in all kernels.

4.18 and 5.6 kernels both have very similar performance characteristics:
* 9.89 Gb/s with no macsec at all
* 7.40 Gb/s with macsec WITHOUT encryption  <--- not sure why, but 
turning OFF encryption slowed things down
* 9.81 Gb/s with macsec WITH encryption

With 5.7 I get:
* 9.90 Gb/s with no macsec at all
* 1.80 Gb/s with macsec WITHOUT encryption
* 1.00 Gb/s (sometimes, but often less) with macsec WITH encryption

With 5.7 but with ab046a5d4be4c90a3952a0eae75617b49c0cb01b reverted, I get:
* 9.90 Gb/s with no macsec at all
* 7.33 Gb/s with macsec WITHOUT encryption
* 9.83 Gb/s with macsec WITH encryption

On tests where performance is bad (including macsec without encryption), 
iperf3 is at 100% CPU usage.  I was able to run it under `perf record`on 
iperf3 in a number of the tests but, unfortunately, I have had trouble 
compiling perf for my own 5.7 compilations (definitely PEBKAC).  If it 
would be useful I can work on fixing the perf compilation issues.

For 5.7.11-1.el8.elrepo (which has the issue) I get the following top 10 
items in `perf report`:
* MACsec without encryption - iperf3 instance running as server 
(receives data)
     29.92%  iperf3   [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
      6.48%  iperf3   [kernel.kallsyms]  [k] do_syscall_64
      2.92%  iperf3   [kernel.kallsyms]  [k] syscall_return_via_sysret
      2.37%  iperf3   [kernel.kallsyms]  [k] entry_SYSCALL_64
      2.32%  iperf3   [kernel.kallsyms]  [k] __skb_datagram_iter
      2.26%  iperf3   [kernel.kallsyms]  [k] __free_pages_ok
      2.09%  iperf3   [kernel.kallsyms]  [k] tcp_poll
      1.75%  iperf3   [kernel.kallsyms]  [k] do_select
      1.48%  iperf3   [kernel.kallsyms]  [k] free_one_page
      1.44%  iperf3   [kernel.kallsyms]  [k] kmem_cache_free

* MACsec without encryption - iperf3 instance running as client (sends data)
     83.63%  iperf3   [kernel.kallsyms]  [k] gf128mul_4k_lle
      3.46%  iperf3   [kernel.kallsyms]  [k] ghash_update
      1.48%  iperf3   [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
      1.18%  iperf3   [kernel.kallsyms]  [k] memcpy_erms
      1.17%  iperf3   [kernel.kallsyms]  [k] do_csum
      0.50%  iperf3   [kernel.kallsyms]  [k] _raw_spin_lock
      0.44%  iperf3   [kernel.kallsyms]  [k] __copy_skb_header
      0.36%  iperf3   [kernel.kallsyms]  [k] get_page_from_freelist
      0.23%  iperf3   [kernel.kallsyms]  [k] ixgbe_xmit_frame_ring
      0.22%  iperf3   [kernel.kallsyms]  [k] skb_segment

* MACsec with encryption - iperf3 instance running as server (receives data)
     15.66%  iperf3   [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
      9.52%  iperf3   [kernel.kallsyms]  [k] do_syscall_64
      3.76%  iperf3   [kernel.kallsyms]  [k] syscall_return_via_sysret
      3.28%  iperf3   [kernel.kallsyms]  [k] entry_SYSCALL_64
      3.22%  iperf3   [kernel.kallsyms]  [k] do_select
      2.71%  iperf3   [kernel.kallsyms]  [k] tcp_poll
      1.84%  iperf3   [kernel.kallsyms]  [k] tcp_recvmsg
      1.59%  iperf3   [kernel.kallsyms]  [k] sock_poll
      1.38%  iperf3   [kernel.kallsyms]  [k] __skb_datagram_iter
      1.37%  iperf3   [kernel.kallsyms]  [k] __free_pages_ok

* MACsec with encryption - iperf3 instance running as client (sends data)
     43.95%  iperf3   [kernel.kallsyms]  [k] gf128mul_4k_lle
     17.48%  iperf3   [kernel.kallsyms]  [k] _aesni_enc1
      9.42%  iperf3   [kernel.kallsyms]  [k] kernel_fpu_begin
      7.75%  iperf3   [kernel.kallsyms]  [k] __crypto_xor
      3.18%  iperf3   [kernel.kallsyms]  [k] crypto_ctr_crypt
      2.67%  iperf3   [kernel.kallsyms]  [k] crypto_inc
      2.30%  iperf3   [kernel.kallsyms]  [k] aesni_encrypt
      2.05%  iperf3   [kernel.kallsyms]  [k] aesni_enc
      1.87%  iperf3   [kernel.kallsyms]  [k] ghash_update
      1.03%  iperf3   [kernel.kallsyms]  [k] kernel_fpu_end

Here is `ethtool -k em1` in case that is helpful:
Features for em1:
rx-checksumming: on
tx-checksumming: on
         tx-checksum-ipv4: off [fixed]
         tx-checksum-ip-generic: on
         tx-checksum-ipv6: off [fixed]
         tx-checksum-fcoe-crc: on [fixed]
         tx-checksum-sctp: on
scatter-gather: on
         tx-scatter-gather: on
         tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
         tx-tcp-segmentation: on
         tx-tcp-ecn-segmentation: off [fixed]
         tx-tcp-mangleid-segmentation: off
         tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: on [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: on
tx-udp-segmentation: on
tx-gso-list: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off
hw-tc-offload: off
esp-hw-offload: on
esp-tx-csum-hw-offload: on
rx-udp_tunnel-port-offload: on
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]

I have lots of logs that I can provide if needed.

I thank Antoine Tenart for suggesting tests for this issue and for 
narrowing down which commits to check.

Thanks,
Ryan

^ permalink raw reply	[flat|nested] 12+ messages in thread