netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Severe performance regression in "net: macsec: preserve ingress frame ordering"
@ 2020-08-06 21:11 Ryan Cox
  2020-08-07  3:48 ` Scott Dial
  0 siblings, 1 reply; 12+ messages in thread
From: Ryan Cox @ 2020-08-06 21:11 UTC (permalink / raw)
  To: netdev, davem, sd, scott; +Cc: Antoine Tenart

Hello,

I have found two performance issues with MACsec on 10 Gb/s links (tested 
on Intel and Broadcom NICs):
1)  MACsec with encryption is much faster than MACsec without encryption 
(9.8 vs 7.4 Gb/s) until 5.7, where both have poor performance
2)  5.7 introduced a severe performance impact for MACsec with and 
without encryption at commit ab046a5d4be4c90a3952a0eae75617b49c0cb01b

I haven't been able to look at issue #1 yet (and I don't know where to 
start) since I got sidetracked looking at issue #2.

This email is about issue #2, which results in the following in my test 
setup:
* MACsec with encryption drops from 9.81 Gb/s to 1.00 Gb/s or sometimes 
worse
* MACsec without encryption drops from 7.40 Gb/s to 1.80 Gb/s

I have tested a number of configurations.  These tests were performed on 
the following hardware:
* dual Intel Xeon E5-2680 v4 @ 2.40GHz, 14 cores each
* Intel 82599ES 10 GbE NIC
* ixgbe driver, version 5.1.0-k

I also tested on the following hardware in a more limited fashion, but 
the results were consistent:
* dual Intel Xeon E5-2670 v3 @ 2.30GHz, 12 cores each
* Broadcom BCM57810 10 GbE NIC
* bnx2x driver, 1.713.36-0

Only one 10 Gb/s link was populated (i.e. no port-channel).  The MTU for 
the network is 9000, with a resulting MACsec MTU of 8968.  All tests 
were performed with only one switch in between the servers.

I tested three scenarios.  The tests were run on servers that are booted 
with NFS root from an identical image.  The only difference was the 
kernel.  A script was run to create the three scenarios and run the 
benchmarks, so the setups are identical across tests.

The scenarios all involved iperf3 tests of these conditions:
1) no MACsec
2) MACsec without encryption
3) MACsec with encryption

The MACsec setup was done as follows:
ip link add link em1 ms1 type macsec sci 1234 encrypt on  #or omitting 
the "encrypt on" for specific tests
ip macsec add ms1 tx sa 0 pn 1234 on key 01 $(printf %032d 1234)
ip macsec add ms1 rx sci 1234
ip macsec add ms1 rx sci 1234 sa 0 pn 1234 on key 01 $(printf %032d 1234)

That results in `ip macsec show` like this:
6: ms1: protect on validate strict sc off sa off encrypt on send_sci on 
end_station off scb off replay off
     cipher suite: GCM-AES-128, using ICV length 16
     TXSC: 0000000000001234 on SA 0
         0: PN 599345, state on, key 01000000000000000000000000000000
     RXSC: 0000000000001234, state on
         0: PN 5076769, state on, key 01000000000000000000000000000000

I tested a number of kernels (all 64 bit) including:
* 4.18.0-193.13.2.el8_2 (RHEL 8)
* 5.6.7-1.el8.elrepo (ELRepo)
* 5.7.11-1.el8.elrepo (ELRepo)
* 5.7 at tag v5.7.11 (I compiled)
* 5.7 at tag v5.7.11 with ab046a5d4be4c90a3952a0eae75617b49c0cb01b 
reverted (I compiled)

I did test 4.18 <-> 5.7 (bi-directional) and both directions resulted in 
poor performance.  Other than that, each test was between two servers of 
the same kernel version.

CONFIG_CRYPTO_AES_NI_INTEL=y is set in all kernels.

4.18 and 5.6 kernels both have very similar performance characteristics:
* 9.89 Gb/s with no macsec at all
* 7.40 Gb/s with macsec WITHOUT encryption  <--- not sure why, but 
turning OFF encryption slowed things down
* 9.81 Gb/s with macsec WITH encryption

With 5.7 I get:
* 9.90 Gb/s with no macsec at all
* 1.80 Gb/s with macsec WITHOUT encryption
* 1.00 Gb/s (sometimes, but often less) with macsec WITH encryption

With 5.7 but with ab046a5d4be4c90a3952a0eae75617b49c0cb01b reverted, I get:
* 9.90 Gb/s with no macsec at all
* 7.33 Gb/s with macsec WITHOUT encryption
* 9.83 Gb/s with macsec WITH encryption

On tests where performance is bad (including macsec without encryption), 
iperf3 is at 100% CPU usage.  I was able to run it under `perf record`on 
iperf3 in a number of the tests but, unfortunately, I have had trouble 
compiling perf for my own 5.7 compilations (definitely PEBKAC).  If it 
would be useful I can work on fixing the perf compilation issues.

For 5.7.11-1.el8.elrepo (which has the issue) I get the following top 10 
items in `perf report`:
* MACsec without encryption - iperf3 instance running as server 
(receives data)
     29.92%  iperf3   [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
      6.48%  iperf3   [kernel.kallsyms]  [k] do_syscall_64
      2.92%  iperf3   [kernel.kallsyms]  [k] syscall_return_via_sysret
      2.37%  iperf3   [kernel.kallsyms]  [k] entry_SYSCALL_64
      2.32%  iperf3   [kernel.kallsyms]  [k] __skb_datagram_iter
      2.26%  iperf3   [kernel.kallsyms]  [k] __free_pages_ok
      2.09%  iperf3   [kernel.kallsyms]  [k] tcp_poll
      1.75%  iperf3   [kernel.kallsyms]  [k] do_select
      1.48%  iperf3   [kernel.kallsyms]  [k] free_one_page
      1.44%  iperf3   [kernel.kallsyms]  [k] kmem_cache_free

* MACsec without encryption - iperf3 instance running as client (sends data)
     83.63%  iperf3   [kernel.kallsyms]  [k] gf128mul_4k_lle
      3.46%  iperf3   [kernel.kallsyms]  [k] ghash_update
      1.48%  iperf3   [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
      1.18%  iperf3   [kernel.kallsyms]  [k] memcpy_erms
      1.17%  iperf3   [kernel.kallsyms]  [k] do_csum
      0.50%  iperf3   [kernel.kallsyms]  [k] _raw_spin_lock
      0.44%  iperf3   [kernel.kallsyms]  [k] __copy_skb_header
      0.36%  iperf3   [kernel.kallsyms]  [k] get_page_from_freelist
      0.23%  iperf3   [kernel.kallsyms]  [k] ixgbe_xmit_frame_ring
      0.22%  iperf3   [kernel.kallsyms]  [k] skb_segment

* MACsec with encryption - iperf3 instance running as server (receives data)
     15.66%  iperf3   [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
      9.52%  iperf3   [kernel.kallsyms]  [k] do_syscall_64
      3.76%  iperf3   [kernel.kallsyms]  [k] syscall_return_via_sysret
      3.28%  iperf3   [kernel.kallsyms]  [k] entry_SYSCALL_64
      3.22%  iperf3   [kernel.kallsyms]  [k] do_select
      2.71%  iperf3   [kernel.kallsyms]  [k] tcp_poll
      1.84%  iperf3   [kernel.kallsyms]  [k] tcp_recvmsg
      1.59%  iperf3   [kernel.kallsyms]  [k] sock_poll
      1.38%  iperf3   [kernel.kallsyms]  [k] __skb_datagram_iter
      1.37%  iperf3   [kernel.kallsyms]  [k] __free_pages_ok

* MACsec with encryption - iperf3 instance running as client (sends data)
     43.95%  iperf3   [kernel.kallsyms]  [k] gf128mul_4k_lle
     17.48%  iperf3   [kernel.kallsyms]  [k] _aesni_enc1
      9.42%  iperf3   [kernel.kallsyms]  [k] kernel_fpu_begin
      7.75%  iperf3   [kernel.kallsyms]  [k] __crypto_xor
      3.18%  iperf3   [kernel.kallsyms]  [k] crypto_ctr_crypt
      2.67%  iperf3   [kernel.kallsyms]  [k] crypto_inc
      2.30%  iperf3   [kernel.kallsyms]  [k] aesni_encrypt
      2.05%  iperf3   [kernel.kallsyms]  [k] aesni_enc
      1.87%  iperf3   [kernel.kallsyms]  [k] ghash_update
      1.03%  iperf3   [kernel.kallsyms]  [k] kernel_fpu_end

Here is `ethtool -k em1` in case that is helpful:
Features for em1:
rx-checksumming: on
tx-checksumming: on
         tx-checksum-ipv4: off [fixed]
         tx-checksum-ip-generic: on
         tx-checksum-ipv6: off [fixed]
         tx-checksum-fcoe-crc: on [fixed]
         tx-checksum-sctp: on
scatter-gather: on
         tx-scatter-gather: on
         tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
         tx-tcp-segmentation: on
         tx-tcp-ecn-segmentation: off [fixed]
         tx-tcp-mangleid-segmentation: off
         tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: on [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: on
tx-udp-segmentation: on
tx-gso-list: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off
hw-tc-offload: off
esp-hw-offload: on
esp-tx-csum-hw-offload: on
rx-udp_tunnel-port-offload: on
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]

I have lots of logs that I can provide if needed.

I thank Antoine Tenart for suggesting tests for this issue and for 
narrowing down which commits to check.

Thanks,
Ryan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Severe performance regression in "net: macsec: preserve ingress frame ordering"
  2020-08-06 21:11 Severe performance regression in "net: macsec: preserve ingress frame ordering" Ryan Cox
@ 2020-08-07  3:48 ` Scott Dial
  2020-08-07 23:21   ` Ryan Cox
  2020-08-10 13:34   ` Sabrina Dubroca
  0 siblings, 2 replies; 12+ messages in thread
From: Scott Dial @ 2020-08-07  3:48 UTC (permalink / raw)
  To: Ryan Cox, netdev, davem, sd; +Cc: Antoine Tenart

On 8/6/2020 5:11 PM, Ryan Cox wrote:
> With 5.7 I get:
> * 9.90 Gb/s with no macsec at all
> * 1.80 Gb/s with macsec WITHOUT encryption
> * 1.00 Gb/s (sometimes, but often less) with macsec WITH encryption
> 
> With 5.7 but with ab046a5d4be4c90a3952a0eae75617b49c0cb01b reverted, I get:
> * 9.90 Gb/s with no macsec at all
> * 7.33 Gb/s with macsec WITHOUT encryption
> * 9.83 Gb/s with macsec WITH encryption
> 
> On tests where performance is bad (including macsec without encryption),
> iperf3 is at 100% CPU usage.  I was able to run it under `perf record`on
> iperf3 in a number of the tests but, unfortunately, I have had trouble
> compiling perf for my own 5.7 compilations (definitely PEBKAC).  If it
> would be useful I can work on fixing the perf compilation issues.

For certain, you are measuring the difference between AES-NI doing
gcm(aes) and gcm_base(ctr(aes-aesni),ghash-generic). Specifically, the
hotspot is ghash-generic's implementation of ghash_update() function.
I appreciate your testing because I was limited in my ability to test
beyond 1Gb/s.

The aes-aesni driver is smart enough to use the FPU if it's not busy and
fallback to the CPU otherwise. Unfortunately, the ghash-clmulni driver
does not have that kind of logic in it and only provides an async version,
so we are forced to use the ghash-generic implementation, which is a pure
CPU implementation. The ideal would be for aesni_intel to provide a
synchronous version of gcm(aes) that fell back to the CPU if the FPU is
busy.
I don't know if the crypto maintainers would be open to such a change, but
if the choice was between reverting and patching the crypto code, then I
would work on patching the crypto code.

In any case, you didn't report how many packets arrived out of order, which
was the issue being addressed by my change. It would be helpful to get
the output of "ip -s macsec show" and specifically the InPktsDelayed
counter. Did iperf3 report out-of-order packets with the patch reverted?
Otherwise, if this is the only process running on your test servers,
then you may not be generating any contention for the FPU, which is the
source of the out-of-order issue. Maybe you could run prime95 to busy
the FPU to see the issue that I was seeing.

I have a product that is a secure router with a half-dozen MACsec
interfaces, boots from a LUKS-encrypted disk, and has a number of TLS
control and status interfaces for local devices attached to product.
Without this patch, the system was completely unusable due to the
out-of-order issue causing TCP retries and UDP out-of-order issues. I
have not seen any examples of this MACsec driver in the wild, so I
assumed nobody had noticed the out-of-order issue because of synthetic
testing.
-- 
Scott Dial
scott@scottdial.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Severe performance regression in "net: macsec: preserve ingress frame ordering"
  2020-08-07  3:48 ` Scott Dial
@ 2020-08-07 23:21   ` Ryan Cox
  2020-08-10 13:34   ` Sabrina Dubroca
  1 sibling, 0 replies; 12+ messages in thread
From: Ryan Cox @ 2020-08-07 23:21 UTC (permalink / raw)
  To: Scott Dial; +Cc: Antoine Tenart, netdev, davem, sd

On 8/6/20 9:48 PM, Scott Dial wrote:
> The aes-aesni driver is smart enough to use the FPU if it's not busy and
> fallback to the CPU otherwise. Unfortunately, the ghash-clmulni driver
> does not have that kind of logic in it and only provides an async version,
> so we are forced to use the ghash-generic implementation, which is a pure
> CPU implementation. The ideal would be for aesni_intel to provide a
> synchronous version of gcm(aes) that fell back to the CPU if the FPU is
> busy.

I don't know how the AES-NI support works, but I did see your specific 
mention of aesni_intel and figured I should mention that this does also 
affect AMD. I just got access to AMD nodes (2 x EPYC 7302) with a 
Mellanox 10 GbE NIC.  I did the same test and it had a similar 
performance pattern.  I doubt this means much but I figured I should 
mention it.

> I don't know if the crypto maintainers would be open to such a change, but
> if the choice was between reverting and patching the crypto code, then I
> would work on patching the crypto code.

I can't opine on anything crypto-related since it is extremely way 
outside of my area of expertise, though it is helpful to hear what is 
going on.

> In any case, you didn't report how many packets arrived out of order, which
> was the issue being addressed by my change. It would be helpful to get
> the output of "ip -s macsec show" and specifically the InPktsDelayed
> counter. Did iperf3 report out-of-order packets with the patch reverted?
> Otherwise, if this is the only process running on your test servers,
> then you may not be generating any contention for the FPU, which is the
> source of the out-of-order issue. Maybe you could run prime95 to busy
> the FPU to see the issue that I was seeing.

I ran some tests again on the same servers as before with the Intel 
NICs.  I tested with prime95 running on 27 of the 28 cores in *each* 
server simultaneously (allowing iperf3 to use a core on each) throughout 
the entire test.  This was using 5.7.11 with 
ab046a5d4be4c90a3952a0eae75617b49c0cb01b reverted, so pre-5.7 performance.

MACsec interfaces are deleted and recreated before each test, so 
counters are always fresh.

== MACSEC WITHOUT ENCRYPTION ==

* Server1:
18: ms1: protect on validate strict sc off sa off encrypt off send_sci 
on end_station off scb off replay off
     cipher suite: GCM-AES-128, using ICV length 16
     TXSC: 0000000000001234 on SA 0
     stats: OutPktsUntagged InPktsUntagged OutPktsTooLong InPktsNoTag 
InPktsBadTag InPktsUnknownSCI InPktsNoSCI InPktsOverrun
                          0              0              0 
1123            0                0           1             0
     stats: OutPktsProtected OutPktsEncrypted OutOctetsProtected 
OutOctetsEncrypted
                     3798421                0 30889802591                  0
         0: PN 3799655, state on, key 01000000000000000000000000000000
     stats: OutPktsProtected OutPktsEncrypted
                     3798421                0
     RXSC: 0000000000001234, state on
     stats: InOctetsValidated InOctetsDecrypted InPktsUnchecked 
InPktsDelayed InPktsOK InPktsInvalid InPktsLate InPktsNotValid 
InPktsNotUsingSA InPktsUnusedSA
                  30042694872                 0 0           218  
3675170             0          0 0                0              0
         0: PN 3676633, state on, key 01000000000000000000000000000000
     stats: InPktsOK InPktsInvalid InPktsNotValid InPktsNotUsingSA 
InPktsUnusedSA
             3675170             0              0 0              0

*Server2:
18: ms1: protect on validate strict sc off sa off encrypt off send_sci 
on end_station off scb off replay off
     cipher suite: GCM-AES-128, using ICV length 16
     TXSC: 0000000000001234 on SA 0
     stats: OutPktsUntagged InPktsUntagged OutPktsTooLong InPktsNoTag 
InPktsBadTag InPktsUnknownSCI InPktsNoSCI InPktsOverrun
                          0              0              0 
1227            0                0           1             0
     stats: OutPktsProtected OutPktsEncrypted OutOctetsProtected 
OutOctetsEncrypted
                     3675399                0 30042696158                  0
         0: PN 3676633, state on, key 01000000000000000000000000000000
     stats: OutPktsProtected OutPktsEncrypted
                     3675399                0
     RXSC: 0000000000001234, state on
     stats: InOctetsValidated InOctetsDecrypted InPktsUnchecked 
InPktsDelayed InPktsOK InPktsInvalid InPktsLate InPktsNotValid 
InPktsNotUsingSA InPktsUnusedSA
                  30889801305                 0 0             0  
3798410             0          0 0                0              0
         0: PN 3799655, state on, key 01000000000000000000000000000000
     stats: InPktsOK InPktsInvalid InPktsNotValid InPktsNotUsingSA 
InPktsUnusedSA
             3798410             0              0 0              0


InPktsDelayed was 218 for Server1 and 0 for Server2.

== MACSEC WITH ENCRYPTION ==

I got the following *with* encryption (macsec interface deleted and 
recreated before the test, so counters are fresh):
*Server1:
19: ms1: protect on validate strict sc off sa off encrypt on send_sci on 
end_station off scb off replay off
     cipher suite: GCM-AES-128, using ICV length 16
     TXSC: 0000000000001234 on SA 0
     stats: OutPktsUntagged InPktsUntagged OutPktsTooLong InPktsNoTag 
InPktsBadTag InPktsUnknownSCI InPktsNoSCI InPktsOverrun
                          0              0              0 
1397            0                0           0             0
     stats: OutPktsProtected OutPktsEncrypted OutOctetsProtected 
OutOctetsEncrypted
                           0          5560714 0        46931594623
         0: PN 5561948, state on, key 01000000000000000000000000000000
     stats: OutPktsProtected OutPktsEncrypted
                           0          5560714
     RXSC: 0000000000001234, state on
     stats: InOctetsValidated InOctetsDecrypted InPktsUnchecked 
InPktsDelayed InPktsOK InPktsInvalid InPktsLate InPktsNotValid 
InPktsNotUsingSA InPktsUnusedSA
                            0       45977049585 0          3771  
5417843             0          0 0                0              0
         0: PN 5422860, state on, key 01000000000000000000000000000000
     stats: InPktsOK InPktsInvalid InPktsNotValid InPktsNotUsingSA 
InPktsUnusedSA
             5417843             0              0 0              0

*Server2:
19: ms1: protect on validate strict sc off sa off encrypt on send_sci on 
end_station off scb off replay off
     cipher suite: GCM-AES-128, using ICV length 16
     TXSC: 0000000000001234 on SA 0
     stats: OutPktsUntagged InPktsUntagged OutPktsTooLong InPktsNoTag 
InPktsBadTag InPktsUnknownSCI InPktsNoSCI InPktsOverrun
                          0              0              0 
1490            0                0           0             0
     stats: OutPktsProtected OutPktsEncrypted OutOctetsProtected 
OutOctetsEncrypted
                           0          5421626 0        45977059885
         0: PN 5422860, state on, key 01000000000000000000000000000000
     stats: OutPktsProtected OutPktsEncrypted
                           0          5421626
     RXSC: 0000000000001234, state on
     stats: InOctetsValidated InOctetsDecrypted InPktsUnchecked 
InPktsDelayed InPktsOK InPktsInvalid InPktsLate InPktsNotValid 
InPktsNotUsingSA InPktsUnusedSA
                            0       46931106683 0           109  
5560541             0          0 0                0              0
         0: PN 5561948, state on, key 01000000000000000000000000000000
     stats: InPktsOK InPktsInvalid InPktsNotValid InPktsNotUsingSA 
InPktsUnusedSA
             5560541             0              0 0              0


InPktsDelayed was 3771 for Server1 and 109 for Server2.


The performance numbers were:
* 9.87 Gb/s without macsec
* 6.00 Gb/s with macsec WITHOUT encryption
* 9.19 Gb/s with macsec WITH encryption

iperf3 retransmits were:
* 27 without macsec
* 1211 with macsec WITHOUT encryption
* 721 with macsec WITH encryption


Thanks for the reply and for the background on this.

Ryan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Severe performance regression in "net: macsec: preserve ingress frame ordering"
  2020-08-07  3:48 ` Scott Dial
  2020-08-07 23:21   ` Ryan Cox
@ 2020-08-10 13:34   ` Sabrina Dubroca
  2020-08-10 16:09     ` Scott Dial
  1 sibling, 1 reply; 12+ messages in thread
From: Sabrina Dubroca @ 2020-08-10 13:34 UTC (permalink / raw)
  To: Scott Dial; +Cc: linux-crypto, Ryan Cox, netdev, davem, Antoine Tenart

[adding the linux-crypto list]

2020-08-06, 23:48:16 -0400, Scott Dial wrote:
> On 8/6/2020 5:11 PM, Ryan Cox wrote:
> > With 5.7 I get:
> > * 9.90 Gb/s with no macsec at all
> > * 1.80 Gb/s with macsec WITHOUT encryption
> > * 1.00 Gb/s (sometimes, but often less) with macsec WITH encryption
> > 
> > With 5.7 but with ab046a5d4be4c90a3952a0eae75617b49c0cb01b reverted, I get:
> > * 9.90 Gb/s with no macsec at all
> > * 7.33 Gb/s with macsec WITHOUT encryption
> > * 9.83 Gb/s with macsec WITH encryption
> > 
> > On tests where performance is bad (including macsec without encryption),
> > iperf3 is at 100% CPU usage.  I was able to run it under `perf record`on
> > iperf3 in a number of the tests but, unfortunately, I have had trouble
> > compiling perf for my own 5.7 compilations (definitely PEBKAC).  If it
> > would be useful I can work on fixing the perf compilation issues.
> 
> For certain, you are measuring the difference between AES-NI doing
> gcm(aes) and gcm_base(ctr(aes-aesni),ghash-generic). Specifically, the
> hotspot is ghash-generic's implementation of ghash_update() function.
> I appreciate your testing because I was limited in my ability to test
> beyond 1Gb/s.
> 
> The aes-aesni driver is smart enough to use the FPU if it's not busy and
> fallback to the CPU otherwise. Unfortunately, the ghash-clmulni driver
> does not have that kind of logic in it and only provides an async version,
> so we are forced to use the ghash-generic implementation, which is a pure
> CPU implementation. The ideal would be for aesni_intel to provide a
> synchronous version of gcm(aes) that fell back to the CPU if the FPU is
> busy.
> I don't know if the crypto maintainers would be open to such a change, but
> if the choice was between reverting and patching the crypto code, then I
> would work on patching the crypto code.

To the crypto folks, a bit of context: Scott wrote commit ab046a5d4be4
("net: macsec: preserve ingress frame ordering"), which made MACsec
use gcm(aes) with CRYPTO_ALG_ASYNC. This prevents out of order
decryption, but reduces performance. We'd like to restore performance
on systems where the FPU is available without breaking MACsec for
systems where the FPU is often busy.

A quick and dirty alternative might be to let the administrator decide
if they're ok with some out of order. Maybe they know that their FPU
will be mostly idle so it won't even be an issue (or maybe the
opposite, ie keep the fast default and let admins fix their setups
with an extra flag).

> In any case, you didn't report how many packets arrived out of order, which
> was the issue being addressed by my change. It would be helpful to get
> the output of "ip -s macsec show" and specifically the InPktsDelayed
> counter. Did iperf3 report out-of-order packets with the patch reverted?
> Otherwise, if this is the only process running on your test servers,
> then you may not be generating any contention for the FPU, which is the
> source of the out-of-order issue. Maybe you could run prime95 to busy
> the FPU to see the issue that I was seeing.

But that's not necessarily a realistic workload for all machines.

> I have a product that is a secure router with a half-dozen MACsec
> interfaces, boots from a LUKS-encrypted disk, and has a number of TLS
> control and status interfaces for local devices attached to product.
> Without this patch, the system was completely unusable due to the
> out-of-order issue causing TCP retries and UDP out-of-order issues. I
> have not seen any examples of this MACsec driver in the wild, so I
> assumed nobody had noticed the out-of-order issue because of synthetic
> testing.

We have customers using MACsec, and I haven't heard of reports like
yours.

-- 
Sabrina


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Severe performance regression in "net: macsec: preserve ingress frame ordering"
  2020-08-10 13:34   ` Sabrina Dubroca
@ 2020-08-10 16:09     ` Scott Dial
  2020-08-12 10:04       ` Sabrina Dubroca
  0 siblings, 1 reply; 12+ messages in thread
From: Scott Dial @ 2020-08-10 16:09 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: linux-crypto, Ryan Cox, netdev, davem, Antoine Tenart, ebiggers

On 8/10/2020 9:34 AM, Sabrina Dubroca wrote:
> [adding the linux-crypto list]
> 
> 2020-08-06, 23:48:16 -0400, Scott Dial wrote:
>> On 8/6/2020 5:11 PM, Ryan Cox wrote:
>>> With 5.7 I get:
>>> * 9.90 Gb/s with no macsec at all
>>> * 1.80 Gb/s with macsec WITHOUT encryption
>>> * 1.00 Gb/s (sometimes, but often less) with macsec WITH encryption
>>>
>>> With 5.7 but with ab046a5d4be4c90a3952a0eae75617b49c0cb01b reverted, I get:
>>> * 9.90 Gb/s with no macsec at all
>>> * 7.33 Gb/s with macsec WITHOUT encryption
>>> * 9.83 Gb/s with macsec WITH encryption
>>>
>>> On tests where performance is bad (including macsec without encryption),
>>> iperf3 is at 100% CPU usage.  I was able to run it under `perf record`on
>>> iperf3 in a number of the tests but, unfortunately, I have had trouble
>>> compiling perf for my own 5.7 compilations (definitely PEBKAC).  If it
>>> would be useful I can work on fixing the perf compilation issues.
>>
>> For certain, you are measuring the difference between AES-NI doing
>> gcm(aes) and gcm_base(ctr(aes-aesni),ghash-generic). Specifically, the
>> hotspot is ghash-generic's implementation of ghash_update() function.
>> I appreciate your testing because I was limited in my ability to test
>> beyond 1Gb/s.
>>
>> The aes-aesni driver is smart enough to use the FPU if it's not busy and
>> fallback to the CPU otherwise. Unfortunately, the ghash-clmulni driver
>> does not have that kind of logic in it and only provides an async version,
>> so we are forced to use the ghash-generic implementation, which is a pure
>> CPU implementation. The ideal would be for aesni_intel to provide a
>> synchronous version of gcm(aes) that fell back to the CPU if the FPU is
>> busy.
>> I don't know if the crypto maintainers would be open to such a change, but
>> if the choice was between reverting and patching the crypto code, then I
>> would work on patching the crypto code.
> 
> To the crypto folks, a bit of context: Scott wrote commit ab046a5d4be4
> ("net: macsec: preserve ingress frame ordering"), which made MACsec
> use gcm(aes) with CRYPTO_ALG_ASYNC. This prevents out of order
> decryption, but reduces performance. We'd like to restore performance
> on systems where the FPU is available without breaking MACsec for
> systems where the FPU is often busy.
> 
> A quick and dirty alternative might be to let the administrator decide
> if they're ok with some out of order. Maybe they know that their FPU
> will be mostly idle so it won't even be an issue (or maybe the
> opposite, ie keep the fast default and let admins fix their setups
> with an extra flag).

I can appreciate favoring performance over correctness as practical
concern, but I'd suggest that the out-of-order decryption *is* a
performance concern as well. We can debate realness of my workload, but
even in Ryan's tests on an otherwise idle server, he showed 0.07% of the
frames needed to be dispatched to cryptd, and that for whatever reason
it's more often with encryption disabled, which correlates to his
decrease in throughput (9.83 Gb/s to 7.33 Gb/s, and 9.19 Gb/s to 6.00
Gb/s), perhaps causing exponential backoff from TCP retries. I can
resurrect my test setup, but my numbers were worse than Ryan's.

In any case, I counted 18 implementations of HW accelerated gcm(aes) in
the kernel, with 3 of those implementations are in arch (x86, arm64, and
s390) and the rest are crypto device drivers. Of all those
implementations, the AES-NI implementation is the only one that
dispatches to cryptd (via code in cypto/simd.c). AFAICT, every other
implementation of gcm(aes) is synchronous, but they would require closer
inspection to be certain. So, I'd like to focus on what we can do to
improve crypto/simd.c to provide a synchronous implementation of
gcm(aes) for AES-NI when possible, which is the vast majority of the time.

I would be interested in proposing a change to improve this issue, but
I'm not sure the direction that the maintainers of this code would
prefer. Since these changes to the crypto API are fairly recent, there
may be context that I am not aware of. However, I think it would be
straight-forward to add another API to crypto/simd.c that allocated sync
algorithms, and I would be willing to do the work.

The only challenge I see in implementing such a change is deciding how
to select a fallback algorithm. The most flexible solution would be to
call crypto_alloc_aead with CRYPTO_ALG_ASYNC during the init to pick the
"best" fallback (in case there is alternative HW offloading available),
but that would almost certainly pick itself and it's not obvious to me
how to avoid that. On the other hand, the caller to the new API could
explicitly declare a fallback algorithm (e.g.,
"gcm_base(ctr(aes-aesni),ghash-generic)"), which probably is the correct
answer anyways -- what are the chances that there is multiple HW
offloads for gcm(aes)? In that case, a possible API would be:

int simd_register_aeads_compat_sync(struct aead_alg *algs,
                                    char **fallback_algs,
                                    int count,
			            struct simd_aead_alg **simd_algs);

Beyond MACsec, it's worth noting that the mac80211 code for AES-GCMP and
BIP-GMAC also use gcm(aes) in sync mode because decryption occurs in a
softirq, however I imagine nobody has reported an issue because the link
speed is typically slower and those encryption modes are still uncommon.

-- 
Scott Dial
scott@scottdial.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Severe performance regression in "net: macsec: preserve ingress frame ordering"
  2020-08-10 16:09     ` Scott Dial
@ 2020-08-12 10:04       ` Sabrina Dubroca
  2020-08-12 10:45         ` Van Leeuwen, Pascal
  0 siblings, 1 reply; 12+ messages in thread
From: Sabrina Dubroca @ 2020-08-12 10:04 UTC (permalink / raw)
  To: Scott Dial
  Cc: linux-crypto, Ryan Cox, netdev, davem, Antoine Tenart, ebiggers

2020-08-10, 12:09:40 -0400, Scott Dial wrote:
> On 8/10/2020 9:34 AM, Sabrina Dubroca wrote:
> > [adding the linux-crypto list]
> > 
> > 2020-08-06, 23:48:16 -0400, Scott Dial wrote:
> >> On 8/6/2020 5:11 PM, Ryan Cox wrote:
> >>> With 5.7 I get:
> >>> * 9.90 Gb/s with no macsec at all
> >>> * 1.80 Gb/s with macsec WITHOUT encryption
> >>> * 1.00 Gb/s (sometimes, but often less) with macsec WITH encryption
> >>>
> >>> With 5.7 but with ab046a5d4be4c90a3952a0eae75617b49c0cb01b reverted, I get:
> >>> * 9.90 Gb/s with no macsec at all
> >>> * 7.33 Gb/s with macsec WITHOUT encryption
> >>> * 9.83 Gb/s with macsec WITH encryption
> >>>
> >>> On tests where performance is bad (including macsec without encryption),
> >>> iperf3 is at 100% CPU usage.  I was able to run it under `perf record`on
> >>> iperf3 in a number of the tests but, unfortunately, I have had trouble
> >>> compiling perf for my own 5.7 compilations (definitely PEBKAC).  If it
> >>> would be useful I can work on fixing the perf compilation issues.
> >>
> >> For certain, you are measuring the difference between AES-NI doing
> >> gcm(aes) and gcm_base(ctr(aes-aesni),ghash-generic). Specifically, the
> >> hotspot is ghash-generic's implementation of ghash_update() function.
> >> I appreciate your testing because I was limited in my ability to test
> >> beyond 1Gb/s.
> >>
> >> The aes-aesni driver is smart enough to use the FPU if it's not busy and
> >> fallback to the CPU otherwise. Unfortunately, the ghash-clmulni driver
> >> does not have that kind of logic in it and only provides an async version,
> >> so we are forced to use the ghash-generic implementation, which is a pure
> >> CPU implementation. The ideal would be for aesni_intel to provide a
> >> synchronous version of gcm(aes) that fell back to the CPU if the FPU is
> >> busy.
> >> I don't know if the crypto maintainers would be open to such a change, but
> >> if the choice was between reverting and patching the crypto code, then I
> >> would work on patching the crypto code.
> > 
> > To the crypto folks, a bit of context: Scott wrote commit ab046a5d4be4
> > ("net: macsec: preserve ingress frame ordering"), which made MACsec
> > use gcm(aes) with CRYPTO_ALG_ASYNC. This prevents out of order
> > decryption, but reduces performance. We'd like to restore performance
> > on systems where the FPU is available without breaking MACsec for
> > systems where the FPU is often busy.
> > 
> > A quick and dirty alternative might be to let the administrator decide
> > if they're ok with some out of order. Maybe they know that their FPU
> > will be mostly idle so it won't even be an issue (or maybe the
> > opposite, ie keep the fast default and let admins fix their setups
> > with an extra flag).
> 
> I can appreciate favoring performance over correctness as practical
> concern, but I'd suggest that the out-of-order decryption *is* a
> performance concern as well. We can debate realness of my workload, but
> even in Ryan's tests on an otherwise idle server, he showed 0.07% of the
> frames needed to be dispatched to cryptd, and that for whatever reason
> it's more often with encryption disabled, which correlates to his
> decrease in throughput (9.83 Gb/s to 7.33 Gb/s, and 9.19 Gb/s to 6.00
> Gb/s), perhaps causing exponential backoff from TCP retries. I can
> resurrect my test setup, but my numbers were worse than Ryan's.
> 
> In any case, I counted 18 implementations of HW accelerated gcm(aes) in
> the kernel, with 3 of those implementations are in arch (x86, arm64, and
> s390) and the rest are crypto device drivers. Of all those
> implementations, the AES-NI implementation is the only one that
> dispatches to cryptd (via code in cypto/simd.c). AFAICT, every other
> implementation of gcm(aes) is synchronous, but they would require closer
> inspection to be certain.

I randomly picked 2 of them (chcr and inside-secure), and they both
set CRYPTO_ALG_ASYNC, so I guess not.

> So, I'd like to focus on what we can do to
> improve crypto/simd.c to provide a synchronous implementation of
> gcm(aes) for AES-NI when possible, which is the vast majority of the time.
>
> I would be interested in proposing a change to improve this issue, but
> I'm not sure the direction that the maintainers of this code would
> prefer. Since these changes to the crypto API are fairly recent, there
> may be context that I am not aware of. However, I think it would be
> straight-forward to add another API to crypto/simd.c that allocated sync
> algorithms, and I would be willing to do the work.
> 
> The only challenge I see in implementing such a change is deciding how
> to select a fallback algorithm. The most flexible solution would be to
> call crypto_alloc_aead with CRYPTO_ALG_ASYNC during the init to pick the
> "best" fallback (in case there is alternative HW offloading available),
> but that would almost certainly pick itself and it's not obvious to me
> how to avoid that.

It's probably possible to add a PURE_SOFTWARE or whatever flag and
request one of those algorithms for the fallback.

> On the other hand, the caller to the new API could
> explicitly declare a fallback algorithm (e.g.,
> "gcm_base(ctr(aes-aesni),ghash-generic)"), which probably is the correct
> answer anyways --

I would try to avoid that, it seems too error-prone to me.

> what are the chances that there is multiple HW
> offloads for gcm(aes)? In that case, a possible API would be:
> int simd_register_aeads_compat_sync(struct aead_alg *algs,
>                                     char **fallback_algs,
>                                     int count,
> 			            struct simd_aead_alg **simd_algs);
> 
> Beyond MACsec, it's worth noting that the mac80211 code for AES-GCMP and
> BIP-GMAC also use gcm(aes) in sync mode because decryption occurs in a
> softirq, however I imagine nobody has reported an issue because the link
> speed is typically slower and those encryption modes are still uncommon.

Decent wireless cards would do the encryption in hw, no? Also, you
can't notice a performance regression if it's never used the fast
implementation :)

-- 
Sabrina


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Severe performance regression in "net: macsec: preserve ingress frame ordering"
  2020-08-12 10:04       ` Sabrina Dubroca
@ 2020-08-12 10:45         ` Van Leeuwen, Pascal
  2020-08-12 12:42           ` Andrew Lunn
  0 siblings, 1 reply; 12+ messages in thread
From: Van Leeuwen, Pascal @ 2020-08-12 10:45 UTC (permalink / raw)
  To: Sabrina Dubroca, Scott Dial
  Cc: linux-crypto, Ryan Cox, netdev, davem, Antoine Tenart, ebiggers

> -----Original Message-----
> From: linux-crypto-owner@vger.kernel.org <linux-crypto-owner@vger.kernel.org> On Behalf Of Sabrina Dubroca
> Sent: Wednesday, August 12, 2020 12:05 PM
> To: Scott Dial <scott@scottdial.com>
> Cc: linux-crypto@vger.kernel.org; Ryan Cox <ryan_cox@byu.edu>; netdev@vger.kernel.org; davem@davemloft.net; Antoine Tenart
> <antoine.tenart@bootlin.com>; ebiggers@google.com
> Subject: Re: Severe performance regression in "net: macsec: preserve ingress frame ordering"
>
> <<< External Email >>>
> 2020-08-10, 12:09:40 -0400, Scott Dial wrote:
> > On 8/10/2020 9:34 AM, Sabrina Dubroca wrote:
> > > [adding the linux-crypto list]
> > >
> > > 2020-08-06, 23:48:16 -0400, Scott Dial wrote:
> > >> On 8/6/2020 5:11 PM, Ryan Cox wrote:
> > >>> With 5.7 I get:
> > >>> * 9.90 Gb/s with no macsec at all
> > >>> * 1.80 Gb/s with macsec WITHOUT encryption
> > >>> * 1.00 Gb/s (sometimes, but often less) with macsec WITH encryption
> > >>>
> > >>> With 5.7 but with ab046a5d4be4c90a3952a0eae75617b49c0cb01b reverted, I get:
> > >>> * 9.90 Gb/s with no macsec at all
> > >>> * 7.33 Gb/s with macsec WITHOUT encryption
> > >>> * 9.83 Gb/s with macsec WITH encryption
> > >>>
> > >>> On tests where performance is bad (including macsec without encryption),
> > >>> iperf3 is at 100% CPU usage.  I was able to run it under `perf record`on
> > >>> iperf3 in a number of the tests but, unfortunately, I have had trouble
> > >>> compiling perf for my own 5.7 compilations (definitely PEBKAC).  If it
> > >>> would be useful I can work on fixing the perf compilation issues.
> > >>
> > >> For certain, you are measuring the difference between AES-NI doing
> > >> gcm(aes) and gcm_base(ctr(aes-aesni),ghash-generic). Specifically, the
> > >> hotspot is ghash-generic's implementation of ghash_update() function.
> > >> I appreciate your testing because I was limited in my ability to test
> > >> beyond 1Gb/s.
> > >>
> > >> The aes-aesni driver is smart enough to use the FPU if it's not busy and
> > >> fallback to the CPU otherwise. Unfortunately, the ghash-clmulni driver
> > >> does not have that kind of logic in it and only provides an async version,
> > >> so we are forced to use the ghash-generic implementation, which is a pure
> > >> CPU implementation. The ideal would be for aesni_intel to provide a
> > >> synchronous version of gcm(aes) that fell back to the CPU if the FPU is
> > >> busy.
> > >> I don't know if the crypto maintainers would be open to such a change, but
> > >> if the choice was between reverting and patching the crypto code, then I
> > >> would work on patching the crypto code.
> > >
> > > To the crypto folks, a bit of context: Scott wrote commit ab046a5d4be4
> > > ("net: macsec: preserve ingress frame ordering"), which made MACsec
> > > use gcm(aes) with CRYPTO_ALG_ASYNC. This prevents out of order
> > > decryption, but reduces performance. We'd like to restore performance
> > > on systems where the FPU is available without breaking MACsec for
> > > systems where the FPU is often busy.
> > >
> > > A quick and dirty alternative might be to let the administrator decide
> > > if they're ok with some out of order. Maybe they know that their FPU
> > > will be mostly idle so it won't even be an issue (or maybe the
> > > opposite, ie keep the fast default and let admins fix their setups
> > > with an extra flag).
> >
> > I can appreciate favoring performance over correctness as practical
> > concern, but I'd suggest that the out-of-order decryption *is* a
> > performance concern as well. We can debate realness of my workload, but
> > even in Ryan's tests on an otherwise idle server, he showed 0.07% of the
> > frames needed to be dispatched to cryptd, and that for whatever reason
> > it's more often with encryption disabled, which correlates to his
> > decrease in throughput (9.83 Gb/s to 7.33 Gb/s, and 9.19 Gb/s to 6.00
> > Gb/s), perhaps causing exponential backoff from TCP retries. I can
> > resurrect my test setup, but my numbers were worse than Ryan's.
> >
> > In any case, I counted 18 implementations of HW accelerated gcm(aes) in
> > the kernel, with 3 of those implementations are in arch (x86, arm64, and
> > s390) and the rest are crypto device drivers. Of all those
> > implementations, the AES-NI implementation is the only one that
> > dispatches to cryptd (via code in cypto/simd.c). AFAICT, every other
> > implementation of gcm(aes) is synchronous, but they would require closer
> > inspection to be certain.
>
> I randomly picked 2 of them (chcr and inside-secure), and they both
> set CRYPTO_ALG_ASYNC, so I guess not.
>
You can expect most, if not all, HW accelerated crypto to by ASYNC. This is
important to achieve decent performance, as going through some external
(to the CPU) accelerator incurs significant latency.  (Note that I don't consider
CPU extensions like AES-NI to be "HW accelerated", anything that uses only
CPU instructions is "just" software in my world). Which implies you need to
pipeline requests to unleash its true performance. So if you need high
throughput crypto with low CPU utilization, you should write your
application appropriately, and not unnecessarily serialize your requests.

With networking protocols you often also have a requirement to minimize
packet reordering, so I understand it's a careful balance. But it is possible
to serialize the important stuff and still do the crypto out-of-order, which
would be really beneficial on _some_ platforms (which have HW crypto
acceleration but no such CPU extensions) at least.

> > So, I'd like to focus on what we can do to
> > improve crypto/simd.c to provide a synchronous implementation of
> > gcm(aes) for AES-NI when possible, which is the vast majority of the time.
> >
> > I would be interested in proposing a change to improve this issue, but
> > I'm not sure the direction that the maintainers of this code would
> > prefer. Since these changes to the crypto API are fairly recent, there
> > may be context that I am not aware of. However, I think it would be
> > straight-forward to add another API to crypto/simd.c that allocated sync
> > algorithms, and I would be willing to do the work.
> >
> > The only challenge I see in implementing such a change is deciding how
> > to select a fallback algorithm. The most flexible solution would be to
> > call crypto_alloc_aead with CRYPTO_ALG_ASYNC during the init to pick the
> > "best" fallback (in case there is alternative HW offloading available),
> > but that would almost certainly pick itself and it's not obvious to me
> > how to avoid that.
>
> It's probably possible to add a PURE_SOFTWARE or whatever flag and
> request one of those algorithms for the fallback.
>
Forcing the use of sync algorithms only would be detrimental to platforms
that do not have CPU accelerated crypto, but do have HW acceleration
for crypto external to the CPU. I understand it's much easier to implement,
but that is just being lazy IMHO. For bulk crypto of relatively independent
blocks (networking packets, disk sectors), ASYNC should always be preferred.

> > On the other hand, the caller to the new API could
> > explicitly declare a fallback algorithm (e.g.,
> > "gcm_base(ctr(aes-aesni),ghash-generic)"), which probably is the correct
> > answer anyways --
>
> I would try to avoid that, it seems too error-prone to me.
>
> > what are the chances that there is multiple HW
> > offloads for gcm(aes)? In that case, a possible API would be:
> > int simd_register_aeads_compat_sync(struct aead_alg *algs,
> >                                     char **fallback_algs,
> >                                     int count,
> >             struct simd_aead_alg **simd_algs);
> >
> > Beyond MACsec, it's worth noting that the mac80211 code for AES-GCMP and
> > BIP-GMAC also use gcm(aes) in sync mode because decryption occurs in a
> > softirq, however I imagine nobody has reported an issue because the link
> > speed is typically slower and those encryption modes are still uncommon.
>
> Decent wireless cards would do the encryption in hw, no? Also, you
> can't notice a performance regression if it's never used the fast
> implementation :)
>
> --
> Sabrina

Regards,
Pascal van Leeuwen
Silicon IP Architect Multi-Protocol Engines, Rambus Security
Rambus ROTW Holding BV
+31-73 6581953

Note: The Inside Secure/Verimatrix Silicon IP team was recently acquired by Rambus.
Please be so kind to update your e-mail address book with my new e-mail address.

** This message and any attachments are for the sole use of the intended recipient(s). It may contain information that is confidential and privileged. If you are not the intended recipient of this message, you are prohibited from printing, copying, forwarding or saving it. Please delete the message and attachments and notify the sender immediately. **

Rambus Inc.<http://www.rambus.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Severe performance regression in "net: macsec: preserve ingress frame ordering"
  2020-08-12 10:45         ` Van Leeuwen, Pascal
@ 2020-08-12 12:42           ` Andrew Lunn
  2020-08-24  9:07             ` Van Leeuwen, Pascal
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Lunn @ 2020-08-12 12:42 UTC (permalink / raw)
  To: Van Leeuwen, Pascal
  Cc: Sabrina Dubroca, Scott Dial, linux-crypto, Ryan Cox, netdev,
	davem, Antoine Tenart, ebiggers

> With networking protocols you often also have a requirement to minimize
> packet reordering, so I understand it's a careful balance. But it is possible
> to serialize the important stuff and still do the crypto out-of-order, which
> would be really beneficial on _some_ platforms (which have HW crypto
> acceleration but no such CPU extensions) at least.

Many Ethernet PHYs are also capable of doing MACSeC as they
send/receive frames. Doing it in hardware in the PHY avoids all these
out-of-order and latency issues. Unfortunately, we are still at the
early days for PHY drivers actually implementing MACSeC offload. At
the moment only the Microsemi PHY and Aquantia PHY via firmware in the
Atlantic NIC support this.

	 Andrew

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Severe performance regression in "net: macsec: preserve ingress frame ordering"
  2020-08-12 12:42           ` Andrew Lunn
@ 2020-08-24  9:07             ` Van Leeuwen, Pascal
  2020-08-24 13:01               ` Andrew Lunn
  0 siblings, 1 reply; 12+ messages in thread
From: Van Leeuwen, Pascal @ 2020-08-24  9:07 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Sabrina Dubroca, Scott Dial, linux-crypto, Ryan Cox, netdev,
	davem, Antoine Tenart, ebiggers

> -----Original Message-----
> From: linux-crypto-owner@vger.kernel.org <linux-crypto-owner@vger.kernel.org> On Behalf Of Andrew Lunn
> Sent: Wednesday, August 12, 2020 2:42 PM
> To: Van Leeuwen, Pascal <pvanleeuwen@rambus.com>
> Cc: Sabrina Dubroca <sd@queasysnail.net>; Scott Dial <scott@scottdial.com>; linux-crypto@vger.kernel.org; Ryan Cox
> <ryan_cox@byu.edu>; netdev@vger.kernel.org; davem@davemloft.net; Antoine Tenart <antoine.tenart@bootlin.com>;
> ebiggers@google.com
> Subject: Re: Severe performance regression in "net: macsec: preserve ingress frame ordering"
>
> <<< External Email >>>
> > With networking protocols you often also have a requirement to minimize
> > packet reordering, so I understand it's a careful balance. But it is possible
> > to serialize the important stuff and still do the crypto out-of-order, which
> > would be really beneficial on _some_ platforms (which have HW crypto
> > acceleration but no such CPU extensions) at least.
>
> Many Ethernet PHYs are also capable of doing MACSeC as they
> send/receive frames. Doing it in hardware in the PHY avoids all these
> out-of-order and latency issues. Unfortunately, we are still at the
> early days for PHY drivers actually implementing MACSeC offload. At
> the moment only the Microsemi PHY and Aquantia PHY via firmware in the
> Atlantic NIC support this.
>
No need to point this out to me as we're the number one supplier of inline MACsec IP :-)
In fact, the Microsemi PHY solution you mention is ours, major parts of that design were
even created by these 2 hands here.  Full protocol offload is obviously the holy grail of HW
acceleration, and what we tend to strive for. The problem is always with the software
integration, so I'm happy to see a framework for inline MACsec acceleration being added to
the kernel.

Without such a protocol acceleration framework (which AFAIK doesn't exist for IPsec yet,
at least not in a generic form supporting all modes and ciphersuites), however, you fall
back to basic hash-encrypt/AEAD offload as the "best you can do".  And some low-cost
devices may still do it on the CPU to minimize silicon cost. So it is still very useful for the
crypto API path to be as efficient as possible, at least for the time being.

Regards,
Pascal van Leeuwen
Silicon IP Architect Multi-Protocol Engines, Rambus Security
Rambus ROTW Holding BV
+31-73 6581953

Note: The Inside Secure/Verimatrix Silicon IP team was recently acquired by Rambus.
Please be so kind to update your e-mail address book with my new e-mail address.


** This message and any attachments are for the sole use of the intended recipient(s). It may contain information that is confidential and privileged. If you are not the intended recipient of this message, you are prohibited from printing, copying, forwarding or saving it. Please delete the message and attachments and notify the sender immediately. **

Rambus Inc.<http://www.rambus.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Severe performance regression in "net: macsec: preserve ingress frame ordering"
  2020-08-24  9:07             ` Van Leeuwen, Pascal
@ 2020-08-24 13:01               ` Andrew Lunn
  2020-08-25 13:09                 ` Van Leeuwen, Pascal
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Lunn @ 2020-08-24 13:01 UTC (permalink / raw)
  To: Van Leeuwen, Pascal
  Cc: Sabrina Dubroca, Scott Dial, linux-crypto, Ryan Cox, netdev,
	davem, Antoine Tenart, ebiggers

On Mon, Aug 24, 2020 at 09:07:26AM +0000, Van Leeuwen, Pascal wrote:
> No need to point this out to me as we're the number one supplier of inline MACsec IP :-)
> In fact, the Microsemi PHY solution you mention is ours, major parts of that design were
> even created by these 2 hands here.

Oh,  O.K.

Do you know of other silicon vendors which are using the same IP?
Maybe we can encourage them to share the driver, rather than re-invent
the wheel, which often happens when nobody realises it is basically
the same core with a different wrapper.

Thanks
	Andrew

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Severe performance regression in "net: macsec: preserve ingress frame ordering"
  2020-08-24 13:01               ` Andrew Lunn
@ 2020-08-25 13:09                 ` Van Leeuwen, Pascal
  2020-08-25 13:33                   ` Andrew Lunn
  0 siblings, 1 reply; 12+ messages in thread
From: Van Leeuwen, Pascal @ 2020-08-25 13:09 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Sabrina Dubroca, Scott Dial, linux-crypto, Ryan Cox, netdev,
	davem, Antoine Tenart, ebiggers

> -----Original Message-----
> From: Andrew Lunn <andrew@lunn.ch>
> Sent: Monday, August 24, 2020 3:02 PM
> To: Van Leeuwen, Pascal <pvanleeuwen@rambus.com>
> Cc: Sabrina Dubroca <sd@queasysnail.net>; Scott Dial <scott@scottdial.com>; linux-crypto@vger.kernel.org; Ryan Cox
> <ryan_cox@byu.edu>; netdev@vger.kernel.org; davem@davemloft.net; Antoine Tenart <antoine.tenart@bootlin.com>;
> ebiggers@google.com
> Subject: Re: Severe performance regression in "net: macsec: preserve ingress frame ordering"
>
> <<< External Email >>>
> On Mon, Aug 24, 2020 at 09:07:26AM +0000, Van Leeuwen, Pascal wrote:
> > No need to point this out to me as we're the number one supplier of inline MACsec IP :-)
> > In fact, the Microsemi PHY solution you mention is ours, major parts of that design were
> > even created by these 2 hands here.
>
> Oh,  O.K.
>
> Do you know of other silicon vendors which are using the same IP?
>
I do, there are many. But unfortunately, I cannot disclose our customers unless this is already
public information, e.g. due to some press release or whatever.

> Maybe we can encourage them to share the driver, rather than re-invent
> the wheel, which often happens when nobody realises it is basically
> the same core with a different wrapper.
>
Yes, that could save a lot of duplication of code and effort. And it should be rather trivial to
move the MACsec stuff to a higher level as all it needs is some register access to PHY control
space and an interrupt callback. So it should be possible to define a simple API between the
MACsec driver and the PHY driver for that. I would expect a similar API to be useful for
MACsec enabled PHY's using other MACsec solutions (i.e. not ours) as well ...

The problem is: who will do it? We can't do it, because we have no access to the actual HW.
Microsemi won't be motivated to do it, because it would only help the competition, so why
would they? So it would have to be some competitor also desiring MACsec support (for the
same MACsec IP), convincing the maintainer of the Microsemi driver to go along with the
changes. I guess it's not all that relevant until we hit that situation.

> Thanks
> Andrew

Regards,
Pascal van Leeuwen
Silicon IP Architect Multi-Protocol Engines, Rambus Security
Rambus ROTW Holding BV
+31-73 6581953

Note: The Inside Secure/Verimatrix Silicon IP team was recently acquired by Rambus.
Please be so kind to update your e-mail address book with my new e-mail address.


** This message and any attachments are for the sole use of the intended recipient(s). It may contain information that is confidential and privileged. If you are not the intended recipient of this message, you are prohibited from printing, copying, forwarding or saving it. Please delete the message and attachments and notify the sender immediately. **

Rambus Inc.<http://www.rambus.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Severe performance regression in "net: macsec: preserve ingress frame ordering"
  2020-08-25 13:09                 ` Van Leeuwen, Pascal
@ 2020-08-25 13:33                   ` Andrew Lunn
  0 siblings, 0 replies; 12+ messages in thread
From: Andrew Lunn @ 2020-08-25 13:33 UTC (permalink / raw)
  To: Van Leeuwen, Pascal
  Cc: Sabrina Dubroca, Scott Dial, linux-crypto, Ryan Cox, netdev,
	davem, Antoine Tenart, ebiggers

On Tue, Aug 25, 2020 at 01:09:31PM +0000, Van Leeuwen, Pascal wrote:
> > -----Original Message-----
> > From: Andrew Lunn <andrew@lunn.ch>
> > Sent: Monday, August 24, 2020 3:02 PM
> > To: Van Leeuwen, Pascal <pvanleeuwen@rambus.com>
> > Cc: Sabrina Dubroca <sd@queasysnail.net>; Scott Dial <scott@scottdial.com>; linux-crypto@vger.kernel.org; Ryan Cox
> > <ryan_cox@byu.edu>; netdev@vger.kernel.org; davem@davemloft.net; Antoine Tenart <antoine.tenart@bootlin.com>;
> > ebiggers@google.com
> > Subject: Re: Severe performance regression in "net: macsec: preserve ingress frame ordering"
> >
> > <<< External Email >>>
> > On Mon, Aug 24, 2020 at 09:07:26AM +0000, Van Leeuwen, Pascal wrote:
> > > No need to point this out to me as we're the number one supplier of inline MACsec IP :-)
> > > In fact, the Microsemi PHY solution you mention is ours, major parts of that design were
> > > even created by these 2 hands here.
> >
> > Oh,  O.K.
> >
> > Do you know of other silicon vendors which are using the same IP?
> >
> I do, there are many. But unfortunately, I cannot disclose our customers unless this is already
> public information, e.g. due to some press release or whatever.

O.K. Maybe i should flip the question around. If somebody was to
submit a driver, how would i quickly determine it is your IP? Any
particularly patterns i should look for.

> > Maybe we can encourage them to share the driver, rather than re-invent
> > the wheel, which often happens when nobody realises it is basically
> > the same core with a different wrapper.
> >
> Yes, that could save a lot of duplication of code and effort.

It would save a lot of effort. But not code duplication. Because if i
or one of the other maintainers notices it is just your IP with a
different wrapper, we would NACK the patch and tell them to refactor
the MSCC driver. There is a long established precedence for that.

> The problem is: who will do it? We can't do it, because we have no
> access to the actual HW.

Microsemi are very friendly. If you ask them, i'm sure they would send
you a board. I assume you also have some sort of FPGA setup you use
for your own testing? That gives you two platforms. And if there are
many PHYs using your IP, it should not be too hard to just go buy a
reference design kit from a vendor.

And there is the marketing aspect for Rambus. You can say your IP is
easy to use, the core code is already in the kernel, supported and
well tested, you just need to add a few wrapper functions in your
driver. No vendor crap driver needed.

	Andrew

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-08-25 13:33 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-06 21:11 Severe performance regression in "net: macsec: preserve ingress frame ordering" Ryan Cox
2020-08-07  3:48 ` Scott Dial
2020-08-07 23:21   ` Ryan Cox
2020-08-10 13:34   ` Sabrina Dubroca
2020-08-10 16:09     ` Scott Dial
2020-08-12 10:04       ` Sabrina Dubroca
2020-08-12 10:45         ` Van Leeuwen, Pascal
2020-08-12 12:42           ` Andrew Lunn
2020-08-24  9:07             ` Van Leeuwen, Pascal
2020-08-24 13:01               ` Andrew Lunn
2020-08-25 13:09                 ` Van Leeuwen, Pascal
2020-08-25 13:33                   ` Andrew Lunn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).