All of lore.kernel.org
 help / color / mirror / Atom feed
* Looking for non-NIC hardware-offload for wpa2 decrypt.
@ 2014-03-31  4:40 Ben Greear
  2014-03-31 18:09 ` Christian Lamparter
  0 siblings, 1 reply; 21+ messages in thread
From: Ben Greear @ 2014-03-31  4:40 UTC (permalink / raw)
  To: linux-wireless

Hello!

Due to hardware/firmware limitations, it does not appear possible to
have a wifi NIC do hardware decrypt when using multiple stations on a single
NIC (and have both stations connected to the same AP).

This just happens to be one of my favourite things to do, and it kills
performance compared to normal 'Open' throughput.

I am curious if anyone knows of any way to accelerate rx-decrypt, perhaps by
using a specialized hardware board or maybe a feature of certain CPUs?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-03-31  4:40 Looking for non-NIC hardware-offload for wpa2 decrypt Ben Greear
@ 2014-03-31 18:09 ` Christian Lamparter
  2014-07-28 20:50   ` Ben Greear
  0 siblings, 1 reply; 21+ messages in thread
From: Christian Lamparter @ 2014-03-31 18:09 UTC (permalink / raw)
  To: Ben Greear; +Cc: linux-wireless

Hello,

On Sunday, March 30, 2014 09:40:24 PM Ben Greear wrote:
> Due to hardware/firmware limitations, it does not appear possible to
> have a wifi NIC do hardware decrypt when using multiple stations on a single
> NIC (and have both stations connected to the same AP).
> 
> This just happens to be one of my favourite things to do, and it kills
> performance compared to normal 'Open' throughput.
> 
> I am curious if anyone knows of any way to accelerate rx-decrypt, perhaps by
> using a specialized hardware board or maybe a feature of certain CPUs?

You could check if your CPU (bios and kernel) have support for AES-NI [0].
AFAICT mac80211 utilizes the cryptoapi. Therefore anything that supports
the proper crypto bindings can be used to accelerate the encryption and
decryption process to some degree. And it just happens that thanks to
AES-NI parts of math can be efficiently calculated by the CPU. 

Regards,
Chr

[0] <http://en.wikipedia.org/wiki/AES_instruction_set>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-03-31 18:09 ` Christian Lamparter
@ 2014-07-28 20:50   ` Ben Greear
  2014-07-29 22:29     ` Christian Lamparter
  0 siblings, 1 reply; 21+ messages in thread
From: Ben Greear @ 2014-07-28 20:50 UTC (permalink / raw)
  To: Christian Lamparter; +Cc: linux-wireless

On 03/31/2014 11:09 AM, Christian Lamparter wrote:
> Hello,
> 
> On Sunday, March 30, 2014 09:40:24 PM Ben Greear wrote:
>> Due to hardware/firmware limitations, it does not appear possible to
>> have a wifi NIC do hardware decrypt when using multiple stations on a single
>> NIC (and have both stations connected to the same AP).
>>
>> This just happens to be one of my favourite things to do, and it kills
>> performance compared to normal 'Open' throughput.
>>
>> I am curious if anyone knows of any way to accelerate rx-decrypt, perhaps by
>> using a specialized hardware board or maybe a feature of certain CPUs?
> 
> You could check if your CPU (bios and kernel) have support for AES-NI [0].
> AFAICT mac80211 utilizes the cryptoapi. Therefore anything that supports
> the proper crypto bindings can be used to accelerate the encryption and
> decryption process to some degree. And it just happens that thanks to
> AES-NI parts of math can be efficiently calculated by the CPU. 

I recently took a look at this again, and the Intel E5 I'm using
does use the aesni instructions/driver as far as I can tell.

Throughput is still around 500Mbps where open is around 800Mbps.

perf top shows this:

Samples: 37K of event 'cycles', Event count (approx.): 19360716192
 12.01%  [kernel]                                      [k] math_state_restore
 11.64%  [kernel]                                      [k] _aesni_enc1
  8.25%  [kernel]                                      [k] __save_init_fpu
  2.44%  [kernel]                                      [k] crypto_xor
  1.87%  [kernel]                                      [k] irq_fpu_usable
  1.30%  [kernel]                                      [k] aes_encrypt
  0.76%  [kernel]                                      [k] __kernel_fpu_end
....


Any other magic add-in cards that would somehow just make this all faster w/out
having to do any real programming work? :)

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-07-28 20:50   ` Ben Greear
@ 2014-07-29 22:29     ` Christian Lamparter
  2014-07-29 22:50       ` Ben Greear
  2014-07-30  7:06       ` Johannes Berg
  0 siblings, 2 replies; 21+ messages in thread
From: Christian Lamparter @ 2014-07-29 22:29 UTC (permalink / raw)
  To: Ben Greear; +Cc: linux-wireless

On Monday, July 28, 2014 01:50:22 PM Ben Greear wrote:
> On 03/31/2014 11:09 AM, Christian Lamparter wrote:
> > Hello,
> > 
> > On Sunday, March 30, 2014 09:40:24 PM Ben Greear wrote:
> >> Due to hardware/firmware limitations, it does not appear possible to
> >> have a wifi NIC do hardware decrypt when using multiple stations on a single
> >> NIC (and have both stations connected to the same AP).
> >>
> >> This just happens to be one of my favourite things to do, and it kills
> >> performance compared to normal 'Open' throughput.
> >>
> >> I am curious if anyone knows of any way to accelerate rx-decrypt, perhaps by
> >> using a specialized hardware board or maybe a feature of certain CPUs?
> > 
> > You could check if your CPU (bios and kernel) have support for AES-NI [0].
> > AFAICT mac80211 utilizes the cryptoapi. Therefore anything that supports
> > the proper crypto bindings can be used to accelerate the encryption and
> > decryption process to some degree. And it just happens that thanks to
> > AES-NI parts of math can be efficiently calculated by the CPU. 
> 
> I recently took a look at this again, and the Intel E5 I'm using
> does use the aesni instructions/driver as far as I can tell.
Which E5 exactly? There are many different E5. 

> Throughput is still around 500Mbps where open is around 800Mbps.
I can't test ath10k or your multiple station on a single NIC thing. But
can you run a test for a "simple" single station - single AP wpa2 setup?
I want to know how close to the 800Mbps it actually goes.

> perf top shows this:
> 
> Samples: 37K of event 'cycles', Event count (approx.): 19360716192
>  12.01%  [kernel]                                      [k] math_state_restore
>  11.64%  [kernel]                                      [k] _aesni_enc1
>   8.25%  [kernel]                                      [k] __save_init_fpu
>   2.44%  [kernel]                                      [k] crypto_xor
>   1.87%  [kernel]                                      [k] irq_fpu_usable
>   1.30%  [kernel]                                      [k] aes_encrypt
>   0.76%  [kernel]                                      [k] __kernel_fpu_end
> ....
Yes, aesni is doing some of the heavy lifting! But in your original post,
you said you are interested in accelerate rx-decrypt... Now it's about 
encryption offload?! [please make up your mind :-D]

That being said 12.01% (math_state_restore -  
called by kernel_fpu_end) and 8.25% (__save_init_fpu - called 
by kernel_fpu_begin) cycles are wasted due fpu save and 
restore overhead. [You have noticed that before, didn't you ;-) ]

I think part of the poor performance is due to the design of
aes_encrypt in arch/x86/crypto/aesni-intel_glue.c:

> static void aes_encrypt(struct crypto_tfm *tfm, u8 *dst, const u8 *src)
> {
>        struct crypto_aes_ctx *ctx = aes_ctx(crypto_tfm_ctx(tfm));
>        [...]
>                kernel_fpu_begin();
>                aesni_enc(ctx, dst, src);
>                kernel_fpu_end();
>        [...]
> }

Ideally you would want something like:

>                kernel_fpu_begin();
>                aesni_enc(ctx, dst_frame1, src_frame1);
>                aesni_enc(ctx, dst_frame2, src_frame2);
>                ...
>                aesni_enc(ctx, dst_frameN, src_frameN);
>                kernel_fpu_end();

But getting there might not be easy and involve more than a bit
of "real programming".

In theory, it should be enough to test if there is some potential
in this approach by "enhancing" the tx-path in the following way:

1. the fpu_begin and fpu_end calls should be added to
ieee80211_crypto_ccmp_encrypt in net/mac80211/wpa.c.

>+     kernel_fpu_begin();
>        skb_queue_walk(&tx->skbs, skb) {
>                if (ccmp_encrypt_skb(tx, skb) < 0)
>                        return TX_DROP;
>        }
>+      kernel_fpu_end();
>
>       return TX_CONTINUE;

2. ieee80211_aes_ccm_encrypt in net/mac80211/aes_ccm.c
has to call __aes_encrypt instead of aes_encrypt in crypto_aead_encrypt.
[I can't think of a sane way to make this work. Of course, it's possible to
make a copy of ccm(aes) crypto_alg* and overwrite aes_encrypt with
__aes_encrypt. But that's not very nice... (It should work though) ]

> Any other magic add-in cards that would somehow just make this all faster w/out
> having to do any real programming work? :)
I doubt there is an magic add-in card for such a use-case. I think most of
them target directly applications/libraries and not the crypto-kernel
interface mac80211 is using.

[It would be really nice to know what E5 you actually have]

Regards
Christian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-07-29 22:29     ` Christian Lamparter
@ 2014-07-29 22:50       ` Ben Greear
  2014-07-30 18:59         ` Christian Lamparter
  2014-07-30  7:06       ` Johannes Berg
  1 sibling, 1 reply; 21+ messages in thread
From: Ben Greear @ 2014-07-29 22:50 UTC (permalink / raw)
  To: Christian Lamparter; +Cc: linux-wireless

On 07/29/2014 03:29 PM, Christian Lamparter wrote:
> On Monday, July 28, 2014 01:50:22 PM Ben Greear wrote:
>> On 03/31/2014 11:09 AM, Christian Lamparter wrote:
>>> Hello,
>>>
>>> On Sunday, March 30, 2014 09:40:24 PM Ben Greear wrote:
>>>> Due to hardware/firmware limitations, it does not appear possible to
>>>> have a wifi NIC do hardware decrypt when using multiple stations on a single
>>>> NIC (and have both stations connected to the same AP).
>>>>
>>>> This just happens to be one of my favourite things to do, and it kills
>>>> performance compared to normal 'Open' throughput.
>>>>
>>>> I am curious if anyone knows of any way to accelerate rx-decrypt, perhaps by
>>>> using a specialized hardware board or maybe a feature of certain CPUs?
>>>
>>> You could check if your CPU (bios and kernel) have support for AES-NI [0].
>>> AFAICT mac80211 utilizes the cryptoapi. Therefore anything that supports
>>> the proper crypto bindings can be used to accelerate the encryption and
>>> decryption process to some degree. And it just happens that thanks to
>>> AES-NI parts of math can be efficiently calculated by the CPU. 
>>
>> I recently took a look at this again, and the Intel E5 I'm using
>> does use the aesni instructions/driver as far as I can tell.
> Which E5 exactly? There are many different E5. 
> 
>> Throughput is still around 500Mbps where open is around 800Mbps.
> I can't test ath10k or your multiple station on a single NIC thing. But
> can you run a test for a "simple" single station - single AP wpa2 setup?
> I want to know how close to the 800Mbps it actually goes.
> 
>> perf top shows this:
>>
>> Samples: 37K of event 'cycles', Event count (approx.): 19360716192
>>  12.01%  [kernel]                                      [k] math_state_restore
>>  11.64%  [kernel]                                      [k] _aesni_enc1
>>   8.25%  [kernel]                                      [k] __save_init_fpu
>>   2.44%  [kernel]                                      [k] crypto_xor
>>   1.87%  [kernel]                                      [k] irq_fpu_usable
>>   1.30%  [kernel]                                      [k] aes_encrypt
>>   0.76%  [kernel]                                      [k] __kernel_fpu_end
>> ....
> Yes, aesni is doing some of the heavy lifting! But in your original post,
> you said you are interested in accelerate rx-decrypt... Now it's about 
> encryption offload?! [please make up your mind :-D]

The perf top results above are from receiving (and decoding) wpa2 wifi
frames that were not decoded by the NIC because NIC rx-decrypt logic was
disabled.  I think this means I want to accelerate the rx-decrypt.

Transmit is not a problem for me because I can make the NIC do the
encryption in it's hardware.


My E5 is:

[root@ct525-2u-3ac-3n]# cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Intel(R) Xeon(R) CPU E5-1660 v2 @ 3.70GHz
stepping	: 4
microcode	: 0x427
cpu MHz		: 2163.054
cache size	: 15360 KB
physical id	: 0
siblings	: 12
core id		: 0
cpu cores	: 6
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm
constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr
pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority
ept vpid fsgsbase smep erms
bogomips	: 7400.31
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

.... 11 more entries.


Thanks for the suggestions below.  I have managed to find yet another
way to crash my firmware so I have to pay attention to that for a bit,
but will look into that decrypt code in more detail when I get a chance.

Thanks,
Ben


> That being said 12.01% (math_state_restore -  
> called by kernel_fpu_end) and 8.25% (__save_init_fpu - called 
> by kernel_fpu_begin) cycles are wasted due fpu save and 
> restore overhead. [You have noticed that before, didn't you ;-) ]
> 
> I think part of the poor performance is due to the design of
> aes_encrypt in arch/x86/crypto/aesni-intel_glue.c:
> 
>> static void aes_encrypt(struct crypto_tfm *tfm, u8 *dst, const u8 *src)
>> {
>>        struct crypto_aes_ctx *ctx = aes_ctx(crypto_tfm_ctx(tfm));
>>        [...]
>>                kernel_fpu_begin();
>>                aesni_enc(ctx, dst, src);
>>                kernel_fpu_end();
>>        [...]
>> }
> 
> Ideally you would want something like:
> 
>>                kernel_fpu_begin();
>>                aesni_enc(ctx, dst_frame1, src_frame1);
>>                aesni_enc(ctx, dst_frame2, src_frame2);
>>                ...
>>                aesni_enc(ctx, dst_frameN, src_frameN);
>>                kernel_fpu_end();
> 
> But getting there might not be easy and involve more than a bit
> of "real programming".
> 
> In theory, it should be enough to test if there is some potential
> in this approach by "enhancing" the tx-path in the following way:
> 
> 1. the fpu_begin and fpu_end calls should be added to
> ieee80211_crypto_ccmp_encrypt in net/mac80211/wpa.c.
> 
>> +     kernel_fpu_begin();
>>        skb_queue_walk(&tx->skbs, skb) {
>>                if (ccmp_encrypt_skb(tx, skb) < 0)
>>                        return TX_DROP;
>>        }
>> +      kernel_fpu_end();
>>
>>       return TX_CONTINUE;
> 
> 2. ieee80211_aes_ccm_encrypt in net/mac80211/aes_ccm.c
> has to call __aes_encrypt instead of aes_encrypt in crypto_aead_encrypt.
> [I can't think of a sane way to make this work. Of course, it's possible to
> make a copy of ccm(aes) crypto_alg* and overwrite aes_encrypt with
> __aes_encrypt. But that's not very nice... (It should work though) ]
> 
>> Any other magic add-in cards that would somehow just make this all faster w/out
>> having to do any real programming work? :)
> I doubt there is an magic add-in card for such a use-case. I think most of
> them target directly applications/libraries and not the crypto-kernel
> interface mac80211 is using.
> 
> [It would be really nice to know what E5 you actually have]
> 
> Regards
> Christian
> 


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-07-29 22:29     ` Christian Lamparter
  2014-07-29 22:50       ` Ben Greear
@ 2014-07-30  7:06       ` Johannes Berg
  1 sibling, 0 replies; 21+ messages in thread
From: Johannes Berg @ 2014-07-30  7:06 UTC (permalink / raw)
  To: Christian Lamparter; +Cc: Ben Greear, linux-wireless

On Wed, 2014-07-30 at 00:29 +0200, Christian Lamparter wrote:

> 1. the fpu_begin and fpu_end calls should be added to
> ieee80211_crypto_ccmp_encrypt in net/mac80211/wpa.c.
> 
> >+     kernel_fpu_begin();
> >        skb_queue_walk(&tx->skbs, skb) {
> >                if (ccmp_encrypt_skb(tx, skb) < 0)
> >                        return TX_DROP;
> >        }
> >+      kernel_fpu_end();
> >
> >       return TX_CONTINUE;

I don't really want to jump in here but I'll point out that this would
be mostly useless afaict as the list is only iterated if you have
software fragmentation.

johannes


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-07-29 22:50       ` Ben Greear
@ 2014-07-30 18:59         ` Christian Lamparter
  2014-07-30 19:08           ` Ben Greear
  2014-07-31 20:05           ` Jouni Malinen
  0 siblings, 2 replies; 21+ messages in thread
From: Christian Lamparter @ 2014-07-30 18:59 UTC (permalink / raw)
  To: Ben Greear; +Cc: linux-wireless, Johannes Berg

On Tuesday, July 29, 2014 03:50:40 PM Ben Greear wrote:
> On 07/29/2014 03:29 PM, Christian Lamparter wrote:
> > On Monday, July 28, 2014 01:50:22 PM Ben Greear wrote:
> >> On 03/31/2014 11:09 AM, Christian Lamparter wrote:
> >>> Hello,
> >>>
> >>> On Sunday, March 30, 2014 09:40:24 PM Ben Greear wrote:
> >>>> Due to hardware/firmware limitations, it does not appear possible to
> >>>> have a wifi NIC do hardware decrypt when using multiple stations on a single
> >>>> NIC (and have both stations connected to the same AP).
> >>>>
> >>>> This just happens to be one of my favourite things to do, and it kills
> >>>> performance compared to normal 'Open' throughput.
> >>>>
> >>>> I am curious if anyone knows of any way to accelerate rx-decrypt, perhaps by
> >>>> using a specialized hardware board or maybe a feature of certain CPUs?
> >>>
> >>> You could check if your CPU (bios and kernel) have support for AES-NI [0].
> >>> AFAICT mac80211 utilizes the cryptoapi. Therefore anything that supports
> >>> the proper crypto bindings can be used to accelerate the encryption and
> >>> decryption process to some degree. And it just happens that thanks to
> >>> AES-NI parts of math can be efficiently calculated by the CPU. 
> >>
> >> I recently took a look at this again, and the Intel E5 I'm using
> >> does use the aesni instructions/driver as far as I can tell.
> > Which E5 exactly? There are many different E5. 

> model name	: Intel(R) Xeon(R) CPU E5-1660 v2 @ 3.70GHz
> stepping	: 4
> microcode	: 0x427
> cpu MHz		: 2163.054
Thanks. 500Mbps should not be a issue though. At 3,70GHz one single
core should be able to encrypt/decrypt several Gbps.

> >> Throughput is still around 500Mbps where open is around 800Mbps.
> > I can't test ath10k or your multiple station on a single NIC thing. But
> > can you run a test for a "simple" single station - single AP wpa2 setup?
> > I want to know how close to the 800Mbps it actually goes.
Any data for the single station, single AP, wpa2 setup? I would like to know
what ath10k is able to achieve in this case.

> >> perf top shows this:
> >>
> >> Samples: 37K of event 'cycles', Event count (approx.): 19360716192
> >>  12.01%  [kernel]                                      [k] math_state_restore
> >>  11.64%  [kernel]                                      [k] _aesni_enc1
> >>   8.25%  [kernel]                                      [k] __save_init_fpu
> >>   2.44%  [kernel]                                      [k] crypto_xor
> >>   1.87%  [kernel]                                      [k] irq_fpu_usable
> >>   1.30%  [kernel]                                      [k] aes_encrypt
> >>   0.76%  [kernel]                                      [k] __kernel_fpu_end
> >> ....
> > Yes, aesni is doing some of the heavy lifting! But in your original post,
> > you said you are interested in accelerate rx-decrypt... Now it's about 
> > encryption offload?! [please make up your mind :-D]
> 
> The perf top results above are from receiving (and decoding) wpa2 wifi
> frames that were not decoded by the NIC because NIC rx-decrypt logic was
> disabled.  I think this means I want to accelerate the rx-decrypt.
Wait.

If you have disabled rx-decrypt logic of ath10k, then why isn't _aesni_dec1
or aes_decrypt listed in the perf top result? I think they should be. Have you
removed them from the "perf top results" or are they really absent 
altogether? 

Because, from this perf result, it looks like your CPU is not burden by the
incoming RX at all?! Instead it is busy with the encryption of frames
it will be transmitting (in case of tcp, this could be tcp acks).

It could be that I missed something important about the setup.
For example, I assumed that you have a dedicated 802.11ac AP
and the perf results are coming from the E5 machine with the ath10k
in multi-station mode. The AP would be transmitting, whereas
the E5 would be receiving. Is this assumption correct or not?

> Transmit is not a problem for me because I can make the NIC do the
> encryption in it's hardware.

> Thanks for the suggestions below.  I have managed to find yet another
> way to crash my firmware so I have to pay attention to that for a bit,
> but will look into that decrypt code in more detail when I get a chance.

Yeah, but don't bother with the suggestions. Johannes pointed out "that
this would be mostly useless afaict as the list is only iterated if you have
software fragmentation." Furthermore, they only covered the ENcryption 
process of the TX path and not the DEcryption part of the RX path (which
is what you are interested in).

Regards
Christian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-07-30 18:59         ` Christian Lamparter
@ 2014-07-30 19:08           ` Ben Greear
  2014-07-31 20:05           ` Jouni Malinen
  1 sibling, 0 replies; 21+ messages in thread
From: Ben Greear @ 2014-07-30 19:08 UTC (permalink / raw)
  To: Christian Lamparter; +Cc: linux-wireless, Johannes Berg

On 07/30/2014 11:59 AM, Christian Lamparter wrote:
> On Tuesday, July 29, 2014 03:50:40 PM Ben Greear wrote:
>> On 07/29/2014 03:29 PM, Christian Lamparter wrote:
>>> On Monday, July 28, 2014 01:50:22 PM Ben Greear wrote:
>>>> On 03/31/2014 11:09 AM, Christian Lamparter wrote:
>>>>> Hello,
>>>>>
>>>>> On Sunday, March 30, 2014 09:40:24 PM Ben Greear wrote:
>>>>>> Due to hardware/firmware limitations, it does not appear possible to
>>>>>> have a wifi NIC do hardware decrypt when using multiple stations on a single
>>>>>> NIC (and have both stations connected to the same AP).
>>>>>>
>>>>>> This just happens to be one of my favourite things to do, and it kills
>>>>>> performance compared to normal 'Open' throughput.
>>>>>>
>>>>>> I am curious if anyone knows of any way to accelerate rx-decrypt, perhaps by
>>>>>> using a specialized hardware board or maybe a feature of certain CPUs?
>>>>>
>>>>> You could check if your CPU (bios and kernel) have support for AES-NI [0].
>>>>> AFAICT mac80211 utilizes the cryptoapi. Therefore anything that supports
>>>>> the proper crypto bindings can be used to accelerate the encryption and
>>>>> decryption process to some degree. And it just happens that thanks to
>>>>> AES-NI parts of math can be efficiently calculated by the CPU. 
>>>>
>>>> I recently took a look at this again, and the Intel E5 I'm using
>>>> does use the aesni instructions/driver as far as I can tell.
>>> Which E5 exactly? There are many different E5. 
> 
>> model name	: Intel(R) Xeon(R) CPU E5-1660 v2 @ 3.70GHz
>> stepping	: 4
>> microcode	: 0x427
>> cpu MHz		: 2163.054
> Thanks. 500Mbps should not be a issue though. At 3,70GHz one single
> core should be able to encrypt/decrypt several Gbps.
> 
>>>> Throughput is still around 500Mbps where open is around 800Mbps.
>>> I can't test ath10k or your multiple station on a single NIC thing. But
>>> can you run a test for a "simple" single station - single AP wpa2 setup?
>>> I want to know how close to the 800Mbps it actually goes.
> Any data for the single station, single AP, wpa2 setup? I would like to know
> what ath10k is able to achieve in this case.

I will run this when I get a chance and let you know.

But, exact same setup (same number of stations, etc), but just with
open authentication, runs 800+Mbps.

>>>> perf top shows this:
>>>>
>>>> Samples: 37K of event 'cycles', Event count (approx.): 19360716192
>>>>  12.01%  [kernel]                                      [k] math_state_restore
>>>>  11.64%  [kernel]                                      [k] _aesni_enc1
>>>>   8.25%  [kernel]                                      [k] __save_init_fpu
>>>>   2.44%  [kernel]                                      [k] crypto_xor
>>>>   1.87%  [kernel]                                      [k] irq_fpu_usable
>>>>   1.30%  [kernel]                                      [k] aes_encrypt
>>>>   0.76%  [kernel]                                      [k] __kernel_fpu_end
>>>> ....
>>> Yes, aesni is doing some of the heavy lifting! But in your original post,
>>> you said you are interested in accelerate rx-decrypt... Now it's about 
>>> encryption offload?! [please make up your mind :-D]
>>
>> The perf top results above are from receiving (and decoding) wpa2 wifi
>> frames that were not decoded by the NIC because NIC rx-decrypt logic was
>> disabled.  I think this means I want to accelerate the rx-decrypt.
> Wait.
> 
> If you have disabled rx-decrypt logic of ath10k, then why isn't _aesni_dec1
> or aes_decrypt listed in the perf top result? I think they should be. Have you
> removed them from the "perf top results" or are they really absent 
> altogether? 
> 
> Because, from this perf result, it looks like your CPU is not burden by the
> incoming RX at all?! Instead it is busy with the encryption of frames
> it will be transmitting (in case of tcp, this could be tcp acks).
> 
> It could be that I missed something important about the setup.
> For example, I assumed that you have a dedicated 802.11ac AP
> and the perf results are coming from the E5 machine with the ath10k
> in multi-station mode. The AP would be transmitting, whereas
> the E5 would be receiving. Is this assumption correct or not?

My setup is where AP is transmitting and E5 is receiving.

Test case is UDP, so very little upstream traffic.

I did not trim anything off the top of perf top, and did not notice
any other aesni calls listed.  I do not particularly know why it is
doing aesni_encl, I had assumed that was how it decoded.

I will double-check all of this and try to figure out why
it is calling the encl instead of decl logic.  Possibly I
have something that is actually configured differently than
I think it is.

Also, good to hear my E5 should be able to handle higher
speeds, gives me something to hope for :)

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-07-30 18:59         ` Christian Lamparter
  2014-07-30 19:08           ` Ben Greear
@ 2014-07-31 20:05           ` Jouni Malinen
  2014-07-31 20:45             ` Christian Lamparter
  1 sibling, 1 reply; 21+ messages in thread
From: Jouni Malinen @ 2014-07-31 20:05 UTC (permalink / raw)
  To: Christian Lamparter; +Cc: Ben Greear, linux-wireless, Johannes Berg

On Wed, Jul 30, 2014 at 08:59:33PM +0200, Christian Lamparter wrote:
> If you have disabled rx-decrypt logic of ath10k, then why isn't _aesni_dec1
> or aes_decrypt listed in the perf top result? I think they should be. Have you
> removed them from the "perf top results" or are they really absent 
> altogether? 
> 
> Because, from this perf result, it looks like your CPU is not burden by the
> incoming RX at all?! Instead it is busy with the encryption of frames
> it will be transmitting (in case of tcp, this could be tcp acks).

Keep in mind that this is CCMP, i.e., AES in CCM (Counter with CBC-MAC)
mode. The CCM mode uses only the block cipher encryption function, i.e.,
you won't be seeing aes_decrypt or _aesni_dec1 for this even on the RX
path (AES encryption operations are used to generate the key stream
blocks for CCM decryption).

-- 
Jouni Malinen                                            PGP id EFC895FA

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-07-31 20:05           ` Jouni Malinen
@ 2014-07-31 20:45             ` Christian Lamparter
  2014-08-05 23:09               ` Ben Greear
  0 siblings, 1 reply; 21+ messages in thread
From: Christian Lamparter @ 2014-07-31 20:45 UTC (permalink / raw)
  To: Jouni Malinen; +Cc: Ben Greear, linux-wireless, Johannes Berg

On Thursday, July 31, 2014 11:05:22 PM Jouni Malinen wrote:
> On Wed, Jul 30, 2014 at 08:59:33PM +0200, Christian Lamparter wrote:
> > If you have disabled rx-decrypt logic of ath10k, then why isn't _aesni_dec1
> > or aes_decrypt listed in the perf top result? I think they should be. Have you
> > removed them from the "perf top results" or are they really absent 
> > altogether? 
> > 
> > Because, from this perf result, it looks like your CPU is not burden by the
> > incoming RX at all?! Instead it is busy with the encryption of frames
> > it will be transmitting (in case of tcp, this could be tcp acks).
> 
> Keep in mind that this is CCMP, i.e., AES in CCM (Counter with CBC-MAC)
> mode. The CCM mode uses only the block cipher encryption function, i.e.,
> you won't be seeing aes_decrypt or _aesni_dec1 for this even on the RX
> path (AES encryption operations are used to generate the key stream
> blocks for CCM decryption).
Yes, I remember this detail/the old days (before 3.12/3.13?). Back then
ieee80211_aes_ccm_decrypt did exactly that. But these semantic pitfalls
were taken care of by the following commit:

commit 7ec7c4a9a686c608315739ab6a2b0527a240883c (from wireless-testing.git)
Author: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Date:   Thu Oct 10 09:55:20 2013 +0200

  mac80211: port CCMP to cryptoapi's CCM driver
   
  Use the generic CCM aead chaining mode driver rather than a local
  implementation that sits right on top of the core AES cipher.
  
  This allows the use of accelerated implementations of either
  CCM as a whole or the CTR mode which it encapsulates.
  [...]

Regards
Christian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-07-31 20:45             ` Christian Lamparter
@ 2014-08-05 23:09               ` Ben Greear
  2014-08-07 14:05                 ` Christian Lamparter
  0 siblings, 1 reply; 21+ messages in thread
From: Ben Greear @ 2014-08-05 23:09 UTC (permalink / raw)
  To: Christian Lamparter; +Cc: Jouni Malinen, linux-wireless, Johannes Berg

On 07/31/2014 01:45 PM, Christian Lamparter wrote:
> On Thursday, July 31, 2014 11:05:22 PM Jouni Malinen wrote:
>> On Wed, Jul 30, 2014 at 08:59:33PM +0200, Christian Lamparter wrote:
>>> If you have disabled rx-decrypt logic of ath10k, then why isn't _aesni_dec1
>>> or aes_decrypt listed in the perf top result? I think they should be. Have you
>>> removed them from the "perf top results" or are they really absent 
>>> altogether? 
>>>
>>> Because, from this perf result, it looks like your CPU is not burden by the
>>> incoming RX at all?! Instead it is busy with the encryption of frames
>>> it will be transmitting (in case of tcp, this could be tcp acks).
>>
>> Keep in mind that this is CCMP, i.e., AES in CCM (Counter with CBC-MAC)
>> mode. The CCM mode uses only the block cipher encryption function, i.e.,
>> you won't be seeing aes_decrypt or _aesni_dec1 for this even on the RX
>> path (AES encryption operations are used to generate the key stream
>> blocks for CCM decryption).
> Yes, I remember this detail/the old days (before 3.12/3.13?). Back then
> ieee80211_aes_ccm_decrypt did exactly that. But these semantic pitfalls
> were taken care of by the following commit:
> 
> commit 7ec7c4a9a686c608315739ab6a2b0527a240883c (from wireless-testing.git)
> Author: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Date:   Thu Oct 10 09:55:20 2013 +0200

This patch is in my tree (I'm using 3.14.14 kernel currently).

Here is a perf top from a different machine, with single wlan interface
running UDP download (btserver is user-space app that is generating/receiving
the traffic).  I can do about 200Mbps download with WPA2 encryption enabled
on this machine, and ksoftirqd is using about 76% of a core according to top.

Samples: 154K of event 'cycles', Event count (approx.): 45228404083
  9.92%  [kernel]                 [k] __lock_acquire
  7.79%  btserver                 [.] 0x0000000000349d73
  6.44%  [kernel]                 [k] math_state_restore
  5.47%  [kernel]                 [k] _aesni_enc1
  4.36%  [kernel]                 [k] fpu_save_init
  2.88%  [kernel]                 [k] arch_local_save_flags
  2.68%  [kernel]                 [k] arch_local_irq_restore
  2.29%  [kernel]                 [k] lock_release
  1.80%  [kernel]                 [k] mark_lock
  1.58%  [kernel]                 [k] lock_acquire
  1.50%  [kernel]                 [k] irq_fpu_usable
  1.43%  [kernel]                 [k] crypto_xor
  1.35%  [kernel]                 [k] mark_held_locks
  1.30%  [kernel]                 [k] fib_rules_lookup
  1.25%  [kernel]                 [k] hlock_class
  1.20%  [kernel]                 [k] trace_hardirqs_on_caller
  0.99%  [kernel]                 [k] copy_user_generic_string
  0.92%  [kernel]                 [k] __netif_receive_skb_core
  0.88%  [kernel]                 [k] trace_hardirqs_off_caller
  0.87%  [kernel]                 [k] arch_local_irq_save
  0.85%  [kernel]                 [k] dev_queue_xmit_nit
  0.84%  [kernel]                 [k] aes_encrypt
  0.59%  [kernel]                 [k] do_raw_spin_lock
  0.55%  [kernel]                 [k] get_data_to_compute
  0.53%  [kernel]                 [k] __rcu_read_unlock
  0.52%  [kernel]                 [k] crypto_ctr_crypt

A second test where the station machine was not generating to itself
(ie, tx on Ethernet, to AP, receive back on wlan), but only
receiving traffic from the AP, shows this perf top:

Samples: 126K of event 'cycles', Event count (approx.): 29019221373
 10.74%  [kernel]                 [k] math_state_restore
 10.50%  btserver                 [.] 0x000000000033260d
  9.00%  [kernel]                 [k] _aesni_enc1
  7.33%  [kernel]                 [k] fpu_save_init
  6.70%  [kernel]                 [k] __lock_acquire
  2.46%  [kernel]                 [k] irq_fpu_usable
  2.34%  [kernel]                 [k] crypto_xor
  1.88%  [kernel]                 [k] arch_local_save_flags
  1.83%  [kernel]                 [k] arch_local_irq_restore
  1.58%  [kernel]                 [k] lock_release
  1.48%  [kernel]                 [k] aes_encrypt
  1.27%  [kernel]                 [k] mark_lock
  1.12%  [kernel]                 [k] lock_acquire
  1.02%  [kernel]                 [k] mark_held_locks
  0.96%  [kernel]                 [k] trace_hardirqs_on_caller
  0.93%  [kernel]                 [k] get_data_to_compute
  0.83%  [kernel]                 [k] hlock_class
  0.81%  [kernel]                 [k] __kernel_fpu_begin
  0.81%  [kernel]                 [k] crypto_ctr_crypt
  0.80%  [kernel]                 [k] crypto_inc


[greearb@ben-dt2 linux-3.14.x64]$ grep CCM .config
CONFIG_LIB80211_CRYPT_CCMP=m
# CONFIG_RTLLIB_CRYPTO_CCMP is not set
CONFIG_CRYPTO_CCM=y
[greearb@ben-dt2 linux-3.14.x64]$

[root@ct523-9292 lanforge]# cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 42
model name	: Intel(R) Core(TM) i7-2655LE CPU @ 2.20GHz
stepping	: 7
microcode	: 0x28
cpu MHz		: 2200.000
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm
constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr
pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4390.31
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:


Out of curiosity, might it help to prefetch the entire skb when
getting it from the NIC, since we are about to have to read it all
to do the decrypt?

Any idea how to prefetch the skb?

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-08-05 23:09               ` Ben Greear
@ 2014-08-07 14:05                 ` Christian Lamparter
  2014-08-07 17:45                   ` Ben Greear
  0 siblings, 1 reply; 21+ messages in thread
From: Christian Lamparter @ 2014-08-07 14:05 UTC (permalink / raw)
  To: Ben Greear; +Cc: Jouni Malinen, linux-wireless, Johannes Berg

On Tuesday, August 05, 2014 04:09:42 PM Ben Greear wrote:
> On 07/31/2014 01:45 PM, Christian Lamparter wrote:
> > On Thursday, July 31, 2014 11:05:22 PM Jouni Malinen wrote:
> >> On Wed, Jul 30, 2014 at 08:59:33PM +0200, Christian Lamparter wrote:
> >>> If you have disabled rx-decrypt logic of ath10k, then why isn't _aesni_dec1
> >>> or aes_decrypt listed in the perf top result? I think they should be. Have you
> >>> removed them from the "perf top results" or are they really absent 
> >>> altogether? 
> >>>
> >>> Because, from this perf result, it looks like your CPU is not burden by the
> >>> incoming RX at all?! Instead it is busy with the encryption of frames
> >>> it will be transmitting (in case of tcp, this could be tcp acks).
> >>
> >> Keep in mind that this is CCMP, i.e., AES in CCM (Counter with CBC-MAC)
> >> mode. The CCM mode uses only the block cipher encryption function, i.e.,
> >> you won't be seeing aes_decrypt or _aesni_dec1 for this even on the RX
> >> path (AES encryption operations are used to generate the key stream
> >> blocks for CCM decryption).
> > Yes, I remember this detail/the old days (before 3.12/3.13?). Back then
> > ieee80211_aes_ccm_decrypt did exactly that. But these semantic pitfalls
> > were taken care of by the following commit:
> > 
> > commit 7ec7c4a9a686c608315739ab6a2b0527a240883c (from wireless-testing.git)
> > Author: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> > Date:   Thu Oct 10 09:55:20 2013 +0200
> 
> This patch is in my tree (I'm using 3.14.14 kernel currently).
> 
> Here is a perf top from a different machine, with single wlan interface
> running UDP download (btserver is user-space app that is generating/receiving
> the traffic).  I can do about 200Mbps download with WPA2 encryption enabled
> on this machine, and ksoftirqd is using about 76% of a core according to top.

Thanks. I looked into AES in CCM (Counter with CBC-MAC) instead of ccm.c
and guess what: "Both the CCM encryption and CCM decryption operations
require only the block cipher encryption function." [0]. (Yes, same as
Jouni said in his mail).

Now to the perf:
 
> Samples: 126K of event 'cycles', Event count (approx.): 29019221373
>  10.74%  [kernel]                 [k] math_state_restore
>  10.50%  btserver                 [.] 0x000000000033260d
>   9.00%  [kernel]                 [k] _aesni_enc1
>   7.33%  [kernel]                 [k] fpu_save_init
>   6.70%  [kernel]                 [k] __lock_acquire
>   2.46%  [kernel]                 [k] irq_fpu_usable
>   2.34%  [kernel]                 [k] crypto_xor
>   1.88%  [kernel]                 [k] arch_local_save_flags
>   1.83%  [kernel]                 [k] arch_local_irq_restore
>   1.58%  [kernel]                 [k] lock_release
>   1.48%  [kernel]                 [k] aes_encrypt
>   1.27%  [kernel]                 [k] mark_lock
>   1.12%  [kernel]                 [k] lock_acquire
>   1.02%  [kernel]                 [k] mark_held_locks
>   0.96%  [kernel]                 [k] trace_hardirqs_on_caller
>   0.93%  [kernel]                 [k] get_data_to_compute
>   0.83%  [kernel]                 [k] hlock_class
>   0.81%  [kernel]                 [k] __kernel_fpu_begin
>   0.81%  [kernel]                 [k] crypto_ctr_crypt
>   0.80%  [kernel]                 [k] crypto_inc

The high overhead (math_state_restore and fpu_save_init) are caused by 
the way ccm.c interacts with the aesni implementation when calculating
the MAC [1] (in compute_mac). 

>    [ ... ]
>	/* now encrypt rest of data */
>	while (datalen >= 16) {
>		crypto_xor(odata, data, bs);
>		crypto_cipher_encrypt_one(tfm, odata, odata);
>
>		datalen -= 16;
>		data += 16;
>	}
>   [...]

crypto_cipher_encrypt_one is a wrapper which in your case calls 
aesni's aes_encrypt [2].

And aes_encrypt looks like this: 

>	[...]
>	kernel_fpu_begin();
>	aesni_enc(ctx, dst, src); <-- this is where it goes to _aesni_enc1
>	kernel_fpu_end();
>	[...] 

Or: for every 16 Bytes of payload there is one fpu context save and
restore... ouch!

[0] http://tools.ietf.org/html/rfc3610
[1] http://lxr.free-electrons.com/source/crypto/ccm.c#L164
[2] http://lxr.free-electrons.com/source/arch/x86/crypto/aesni-intel_glue.c#L323


Regards

Christian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-08-07 14:05                 ` Christian Lamparter
@ 2014-08-07 17:45                   ` Ben Greear
  2014-08-10 13:44                     ` Christian Lamparter
  0 siblings, 1 reply; 21+ messages in thread
From: Ben Greear @ 2014-08-07 17:45 UTC (permalink / raw)
  To: Christian Lamparter; +Cc: Jouni Malinen, linux-wireless, Johannes Berg

On 08/07/2014 07:05 AM, Christian Lamparter wrote:

> The high overhead (math_state_restore and fpu_save_init) are caused by 
> the way ccm.c interacts with the aesni implementation when calculating
> the MAC [1] (in compute_mac). 
> 
>>    [ ... ]
>> 	/* now encrypt rest of data */
>> 	while (datalen >= 16) {
>> 		crypto_xor(odata, data, bs);
>> 		crypto_cipher_encrypt_one(tfm, odata, odata);
>>
>> 		datalen -= 16;
>> 		data += 16;
>> 	}
>>   [...]
> 
> crypto_cipher_encrypt_one is a wrapper which in your case calls 
> aesni's aes_encrypt [2].
> 
> And aes_encrypt looks like this: 
> 
>> 	[...]
>> 	kernel_fpu_begin();
>> 	aesni_enc(ctx, dst, src); <-- this is where it goes to _aesni_enc1
>> 	kernel_fpu_end();
>> 	[...] 
> 
> Or: for every 16 Bytes of payload there is one fpu context save and
> restore... ouch!

I have never messed with this kind of stuff...

Any idea if it would work to put the fpu_begin/end a bit higher and do all those 16 byte
chunks in a batch without messing with the FPU for each chunk?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-08-07 17:45                   ` Ben Greear
@ 2014-08-10 13:44                     ` Christian Lamparter
  2014-08-12 18:34                       ` Ben Greear
  0 siblings, 1 reply; 21+ messages in thread
From: Christian Lamparter @ 2014-08-10 13:44 UTC (permalink / raw)
  To: Ben Greear; +Cc: Jouni Malinen, linux-wireless, Johannes Berg

On Thursday, August 07, 2014 10:45:01 AM Ben Greear wrote:
> On 08/07/2014 07:05 AM, Christian Lamparter wrote:
> > Or: for every 16 Bytes of payload there is one fpu context save and
> > restore... ouch!
>
> Any idea if it would work to put the fpu_begin/end a bit higher
> and do all those 16 byte chunks in a batch without messing with
> the FPU for each chunk?

It sort of works - see sample feature patch for aesni-intel-glue 
(taken from 3.16-wl). Older kernels (like 3.15, 3.14) need:
"crypto: allow blkcipher walks over AEAD data" [0] (and maybe more).

The FPU save/restore overhead should be gone. Also, if the aesni
instructions can't be used, the implementation will fall back
to the original ccm(aes) code. Calculating the MAC is still much
more expensive than the payload encryption or decryption. However,
I can't see a way of making this more efficient without rewriting
and combining the parts I took from crypto/ccm.c into an several, 
dedicated assembler functions.

Regards
Christian
---
 arch/x86/crypto/aesni-intel_glue.c | 484 +++++++++++++++++++++++++++++++++++++
 1 file changed, 484 insertions(+)

diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index 948ad0e..beab823 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -36,6 +36,7 @@
 #include <asm/crypto/aes.h>
 #include <crypto/ablk_helper.h>
 #include <crypto/scatterwalk.h>
+#include <crypto/aead.h>
 #include <crypto/internal/aead.h>
 #include <linux/workqueue.h>
 #include <linux/spinlock.h>
@@ -499,6 +500,448 @@ static int ctr_crypt(struct blkcipher_desc *desc,
 
 	return err;
 }
+
+static int __ccm_setkey(struct crypto_aead *tfm, const u8 *in_key,
+		      unsigned int key_len)
+{
+	struct crypto_aes_ctx *ctx = crypto_aead_ctx(tfm);
+
+	return aes_set_key_common(crypto_aead_tfm(tfm), ctx, in_key, key_len);
+}
+
+static int __ccm_setauthsize(struct crypto_aead *tfm, unsigned int authsize)
+{
+	if ((authsize & 1) || authsize < 4)
+		return -EINVAL;
+	return 0;
+}
+
+static int set_msg_len(u8 *block, unsigned int msglen, int csize)
+{
+	__be32 data;
+
+	memset(block, 0, csize);
+	block += csize;
+
+	if (csize >= 4)
+		csize = 4;
+	else if (msglen > (1 << (8 * csize)))
+		return -EOVERFLOW;
+
+	data = cpu_to_be32(msglen);
+	memcpy(block - csize, (u8 *)&data + 4 - csize, csize);
+
+	return 0;
+}
+
+static int ccm_init_mac(struct aead_request *req, u8 maciv[], u32 msglen)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	__be32 *n = (__be32 *)&maciv[AES_BLOCK_SIZE - 8];
+	u32 l = req->iv[0] + 1;
+
+	/* verify that CCM dimension 'L' is set correctly in the IV */
+	if (l < 2 || l > 8)
+		return -EINVAL;
+
+	/* verify that msglen can in fact be represented in L bytes */
+	if (l < 4 && msglen >> (8 * l))
+		return -EOVERFLOW;
+
+	/*
+	 * Even if the CCM spec allows L values of up to 8, the Linux cryptoapi
+	 * uses a u32 type to represent msglen so the top 4 bytes are always 0.
+	 */
+	n[0] = 0;
+	n[1] = cpu_to_be32(msglen);
+
+	memcpy(maciv, req->iv, AES_BLOCK_SIZE - l);
+
+	/*
+	 * Meaning of byte 0 according to CCM spec (RFC 3610/NIST 800-38C)
+	 * - bits 0..2	: max # of bytes required to represent msglen, minus 1
+	 *                (already set by caller)
+	 * - bits 3..5	: size of auth tag (1 => 4 bytes, 2 => 6 bytes, etc)
+	 * - bit 6	: indicates presence of authenticate-only data
+	 */
+	maciv[0] |= (crypto_aead_authsize(aead) - 2) << 2;
+	if (req->assoclen)
+		maciv[0] |= 0x40;
+
+	memset(&req->iv[AES_BLOCK_SIZE - l], 0, l);
+	return set_msg_len(maciv + AES_BLOCK_SIZE - l, msglen, l);
+}
+
+static int compute_mac(struct crypto_aes_ctx *ctx, u8 mac[], u8 *data, int n,
+		       unsigned int ilen, u8 *idata)
+{
+	unsigned int bs = AES_BLOCK_SIZE;
+	u8 *odata = mac;
+	int datalen, getlen;
+
+	datalen = n;
+
+	/* first time in here, block may be partially filled. */
+	getlen = bs - ilen;
+	if (datalen >= getlen) {
+		memcpy(idata + ilen, data, getlen);
+		crypto_xor(odata, idata, bs);
+
+		aesni_enc(ctx, odata, odata);
+		datalen -= getlen;
+		data += getlen;
+		ilen = 0;
+	}
+
+	/* now encrypt rest of data */
+	while (datalen >= bs) {
+		crypto_xor(odata, data, bs);
+
+		aesni_enc(ctx, odata, odata);
+
+		datalen -= bs;
+		data += bs;
+	}
+
+	/* check and see if there's leftover data that wasn't
+	 * enough to fill a block.
+	 */
+	if (datalen) {
+		memcpy(idata + ilen, data, datalen);
+		ilen += datalen;
+	}
+	return ilen;
+}
+
+static unsigned int get_data_to_compute(struct crypto_aes_ctx *ctx, u8 mac[],
+					u8 *idata, struct scatterlist *sg,
+					unsigned int len, unsigned int ilen)
+{
+	struct scatter_walk walk;
+	u8 *data_src;
+	int n;
+
+	scatterwalk_start(&walk, sg);
+
+	while (len) {
+		n = scatterwalk_clamp(&walk, len);
+		if (!n) {
+			scatterwalk_start(&walk, sg_next(walk.sg));
+			n = scatterwalk_clamp(&walk, len);
+		}
+		data_src = scatterwalk_map(&walk);
+
+		ilen = compute_mac(ctx, mac, data_src, n, ilen, idata);
+		len -= n;
+
+		scatterwalk_unmap(data_src);
+		scatterwalk_advance(&walk, n);
+		scatterwalk_done(&walk, 0, len);
+	}
+
+	/* any leftover needs padding and then encrypted */
+	if (ilen) {
+		int padlen;
+		u8 *odata = mac;
+
+		padlen = AES_BLOCK_SIZE - ilen;
+		memset(idata + ilen, 0, padlen);
+		crypto_xor(odata, idata, AES_BLOCK_SIZE);
+
+		aesni_enc(ctx, odata, odata);
+		ilen = 0;
+	}
+	return ilen;
+}
+
+static void ccm_calculate_auth_mac(struct aead_request *req,
+				   struct crypto_aes_ctx *ctx, u8 mac[],
+				   struct scatterlist *src,
+				   unsigned int cryptlen)
+{
+	unsigned int ilen;
+	u8 idata[AES_BLOCK_SIZE];
+	u32 len = req->assoclen;
+
+	aesni_enc(ctx, mac, mac);
+
+	if (len) {
+		struct __packed {
+			__be16 l;
+			__be32 h;
+		} *ltag = (void *)idata;
+
+		/* prepend the AAD with a length tag */
+		if (len < 0xff00) {
+			ltag->l = cpu_to_be16(len);
+			ilen = 2;
+		} else  {
+			ltag->l = cpu_to_be16(0xfffe);
+			ltag->h = cpu_to_be32(len);
+			ilen = 6;
+		}
+
+		ilen = get_data_to_compute(ctx, mac, idata,
+					   req->assoc, req->assoclen,
+					   ilen);
+	} else {
+		ilen = 0;
+	}
+
+	/* compute plaintext into mac */
+	if (cryptlen) {
+		ilen = get_data_to_compute(ctx, mac, idata,
+					   src, cryptlen, ilen);
+	}
+}
+
+static int __ccm_encrypt(struct aead_request *req)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	struct crypto_aes_ctx *ctx = aes_ctx(crypto_aead_ctx(aead));
+	struct blkcipher_desc desc = { .info = req->iv };
+	struct blkcipher_walk walk;
+	struct scatterlist src[2], dst[2], *pdst;
+	u8 __aligned(8) mac[AES_BLOCK_SIZE];
+	u32 len = req->cryptlen;
+	int err;
+
+	err = ccm_init_mac(req, mac, len);
+	if (err)
+		return err;
+
+	ccm_calculate_auth_mac(req, ctx, mac, req->src, len);
+
+	sg_init_table(src, 2);
+	sg_set_buf(src, mac, sizeof(mac));
+	scatterwalk_sg_chain(src, 2, req->src);
+
+	pdst = src;
+	if (req->src != req->dst) {
+		sg_init_table(dst, 2);
+		sg_set_buf(dst, mac, sizeof(mac));
+		scatterwalk_sg_chain(dst, 2, req->dst);
+		pdst = dst;
+	}
+
+	len += sizeof(mac);
+	blkcipher_walk_init(&walk, pdst, src, len);
+	err = blkcipher_aead_walk_virt_block(&desc, &walk, aead,
+					     AES_BLOCK_SIZE);
+
+	while ((len = walk.nbytes) >= AES_BLOCK_SIZE) {
+		aesni_ctr_enc(ctx, walk.dst.virt.addr, walk.src.virt.addr,
+			      len & AES_BLOCK_MASK, walk.iv);
+		len &= AES_BLOCK_SIZE - 1;
+		err = blkcipher_walk_done(&desc, &walk, len);
+	}
+	if (walk.nbytes) {
+		ctr_crypt_final(ctx, &walk);
+		err = blkcipher_walk_done(&desc, &walk, 0);
+	}
+
+	if (err)
+		return err;
+
+	/* copy authtag to end of dst */
+	scatterwalk_map_and_copy(mac, req->dst, req->cryptlen,
+				 crypto_aead_authsize(aead), 1);
+	return 0;
+}
+
+static int __ccm_decrypt(struct aead_request *req)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	struct crypto_aes_ctx *ctx = aes_ctx(crypto_aead_ctx(aead));
+	unsigned int authsize = crypto_aead_authsize(aead);
+	struct blkcipher_desc desc = { .info = req->iv };
+	struct blkcipher_walk walk;
+	struct scatterlist src[2], dst[2], *pdst;
+	u8 __aligned(8) authtag[AES_BLOCK_SIZE], mac[AES_BLOCK_SIZE];
+	u32 len;
+	int err;
+
+	if (req->cryptlen < authsize)
+		return -EINVAL;
+
+	scatterwalk_map_and_copy(authtag, req->src,
+				 req->cryptlen - authsize, authsize, 0);
+
+	err = ccm_init_mac(req, mac, req->cryptlen - authsize);
+	if (err)
+		return err;
+
+	sg_init_table(src, 2);
+	sg_set_buf(src, authtag, sizeof(authtag));
+	scatterwalk_sg_chain(src, 2, req->src);
+
+	pdst = src;
+	if (req->src != req->dst) {
+		sg_init_table(dst, 2);
+		sg_set_buf(dst, authtag, sizeof(authtag));
+		scatterwalk_sg_chain(dst, 2, req->dst);
+		pdst = dst;
+	}
+
+	blkcipher_walk_init(&walk, pdst, src,
+			    req->cryptlen - authsize + sizeof(mac));
+	err = blkcipher_aead_walk_virt_block(&desc, &walk, aead,
+					     AES_BLOCK_SIZE);
+
+	while ((len = walk.nbytes) >= AES_BLOCK_SIZE) {
+		aesni_ctr_enc(ctx, walk.dst.virt.addr, walk.src.virt.addr,
+			      len & AES_BLOCK_MASK, walk.iv);
+		len &= AES_BLOCK_SIZE - 1;
+		err = blkcipher_walk_done(&desc, &walk, len);
+	}
+	if (walk.nbytes) {
+		ctr_crypt_final(ctx, &walk);
+		err = blkcipher_walk_done(&desc, &walk, 0);
+	}
+
+	ccm_calculate_auth_mac(req, ctx, mac, req->dst,
+			       req->cryptlen - authsize);
+	if (err)
+		return err;
+
+	/* compare calculated auth tag with the stored one */
+	if (crypto_memneq(mac, authtag, authsize))
+		return -EBADMSG;
+	return 0;
+}
+
+struct ccm_async_ctx {
+	struct crypto_aes_ctx ctx;
+	struct crypto_aead *fallback;
+};
+
+static inline struct
+ccm_async_ctx *get_ccm_ctx(struct crypto_aead *aead)
+{
+	return (struct ccm_async_ctx *)
+		PTR_ALIGN((u8 *)
+		crypto_tfm_ctx(crypto_aead_tfm(aead)), AESNI_ALIGN);
+}
+
+static int ccm_init(struct crypto_tfm *tfm)
+{
+	struct crypto_aead *crypto_tfm;
+	struct ccm_async_ctx *ctx = (struct ccm_async_ctx *)
+		PTR_ALIGN((u8 *)crypto_tfm_ctx(tfm), AESNI_ALIGN);
+
+	crypto_tfm = crypto_alloc_aead("ccm(aes)", 0,
+		CRYPTO_ALG_ASYNC | CRYPTO_ALG_NEED_FALLBACK);
+	if (IS_ERR(crypto_tfm))
+		return PTR_ERR(crypto_tfm);
+
+	ctx->fallback = crypto_tfm;
+	return 0;
+}
+
+static void ccm_exit(struct crypto_tfm *tfm)
+{
+	struct ccm_async_ctx *ctx = (struct ccm_async_ctx *)
+		PTR_ALIGN((u8 *)crypto_tfm_ctx(tfm), AESNI_ALIGN);
+
+	if (!IS_ERR_OR_NULL(ctx->fallback))
+		crypto_free_aead(ctx->fallback);
+}
+
+static int ccm_setkey(struct crypto_aead *aead, const u8 *in_key,
+		      unsigned int key_len)
+{
+	struct crypto_tfm *tfm = crypto_aead_tfm(aead);
+	struct ccm_async_ctx *ctx = (struct ccm_async_ctx *)
+		PTR_ALIGN((u8 *)crypto_tfm_ctx(tfm), AESNI_ALIGN);
+	int err;
+
+	err = __ccm_setkey(aead, in_key, key_len);
+	if (err)
+		return err;
+
+	/*
+	 * Set the fallback transform to use the same request flags as
+	 * the hardware transform.
+	 */
+	ctx->fallback->base.crt_flags &= ~CRYPTO_TFM_REQ_MASK;
+	ctx->fallback->base.crt_flags |=
+			tfm->crt_flags & CRYPTO_TFM_REQ_MASK;
+	return crypto_aead_setkey(ctx->fallback, in_key, key_len);
+}
+
+static int ccm_setauthsize(struct crypto_aead *aead, unsigned int authsize)
+{
+	struct crypto_tfm *tfm = crypto_aead_tfm(aead);
+	struct ccm_async_ctx *ctx = (struct ccm_async_ctx *)
+		PTR_ALIGN((u8 *)crypto_tfm_ctx(tfm), AESNI_ALIGN);
+	int err;
+
+	err = __ccm_setauthsize(aead, authsize);
+	if (err)
+		return err;
+
+	return crypto_aead_setauthsize(ctx->fallback, authsize);
+}
+
+static int ccm_encrypt(struct aead_request *req)
+{
+	int ret;
+
+	if (!irq_fpu_usable()) {
+		struct crypto_aead *aead = crypto_aead_reqtfm(req);
+		struct ccm_async_ctx *ctx = get_ccm_ctx(aead);
+		struct crypto_aead *fallback = ctx->fallback;
+
+		char aead_req_data[sizeof(struct aead_request) +
+				   crypto_aead_reqsize(fallback)]
+		__aligned(__alignof__(struct aead_request));
+		struct aead_request *aead_req = (void *) aead_req_data;
+
+		memset(aead_req, 0, sizeof(aead_req_data));
+		aead_request_set_tfm(aead_req, fallback);
+		aead_request_set_assoc(aead_req, req->assoc, req->assoclen);
+		aead_request_set_crypt(aead_req, req->src, req->dst,
+				       req->cryptlen, req->iv);
+		aead_request_set_callback(aead_req, req->base.flags,
+					  req->base.complete, req->base.data);
+		ret = crypto_aead_encrypt(aead_req);
+	} else {
+		kernel_fpu_begin();
+		ret = __ccm_encrypt(req);
+		kernel_fpu_end();
+	}
+	return ret;
+}
+
+static int ccm_decrypt(struct aead_request *req)
+{
+	int ret;
+
+	if (!irq_fpu_usable()) {
+		struct crypto_aead *aead = crypto_aead_reqtfm(req);
+		struct ccm_async_ctx *ctx = get_ccm_ctx(aead);
+		struct crypto_aead *fallback = ctx->fallback;
+
+		char aead_req_data[sizeof(struct aead_request) +
+				   crypto_aead_reqsize(fallback)]
+		__aligned(__alignof__(struct aead_request));
+		struct aead_request *aead_req = (void *) aead_req_data;
+
+		memset(aead_req, 0, sizeof(aead_req_data));
+		aead_request_set_tfm(aead_req, fallback);
+		aead_request_set_assoc(aead_req, req->assoc, req->assoclen);
+		aead_request_set_crypt(aead_req, req->src, req->dst,
+				       req->cryptlen, req->iv);
+		aead_request_set_callback(aead_req, req->base.flags,
+					  req->base.complete, req->base.data);
+		ret = crypto_aead_decrypt(aead_req);
+	} else {
+		kernel_fpu_begin();
+		ret = __ccm_decrypt(req);
+		kernel_fpu_end();
+	}
+	return ret;
+}
 #endif
 
 static int ablk_ecb_init(struct crypto_tfm *tfm)
@@ -1308,6 +1751,47 @@ static struct crypto_alg aesni_algs[] = { {
 		},
 	},
 }, {
+	.cra_name		= "__ccm-aes-aesni",
+	.cra_driver_name	= "__driver-ccm-aes-aesni",
+	.cra_priority		= 0,
+	.cra_flags		= CRYPTO_ALG_TYPE_AEAD,
+	.cra_blocksize		= 1,
+	.cra_ctxsize		= sizeof(struct crypto_aes_ctx) +
+				  AESNI_ALIGN - 1,
+	.cra_alignmask		= 0,
+	.cra_type		= &crypto_aead_type,
+	.cra_module		= THIS_MODULE,
+	.cra_aead = {
+		.ivsize		= AES_BLOCK_SIZE,
+		.maxauthsize	= AES_BLOCK_SIZE,
+		.setkey		= __ccm_setkey,
+		.setauthsize	= __ccm_setauthsize,
+		.encrypt	= __ccm_encrypt,
+		.decrypt	= __ccm_decrypt,
+	},
+}, {
+	.cra_name		= "ccm(aes)",
+	.cra_driver_name	= "ccm-aes-aesni",
+	.cra_priority		= 700,
+	.cra_flags		= CRYPTO_ALG_TYPE_AEAD |
+				  CRYPTO_ALG_NEED_FALLBACK,
+	.cra_blocksize		= 1,
+	.cra_ctxsize		= AESNI_ALIGN - 1 +
+				  sizeof(struct ccm_async_ctx),
+	.cra_alignmask		= 0,
+	.cra_type		= &crypto_aead_type,
+	.cra_module		= THIS_MODULE,
+	.cra_init		= ccm_init,
+	.cra_exit		= ccm_exit,
+	.cra_aead = {
+		.ivsize		= AES_BLOCK_SIZE,
+		.maxauthsize	= AES_BLOCK_SIZE,
+		.setkey		= ccm_setkey,
+		.setauthsize	= ccm_setauthsize,
+		.encrypt	= ccm_encrypt,
+		.decrypt	= ccm_decrypt,
+	},
+}, {
 	.cra_name		= "__gcm-aes-aesni",
 	.cra_driver_name	= "__driver-gcm-aes-aesni",
 	.cra_priority		= 0,
-- 
2.0.1

[0] <https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4f7f1d7cff8f2c170ce0319eb4c01a82c328d34f>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-08-10 13:44                     ` Christian Lamparter
@ 2014-08-12 18:34                       ` Ben Greear
  2014-08-14 12:39                         ` Christian Lamparter
  0 siblings, 1 reply; 21+ messages in thread
From: Ben Greear @ 2014-08-12 18:34 UTC (permalink / raw)
  To: Christian Lamparter; +Cc: Jouni Malinen, linux-wireless, Johannes Berg

On 08/10/2014 06:44 AM, Christian Lamparter wrote:
> On Thursday, August 07, 2014 10:45:01 AM Ben Greear wrote:
>> On 08/07/2014 07:05 AM, Christian Lamparter wrote:
>>> Or: for every 16 Bytes of payload there is one fpu context save and
>>> restore... ouch!
>>
>> Any idea if it would work to put the fpu_begin/end a bit higher
>> and do all those 16 byte chunks in a batch without messing with
>> the FPU for each chunk?
> 
> It sort of works - see sample feature patch for aesni-intel-glue 
> (taken from 3.16-wl). Older kernels (like 3.15, 3.14) need:
> "crypto: allow blkcipher walks over AEAD data" [0] (and maybe more).
> 
> The FPU save/restore overhead should be gone. Also, if the aesni
> instructions can't be used, the implementation will fall back
> to the original ccm(aes) code. Calculating the MAC is still much
> more expensive than the payload encryption or decryption. However,
> I can't see a way of making this more efficient without rewriting
> and combining the parts I took from crypto/ccm.c into an several, 
> dedicated assembler functions.

I tried this patch on my i7 machine, on the 3.16+ kernel.  Without your
patch, I see about 260Mbps download.  With it, performance improves
to around 350Mbps - 375Mbps.

Without encryption, I see download rate of around 400 - 420Mbps.

So, your patch looks like a good improvement to me, and I'll be
happy to test further patches if you happen to do those assembler
optimizations you talk about above.

Let me know if you would like more/different performance
stats.


Here is perf top of open authentication, download, UDP:

Samples: 64K of event 'cycles', Event count (approx.): 8792558478
 30.78%  btserver                [.] 0x0000000000100501
  2.73%  [kernel]                [k] copy_user_generic_string
  2.02%  [kernel]                [k] swiotlb_tbl_unmap_single
  1.43%  [kernel]                [k] ioread32
  1.40%  [ath10k_core]           [k] ath10k_htt_txrx_compl_task
  1.38%  [kernel]                [k] csum_partial
  1.22%  [kernel]                [k] _raw_spin_lock_irqsave
  0.97%  [cfg80211]              [k] ftrace_define_fields_rdev_return_int_survey_info
  0.97%  [kernel]                [k] pskb_expand_head
  0.95%  [kernel]                [k] do_raw_spin_lock
  0.82%  [kernel]                [k] __slab_free
  0.78%  [kernel]                [k] __sk_run_filter
  0.71%  [kernel]                [k] __rcu_read_unlock
  0.67%  [kernel]                [k] __netif_receive_skb_core
  0.65%  [kernel]                [k] __rcu_read_lock
  0.62%  [kernel]                [k] build_skb
  0.59%  [mac80211]              [k] ieee80211_rx_handlers
  0.55%  [kernel]                [k] nf_iterate
  0.52%  [kernel]                [k] arch_local_irq_restore


Using WPA2, sw-crypt, download, UDP:


Samples: 52K of event 'cycles', Event count (approx.): 13162827574
 24.78%  btserver              [.] 0x00000000000c598c
 10.97%  [kernel]              [k] _aesni_enc1
  2.75%  [kernel]              [k] _aesni_enc4
  2.26%  [kernel]              [k] crypto_xor
  1.69%  [kernel]              [k] aesni_enc
  1.29%  [kernel]              [k] swiotlb_tbl_unmap_single
  1.21%  [kernel]              [k] copy_user_generic_string
  1.17%  [kernel]              [k] ioread32
  1.13%  [kernel]              [k] get_data_to_compute
  0.99%  [kernel]              [k] _raw_spin_lock_irqsave
  0.91%  [ath10k_core]         [k] ath10k_htt_txrx_compl_task
  0.70%  [kernel]              [k] __schedule
  0.70%  [kernel]              [k] native_write_msr_safe
  0.69%  [kernel]              [k] csum_partial
  0.62%  [kernel]              [k] pskb_expand_head
  0.62%  [kernel]              [k] __switch_to
  0.58%  [kernel]              [k] do_raw_spin_lock
  0.53%  [kernel]              [k] menu_select
  0.51%  [kernel]              [k] __rcu_read_unlock
  0.47%  [cfg80211]            [k] ftrace_define_fields_rdev_return_int_survey_info
  0.47%  [kernel]              [k] _aesni_inc
  0.47%  [kernel]              [k] __rcu_read_lock
  0.47%  [kernel]              [k] __sk_run_filter
  0.44%  [kernel]              [k] aesni_ctr_enc
  0.43%  [kernel]              [k] arch_local_irq_restore
  0.43%  [kernel]              [k] do_sys_poll
  0.42%  [kernel]              [k] __netif_receive_skb_core
  0.41%  [mac80211]            [k] ieee80211_rx_handlers
  0.38%  [kernel]              [k] update_cfs_shares


Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-08-12 18:34                       ` Ben Greear
@ 2014-08-14 12:39                         ` Christian Lamparter
  2014-08-14 17:09                           ` Ben Greear
  0 siblings, 1 reply; 21+ messages in thread
From: Christian Lamparter @ 2014-08-14 12:39 UTC (permalink / raw)
  To: Ben Greear; +Cc: Jouni Malinen, linux-wireless, Johannes Berg

On Tuesday, August 12, 2014 11:34:59 AM Ben Greear wrote:
> On 08/10/2014 06:44 AM, Christian Lamparter wrote:
> > On Thursday, August 07, 2014 10:45:01 AM Ben Greear wrote:
> >> On 08/07/2014 07:05 AM, Christian Lamparter wrote:
> >>> Or: for every 16 Bytes of payload there is one fpu context save and
> >>> restore... ouch!
> >>
> >> Any idea if it would work to put the fpu_begin/end a bit higher
> >> and do all those 16 byte chunks in a batch without messing with
> >> the FPU for each chunk?
> > 
> > It sort of works - see sample feature patch for aesni-intel-glue 
> > (taken from 3.16-wl). Older kernels (like 3.15, 3.14) need:
> > "crypto: allow blkcipher walks over AEAD data" [0] (and maybe more).
> > 
> > The FPU save/restore overhead should be gone. Also, if the aesni
> > instructions can't be used, the implementation will fall back
> > to the original ccm(aes) code. Calculating the MAC is still much
> > more expensive than the payload encryption or decryption. However,
> > I can't see a way of making this more efficient without rewriting
> > and combining the parts I took from crypto/ccm.c into an several, 
> > dedicated assembler functions.
> 
> Without encryption, I see download rate of around 400 - 420Mbps.
>
> So, your patch looks like a good improvement to me, and I'll be
> happy to test further patches if you happen to do those assembler
> optimizations you talk about above.

Maybe, that will depend on what the results for: "wpa2, *HW*-crypt,
download, udp" are.

> Let me know if you would like more/different performance
> stats. 

There's a test bench tool (tcrypt) to measure the performance 
of any cipher. It would be interesting to know what the 
performance/throughput it can produce without the overhead
of any application. [Yep, I'm making a small patch to test that,
but not before Saturday next week].
  
> Here is perf top of open authentication, download, UDP:
> 
> Using WPA2, sw-crypt, download, UDP:
> 
> Samples: 52K of event 'cycles', Event count (approx.): 13162827574
>  24.78%  btserver              [.] 0x00000000000c598c
Is btserver your "udp download" test application? What does it do, as
it is accounting for nearly 25%?

Regards
Christian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-08-14 12:39                         ` Christian Lamparter
@ 2014-08-14 17:09                           ` Ben Greear
  2014-08-19 18:18                             ` Ben Greear
  0 siblings, 1 reply; 21+ messages in thread
From: Ben Greear @ 2014-08-14 17:09 UTC (permalink / raw)
  To: Christian Lamparter; +Cc: Jouni Malinen, linux-wireless, Johannes Berg

On 08/14/2014 05:39 AM, Christian Lamparter wrote:
> On Tuesday, August 12, 2014 11:34:59 AM Ben Greear wrote:
>> On 08/10/2014 06:44 AM, Christian Lamparter wrote:
>>> On Thursday, August 07, 2014 10:45:01 AM Ben Greear wrote:
>>>> On 08/07/2014 07:05 AM, Christian Lamparter wrote:
>>>>> Or: for every 16 Bytes of payload there is one fpu context save and
>>>>> restore... ouch!
>>>>
>>>> Any idea if it would work to put the fpu_begin/end a bit higher
>>>> and do all those 16 byte chunks in a batch without messing with
>>>> the FPU for each chunk?
>>>
>>> It sort of works - see sample feature patch for aesni-intel-glue 
>>> (taken from 3.16-wl). Older kernels (like 3.15, 3.14) need:
>>> "crypto: allow blkcipher walks over AEAD data" [0] (and maybe more).
>>>
>>> The FPU save/restore overhead should be gone. Also, if the aesni
>>> instructions can't be used, the implementation will fall back
>>> to the original ccm(aes) code. Calculating the MAC is still much
>>> more expensive than the payload encryption or decryption. However,
>>> I can't see a way of making this more efficient without rewriting
>>> and combining the parts I took from crypto/ccm.c into an several, 
>>> dedicated assembler functions.
>>
>> Without encryption, I see download rate of around 400 - 420Mbps.
>>
>> So, your patch looks like a good improvement to me, and I'll be
>> happy to test further patches if you happen to do those assembler
>> optimizations you talk about above.
> 
> Maybe, that will depend on what the results for: "wpa2, *HW*-crypt,
> download, udp" are.

I'll do that test sometime soon and post the results.

>> Let me know if you would like more/different performance
>> stats. 
> 
> There's a test bench tool (tcrypt) to measure the performance 
> of any cipher. It would be interesting to know what the 
> performance/throughput it can produce without the overhead
> of any application. [Yep, I'm making a small patch to test that,
> but not before Saturday next week].
>   
>> Here is perf top of open authentication, download, UDP:
>>
>> Using WPA2, sw-crypt, download, UDP:
>>
>> Samples: 52K of event 'cycles', Event count (approx.): 13162827574
>>  24.78%  btserver              [.] 0x00000000000c598c
> Is btserver your "udp download" test application? What does it do, as
> it is accounting for nearly 25%?

btserver is our traffic generator.  In this case, it is mostly just
receiving UDP frames using non-blocking IO (using recvmmsg, in this case),
but it does a fair bit of stats gathering and
such.  It typically compares well with iperf as far as throughput goes,
but I'm sure it uses at least a bit more CPU as compared to iperf.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-08-14 17:09                           ` Ben Greear
@ 2014-08-19 18:18                             ` Ben Greear
  2014-08-20 20:47                               ` Christian Lamparter
  0 siblings, 1 reply; 21+ messages in thread
From: Ben Greear @ 2014-08-19 18:18 UTC (permalink / raw)
  To: Christian Lamparter; +Cc: Jouni Malinen, linux-wireless, Johannes Berg

On 08/14/2014 10:09 AM, Ben Greear wrote:
> On 08/14/2014 05:39 AM, Christian Lamparter wrote:
>> On Tuesday, August 12, 2014 11:34:59 AM Ben Greear wrote:
>>> On 08/10/2014 06:44 AM, Christian Lamparter wrote:
>>>> On Thursday, August 07, 2014 10:45:01 AM Ben Greear wrote:
>>>>> On 08/07/2014 07:05 AM, Christian Lamparter wrote:
>>>>>> Or: for every 16 Bytes of payload there is one fpu context save and
>>>>>> restore... ouch!
>>>>>
>>>>> Any idea if it would work to put the fpu_begin/end a bit higher
>>>>> and do all those 16 byte chunks in a batch without messing with
>>>>> the FPU for each chunk?
>>>>
>>>> It sort of works - see sample feature patch for aesni-intel-glue 
>>>> (taken from 3.16-wl). Older kernels (like 3.15, 3.14) need:
>>>> "crypto: allow blkcipher walks over AEAD data" [0] (and maybe more).
>>>>
>>>> The FPU save/restore overhead should be gone. Also, if the aesni
>>>> instructions can't be used, the implementation will fall back
>>>> to the original ccm(aes) code. Calculating the MAC is still much
>>>> more expensive than the payload encryption or decryption. However,
>>>> I can't see a way of making this more efficient without rewriting
>>>> and combining the parts I took from crypto/ccm.c into an several, 
>>>> dedicated assembler functions.
>>>
>>> Without encryption, I see download rate of around 400 - 420Mbps.
>>>
>>> So, your patch looks like a good improvement to me, and I'll be
>>> happy to test further patches if you happen to do those assembler
>>> optimizations you talk about above.
>>
>> Maybe, that will depend on what the results for: "wpa2, *HW*-crypt,
>> download, udp" are.
> 
> I'll do that test sometime soon and post the results.

I ran that today, and I get about the same throughput with hw-crypt or
sw-crypt (350-355Mbps UDP download goodput).

I still see 400+Mbps with Open authentication.

So, maybe the bottleneck now is elsewhere...

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-08-19 18:18                             ` Ben Greear
@ 2014-08-20 20:47                               ` Christian Lamparter
  2014-08-20 21:04                                 ` Ben Greear
  0 siblings, 1 reply; 21+ messages in thread
From: Christian Lamparter @ 2014-08-20 20:47 UTC (permalink / raw)
  To: Ben Greear; +Cc: Jouni Malinen, linux-wireless, Johannes Berg

On Tuesday, August 19, 2014 11:18:39 AM Ben Greear wrote:
> On 08/14/2014 10:09 AM, Ben Greear wrote:
> > On 08/14/2014 05:39 AM, Christian Lamparter wrote:
> >> On Tuesday, August 12, 2014 11:34:59 AM Ben Greear wrote:
> >>>
> >>> Without encryption, I see download rate of around 400 - 420Mbps.
> >>>
> >>> So, your patch looks like a good improvement to me, and I'll be
> >>> happy to test further patches if you happen to do those assembler
> >>> optimizations you talk about above.
> >>
> >> Maybe, that will depend on what the results for: "wpa2, *HW*-crypt,
> >> download, udp" are.
> > 
> > I'll do that test sometime soon and post the results.
> 
> I ran that today, and I get about the same throughput with hw-crypt or
> sw-crypt (350-355Mbps UDP download goodput).
> 
> I still see 400+Mbps with Open authentication.
> 
> So, maybe the bottleneck now is elsewhere...
Can you rule out that the "udp generator" (either the application
or the hardware) is now the bottleneck for this test? [Does the
datasheet mention the throughput of the hw-crypto? Or do you know
someone at QCA which can tell you if the hardware is filling up
the aggregates with additional padding to meet the MPDU start
spacing]

I'll look into the assembler implementation of aes-ccm. But I'm
afraid that this won't increase the throughput (and only decrease
the load on the CPU a bit).

Also, just for fun: what goodput can you achieve over gbit ethernet?
[Because ethernet is also affected by filtering, bridging, 
pcie-throughput... if it is setup in the same way so you could
rule out that iptables, its friends or the pcie-port is a
bottleneck].

Regards
Christian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-08-20 20:47                               ` Christian Lamparter
@ 2014-08-20 21:04                                 ` Ben Greear
  2014-08-22 22:55                                   ` Christian Lamparter
  0 siblings, 1 reply; 21+ messages in thread
From: Ben Greear @ 2014-08-20 21:04 UTC (permalink / raw)
  To: Christian Lamparter; +Cc: Jouni Malinen, linux-wireless, Johannes Berg



On 08/20/2014 01:47 PM, Christian Lamparter wrote:
> On Tuesday, August 19, 2014 11:18:39 AM Ben Greear wrote:
>> On 08/14/2014 10:09 AM, Ben Greear wrote:
>>> On 08/14/2014 05:39 AM, Christian Lamparter wrote:
>>>> On Tuesday, August 12, 2014 11:34:59 AM Ben Greear wrote:
>>>>>
>>>>> Without encryption, I see download rate of around 400 - 420Mbps.
>>>>>
>>>>> So, your patch looks like a good improvement to me, and I'll be
>>>>> happy to test further patches if you happen to do those assembler
>>>>> optimizations you talk about above.
>>>>
>>>> Maybe, that will depend on what the results for: "wpa2, *HW*-crypt,
>>>> download, udp" are.
>>>
>>> I'll do that test sometime soon and post the results.
>>
>> I ran that today, and I get about the same throughput with hw-crypt or
>> sw-crypt (350-355Mbps UDP download goodput).
>>
>> I still see 400+Mbps with Open authentication.
>>
>> So, maybe the bottleneck now is elsewhere...
> Can you rule out that the "udp generator" (either the application
> or the hardware) is now the bottleneck for this test? [Does the
> datasheet mention the throughput of the hw-crypto? Or do you know
> someone at QCA which can tell you if the hardware is filling up
> the aggregates with additional padding to meet the MPDU start
> spacing]

It is unlikely the UDP generator acts differently for encrypted v/s open
traffic, and since the NIC is supposed to do offload in hw-crypt mode,
the rest of the stack should be similar as well.

Other ath10k users report similar open & wpa2 throughput, so
it may be something in my kernel or firmware or configs.
I will run some additional tests when I get a chance...

> I'll look into the assembler implementation of aes-ccm. But I'm
> afraid that this won't increase the throughput (and only decrease
> the load on the CPU a bit).

I think you are right, and probably it is not worth much effort at
this point, at least as far as my setup is concerned.

> Also, just for fun: what goodput can you achieve over gbit ethernet?
> [Because ethernet is also affected by filtering, bridging,
> pcie-throughput... if it is setup in the same way so you could
> rule out that iptables, its friends or the pcie-port is a
> bottleneck].

Since Open runs faster, it shouldn't be pci-e bus or CPU bottleneck.
This class of system can generally sustain near 1 Gbps throughput
on wired Ethernet.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Looking for non-NIC hardware-offload for wpa2 decrypt.
  2014-08-20 21:04                                 ` Ben Greear
@ 2014-08-22 22:55                                   ` Christian Lamparter
  0 siblings, 0 replies; 21+ messages in thread
From: Christian Lamparter @ 2014-08-22 22:55 UTC (permalink / raw)
  To: Ben Greear; +Cc: Jouni Malinen, linux-wireless, Johannes Berg

On Wednesday, August 20, 2014 02:04:35 PM Ben Greear wrote:
> On 08/20/2014 01:47 PM, Christian Lamparter wrote:
> 
> > I'll look into the assembler implementation of aes-ccm. But I'm
> > afraid that this won't increase the throughput (and only decrease
> > the load on the CPU a bit).
> 
> I think you are right, and probably it is not worth much effort at
> this point, at least as far as my setup is concerned.

"There's a test bench tool (tcrypt) to measure the performance 
of any cipher. It would be interesting to know what the 
performance/throughput it can produce without the overhead
of any application. ..."

here it is: the module is located in crpyto/tcrypt

module parameters:
 - mode=212 (original ccm)
 - mode=213 (ccm-aesni)
 (sec=1 - Length in seconds of speed tests)

This will test the speed of the ccm implementation at
different block sizes for one second.

BTW: any luck with figuring out, if there are any other obvious 
bottlenecks? (Other than: btserver, checksumming, ...)?

Regards
Christian

---
diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 890449e..7675a13 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -354,8 +354,10 @@ static void test_aead_speed(const char *algo, int enc, unsigned int secs,
 			ret = crypto_aead_setauthsize(tfm, authsize);
 
 			iv_len = crypto_aead_ivsize(tfm);
-			if (iv_len)
-				memset(&iv, 0xff, iv_len);
+			if (iv_len) {
+				for (j = 0; j < iv_len; j++)
+					iv[j] = j + 1;
+			}
 
 			crypto_aead_clear_flags(tfm, ~0);
 			printk(KERN_INFO "test %u (%d bit key, %d byte blocks): ",
@@ -1751,6 +1753,15 @@ static int do_test(int m)
 				NULL, 0, 16, 8, aead_speed_template_20);
 		break;
 
+	case 212:
+		test_aead_speed("ccm_base(ctr(aes-aesni),aes-aesni)", ENCRYPT, sec,
+				NULL, 0, 16, 8, aead_speed_template_16);
+		break;
+	case 213:
+		test_aead_speed("ccm-aes-aesni", ENCRYPT, sec,
+				NULL, 0, 16, 8, aead_speed_template_16);
+		break;
+
 	case 300:
 		/* fall through */
 
diff --git a/crypto/tcrypt.h b/crypto/tcrypt.h
index 6c7e21a..88f152d 100644
--- a/crypto/tcrypt.h
+++ b/crypto/tcrypt.h
@@ -66,6 +66,7 @@ static u8 speed_template_32_64[] = {32, 64, 0};
  * AEAD speed tests
  */
 static u8 aead_speed_template_20[] = {20, 0};
+static u8 aead_speed_template_16[] = {16, 0};
 
 /*
  * Digest speed tests
 


^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-08-22 22:55 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-31  4:40 Looking for non-NIC hardware-offload for wpa2 decrypt Ben Greear
2014-03-31 18:09 ` Christian Lamparter
2014-07-28 20:50   ` Ben Greear
2014-07-29 22:29     ` Christian Lamparter
2014-07-29 22:50       ` Ben Greear
2014-07-30 18:59         ` Christian Lamparter
2014-07-30 19:08           ` Ben Greear
2014-07-31 20:05           ` Jouni Malinen
2014-07-31 20:45             ` Christian Lamparter
2014-08-05 23:09               ` Ben Greear
2014-08-07 14:05                 ` Christian Lamparter
2014-08-07 17:45                   ` Ben Greear
2014-08-10 13:44                     ` Christian Lamparter
2014-08-12 18:34                       ` Ben Greear
2014-08-14 12:39                         ` Christian Lamparter
2014-08-14 17:09                           ` Ben Greear
2014-08-19 18:18                             ` Ben Greear
2014-08-20 20:47                               ` Christian Lamparter
2014-08-20 21:04                                 ` Ben Greear
2014-08-22 22:55                                   ` Christian Lamparter
2014-07-30  7:06       ` Johannes Berg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.