* [WireGuard] News about MIPS and ARM optimized code?
@ 2016-08-08 13:23 René van Dorst
2016-08-08 14:29 ` Jason A. Donenfeld
0 siblings, 1 reply; 12+ messages in thread
From: René van Dorst @ 2016-08-08 13:23 UTC (permalink / raw)
To: wireguard
News about MIPS and ARM optimized code?
Greats,
René van Dorst.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code?
2016-08-08 13:23 [WireGuard] News about MIPS and ARM optimized code? René van Dorst
@ 2016-08-08 14:29 ` Jason A. Donenfeld
2016-09-08 11:57 ` René van Dorst
0 siblings, 1 reply; 12+ messages in thread
From: Jason A. Donenfeld @ 2016-08-08 14:29 UTC (permalink / raw)
To: René van Dorst; +Cc: WireGuard mailing list
Would you like to write it?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code?
2016-08-08 14:29 ` Jason A. Donenfeld
@ 2016-09-08 11:57 ` René van Dorst
2016-09-09 13:46 ` René van Dorst
0 siblings, 1 reply; 12+ messages in thread
From: René van Dorst @ 2016-09-08 11:57 UTC (permalink / raw)
Cc: WireGuard mailing list
I did try to write some MIPS32r2 code.
I wrote the chacha20_keysetup, chacha20_generic_block and
poly1305_generic_blocks in assembly.
Tried to load all needed variables in the registers. Which should
reduce the memory overhead.
But it is very difficult for me to do code profiling and/or isolate
the code and make some benchmark programs like supercop.
So testing was simple. Crosscompile the code. Copy and load the module
on the target. Run setup script and iperf.
#ifdef CONFIG_CPU_MIPS32_R2
asmlinkage void chacha20_keysetup(struct chacha20_ctx *ctx, const u8
key[static 32], const u8 nonce[static 8]);
asmlinkage void chacha20_generic_block(struct chacha20_ctx *ctx);
asmlinkage unsigned int poly1305_generic_blocks(struct poly1305_ctx
*ctx, const u8 *src, unsigned int srclen, u32 hibit);
#endif
But the speed is equal or less on my TP WR1043ND device which is a
MIPS32r2 24kc big endian.
So GCC does a good job. Also 24kc has no special CoProcessors or FPU.
Most improvement what I had it to change the buildroot default
optimization -Os to -O2.
This gives around 1-3% speed improvement.
ideas:
- remove the little endian parts on the MIPS.
Offcourse do it also on the other side.
On this device I can't switch endian.
But I did not see any improvements. Need 2 instruction for swapping
32bit register.
After a quick calculation it could save around 0.4% which is
~0.1MBit/s on this device.
Greats,
René van Dorst.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code?
2016-09-08 11:57 ` René van Dorst
@ 2016-09-09 13:46 ` René van Dorst
2016-09-09 13:52 ` Baptiste Jonglez
0 siblings, 1 reply; 12+ messages in thread
From: René van Dorst @ 2016-09-09 13:46 UTC (permalink / raw)
To: wireguard
Duo the misaligned data fetching function like poly1305 causes
regression on the mips.
h0 += (le32_to_cpuvp(src + 0) >> 0) & 0x3ffffff;
h1 += (le32_to_cpuvp(src + 3) >> 2) & 0x3ffffff;
h2 += (le32_to_cpuvp(src + 6) >> 4) & 0x3ffffff;
h3 += (le32_to_cpuvp(src + 9) >> 6) & 0x3ffffff;
h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit;
Had 26MBit now +42.
root@lede:~# iperf3 -c 10.0.0.1 -i 10
Connecting to host 10.0.0.1, port 5201
[ 4] local 10.0.0.2 port 36216 connected to 10.0.0.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-10.08 sec 51.2 MBytes 42.7 Mbits/sec 0 171 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.08 sec 51.2 MBytes 42.7 Mbits/sec 0 sender
[ 4] 0.00-10.08 sec 51.2 MBytes 42.7 Mbits/sec receiver
iperf Done.
root@lede:~# iperf3 -c 10.0.0.1 -u -b 1G -i 10
Connecting to host 10.0.0.1, port 5201
[ 4] local 10.0.0.2 port 60714 connected to 10.0.0.1 port 5201
[ ID] Interval Transfer Bandwidth Total Datagrams
[ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 7209
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Jitter
Lost/Total Datagrams
[ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 0.034 ms 0/7209 (0%)
[ 4] Sent 7209 datagrams
iperf Done.
root@lede:~#
Work is not done yet but a good start.
Greats,
René van Dorst.
Quoting René van Dorst <opensource@vdorst.com>:
> I did try to write some MIPS32r2 code.
> I wrote the chacha20_keysetup, chacha20_generic_block and
> poly1305_generic_blocks in assembly.
> Tried to load all needed variables in the registers. Which should
> reduce the memory overhead.
> But it is very difficult for me to do code profiling and/or isolate
> the code and make some benchmark programs like supercop.
> So testing was simple. Crosscompile the code. Copy and load the
> module on the target. Run setup script and iperf.
>
> #ifdef CONFIG_CPU_MIPS32_R2
> asmlinkage void chacha20_keysetup(struct chacha20_ctx *ctx, const u8
> key[static 32], const u8 nonce[static 8]);
> asmlinkage void chacha20_generic_block(struct chacha20_ctx *ctx);
> asmlinkage unsigned int poly1305_generic_blocks(struct poly1305_ctx
> *ctx, const u8 *src, unsigned int srclen, u32 hibit);
> #endif
>
> But the speed is equal or less on my TP WR1043ND device which is a
> MIPS32r2 24kc big endian.
> So GCC does a good job. Also 24kc has no special CoProcessors or FPU.
>
> Most improvement what I had it to change the buildroot default
> optimization -Os to -O2.
> This gives around 1-3% speed improvement.
>
> ideas:
> - remove the little endian parts on the MIPS.
> Offcourse do it also on the other side.
> On this device I can't switch endian.
> But I did not see any improvements. Need 2 instruction for
> swapping 32bit register.
> After a quick calculation it could save around 0.4% which is
> ~0.1MBit/s on this device.
>
> Greats,
>
> René van Dorst.
>
> _______________________________________________
> WireGuard mailing list
> WireGuard@lists.zx2c4.com
> http://lists.zx2c4.com/mailman/listinfo/wireguard
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code?
2016-09-09 13:46 ` René van Dorst
@ 2016-09-09 13:52 ` Baptiste Jonglez
2016-09-09 15:22 ` René van Dorst
2016-09-14 8:10 ` jens
0 siblings, 2 replies; 12+ messages in thread
From: Baptiste Jonglez @ 2016-09-09 13:52 UTC (permalink / raw)
To: René van Dorst; +Cc: wireguard
[-- Attachment #1: Type: text/plain, Size: 3959 bytes --]
Nice work! I had tried to write chacha20_generic_block in MIPS assembly,
but I got confused with endianness issues and the code didn't work in the
end.
Is your code available somewhere? I'd be happy to test on a variety of
MIPS routers.
On Fri, Sep 09, 2016 at 01:46:11PM +0000, René van Dorst wrote:
> Duo the misaligned data fetching function like poly1305 causes regression on
> the mips.
>
> h0 += (le32_to_cpuvp(src + 0) >> 0) & 0x3ffffff;
> h1 += (le32_to_cpuvp(src + 3) >> 2) & 0x3ffffff;
> h2 += (le32_to_cpuvp(src + 6) >> 4) & 0x3ffffff;
> h3 += (le32_to_cpuvp(src + 9) >> 6) & 0x3ffffff;
> h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit;
>
>
> Had 26MBit now +42.
>
> root@lede:~# iperf3 -c 10.0.0.1 -i 10
> Connecting to host 10.0.0.1, port 5201
> [ 4] local 10.0.0.2 port 36216 connected to 10.0.0.1 port 5201
> [ ID] Interval Transfer Bandwidth Retr Cwnd
> [ 4] 0.00-10.08 sec 51.2 MBytes 42.7 Mbits/sec 0 171 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bandwidth Retr
> [ 4] 0.00-10.08 sec 51.2 MBytes 42.7 Mbits/sec 0 sender
> [ 4] 0.00-10.08 sec 51.2 MBytes 42.7 Mbits/sec receiver
>
> iperf Done.
> root@lede:~# iperf3 -c 10.0.0.1 -u -b 1G -i 10
> Connecting to host 10.0.0.1, port 5201
> [ 4] local 10.0.0.2 port 60714 connected to 10.0.0.1 port 5201
> [ ID] Interval Transfer Bandwidth Total Datagrams
> [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 7209
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bandwidth Jitter Lost/Total
> Datagrams
> [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 0.034 ms 0/7209 (0%)
> [ 4] Sent 7209 datagrams
>
> iperf Done.
> root@lede:~#
>
>
> Work is not done yet but a good start.
>
> Greats,
>
> René van Dorst.
>
> Quoting René van Dorst <opensource@vdorst.com>:
>
> >I did try to write some MIPS32r2 code.
> >I wrote the chacha20_keysetup, chacha20_generic_block and
> >poly1305_generic_blocks in assembly.
> >Tried to load all needed variables in the registers. Which should reduce
> >the memory overhead.
> >But it is very difficult for me to do code profiling and/or isolate the
> >code and make some benchmark programs like supercop.
> >So testing was simple. Crosscompile the code. Copy and load the module on
> >the target. Run setup script and iperf.
> >
> >#ifdef CONFIG_CPU_MIPS32_R2
> >asmlinkage void chacha20_keysetup(struct chacha20_ctx *ctx, const u8
> >key[static 32], const u8 nonce[static 8]);
> >asmlinkage void chacha20_generic_block(struct chacha20_ctx *ctx);
> >asmlinkage unsigned int poly1305_generic_blocks(struct poly1305_ctx *ctx,
> >const u8 *src, unsigned int srclen, u32 hibit);
> >#endif
> >
> >But the speed is equal or less on my TP WR1043ND device which is a
> >MIPS32r2 24kc big endian.
> >So GCC does a good job. Also 24kc has no special CoProcessors or FPU.
> >
> >Most improvement what I had it to change the buildroot default
> >optimization -Os to -O2.
> >This gives around 1-3% speed improvement.
> >
> >ideas:
> >- remove the little endian parts on the MIPS.
> > Offcourse do it also on the other side.
> > On this device I can't switch endian.
> > But I did not see any improvements. Need 2 instruction for swapping
> >32bit register.
> > After a quick calculation it could save around 0.4% which is ~0.1MBit/s
> >on this device.
> >
> >Greats,
> >
> >René van Dorst.
> >
> >_______________________________________________
> >WireGuard mailing list
> >WireGuard@lists.zx2c4.com
> >http://lists.zx2c4.com/mailman/listinfo/wireguard
>
>
>
> _______________________________________________
> WireGuard mailing list
> WireGuard@lists.zx2c4.com
> http://lists.zx2c4.com/mailman/listinfo/wireguard
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code?
2016-09-09 13:52 ` Baptiste Jonglez
@ 2016-09-09 15:22 ` René van Dorst
2016-09-09 19:49 ` René van Dorst
2016-09-14 8:10 ` jens
1 sibling, 1 reply; 12+ messages in thread
From: René van Dorst @ 2016-09-09 15:22 UTC (permalink / raw)
To: Baptiste Jonglez; +Cc: wireguard
Not yet.
But it think more platforms suffer of this misaligned memory fetching.
So if someone fix this also in the C code that it will boost the
performance without the assembly version.
Greats,
René
Quoting Baptiste Jonglez <baptiste@bitsofnetworks.org>:
> Nice work! I had tried to write chacha20_generic_block in MIPS assembly,
> but I got confused with endianness issues and the code didn't work in the
> end.
>
> Is your code available somewhere? I'd be happy to test on a variety of
> MIPS routers.
>
> On Fri, Sep 09, 2016 at 01:46:11PM +0000, René van Dorst wrote:
>> Duo the misaligned data fetching function like poly1305 causes regression on
>> the mips.
>>
>> h0 += (le32_to_cpuvp(src + 0) >> 0) & 0x3ffffff;
>> h1 += (le32_to_cpuvp(src + 3) >> 2) & 0x3ffffff;
>> h2 += (le32_to_cpuvp(src + 6) >> 4) & 0x3ffffff;
>> h3 += (le32_to_cpuvp(src + 9) >> 6) & 0x3ffffff;
>> h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit;
>>
>>
>> Had 26MBit now +42.
>>
>> root@lede:~# iperf3 -c 10.0.0.1 -i 10
>> Connecting to host 10.0.0.1, port 5201
>> [ 4] local 10.0.0.2 port 36216 connected to 10.0.0.1 port 5201
>> [ ID] Interval Transfer Bandwidth Retr Cwnd
>> [ 4] 0.00-10.08 sec 51.2 MBytes 42.7 Mbits/sec 0 171 KBytes
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval Transfer Bandwidth Retr
>> [ 4] 0.00-10.08 sec 51.2 MBytes 42.7 Mbits/sec 0 sender
>> [ 4] 0.00-10.08 sec 51.2 MBytes 42.7 Mbits/sec
>> receiver
>>
>> iperf Done.
>> root@lede:~# iperf3 -c 10.0.0.1 -u -b 1G -i 10
>> Connecting to host 10.0.0.1, port 5201
>> [ 4] local 10.0.0.2 port 60714 connected to 10.0.0.1 port 5201
>> [ ID] Interval Transfer Bandwidth Total Datagrams
>> [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 7209
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval Transfer Bandwidth Jitter Lost/Total
>> Datagrams
>> [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 0.034 ms 0/7209 (0%)
>> [ 4] Sent 7209 datagrams
>>
>> iperf Done.
>> root@lede:~#
>>
>>
>> Work is not done yet but a good start.
>>
>> Greats,
>>
>> René van Dorst.
>>
>> Quoting René van Dorst <opensource@vdorst.com>:
>>
>> >I did try to write some MIPS32r2 code.
>> >I wrote the chacha20_keysetup, chacha20_generic_block and
>> >poly1305_generic_blocks in assembly.
>> >Tried to load all needed variables in the registers. Which should reduce
>> >the memory overhead.
>> >But it is very difficult for me to do code profiling and/or isolate the
>> >code and make some benchmark programs like supercop.
>> >So testing was simple. Crosscompile the code. Copy and load the module on
>> >the target. Run setup script and iperf.
>> >
>> >#ifdef CONFIG_CPU_MIPS32_R2
>> >asmlinkage void chacha20_keysetup(struct chacha20_ctx *ctx, const u8
>> >key[static 32], const u8 nonce[static 8]);
>> >asmlinkage void chacha20_generic_block(struct chacha20_ctx *ctx);
>> >asmlinkage unsigned int poly1305_generic_blocks(struct poly1305_ctx *ctx,
>> >const u8 *src, unsigned int srclen, u32 hibit);
>> >#endif
>> >
>> >But the speed is equal or less on my TP WR1043ND device which is a
>> >MIPS32r2 24kc big endian.
>> >So GCC does a good job. Also 24kc has no special CoProcessors or FPU.
>> >
>> >Most improvement what I had it to change the buildroot default
>> >optimization -Os to -O2.
>> >This gives around 1-3% speed improvement.
>> >
>> >ideas:
>> >- remove the little endian parts on the MIPS.
>> > Offcourse do it also on the other side.
>> > On this device I can't switch endian.
>> > But I did not see any improvements. Need 2 instruction for swapping
>> >32bit register.
>> > After a quick calculation it could save around 0.4% which is ~0.1MBit/s
>> >on this device.
>> >
>> >Greats,
>> >
>> >René van Dorst.
>> >
>> >_______________________________________________
>> >WireGuard mailing list
>> >WireGuard@lists.zx2c4.com
>> >http://lists.zx2c4.com/mailman/listinfo/wireguard
>>
>>
>>
>> _______________________________________________
>> WireGuard mailing list
>> WireGuard@lists.zx2c4.com
>> http://lists.zx2c4.com/mailman/listinfo/wireguard
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code?
2016-09-09 15:22 ` René van Dorst
@ 2016-09-09 19:49 ` René van Dorst
2016-09-14 7:16 ` René van Dorst
0 siblings, 1 reply; 12+ messages in thread
From: René van Dorst @ 2016-09-09 19:49 UTC (permalink / raw)
To: wireguard
Here is my last source code https://github.com/vDorst/wireguard/tree/mips32r2
Including the long history of try and fail ;).
But also good ideas like try to optimize the code for better data dependency.
Which makes the code less readable but more efficient.
This is the assembly part
https://github.com/vDorst/wireguard/blob/mips32r2/src/crypto/chacha20-mips32r2.S
Created functions:
* asmlinkage void chacha20_keysetup(struct chacha20_ctx *ctx, const u8
key[static 32], const u8 nonce[static 8]);
* asmlinkage void chacha20_generic_block(struct chacha20_ctx *ctx);
* asmlinkage unsigned int poly1305_generic_blocks(struct poly1305_ctx
*ctx, const u8 *src, unsigned int srclen, u32 hibit);
poly1305_generic_blocks is fixed in the last commit.
Code is written for MIPS32r2 Big endian.
Code has some define for __ORDER_BIG_ENDIAN__ which enable the endian
swap for that data but is not tested for Litte endian.
Todo:
* Change the C code to see how fast that works and set benchmark baseline.
* Look if I can optimize assembler version even more.
Greats,
René van Dorst.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code?
2016-09-09 19:49 ` René van Dorst
@ 2016-09-14 7:16 ` René van Dorst
2016-09-20 20:39 ` Jason A. Donenfeld
2016-09-27 1:48 ` Jason A. Donenfeld
0 siblings, 2 replies; 12+ messages in thread
From: René van Dorst @ 2016-09-14 7:16 UTC (permalink / raw)
To: wireguard
An update of my current findings.
Most improvements I have seen at the moment is writing and optimize
poly1305_generic_blocks function.
This gives a improvement of more than 1%.
I also noticed that the ping time does not change.
Improvement at the moment is around UDP: ~1.47% TCP: ~1.68% on large
transfers like iperf.
Wireguard mix of Asm and C variant:
https://github.com/vDorst/wireguard/commit/6f9187c325ee883b1f2b9f9da3deb0a61655b504
root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G -t 60
Connecting to host 10.0.0.1, port 5201
[ 4] local 10.0.0.2 port 47996 connected to 10.0.0.1 port 5201
[ ID] Interval Transfer Bandwidth Total Datagrams
[ 4] 0.00-10.00 sec 57.5 MBytes 48.2 Mbits/sec 7354
[ 4] 10.00-20.00 sec 57.4 MBytes 48.2 Mbits/sec 7350
[ 4] 20.00-30.00 sec 57.4 MBytes 48.2 Mbits/sec 7353
[ 4] 30.00-40.00 sec 57.5 MBytes 48.2 Mbits/sec 7356
[ 4] 40.00-50.00 sec 57.5 MBytes 48.2 Mbits/sec 7357
[ 4] 50.00-60.00 sec 57.5 MBytes 48.2 Mbits/sec 7358
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Jitter
Lost/Total Datag rams
[ 4] 0.00-60.00 sec 345 MBytes 48.2 Mbits/sec 0.037 ms 0/44128 (0%)
[ 4] Sent 44128 datagrams
root@lede:~# iperf3 -c 10.0.0.1 -i 10-b 1G -t 60
Connecting to host 10.0.0.1, port 5201
[ 4] local 10.0.0.2 port 37950 connected to 10.0.0.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-10.14 sec 52.5 MBytes 43.4 Mbits/sec 0 147 KBytes
[ 4] 10.14-20.02 sec 51.2 MBytes 43.5 Mbits/sec 0 147 KBytes
[ 4] 20.02-30.14 sec 52.5 MBytes 43.5 Mbits/sec 0 147 KBytes
[ 4] 30.14-40.01 sec 51.2 MBytes 43.5 Mbits/sec 0 147 KBytes
[ 4] 40.01-50.16 sec 52.5 MBytes 43.4 Mbits/sec 0 220 KBytes
[ 4] 50.16-60.01 sec 42.5 MBytes 36.2 Mbits/sec 0 220 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-60.01 sec 302 MBytes 42.3 Mbits/sec 0 sender
[ 4] 0.00-60.01 sec 302 MBytes 42.3 Mbits/sec receiver
Wireguard C variant:
https://github.com/vDorst/wireguard/commit/13fae657624aac6b9c1f411aa6472a91aae7fcc3
root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G -t 60
Connecting to host 10.0.0.1, port 5201
[ 4] local 10.0.0.2 port 40439 connected to 10.0.0.1 port 5201
[ ID] Interval Transfer Bandwidth Total Datagrams
[ 4] 0.00-10.00 sec 56.6 MBytes 47.5 Mbits/sec 7246
[ 4] 10.00-20.00 sec 56.6 MBytes 47.5 Mbits/sec 7243
[ 4] 20.00-30.00 sec 56.6 MBytes 47.5 Mbits/sec 7244
[ 4] 30.00-40.00 sec 56.6 MBytes 47.5 Mbits/sec 7245
[ 4] 40.00-50.00 sec 56.6 MBytes 47.5 Mbits/sec 7245
[ 4] 50.00-60.00 sec 56.6 MBytes 47.5 Mbits/sec 7247
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Jitter
Lost/Total Datagrams
[ 4] 0.00-60.00 sec 340 MBytes 47.5 Mbits/sec 0.039 ms 0/43470 (0%)
[ 4] Sent 43470 datagrams
root@lede:~# iperf3 -c 10.0.0.1 -i 10 -b 1G -t 60
Connecting to host 10.0.0.1, port 5201
[ 4] local 10.0.0.2 port 37956 connected to 10.0.0.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-10.02 sec 49.6 MBytes 41.5 Mbits/sec 0 137 KBytes
[ 4] 10.02-20.00 sec 49.6 MBytes 41.7 Mbits/sec 0 209 KBytes
[ 4] 20.00-30.02 sec 49.6 MBytes 41.6 Mbits/sec 0 209 KBytes
[ 4] 30.02-40.01 sec 49.2 MBytes 41.3 Mbits/sec 0 209 KBytes
[ 4] 40.01-50.02 sec 49.6 MBytes 41.6 Mbits/sec 0 209 KBytes
[ 4] 50.02-60.02 sec 49.6 MBytes 41.6 Mbits/sec 0 209 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-60.02 sec 297 MBytes 41.6 Mbits/sec 0 sender
[ 4] 0.00-60.02 sec 297 MBytes 41.6 Mbits/sec receiver
Greats,
René van Dorst.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code?
2016-09-09 13:52 ` Baptiste Jonglez
2016-09-09 15:22 ` René van Dorst
@ 2016-09-14 8:10 ` jens
1 sibling, 0 replies; 12+ messages in thread
From: jens @ 2016-09-14 8:10 UTC (permalink / raw)
To: wireguard
On 09.09.2016 15:52, Baptiste Jonglez wrote:
> Nice work! I had tried to write chacha20_generic_block in MIPS assembl=
y,
> but I got confused with endianness issues and the code didn't work in t=
he
> end.
>
> Is your code available somewhere? I'd be happy to test on a variety of=
> MIPS routers.
i build some lede with Rene v Dorst patch - but have no time to actually
test it, if someone has ...
here a the links for 841-v11 we want to test specificly
and here is the link for more devices (only build in patched)
patch openfreiburg.de/freifunk/firmware/lede/chacha20poly1305.c_patch1
841 stuff openfreiburg.de/freifunk/firmware/lede/
more lede buildstuff also there (other images and packages)
jens
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code?
2016-09-14 7:16 ` René van Dorst
@ 2016-09-20 20:39 ` Jason A. Donenfeld
2016-09-22 18:27 ` René van Dorst
2016-09-27 1:48 ` Jason A. Donenfeld
1 sibling, 1 reply; 12+ messages in thread
From: Jason A. Donenfeld @ 2016-09-20 20:39 UTC (permalink / raw)
To: René van Dorst; +Cc: WireGuard mailing list
Hey Ren=C3=A9,
That's excellent. Thanks for writing that. I'll review this implementation.
Is your speed up compared to your unaligned optimization from the
other patch? Or is that against vanilla?
With only a 1% increase, I'm first interested to see where precisely
that improvement is coming from, and if we could squeeze that out of
gcc instead, so that they're producing more or less the same code.
Regards,
Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code?
2016-09-20 20:39 ` Jason A. Donenfeld
@ 2016-09-22 18:27 ` René van Dorst
0 siblings, 0 replies; 12+ messages in thread
From: René van Dorst @ 2016-09-22 18:27 UTC (permalink / raw)
To: Jason A. Donenfeld; +Cc: WireGuard mailing list
Hi Jason,
I am using the LEDE-projects default kernel.
My comparison is only between the patched C version with the aligned
memory reads and my assembly version module.
I think it is too complex for GCC to optimize, so it flows the code by
the letter.
This results in a lot of data hazards.
By doing by hand you can prevent many data hazards.
The trick is try to do 2 things by weaving the code together.
Which results in less maintainable code.
Greats,
René van Dorst.
Quoting "Jason A. Donenfeld" <Jason@zx2c4.com>:
> Hey René,
>
> That's excellent. Thanks for writing that. I'll review this implementation.
>
> Is your speed up compared to your unaligned optimization from the
> other patch? Or is that against vanilla?
>
> With only a 1% increase, I'm first interested to see where precisely
> that improvement is coming from, and if we could squeeze that out of
> gcc instead, so that they're producing more or less the same code.
>
> Regards,
> Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code?
2016-09-14 7:16 ` René van Dorst
2016-09-20 20:39 ` Jason A. Donenfeld
@ 2016-09-27 1:48 ` Jason A. Donenfeld
1 sibling, 0 replies; 12+ messages in thread
From: Jason A. Donenfeld @ 2016-09-27 1:48 UTC (permalink / raw)
To: René van Dorst; +Cc: WireGuard mailing list
Hey Ren=C3=A9,
I've begun trying to integrate your excellent work into WireGuard in
the branch rvh/mips:
https://git.zx2c4.com/WireGuard/commit/?h=3Drvd/mips
It seems like there's still a bit of cleaning up and polishing to do,
but it's headed in a great direction. There's a lot of weird
formatting and general inconstancy to clean up. I'll do a review of
the crypto as we get rolling here.
To make things easier, I gave you commit access to the rvh/mips branch
in the repo. Feel free to do with this what you like, and when we're
ready, I'll merge it to master.
$ git clone ssh://git@git.zx2c4.com/WireGuard
$ cd WireGuard
$ git checkout -b rvh/mips origin/rvh/mips
$ edit code...
$ git commit...
$ git push
That general flow should work for you, using your Github SSH key. Let
me know if there are any issues, and feel free to poke me on irc
(zx2c4 on freenode -- #wireguard).
Talk soon,
Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2016-09-27 1:38 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-08 13:23 [WireGuard] News about MIPS and ARM optimized code? René van Dorst
2016-08-08 14:29 ` Jason A. Donenfeld
2016-09-08 11:57 ` René van Dorst
2016-09-09 13:46 ` René van Dorst
2016-09-09 13:52 ` Baptiste Jonglez
2016-09-09 15:22 ` René van Dorst
2016-09-09 19:49 ` René van Dorst
2016-09-14 7:16 ` René van Dorst
2016-09-20 20:39 ` Jason A. Donenfeld
2016-09-22 18:27 ` René van Dorst
2016-09-27 1:48 ` Jason A. Donenfeld
2016-09-14 8:10 ` jens
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.