All of lore.kernel.org
 help / color / mirror / Atom feed
* crypto: caam from tasklet to threadirq
       [not found] <DB5PR04MB130229BBD433C8FF7BDD975FEEF90@DB5PR04MB1302.eurprd04.prod.outlook.com>
@ 2016-09-16 14:01 ` Cata Vasile
  2016-09-16 16:53   ` Russell King - ARM Linux
  2016-09-20 16:12   ` Russell King - ARM Linux
  0 siblings, 2 replies; 6+ messages in thread
From: Cata Vasile @ 2016-09-16 14:01 UTC (permalink / raw)
  To: rmk+kernel; +Cc: Horia Geanta Neag, linux-crypto

Hi,

We've tried to test and benchmark your submitted work[1].

Cryptographic offloading is also used in IPsec in the Linux Kernel. In heavy traffic scenarios, the NIC driver competes with the crypto device driver. Most NICs use the NAPI context, which is one of the most prioritized context types. In IPsec scenarios  the performance is trashed because, although raw data gets in to device, the data is encrypted/decrypted and the dequeue code in CAAM driver has a hard time being scheduled to actually call the callback to notify the networking stack it can continue working with  that data.

Being this scenario, at heavy load, the Kernel warns on rcu stalls and the forwarding path has a lot of latency.
Have you tried benchmarking the board you used for testing?

I have ran some on our other platforms. The after benchmark fails to run at the top level of the before results. The rcu stall does not always stall in the same place. The after ping latency is greater, and oscillates a lot.

It might be a good idea for the codebase to change to a threadirq, but from a pragmatic perspective, the whole system has to suffer. That is one the reasons most crypto accelerators try to run dequeue primitives in high priority contexts.


Regards,
Catalin Vasile


[1] https://git.kernel.org/cgit/linux/kernel/git/herbert/cryptodev-2.6.git/commit/?id=66d2e2028091a074aa1290d2eeda5ddb1a6c329c
     

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: crypto: caam from tasklet to threadirq
  2016-09-16 14:01 ` crypto: caam from tasklet to threadirq Cata Vasile
@ 2016-09-16 16:53   ` Russell King - ARM Linux
  2016-09-20 16:12   ` Russell King - ARM Linux
  1 sibling, 0 replies; 6+ messages in thread
From: Russell King - ARM Linux @ 2016-09-16 16:53 UTC (permalink / raw)
  To: Cata Vasile; +Cc: Horia Geanta Neag, linux-crypto

On Fri, Sep 16, 2016 at 02:01:00PM +0000, Cata Vasile wrote:
> Hi,
> 
> We've tried to test and benchmark your submitted work[1].
> 
> Cryptographic offloading is also used in IPsec in the Linux Kernel. In
> heavy traffic scenarios, the NIC driver competes with the crypto device
> driver. Most NICs use the NAPI context, which is one of the most
> prioritized context types. In IPsec scenarios  the performance is
> trashed because, although raw data gets in to device, the data is
> encrypted/decrypted and the dequeue code in CAAM driver has a hard time
> being scheduled to actually call the callback to notify the networking
> stack it can continue working with  that data.

Having received a reply from Thomas Gleixner today, there appears to be
some disagreement with your findings, and a suggestion that the problem
needs proper and more in-depth investigation.

Thomas indicates that the NAPI processing shows an improvement when
moved to the same context that threaded interrupts run in, as opposed
to the current softirq context - which also would run the tasklets.

What I would say is that if threaded IRQs are causing harm, then there
seems to be something very wrong somewhere.

> Being this scenario, at heavy load, the Kernel warns on rcu stalls and
> the forwarding path has a lot of latency.  Have you tried benchmarking
> the board you used for testing?

It's way too long ago for me to remember - these patches were created
almost a year ago - October 20th 2015, which is when I'd have tested
them.  So, I'm afraid I can't help very much at this point, apart from
trying to re-run some benchmarks.

I'd suggest testing the openssl (with AF_ALG support), which is probably
what I tested and benchmarked.  However, as I say, it's far too long
ago for me to really remember at this point.

> I have ran some on our other platforms. The after benchmark fails to
> run at the top level of the before results.

Sorry, that last sentence doesn't make any sense to me.

I don't have the bandwidth to look at this, and IPsec doesn't interest
me one bit - I've never been able to work out how to setup IPsec
locally.  From what I remember when I looked into it many years ago,
you had to have significant information about ipsec to get it up and
running.  Maybe things have changed since then, I don't know.

If you want me to reproduce it, please send me a step-by-step idiots
guide on setting up a working test scenario which reproduces your
problem.

Thanks.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: crypto: caam from tasklet to threadirq
  2016-09-16 14:01 ` crypto: caam from tasklet to threadirq Cata Vasile
  2016-09-16 16:53   ` Russell King - ARM Linux
@ 2016-09-20 16:12   ` Russell King - ARM Linux
  2016-09-20 20:10     ` Thomas Gleixner
  1 sibling, 1 reply; 6+ messages in thread
From: Russell King - ARM Linux @ 2016-09-20 16:12 UTC (permalink / raw)
  To: Cata Vasile, Thomas Gleixner; +Cc: Horia Geanta Neag, linux-crypto

Okay, I've re-tested, using a different way of measuring, because using
openssl speed is impractical for off-loaded engines.  I've decided to
use this way to measure the performance:

dd if=/dev/zero bs=1048576 count=128 | /usr/bin/time openssl dgst -md5

For the threaded IRQs case gives:

0.05user 2.74system 0:05.30elapsed 52%CPU (0avgtext+0avgdata 2400maxresident)k
0.06user 2.52system 0:05.18elapsed 49%CPU (0avgtext+0avgdata 2404maxresident)k
0.12user 2.60system 0:05.61elapsed 48%CPU (0avgtext+0avgdata 2460maxresident)k
	=> 5.36s => 25.0MB/s

and the tasklet case:

0.08user 2.53system 0:04.83elapsed 54%CPU (0avgtext+0avgdata 2468maxresident)k
0.09user 2.47system 0:05.16elapsed 49%CPU (0avgtext+0avgdata 2368maxresident)k
0.10user 2.51system 0:04.87elapsed 53%CPU (0avgtext+0avgdata 2460maxresident)k
	=> 4.95 => 27.1MB/s

which corresponds to an 8% slowdown for the threaded IRQ case.  So,
tasklets are indeed faster than threaded IRQs.

I guess the reason is that tasklets are much simpler, being able to
run just before we return to userspace without involving scheduler
overheads, but that's speculation.

I've tried to perf it, but...

Samples: 31K of event 'cycles', Event count (approx.): 3552246846
  Overhead  Command          Shared Object     Symbol
+   33.22%  kworker/0:1      [kernel.vmlinux]  [k] __do_softirq
+   15.78%  irq/311-2101000  [kernel.vmlinux]  [k] __do_softirq
+    7.49%  irqbalance       [kernel.vmlinux]  [k] __do_softirq
+    7.26%  openssl          [kernel.vmlinux]  [k] __do_softirq
+    5.71%  ksoftirqd/0      [kernel.vmlinux]  [k] __do_softirq
+    3.64%  kworker/0:2      [kernel.vmlinux]  [k] __do_softirq
+    3.52%  swapper          [kernel.vmlinux]  [k] __do_softirq
+    3.14%  kworker/0:1      [kernel.vmlinux]  [k] _raw_spin_unlock_irq

I was going to try to get the threaded IRQ case, but I've ended up with
perf getting buggered because of the iMX6 SMP perf disfunctionality:

[ 3448.810416] irq 24: nobody cared (try booting with the "irqpoll" option)
...
[ 3448.824528] Disabling IRQ #24

caused by FSL's utterly brain-dead idea of routing all the perf
interrupts to single non-CPU local interrupt input, and the refusal of
kernel folk to find an acceptable solution to support this.

So, sorry, I'm not going to bother trying to get any further with this.
If the job was not made harder stupid hardware design and kernel
politics, then I might be more inclined to do deeper investigation, but
right now I'm finding that I'm not interested in trying to jump through
these stupid hoops.

I think I've proven from the above that this patch needs to be reverted
due to the performance regression, and that there _is_ most definitely
a deterimental effect of switching from tasklets to threaded IRQs.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: crypto: caam from tasklet to threadirq
  2016-09-20 16:12   ` Russell King - ARM Linux
@ 2016-09-20 20:10     ` Thomas Gleixner
  2016-09-20 21:32       ` Russell King - ARM Linux
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2016-09-20 20:10 UTC (permalink / raw)
  To: Russell King - ARM Linux; +Cc: Cata Vasile, Horia Geanta Neag, linux-crypto

On Tue, 20 Sep 2016, Russell King - ARM Linux wrote:
> which corresponds to an 8% slowdown for the threaded IRQ case.  So,
> tasklets are indeed faster than threaded IRQs.

Fair enough.

> I've tried to perf it, but...
>  ....
 
> So, sorry, I'm not going to bother trying to get any further with this.
> If the job was not made harder stupid hardware design and kernel
> politics, then I might be more inclined to do deeper investigation, but
> right now I'm finding that I'm not interested in trying to jump through
> these stupid hoops.

I'd be very interested in a sched_switch + irq + softirq trace which does
not involve PMU hardware for both irqthreads and tasklets, but I can
understand if you can't be bothered to gather it.

Vs. the PMU interrupt thing. What's the politics about that? Do you have
any pointers?
 
> I think I've proven from the above that this patch needs to be reverted
> due to the performance regression, and that there _is_ most definitely
> a deterimental effect of switching from tasklets to threaded IRQs.

I agree that the revert should happen, but I'd rather see a bit more
information why this regression happens with the switch from tasklets to
threaded irqs.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: crypto: caam from tasklet to threadirq
  2016-09-20 20:10     ` Thomas Gleixner
@ 2016-09-20 21:32       ` Russell King - ARM Linux
  2016-09-20 23:31         ` Thomas Gleixner
  0 siblings, 1 reply; 6+ messages in thread
From: Russell King - ARM Linux @ 2016-09-20 21:32 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Cata Vasile, Horia Geanta Neag, linux-crypto

[-- Attachment #1: Type: text/plain, Size: 8782 bytes --]

On Tue, Sep 20, 2016 at 10:10:20PM +0200, Thomas Gleixner wrote:
> On Tue, 20 Sep 2016, Russell King - ARM Linux wrote:
> > which corresponds to an 8% slowdown for the threaded IRQ case.  So,
> > tasklets are indeed faster than threaded IRQs.
> 
> Fair enough.
> 
> > I've tried to perf it, but...
> >  ....
>  
> > So, sorry, I'm not going to bother trying to get any further with this.
> > If the job was not made harder stupid hardware design and kernel
> > politics, then I might be more inclined to do deeper investigation, but
> > right now I'm finding that I'm not interested in trying to jump through
> > these stupid hoops.
> 
> I'd be very interested in a sched_switch + irq + softirq trace which does
> not involve PMU hardware for both irqthreads and tasklets, but I can
> understand if you can't be bothered to gather it.

That's involved a rebuild of perf to get it to see the trace events.
What I see probably indicates why the crypto AF_ALG way of doing
things sucks big time - the interface to the crypto backends is
entirely synchronous.

This means that we're guaranteed to get one interrupt per msghdr
entry:

crypto/algif_hash.c:
        lock_sock(sk);
        if (!ctx->more) {
                err = af_alg_wait_for_completion(crypto_ahash_init(&ctx->req),
                                                &ctx->completion);
                if (err)
                        goto unlock;
        }
...
        while (msg_data_left(msg)) {
...
                ahash_request_set_crypt(&ctx->req, ctx->sgl.sg, NULL, len);

                err = af_alg_wait_for_completion(crypto_ahash_update(&ctx->req),
                                                 &ctx->completion);
...
        }
...
        ctx->more = msg->msg_flags & MSG_MORE;
        if (!ctx->more) {
                ahash_request_set_crypt(&ctx->req, NULL, ctx->result, 0);
                err = af_alg_wait_for_completion(crypto_ahash_final(&ctx->req),
                                                 &ctx->completion);
        }

each of those places where af_alg_wait_for_completion() called, we
end up submitting a bunch of data and then immediately waiting for
the operation to complete... and this can be seen in the perf
trace logs.

So, unless I'm mistaken, there's no way for a crypto backend to run
asynchronously, and there's no way for a crypto backend to batch up
the "job" - in order to do that, I think it would have to store quite
a lot of state.

Now, I'm not entirely sure that asking perf to record irq:* and
sched:* events was what we were after - there appears to be no trace
events recorded for entering a threaded IRQ handler.

swapper     0 [000]  3600.260199:             irq:irq_handler_entry: irq=311 name=2101000.jr0
swapper     0 [000]  3600.260209:              irq:irq_handler_exit: irq=311 ret=handled
swapper     0 [000]  3600.260219:                sched:sched_waking: comm=irq/311-2101000 pid=2426 prio=49 target_cpu=000
swapper     0 [000]  3600.260236:                sched:sched_wakeup: comm=irq/311-2101000 pid=2426 prio=49 target_cpu=000
swapper     0 [000]  3600.260258:                sched:sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=irq/311-2101000 next_pid=2426 next_prio=49
irq/311-2101000  2426 [000]  3600.260278:                sched:sched_waking: comm=openssl pid=8005 prio=120 target_cpu=000
irq/311-2101000  2426 [000]  3600.260296:                sched:sched_wakeup: comm=openssl pid=8005 prio=120 target_cpu=000
irq/311-2101000  2426 [000]  3600.260322:                sched:sched_switch: prev_comm=irq/311-2101000 prev_pid=2426 prev_prio=49 prev_state=S ==> next_comm=openssl next_pid=8005 next_prio=120
openssl  8005 [000]  3600.260369:                sched:sched_waking: comm=dd pid=8002 prio=120 target_cpu=000
openssl  8005 [000]  3600.260388:          sched:sched_stat_runtime: comm=openssl pid=8005 runtime=71000 [ns] vruntime=419661176835 [ns]
openssl  8005 [000]  3600.260401:                sched:sched_wakeup: comm=dd pid=8002 prio=120 target_cpu=000
openssl  8005 [000]  3600.260421:                sched:sched_switch: prev_comm=openssl prev_pid=8005 prev_prio=120 prev_state=R ==> next_comm=dd next_pid=8002 next_prio=120
dd  8002 [000]  3600.260459:          sched:sched_stat_runtime: comm=dd pid=8002 runtime=71667 [ns] vruntime=419655248502 [ns]
dd  8002 [000]  3600.260473:                sched:sched_switch: prev_comm=dd prev_pid=8002 prev_prio=120 prev_state=S ==> next_comm=openssl next_pid=8005 next_prio=120
openssl  8005 [000]  3600.260572:          sched:sched_stat_runtime: comm=openssl pid=8005 runtime=112666 [ns] vruntime=419661289501 [ns]
openssl  8005 [000]  3600.260587:                sched:sched_switch: prev_comm=openssl prev_pid=8005 prev_prio=120 prev_state=D ==> next_comm=swapper/0 next_pid=0 next_prio=120
swapper     0 [000]  3600.260638:             irq:irq_handler_entry: irq=311 name=2101000.jr0
...

So 123us (260322 - 260199) to the switch to openssl via the threaded IRQ.

tasklet case:

swapper     0 [000]  5082.667101:             irq:irq_handler_entry: irq=311 name=2101000.jr0
swapper     0 [000]  5082.667111:                 irq:softirq_raise: vec=6 [action=TASKLET]
swapper     0 [000]  5082.667119:              irq:irq_handler_exit: irq=311 ret=handled
swapper     0 [000]  5082.667134:                 irq:softirq_entry: vec=6 [action=TASKLET]
swapper     0 [000]  5082.667151:                sched:sched_waking: comm=openssl pid=8251 prio=120 target_cpu=000
swapper     0 [000]  5082.667169:                sched:sched_wakeup: comm=openssl pid=8251 prio=120 target_cpu=000
swapper     0 [000]  5082.667183:                  irq:softirq_exit: vec=6 [action=TASKLET]
swapper     0 [000]  5082.667202:                sched:sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=openssl next_pid=8251 next_prio=120
openssl  8251 [000]  5082.667248:                sched:sched_waking: comm=dd pid=8248 prio=120 target_cpu=000
openssl  8251 [000]  5082.667267:          sched:sched_stat_runtime: comm=openssl pid=8251 runtime=39668 [ns] vruntime=444714027428 [ns]
openssl  8251 [000]  5082.667280:                sched:sched_wakeup: comm=dd pid=8248 prio=120 target_cpu=000
openssl  8251 [000]  5082.667301:                sched:sched_switch: prev_comm=openssl prev_pid=8251 prev_prio=120 prev_state=R ==> next_comm=dd next_pid=8248 next_prio=120
dd  8248 [000]  5082.667339:          sched:sched_stat_runtime: comm=dd pid=8248 runtime=70000 [ns] vruntime=444708097428 [ns]
dd  8248 [000]  5082.667354:                sched:sched_switch: prev_comm=dd prev_pid=8248 prev_prio=120 prev_state=S ==> next_comm=openssl next_pid=8251 next_prio=120
openssl  8251 [000]  5082.667451:          sched:sched_stat_runtime: comm=openssl pid=8251 runtime=113666 [ns] vruntime=444714141094 [ns]
openssl  8251 [000]  5082.667466:                sched:sched_switch: prev_comm=openssl prev_pid=8251 prev_prio=120 prev_state=D ==> next_comm=swapper/0 next_pid=0 next_prio=120
swapper     0 [000]  5082.667517:             irq:irq_handler_entry: irq=311 name=2101000.jr0

101us (667202 - 667101) between the same two events, which is 22us
faster than the above.

In both cases, I picked out an irq_handler_entry event which was near
to a whole number of 100us.  I haven't looked any deeper to see what
the variations are in the hard IRQ->openssl schedule latency - I
expect that needs some scripts written.

Attached are compressed files of the perf script -G output.  If you
want the perf.data files, I have them (I'm not sure how useful they
are without the binaries though.)

> Vs. the PMU interrupt thing. What's the politics about that? Do you have
> any pointers?

I just remember there being a discussion about how stupid FSL have been
and "we're not going to support that" - the perf code wants the per-CPU
performance unit interrupts delivered _on_ the CPU to which the perf
unit is attached.  FSL decided in their stupidity to OR all the perf
unit interrupts together and route them to a single common interrupt.

This means that we end up with one CPU taking the perf interrupt for
any perf unit - and the CPUs can only access their local perf unit.
So, if (eg) CPU1's perf unit fires an interrupt, but the common
interrupt is routed to CPU0, CPU0 checks its perf unit, finds no
interrupt, and returns with IRQ_NONE.

There's no mechanism in perf (or anywhere else) to hand the interrupt
over to another CPU.

The result is that trying to run perf on the multi-core iMX SoCs ends
up with the perf interrupt disabled, at which point perf collapses in
a sad pile.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

[-- Attachment #2: threaded.txt.bz2 --]
[-- Type: application/x-bzip2, Size: 23786 bytes --]

[-- Attachment #3: tasklet.txt.bz2 --]
[-- Type: application/x-bzip2, Size: 23617 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: crypto: caam from tasklet to threadirq
  2016-09-20 21:32       ` Russell King - ARM Linux
@ 2016-09-20 23:31         ` Thomas Gleixner
  0 siblings, 0 replies; 6+ messages in thread
From: Thomas Gleixner @ 2016-09-20 23:31 UTC (permalink / raw)
  To: Russell King - ARM Linux; +Cc: Cata Vasile, Horia Geanta Neag, linux-crypto

On Tue, 20 Sep 2016, Russell King - ARM Linux wrote:
> each of those places where af_alg_wait_for_completion() called, we
> end up submitting a bunch of data and then immediately waiting for
> the operation to complete... and this can be seen in the perf
> trace logs.

That'd explain it.
 
> So, unless I'm mistaken, there's no way for a crypto backend to run
> asynchronously, and there's no way for a crypto backend to batch up
> the "job" - in order to do that, I think it would have to store quite
> a lot of state.

Hmm.

> Now, I'm not entirely sure that asking perf to record irq:* and
> sched:* events was what we were after - there appears to be no trace
> events recorded for entering a threaded IRQ handler.

Indeed. We can only deduce it from the thread being woken and scheduled
in/out. ?me makes note to add a tracepoint in the thread handler
invocation.

> So 123us (260322 - 260199) to the switch to openssl via the threaded IRQ.

> 101us (667202 - 667101) between the same two events, which is 22us
> faster than the above.

So it looks like the two extra context switches are responsible for that
delta.
 
> Attached are compressed files of the perf script -G output.  If you
> want the perf.data files, I have them (I'm not sure how useful they
> are without the binaries though.)

Thanks. I'll have a look tomorrow when brain is unfried.

> > Vs. the PMU interrupt thing. What's the politics about that? Do you have
> > any pointers?
> 
> I just remember there being a discussion about how stupid FSL have been
> and "we're not going to support that" - the perf code wants the per-CPU
> performance unit interrupts delivered _on_ the CPU to which the perf
> unit is attached.  FSL decided in their stupidity to OR all the perf
> unit interrupts together and route them to a single common interrupt.

Brilliant.
 
> This means that we end up with one CPU taking the perf interrupt for
> any perf unit - and the CPUs can only access their local perf unit.
> So, if (eg) CPU1's perf unit fires an interrupt, but the common
> interrupt is routed to CPU0, CPU0 checks its perf unit, finds no
> interrupt, and returns with IRQ_NONE.
> 
> There's no mechanism in perf (or anywhere else) to hand the interrupt
> over to another CPU.
> 
> The result is that trying to run perf on the multi-core iMX SoCs ends
> up with the perf interrupt disabled, at which point perf collapses in
> a sad pile.

Not surprising.

Solving that in perf is probably the wrong place. So what we'd need is some
kind of special irq flow handler which does:

	ret = handle_irq(desc);
	if (ret == IRQ_NONE && desc->ipi_next) {
     		dest = get_next_cpu(this_cpu);
		if (dest != this_cpu)
			desc->ipi_next(dest);
			
     }

get_next_cpu() would just pick the next cpu in the online mask or the first
when this_cpu is the last one in the mask.

That shouldn't be overly complex to implement. All you'd need to do in the
PMU driver is to hook into that IPI vector.

If you're interested then I can hack the core bits.

Thanks,

	tglx




   

      	

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-09-20 23:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <DB5PR04MB130229BBD433C8FF7BDD975FEEF90@DB5PR04MB1302.eurprd04.prod.outlook.com>
2016-09-16 14:01 ` crypto: caam from tasklet to threadirq Cata Vasile
2016-09-16 16:53   ` Russell King - ARM Linux
2016-09-20 16:12   ` Russell King - ARM Linux
2016-09-20 20:10     ` Thomas Gleixner
2016-09-20 21:32       ` Russell King - ARM Linux
2016-09-20 23:31         ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.