All of lore.kernel.org
 help / color / mirror / Atom feed
* irq_fpu_usable() is false in ndo_start_xmit() for UDP packets
@ 2015-11-16 19:52 Jason A. Donenfeld
  2015-11-16 20:32 ` David Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Jason A. Donenfeld @ 2015-11-16 19:52 UTC (permalink / raw)
  To: Netdev, LKML; +Cc: David Miller

Hi David & Folks,

I have a virtual device driver that does some fancy processing of
packets in ndo_start_xmit before forwarding them onward out of a
tunnel elsewhere. In order to make that fancy processing fast, I have
AVX and AVX2 implementations. This means I need to use the FPU.

So, I do the usual pattern found throughout the kernel:

        if (!irq_fpu_usable())
                generic_c(...);
        else {
                kernel_fpu_begin();
                optimized_avx(...);
                kernel_fpu_end();
         }

This works fine with, say, iperf3 in TCP mode. The AVX performance is
great. However, when using iperf3 in UDP mode, irq_fpu_usable() is
mostly false! I added a dump_stack() call to see why, except nothing
looks strange; the initial call in the stack trace is
entry_SYSCALL_64_fastpath. Why would irq_fpu_usable() return false
when we're in a syscall? Doesn't that mean this is in process context?

So, I find this a bit disturbing. If anybody has an explanation, and a
way to work around it, I'd be quite happy. Or, simply if there is a
debugging technique you'd recommend, I'd be happy to try something and
report back.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets
  2015-11-16 19:52 irq_fpu_usable() is false in ndo_start_xmit() for UDP packets Jason A. Donenfeld
@ 2015-11-16 20:32 ` David Miller
  2015-11-16 20:58   ` Jason A. Donenfeld
  2015-11-17 12:38   ` David Laight
  0 siblings, 2 replies; 11+ messages in thread
From: David Miller @ 2015-11-16 20:32 UTC (permalink / raw)
  To: Jason; +Cc: netdev, linux-kernel

From: "Jason A. Donenfeld" <Jason@zx2c4.com>
Date: Mon, 16 Nov 2015 20:52:28 +0100

> This works fine with, say, iperf3 in TCP mode. The AVX performance
> is great. However, when using iperf3 in UDP mode, irq_fpu_usable()
> is mostly false! I added a dump_stack() call to see why, except
> nothing looks strange; the initial call in the stack trace is
> entry_SYSCALL_64_fastpath. Why would irq_fpu_usable() return false
> when we're in a syscall? Doesn't that mean this is in process
> context?

Network device driver transmit executes with software interrupts
disabled.

Therefore on x86, you cannot use the FPU.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets
  2015-11-16 20:32 ` David Miller
@ 2015-11-16 20:58   ` Jason A. Donenfeld
  2015-11-16 21:17     ` David Miller
  2015-11-16 22:27     ` Hannes Frederic Sowa
  2015-11-17 12:38   ` David Laight
  1 sibling, 2 replies; 11+ messages in thread
From: Jason A. Donenfeld @ 2015-11-16 20:58 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev, LKML

Hi David,

On Mon, Nov 16, 2015 at 9:32 PM, David Miller <davem@davemloft.net> wrote:
> Network device driver transmit executes with software interrupts
> disabled.
>
> Therefore on x86, you cannot use the FPU.

That is extremely problematic for me. Is there a way to make this not
so? A driver flag that would allow this?

Also - how come it irq_fpu_usable() is true when using TCP but not
when using UDP?

Further, irq_fpu_usable() doesn't only check for interrupts. There are
two other conditions that allow the FPU's usage, from
arch/x86/kernel/fpu/core.c:

bool irq_fpu_usable(void)
{
        return !in_interrupt() ||
                interrupted_user_mode() ||
                interrupted_kernel_fpu_idle();
}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets
  2015-11-16 20:58   ` Jason A. Donenfeld
@ 2015-11-16 21:17     ` David Miller
  2015-11-16 21:28       ` Jason A. Donenfeld
  2015-11-16 22:27     ` Hannes Frederic Sowa
  1 sibling, 1 reply; 11+ messages in thread
From: David Miller @ 2015-11-16 21:17 UTC (permalink / raw)
  To: Jason; +Cc: netdev, linux-kernel

From: "Jason A. Donenfeld" <Jason@zx2c4.com>
Date: Mon, 16 Nov 2015 21:58:49 +0100

> That is extremely problematic for me. Is there a way to make this
> not so?

Not without a complete redesign of the x86 fpu save/restore mechanism.

The driver is the wrong place to do software cryptographic transforms
anyways.

Judging from your other emails, you doing a lot of weird shit in your
driver.

Maybe you should just tell us exactly what kind of device it is for,
exactly what the features and offloads are, and maybe we can tell you
therefore what kind of facilities would match that situation best.

You're currently trying to do it the other way, you know everything
about your device and goals, and you're sending us small piecemeal
questions.  We lack the high level full picture of your device, so
it's hard for us to give you good answers.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets
  2015-11-16 21:17     ` David Miller
@ 2015-11-16 21:28       ` Jason A. Donenfeld
  2015-11-16 21:33         ` David Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Jason A. Donenfeld @ 2015-11-16 21:28 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev, LKML

On Mon, Nov 16, 2015 at 10:17 PM, David Miller <davem@davemloft.net> wrote:
>
> Not without a complete redesign of the x86 fpu save/restore mechanism.

Urg, okay. I still wonder why irq_fpu_usable() is true when using TCP but not
when using UDP... Any ideas on this?

> The driver is the wrong place to do software cryptographic transforms
> anyways.
> Judging from your other emails, you doing a lot of weird shit in your
> driver.
> Maybe you should just tell us exactly what kind of device it is for,
> exactly what the features and offloads are, and maybe we can tell you
> therefore what kind of facilities would match that situation best.
> You're currently trying to do it the other way, you know everything
> about your device and goals, and you're sending us small piecemeal
> questions.  We lack the high level full picture of your device, so
> it's hard for us to give you good answers.

Yes, this is fair to ask. Here it goes:

I'm making a simpler replacement for IPSec that operates as an device
driver on the interface level, rather than the IPSec xfrm method. The
methodology is going to be controversial, so I'm taking my time
perfecting each component, and then I'm planning on writing in with a
big email explaining why, with justifications, and numbers. It has
some real world benefits that are already quantifiable. If you're
curious, I did a talk on it at Kernel Recipes in Paris [1]. But
please, give me some time to finish things and prepare myself. I want
to present it to you in the best way possible. I'd hate for it to be
dismissed too early or too hastily, before it's had its chance.

[1] https://www.youtube.com/watch?v=9Rk4doELmwM

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets
  2015-11-16 21:28       ` Jason A. Donenfeld
@ 2015-11-16 21:33         ` David Miller
  2015-11-16 21:37           ` Jason A. Donenfeld
  0 siblings, 1 reply; 11+ messages in thread
From: David Miller @ 2015-11-16 21:33 UTC (permalink / raw)
  To: Jason; +Cc: netdev, linux-kernel

From: "Jason A. Donenfeld" <Jason@zx2c4.com>
Date: Mon, 16 Nov 2015 22:28:52 +0100

> I'm making a simpler replacement for IPSec that operates as an
> device driver on the interface level

Someone already did a Linux implemenation of exactly that two decades
ago, we rejected that design and did what we did.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets
  2015-11-16 21:33         ` David Miller
@ 2015-11-16 21:37           ` Jason A. Donenfeld
  0 siblings, 0 replies; 11+ messages in thread
From: Jason A. Donenfeld @ 2015-11-16 21:37 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev, LKML

On Mon, Nov 16, 2015 at 10:33 PM, David Miller <davem@davemloft.net> wrote:
> Someone already did a Linux implemenation of exactly that two decades
> ago, we rejected that design and did what we did.

It's not exactly IPSec though, so don't worry. It's a totally new
thing that combines a lot of different recent ideas, in the virtual
networking arena as well as in crypto. *It's going to be cool.* And
I'm determined to please you with the design and implementation in the
end. So let me finish off my first implementation, and then we can do
some back and forth on the overall high level ideas and iron out some
of the potential implementation issues that you find. I'll incorporate
your feedback iteratively until it's in a state that you like. I've
been working on this day and night for months, and I'm *almost* there.
Bare with me a little longer, and expect a nice "[RFC]" email to be
sent your way soon.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets
  2015-11-16 20:58   ` Jason A. Donenfeld
  2015-11-16 21:17     ` David Miller
@ 2015-11-16 22:27     ` Hannes Frederic Sowa
  2015-11-16 23:57       ` Jason A. Donenfeld
  1 sibling, 1 reply; 11+ messages in thread
From: Hannes Frederic Sowa @ 2015-11-16 22:27 UTC (permalink / raw)
  To: Jason A. Donenfeld, David Miller; +Cc: Netdev, LKML

Hi Jason,

On Mon, Nov 16, 2015, at 21:58, Jason A. Donenfeld wrote:
> Hi David,
> 
> On Mon, Nov 16, 2015 at 9:32 PM, David Miller <davem@davemloft.net>
> wrote:
> > Network device driver transmit executes with software interrupts
> > disabled.
> >
> > Therefore on x86, you cannot use the FPU.
> 
> That is extremely problematic for me. Is there a way to make this not
> so? A driver flag that would allow this?
> 
> Also - how come it irq_fpu_usable() is true when using TCP but not
> when using UDP?
> 
> Further, irq_fpu_usable() doesn't only check for interrupts. There are
> two other conditions that allow the FPU's usage, from
> arch/x86/kernel/fpu/core.c:
> 
> bool irq_fpu_usable(void)
> {
>         return !in_interrupt() ||
>                 interrupted_user_mode() ||
>                 interrupted_kernel_fpu_idle();
> }

Use the irqsoff tracer to find the problematic functions which disable
interrupts and try to work around it in case of UDP. This could benefit
the whole stack.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets
  2015-11-16 22:27     ` Hannes Frederic Sowa
@ 2015-11-16 23:57       ` Jason A. Donenfeld
  2015-11-17  0:04         ` Jason A. Donenfeld
  0 siblings, 1 reply; 11+ messages in thread
From: Jason A. Donenfeld @ 2015-11-16 23:57 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: David Miller, Netdev, LKML

Hi Hannes,

Thanks for your response.

On Mon, Nov 16, 2015 at 11:27 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Use the irqsoff tracer to find the problematic functions which disable
> interrupts and try to work around it in case of UDP. This could benefit
> the whole stack.

I didn't know about the irqsoff tracer. This looks very useful.
Unfortunately, it turns out that David was right: in_interrupt() is
always true, anyway, when ndo_start_xmit is called. This means, based
on this function:

bool irq_fpu_usable(void)
{
        return !in_interrupt() ||
                interrupted_user_mode() ||
                interrupted_kernel_fpu_idle();
}


1. irq_fpu_usable() is true for TCP. Since in_interrupt() is always
true in ndo_start_xmit, this means that in this case, we're lucky and
either interrupted_user_mode() is true, or
interrupted_kernel_fpu_idle() is true.

2. irq_fpu_usable() is FALSE for UDP! Since in_interrupt() is always
true in ndo_start_xmit, this means that in this case, both
interrupted_user_mode() and interrupted_kernel_fpu_idle() are false!

I now need to determine why precisely these are false in that case. Is
there other UDP code that's somehow making use of the FPU? Some
strange accelerated CRC32 perhaps? Or is there a weird situation
happening in which user mode isn't being interrupted? I suspect not,
since tracing this shows an entry point always of a syscall.

Investigating further, will report back.

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets
  2015-11-16 23:57       ` Jason A. Donenfeld
@ 2015-11-17  0:04         ` Jason A. Donenfeld
  0 siblings, 0 replies; 11+ messages in thread
From: Jason A. Donenfeld @ 2015-11-17  0:04 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: David Miller, Netdev, LKML

On Tue, Nov 17, 2015 at 12:57 AM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> 2. irq_fpu_usable() is FALSE for UDP! Since in_interrupt() is always
> true in ndo_start_xmit, this means that in this case, both
> interrupted_user_mode() and interrupted_kernel_fpu_idle() are false!
> Investigating further, will report back.

GASP, the plot thickens.

It turns out that when it is working, for TCP, interrupted_user_mode()
is false. This means that the only reason it's succeeding for TCP is
because interrupted_kernel_fpu_idle() is true. Therefore, for some
reason for UDP, interrupted_kernel_fpu_idle() is false!

That function is defined as:

static bool interrupted_kernel_fpu_idle(void)
{
        if (kernel_fpu_disabled())
                return false;

        if (use_eager_fpu())
                return true;

        return !current->thread.fpu.fpregs_active && (read_cr0() & X86_CR0_TS);
}

So now the big question is: what in the UDP pipeline is using the FPU?
And why is that usage not being released by the time it gets to
ndo_start_xmit? Or, alternatively, why would X86_CR0_TS be unset in
the UDP path? Is it possible this is related to UFO?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets
  2015-11-16 20:32 ` David Miller
  2015-11-16 20:58   ` Jason A. Donenfeld
@ 2015-11-17 12:38   ` David Laight
  1 sibling, 0 replies; 11+ messages in thread
From: David Laight @ 2015-11-17 12:38 UTC (permalink / raw)
  To: 'David Miller', Jason; +Cc: netdev, linux-kernel

From: David Miller
> Sent: 16 November 2015 20:32
> From: "Jason A. Donenfeld" <Jason@zx2c4.com>
> Date: Mon, 16 Nov 2015 20:52:28 +0100
> 
> > This works fine with, say, iperf3 in TCP mode. The AVX performance
> > is great. However, when using iperf3 in UDP mode, irq_fpu_usable()
> > is mostly false! I added a dump_stack() call to see why, except
> > nothing looks strange; the initial call in the stack trace is
> > entry_SYSCALL_64_fastpath. Why would irq_fpu_usable() return false
> > when we're in a syscall? Doesn't that mean this is in process
> > context?
> 
> Network device driver transmit executes with software interrupts
> disabled.
> 
> Therefore on x86, you cannot use the FPU.

I had some thoughts about driver access to AVX instructions when
I was adding AVX support to NetBSD.

The fpu state is almost certainly 'lazy switched' this means that
the fpu registers can contain data for a process that is currently
running on a different cpu.
At any time the other cpu might request (by IPI) that they be flushed
to the process data area so that it can reload them.
Kernel code could request them be flushed, but that can only happen once.
If a nested function requires them it would have to supply a local
save area. But the full save area is too big to go on stack.
Not only that, the save and restore instructions are slow.

It is also worth noting that all the AVX registers are callee saved.
This means that the syscall entry need not preserve them, instead
it can mark that they will be 'restored as zero'. However this
isn't true of any other kernel entry points.

Back to my thoughts...

Kernel code is likely to want to use special SSE/AVX instructions (eg
the crc and crypto ones) rather than generic FP calculations.
As such just saving a two of three of AVX registers would suffice.
This could be done using a small on-stack save structure that
can be referenced from the process's save area so that any IPI
can copy over the correct values after saving the full state.

This would allow kernel code (including interrupts) to execute
some AVX instructions without having to save the entire cpu
extended state anywhere.

I suspect there is a big hole in the above though...

	David

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-11-17 12:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-16 19:52 irq_fpu_usable() is false in ndo_start_xmit() for UDP packets Jason A. Donenfeld
2015-11-16 20:32 ` David Miller
2015-11-16 20:58   ` Jason A. Donenfeld
2015-11-16 21:17     ` David Miller
2015-11-16 21:28       ` Jason A. Donenfeld
2015-11-16 21:33         ` David Miller
2015-11-16 21:37           ` Jason A. Donenfeld
2015-11-16 22:27     ` Hannes Frederic Sowa
2015-11-16 23:57       ` Jason A. Donenfeld
2015-11-17  0:04         ` Jason A. Donenfeld
2015-11-17 12:38   ` David Laight

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.