linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Lazy FPU restoration / moving kernel_fpu_end() to context switch
@ 2018-06-15 13:11 Jason A. Donenfeld
  2018-06-15 16:25 ` Thomas Gleixner
  2018-06-15 18:31 ` Andy Lutomirski
  0 siblings, 2 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2018-06-15 13:11 UTC (permalink / raw)
  To: LKML, X86 ML, Andy Lutomirski

Hi Andy & folks,

Lots of crypto routines look like this:

kernel_fpu_begin();
encrypt();
kernel_fpu_end();

If you call such a routine twice, you get:

kernel_fpu_begin();
encrypt();
kernel_fpu_end();
kernel_fpu_begin();
encrypt();
kernel_fpu_end();

In a loop this looks like:

for (thing) {
  kernel_fpu_begin();
  encrypt(thing);
  kernel_fpu_end();
}

This is obviously very bad, because begin() and end() are slow, so
WireGuard does the obvious:

kernel_fpu_begin();
for (thing)
  encrypt(thing);
kernel_fpu_end();

This is fine and well, and the crypto API I'm working on will enable
this to be done in a clear way, but I do wonder if maybe this is not
something that should be happening at the level of the caller, but
rather in the fpu functions themselves. Namely, what are your thoughts
on modifying kernel_fpu_end() so that it doesn't actually restore the
state, but just reenables preemption and marks that on the next
context switch, the state should be restored? Then, essentially,
kernel_fpu_begin() and end() become free after the first usage of
kernel_fpu_begin().

Is this something feasible? I know that performance-wise, I'm really
gaining a lot from hoisting those functions out of the loops, and API
wise, it'd be slightly simpler to implement if I didn't have to all
for that hoisting.

Regards,
Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 13:11 Lazy FPU restoration / moving kernel_fpu_end() to context switch Jason A. Donenfeld
@ 2018-06-15 16:25 ` Thomas Gleixner
  2018-06-15 18:33   ` Brian Gerst
  2018-06-15 19:34   ` Peter Zijlstra
  2018-06-15 18:31 ` Andy Lutomirski
  1 sibling, 2 replies; 23+ messages in thread
From: Thomas Gleixner @ 2018-06-15 16:25 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: LKML, X86 ML, Andy Lutomirski, Peter Zijlstra

On Fri, 15 Jun 2018, Jason A. Donenfeld wrote:
> In a loop this looks like:
> 
> for (thing) {
>   kernel_fpu_begin();
>   encrypt(thing);
>   kernel_fpu_end();
> }
> 
> This is obviously very bad, because begin() and end() are slow, so
> WireGuard does the obvious:
> 
> kernel_fpu_begin();
> for (thing)
>   encrypt(thing);
> kernel_fpu_end();
> 
> This is fine and well, and the crypto API I'm working on will enable

It might be fine crypto performance wise, but it's a total nightmare
latency wise because kernel_fpu_begin() disables preemption. We've seen
latencies in the larger millisecond range due to processing large data sets
with kernel FPU.

If you want to go there then we really need a better approach which allows
kernel FPU usage in preemptible context and in case of preemption a way to
stash the preempted FPU context and restore it when the task gets scheduled
in again. Just using the existing FPU stuff and moving the loops inside the
begin/end section and keeping preemption disabled for arbitrary time spans
is not going to fly.

Thanks,

	tglx


  



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 13:11 Lazy FPU restoration / moving kernel_fpu_end() to context switch Jason A. Donenfeld
  2018-06-15 16:25 ` Thomas Gleixner
@ 2018-06-15 18:31 ` Andy Lutomirski
  2018-06-15 18:40   ` Rik van Riel
                     ` (3 more replies)
  1 sibling, 4 replies; 23+ messages in thread
From: Andy Lutomirski @ 2018-06-15 18:31 UTC (permalink / raw)
  To: Jason A. Donenfeld, Rik van Riel, Dave Hansen; +Cc: LKML, X86 ML

On Fri, Jun 15, 2018 at 6:11 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Hi Andy & folks,
>
> Lots of crypto routines look like this:
>
> kernel_fpu_begin();
> encrypt();
> kernel_fpu_end();
>
> If you call such a routine twice, you get:
>
> kernel_fpu_begin();
> encrypt();
> kernel_fpu_end();
> kernel_fpu_begin();
> encrypt();
> kernel_fpu_end();
>
> In a loop this looks like:
>
> for (thing) {
>   kernel_fpu_begin();
>   encrypt(thing);
>   kernel_fpu_end();
> }
>
> This is obviously very bad, because begin() and end() are slow, so
> WireGuard does the obvious:
>
> kernel_fpu_begin();
> for (thing)
>   encrypt(thing);
> kernel_fpu_end();
>
> This is fine and well, and the crypto API I'm working on will enable
> this to be done in a clear way, but I do wonder if maybe this is not
> something that should be happening at the level of the caller, but
> rather in the fpu functions themselves. Namely, what are your thoughts
> on modifying kernel_fpu_end() so that it doesn't actually restore the
> state, but just reenables preemption and marks that on the next
> context switch, the state should be restored? Then, essentially,
> kernel_fpu_begin() and end() become free after the first usage of
> kernel_fpu_begin().
>
> Is this something feasible? I know that performance-wise, I'm really
> gaining a lot from hoisting those functions out of the loops, and API
> wise, it'd be slightly simpler to implement if I didn't have to all
> for that hoisting.

Hi Jason-

Funny you asked.  This has been discussed a few times, although not
quite in the form you imagined.  The idea that we've tossed around is
to restore FPU state on return to user mode.  Roughly, we'd introduce
a new thread flag TIF_FPU_UNLOADED (name TBD).
prepare_exit_to_usermode() would notice this flag, copy the fpstate to
fpregs, and clear the flag.  (Or maybe exit_to_usermode_loop() -- No
one has quite thought it through, but I think it should be outside the
loop.)  We'd update all the FPU accessors to understand the flag.
We'd probably have a helper like:

void unload_user_fpstate(void)

that would do nothing if TIF_FPU_UNLOADED is set.  If TIF_FPU_UNLOADED
is clear, it would move the state to memory and set TIF_FPU_UNLOADED.
We'd call unload_user_fpstate() during context switch in the prev task
context and in kernel_fpu_begin().

The one major complication I know of is that PKRU is different from
all other FPU state in that it needs to be loaded when *kernel* code
does any user memory access.  So we'd need to make sure that context
switches load the new task's PKRU eagerly.  Using WRPKRU is easy, but,
unless we do something very clever, actually finding PKRU in the
in-memory fpstate image may be slightly nontrivial.

I suppose we could eagerly load the new FPU state on context switches,
but I'm not sure I see any point.  We'd still have to load it when we
switch back to user mode, so it would be slower but not necessarily
any simpler.

I'd love to review patches to make this change, but I don't have the
bandwidth to write them right now.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 16:25 ` Thomas Gleixner
@ 2018-06-15 18:33   ` Brian Gerst
  2018-06-15 19:34   ` Peter Zijlstra
  1 sibling, 0 replies; 23+ messages in thread
From: Brian Gerst @ 2018-06-15 18:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jason A. Donenfeld, LKML, X86 ML, Andy Lutomirski, Peter Zijlstra

On Fri, Jun 15, 2018 at 12:25 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 15 Jun 2018, Jason A. Donenfeld wrote:
>> In a loop this looks like:
>>
>> for (thing) {
>>   kernel_fpu_begin();
>>   encrypt(thing);
>>   kernel_fpu_end();
>> }
>>
>> This is obviously very bad, because begin() and end() are slow, so
>> WireGuard does the obvious:
>>
>> kernel_fpu_begin();
>> for (thing)
>>   encrypt(thing);
>> kernel_fpu_end();
>>
>> This is fine and well, and the crypto API I'm working on will enable
>
> It might be fine crypto performance wise, but it's a total nightmare
> latency wise because kernel_fpu_begin() disables preemption. We've seen
> latencies in the larger millisecond range due to processing large data sets
> with kernel FPU.
>
> If you want to go there then we really need a better approach which allows
> kernel FPU usage in preemptible context and in case of preemption a way to
> stash the preempted FPU context and restore it when the task gets scheduled
> in again. Just using the existing FPU stuff and moving the loops inside the
> begin/end section and keeping preemption disabled for arbitrary time spans
> is not going to fly.

One optimization that can be done is to delay restoring the user FPU
state until we exit to userspace.  That way the FPU is saved and
restored only once no matter how many times
kernel_fpu_begin()/kernel_fpu_end() are called.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 18:31 ` Andy Lutomirski
@ 2018-06-15 18:40   ` Rik van Riel
  2018-06-15 18:41   ` Dave Hansen
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2018-06-15 18:40 UTC (permalink / raw)
  To: Andy Lutomirski, Jason A. Donenfeld, Dave Hansen; +Cc: LKML, X86 ML

[-- Attachment #1: Type: text/plain, Size: 3746 bytes --]

On Fri, 2018-06-15 at 11:31 -0700, Andy Lutomirski wrote:
> On Fri, Jun 15, 2018 at 6:11 AM Jason A. Donenfeld <Jason@zx2c4.com>
> wrote:
> > 
> > Hi Andy & folks,
> > 
> > Lots of crypto routines look like this:
> > 
> > kernel_fpu_begin();
> > encrypt();
> > kernel_fpu_end();
> > 
> > If you call such a routine twice, you get:
> > 
> > kernel_fpu_begin();
> > encrypt();
> > kernel_fpu_end();
> > kernel_fpu_begin();
> > encrypt();
> > kernel_fpu_end();
> > 
> > In a loop this looks like:
> > 
> > for (thing) {
> >   kernel_fpu_begin();
> >   encrypt(thing);
> >   kernel_fpu_end();
> > }
> > 
> > This is obviously very bad, because begin() and end() are slow, so
> > WireGuard does the obvious:
> > 
> > kernel_fpu_begin();
> > for (thing)
> >   encrypt(thing);
> > kernel_fpu_end();
> > 
> > This is fine and well, and the crypto API I'm working on will
> > enable
> > this to be done in a clear way, but I do wonder if maybe this is
> > not
> > something that should be happening at the level of the caller, but
> > rather in the fpu functions themselves. Namely, what are your
> > thoughts
> > on modifying kernel_fpu_end() so that it doesn't actually restore
> > the
> > state, but just reenables preemption and marks that on the next
> > context switch, the state should be restored? Then, essentially,
> > kernel_fpu_begin() and end() become free after the first usage of
> > kernel_fpu_begin().
> > 
> > Is this something feasible? I know that performance-wise, I'm
> > really
> > gaining a lot from hoisting those functions out of the loops, and
> > API
> > wise, it'd be slightly simpler to implement if I didn't have to all
> > for that hoisting.
> 
> Hi Jason-
> 
> Funny you asked.  This has been discussed a few times, although not
> quite in the form you imagined.  The idea that we've tossed around is
> to restore FPU state on return to user mode.  Roughly, we'd introduce
> a new thread flag TIF_FPU_UNLOADED (name TBD).
> prepare_exit_to_usermode() would notice this flag, copy the fpstate
> to
> fpregs, and clear the flag.  (Or maybe exit_to_usermode_loop() -- No
> one has quite thought it through, but I think it should be outside
> the
> loop.)  We'd update all the FPU accessors to understand the flag.
> We'd probably have a helper like:
> 
> void unload_user_fpstate(void)
> 
> that would do nothing if TIF_FPU_UNLOADED is set.  If
> TIF_FPU_UNLOADED
> is clear, it would move the state to memory and set TIF_FPU_UNLOADED.
> We'd call unload_user_fpstate() during context switch in the prev
> task
> context and in kernel_fpu_begin().
> 
> The one major complication I know of is that PKRU is different from
> all other FPU state in that it needs to be loaded when *kernel* code
> does any user memory access.  So we'd need to make sure that context
> switches load the new task's PKRU eagerly.  Using WRPKRU is easy,
> but,
> unless we do something very clever, actually finding PKRU in the
> in-memory fpstate image may be slightly nontrivial.

IIRC the in-kernel FPU state always has every FPU field
at a constant location.

> I suppose we could eagerly load the new FPU state on context
> switches,
> but I'm not sure I see any point.  We'd still have to load it when we
> switch back to user mode, so it would be slower but not necessarily
> any simpler.
> 
> I'd love to review patches to make this change, but I don't have the
> bandwidth to write them right now.

I can take a look at this, I am already looking at
some context switch code right now, anyway.

I also should still have the TIF_FPU_UNLOADED patches
(or whatever the flag was called) code around somewhere.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 18:31 ` Andy Lutomirski
  2018-06-15 18:40   ` Rik van Riel
@ 2018-06-15 18:41   ` Dave Hansen
  2018-06-15 18:49     ` Andy Lutomirski
  2018-06-15 18:48   ` Dave Hansen
  2018-06-15 20:33   ` Jason A. Donenfeld
  3 siblings, 1 reply; 23+ messages in thread
From: Dave Hansen @ 2018-06-15 18:41 UTC (permalink / raw)
  To: Andy Lutomirski, Jason A. Donenfeld, Rik van Riel; +Cc: LKML, X86 ML

On 06/15/2018 11:31 AM, Andy Lutomirski wrote:
> Using WRPKRU is easy, but,
> unless we do something very clever, actually finding PKRU in the
> in-memory fpstate image may be slightly nontrivial.

Why?

It's at a constant offset during any one boot, for sure.  XSAVEC/XSAVES
can move it around, but only if you change the Requested Feature BitMap
(RFBM) that we pass to XSAVES or change the control register that
enables XSAVE states (XCR0).

We don't change XCR0 after boot, and RFBM is hard-coded to -1 as far as
I remember.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 18:31 ` Andy Lutomirski
  2018-06-15 18:40   ` Rik van Riel
  2018-06-15 18:41   ` Dave Hansen
@ 2018-06-15 18:48   ` Dave Hansen
  2018-06-15 18:53     ` Andy Lutomirski
  2018-06-15 20:33   ` Jason A. Donenfeld
  3 siblings, 1 reply; 23+ messages in thread
From: Dave Hansen @ 2018-06-15 18:48 UTC (permalink / raw)
  To: Andy Lutomirski, Jason A. Donenfeld, Rik van Riel; +Cc: LKML, X86 ML

On 06/15/2018 11:31 AM, Andy Lutomirski wrote:
> for (thing) {
>   kernel_fpu_begin();
>   encrypt(thing);
>   kernel_fpu_end();
> }

Don't forget that the processor has optimizations for this, too.  The
"modified optimization" will notice that between:

	kernel_fpu_end();   -> XRSTOR
and
	kernel_fpu_start(); -> XSAVE(S|OPT)

the processor has not modified the states.  It'll skip doing any writes
of the state.  Doing what Andy is describing is still way better than
letting the processor do it, but you should just know up front that this
may not be as much of a win as you would expect.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 18:41   ` Dave Hansen
@ 2018-06-15 18:49     ` Andy Lutomirski
  0 siblings, 0 replies; 23+ messages in thread
From: Andy Lutomirski @ 2018-06-15 18:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Lutomirski, Jason A. Donenfeld, Rik van Riel, LKML, X86 ML

On Fri, Jun 15, 2018 at 11:43 AM Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> On 06/15/2018 11:31 AM, Andy Lutomirski wrote:
> > Using WRPKRU is easy, but,
> > unless we do something very clever, actually finding PKRU in the
> > in-memory fpstate image may be slightly nontrivial.
>
> Why?
>
> It's at a constant offset during any one boot, for sure.  XSAVEC/XSAVES
> can move it around, but only if you change the Requested Feature BitMap
> (RFBM) that we pass to XSAVES or change the control register that
> enables XSAVE states (XCR0).
>
> We don't change XCR0 after boot, and RFBM is hard-coded to -1 as far as
> I remember.

I thought that XSAVES didn't allocate space for parts of the state
that are in the init state. I guess I was wrong :)

So never mind, context switch should just need to WRPKRU the field at
the predetermined offset in fpstate for the new task.  And, if some
appropriate debug option is set, it should warn if TIF_FPU_UNLOADED is
clear for the new task.

I suspect that the whole patch should only be a couple hundred lines
of code.  And I think there are VM workloads where it would be a
*huge* win.

--Andy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 18:48   ` Dave Hansen
@ 2018-06-15 18:53     ` Andy Lutomirski
  2018-06-15 20:27       ` Jason A. Donenfeld
  2018-06-19 11:43       ` David Laight
  0 siblings, 2 replies; 23+ messages in thread
From: Andy Lutomirski @ 2018-06-15 18:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Lutomirski, Jason A. Donenfeld, Rik van Riel, LKML, X86 ML

On Fri, Jun 15, 2018 at 11:50 AM Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> On 06/15/2018 11:31 AM, Andy Lutomirski wrote:
> > for (thing) {
> >   kernel_fpu_begin();
> >   encrypt(thing);
> >   kernel_fpu_end();
> > }
>
> Don't forget that the processor has optimizations for this, too.  The
> "modified optimization" will notice that between:
>
>         kernel_fpu_end();   -> XRSTOR
> and
>         kernel_fpu_start(); -> XSAVE(S|OPT)
>
> the processor has not modified the states.  It'll skip doing any writes
> of the state.  Doing what Andy is describing is still way better than
> letting the processor do it, but you should just know up front that this
> may not be as much of a win as you would expect.

Even with the modified optimization, kernel_fpu_end() still needs to
reload the state that was trashed by the kernel FPU use.  If the
kernel is using something like AVX512 state, then kernel_fpu_end()
will transfer an enormous amount of data no matter how clever the CPU
is.  And I think I once measured XSAVEOPT taking a hundred cycles or
so even when RFBM==0, so it's not exactly super fast.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 16:25 ` Thomas Gleixner
  2018-06-15 18:33   ` Brian Gerst
@ 2018-06-15 19:34   ` Peter Zijlstra
  2018-06-15 20:30     ` Jason A. Donenfeld
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2018-06-15 19:34 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Jason A. Donenfeld, LKML, X86 ML, Andy Lutomirski

On Fri, Jun 15, 2018 at 06:25:39PM +0200, Thomas Gleixner wrote:
> On Fri, 15 Jun 2018, Jason A. Donenfeld wrote:
> > In a loop this looks like:
> > 
> > for (thing) {
> >   kernel_fpu_begin();
> >   encrypt(thing);
> >   kernel_fpu_end();
> > }
> > 
> > This is obviously very bad, because begin() and end() are slow, so
> > WireGuard does the obvious:
> > 
> > kernel_fpu_begin();
> > for (thing)
> >   encrypt(thing);
> > kernel_fpu_end();
> > 
> > This is fine and well, and the crypto API I'm working on will enable
> 
> It might be fine crypto performance wise, but it's a total nightmare
> latency wise because kernel_fpu_begin() disables preemption. We've seen
> latencies in the larger millisecond range due to processing large data sets
> with kernel FPU.
> 
> If you want to go there then we really need a better approach which allows
> kernel FPU usage in preemptible context and in case of preemption a way to
> stash the preempted FPU context and restore it when the task gets scheduled
> in again. Just using the existing FPU stuff and moving the loops inside the
> begin/end section and keeping preemption disabled for arbitrary time spans
> is not going to fly.

Didn't we recently do a bunch of crypto patches to help with this?

I think they had the pattern:

	kernel_fpu_begin();
	for (units-of-work) {
		do_unit_of_work();
		if (need_resched()) {
			kernel_fpu_end();
			cond_resched();
			kernel_fpu_begin();
		}
	}
	kernel_fpu_end();

I'd have to go dig out the actual series, but I think they had a bunch
of helpers to deal with that.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 18:53     ` Andy Lutomirski
@ 2018-06-15 20:27       ` Jason A. Donenfeld
  2018-06-15 20:48         ` Jason A. Donenfeld
  2018-06-19 11:43       ` David Laight
  1 sibling, 1 reply; 23+ messages in thread
From: Jason A. Donenfeld @ 2018-06-15 20:27 UTC (permalink / raw)
  To: Andrew Lutomirski; +Cc: dave.hansen, riel, LKML, X86 ML

On Fri, Jun 15, 2018 at 8:53 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Fri, Jun 15, 2018 at 11:50 AM Dave Hansen
> <dave.hansen@linux.intel.com> wrote:
> Even with the modified optimization, kernel_fpu_end() still needs to
> reload the state that was trashed by the kernel FPU use.  If the
> kernel is using something like AVX512 state, then kernel_fpu_end()
> will transfer an enormous amount of data no matter how clever the CPU
> is.  And I think I once measured XSAVEOPT taking a hundred cycles or
> so even when RFBM==0, so it's not exactly super fast.

Indeed the speed up is really significant, especially for the AVX512
case. Here are some numbers from my laptop and a server taken a few
seconds ago:

AVX2 - Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz
Inside: 684617437
Outside: 547710093
Percent speedup: 24

AVX512 - Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
Inside: 634415672
Outside: 286698960
Percent speedup: 121

This is from this test -- https://xn--4db.cc/F7RF2fhv/c . There are
probably various issues with that test case, and it's possible there
are other effects going on (the avx512 case looks particularly insane)
to make the difference _that_ drastic, but I think there's no doubt
that the optimization here is a meaningful one.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 19:34   ` Peter Zijlstra
@ 2018-06-15 20:30     ` Jason A. Donenfeld
  2018-06-18  9:44       ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Jason A. Donenfeld @ 2018-06-15 20:30 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, X86 ML, Andy Lutomirski

On Fri, Jun 15, 2018 at 9:34 PM Peter Zijlstra <peterz@infradead.org> wrote:
> Didn't we recently do a bunch of crypto patches to help with this?
>
> I think they had the pattern:
>
>         kernel_fpu_begin();
>         for (units-of-work) {
>                 do_unit_of_work();
>                 if (need_resched()) {
>                         kernel_fpu_end();
>                         cond_resched();
>                         kernel_fpu_begin();
>                 }
>         }
>         kernel_fpu_end();

Right, so that's the thing -- this is an optimization easily available
to individual crypto primitives. But I'm interested in applying this
kind of optimization to an entire queue of, say, tiny packets, where
each packet is processed individually. Or, to a cryptographic
construction, where several different primitives are used, such that
it'd be meaningful not to have to get the performance hit of
end()begin() in between each and everyone of them.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 18:31 ` Andy Lutomirski
                     ` (2 preceding siblings ...)
  2018-06-15 18:48   ` Dave Hansen
@ 2018-06-15 20:33   ` Jason A. Donenfeld
  2018-06-15 20:42     ` Dave Hansen
  2018-07-11 16:28     ` Sebastian Andrzej Siewior
  3 siblings, 2 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2018-06-15 20:33 UTC (permalink / raw)
  To: Andrew Lutomirski; +Cc: riel, dave.hansen, LKML, X86 ML

On Fri, Jun 15, 2018 at 8:32 PM Andy Lutomirski <luto@kernel.org> wrote:
> quite in the form you imagined.  The idea that we've tossed around is
> to restore FPU state on return to user mode.  Roughly, we'd introduce
> a new thread flag TIF_FPU_UNLOADED (name TBD).
> prepare_exit_to_usermode() would notice this flag, copy the fpstate to
> fpregs, and clear the flag.  (Or maybe exit_to_usermode_loop() -- No
> one has quite thought it through, but I think it should be outside the
> loop.)  We'd update all the FPU accessors to understand the flag.

Yes! This is exactly what I was thinking. Then those calls to begin()
and end() could be placed as close to the actual FPU usage as
possible.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 20:33   ` Jason A. Donenfeld
@ 2018-06-15 20:42     ` Dave Hansen
  2018-06-15 20:52       ` Andy Lutomirski
  2018-06-15 20:56       ` Rik van Riel
  2018-07-11 16:28     ` Sebastian Andrzej Siewior
  1 sibling, 2 replies; 23+ messages in thread
From: Dave Hansen @ 2018-06-15 20:42 UTC (permalink / raw)
  To: Jason A. Donenfeld, Andrew Lutomirski; +Cc: riel, LKML, X86 ML

On 06/15/2018 01:33 PM, Jason A. Donenfeld wrote:
> On Fri, Jun 15, 2018 at 8:32 PM Andy Lutomirski <luto@kernel.org> wrote:
>> quite in the form you imagined.  The idea that we've tossed around is
>> to restore FPU state on return to user mode.  Roughly, we'd introduce
>> a new thread flag TIF_FPU_UNLOADED (name TBD).
>> prepare_exit_to_usermode() would notice this flag, copy the fpstate to
>> fpregs, and clear the flag.  (Or maybe exit_to_usermode_loop() -- No
>> one has quite thought it through, but I think it should be outside the
>> loop.)  We'd update all the FPU accessors to understand the flag.
> Yes! This is exactly what I was thinking. Then those calls to begin()
> and end() could be placed as close to the actual FPU usage as
> possible.

Andy, what was the specific concern about PKRU?  That we might do:

	kernel_fpu_begin(); <- Saves the first time
	something()
	kernel_fpu_end(); <- Does not XRSTOR

	copy_from_user(); <- Sees old PKRU, does the wrong thing

	prepare_exit_to_usermode(); <- Does the XRSTOR
	// only now does PKRU have the right value
	SYSRET/IRET

?

Does that *matter* unless something() modified PKRU?  We could just make
the rule that nobody is supposed to mess with it and that it's not
covered by kernel_fpu_begin/end() semantics.  We could even
theoretically enforce that in a debug environment if we watch its value.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 20:27       ` Jason A. Donenfeld
@ 2018-06-15 20:48         ` Jason A. Donenfeld
  0 siblings, 0 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2018-06-15 20:48 UTC (permalink / raw)
  To: Andrew Lutomirski; +Cc: dave.hansen, riel, LKML, X86 ML

On Fri, Jun 15, 2018 at 10:27 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> AVX512 - Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
> Inside: 634415672
> Outside: 286698960
> Percent speedup: 121

More realistic results, with turbo turned off:

Inside: 616851298
Outside: 339606790
Percent speedup: 81

Still a pretty outstanding speedup.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 20:42     ` Dave Hansen
@ 2018-06-15 20:52       ` Andy Lutomirski
  2018-06-15 20:56       ` Rik van Riel
  1 sibling, 0 replies; 23+ messages in thread
From: Andy Lutomirski @ 2018-06-15 20:52 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Jason A. Donenfeld, Andrew Lutomirski, riel, LKML, X86 ML



> On Jun 15, 2018, at 1:42 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
> 
>> On 06/15/2018 01:33 PM, Jason A. Donenfeld wrote:
>>> On Fri, Jun 15, 2018 at 8:32 PM Andy Lutomirski <luto@kernel.org> wrote:
>>> quite in the form you imagined.  The idea that we've tossed around is
>>> to restore FPU state on return to user mode.  Roughly, we'd introduce
>>> a new thread flag TIF_FPU_UNLOADED (name TBD).
>>> prepare_exit_to_usermode() would notice this flag, copy the fpstate to
>>> fpregs, and clear the flag.  (Or maybe exit_to_usermode_loop() -- No
>>> one has quite thought it through, but I think it should be outside the
>>> loop.)  We'd update all the FPU accessors to understand the flag.
>> Yes! This is exactly what I was thinking. Then those calls to begin()
>> and end() could be placed as close to the actual FPU usage as
>> possible.
> 
> Andy, what was the specific concern about PKRU?  That we might do:
> 
>    kernel_fpu_begin(); <- Saves the first time
>    something()
>    kernel_fpu_end(); <- Does not XRSTOR
> 
>    copy_from_user(); <- Sees old PKRU, does the wrong thing
> 
>    prepare_exit_to_usermode(); <- Does the XRSTOR
>    // only now does PKRU have the right value
>    SYSRET/IRET
> 
> ?
> 
> Does that *matter* unless something() modified PKRU?  We could just make
> the rule that nobody is supposed to mess with it and that it's not
> covered by kernel_fpu_begin/end() semantics.  We could even
> theoretically enforce that in a debug environment if we watch its value.

Indeed, this case seems highly unlikely. I was imagining we have two tasks. Task A enters the kernel and sleeps. Task B runs and gets its PKRU loaded. Then it enters the kernel and we switch back to A. Now we have A’s FPU state in memory and B’s PKRU in the register. If A does copy_to_user, we lose.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 20:42     ` Dave Hansen
  2018-06-15 20:52       ` Andy Lutomirski
@ 2018-06-15 20:56       ` Rik van Riel
  1 sibling, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2018-06-15 20:56 UTC (permalink / raw)
  To: Dave Hansen, Jason A. Donenfeld, Andrew Lutomirski; +Cc: LKML, X86 ML

[-- Attachment #1: Type: text/plain, Size: 1879 bytes --]

On Fri, 2018-06-15 at 13:42 -0700, Dave Hansen wrote:
> On 06/15/2018 01:33 PM, Jason A. Donenfeld wrote:
> > On Fri, Jun 15, 2018 at 8:32 PM Andy Lutomirski <luto@kernel.org>
> > wrote:
> > > quite in the form you imagined.  The idea that we've tossed
> > > around is
> > > to restore FPU state on return to user mode.  Roughly, we'd
> > > introduce
> > > a new thread flag TIF_FPU_UNLOADED (name TBD).
> > > prepare_exit_to_usermode() would notice this flag, copy the
> > > fpstate to
> > > fpregs, and clear the flag.  (Or maybe exit_to_usermode_loop() --
> > > No
> > > one has quite thought it through, but I think it should be
> > > outside the
> > > loop.)  We'd update all the FPU accessors to understand the flag.
> > 
> > Yes! This is exactly what I was thinking. Then those calls to
> > begin()
> > and end() could be placed as close to the actual FPU usage as
> > possible.
> 
> Andy, what was the specific concern about PKRU?  That we might do:
> 
> 	kernel_fpu_begin(); <- Saves the first time
> 	something()
> 	kernel_fpu_end(); <- Does not XRSTOR
> 
> 	copy_from_user(); <- Sees old PKRU, does the wrong thing
> 
> 	prepare_exit_to_usermode(); <- Does the XRSTOR
> 	// only now does PKRU have the right value
> 	SYSRET/IRET
> 
> ?
> 
> Does that *matter* unless something() modified PKRU?  We could just
> make
> the rule that nobody is supposed to mess with it and that it's not
> covered by kernel_fpu_begin/end() semantics.  We could even
> theoretically enforce that in a debug environment if we watch its
> value.

KVM needs to change out guest and host PKRU values
when switching between guest and host mode, but
since f775b13eedee ("x86,kvm: move qemu/guest FPU 
switching out to vcpu_run") that no longer happens
under kernel_fpu_begin/end so we don't need to care
about that :)

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 20:30     ` Jason A. Donenfeld
@ 2018-06-18  9:44       ` Peter Zijlstra
  2018-06-18 15:25         ` Jason A. Donenfeld
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2018-06-18  9:44 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: Thomas Gleixner, LKML, X86 ML, Andy Lutomirski

On Fri, Jun 15, 2018 at 10:30:46PM +0200, Jason A. Donenfeld wrote:
> On Fri, Jun 15, 2018 at 9:34 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > Didn't we recently do a bunch of crypto patches to help with this?
> >
> > I think they had the pattern:
> >
> >         kernel_fpu_begin();
> >         for (units-of-work) {
> >                 do_unit_of_work();
> >                 if (need_resched()) {
> >                         kernel_fpu_end();
> >                         cond_resched();
> >                         kernel_fpu_begin();
> >                 }
> >         }
> >         kernel_fpu_end();
> 
> Right, so that's the thing -- this is an optimization easily available
> to individual crypto primitives. But I'm interested in applying this
> kind of optimization to an entire queue of, say, tiny packets, where
> each packet is processed individually. Or, to a cryptographic
> construction, where several different primitives are used, such that
> it'd be meaningful not to have to get the performance hit of
> end()begin() in between each and everyone of them.

I'm confused.. how does the above not apply to your situation?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-18  9:44       ` Peter Zijlstra
@ 2018-06-18 15:25         ` Jason A. Donenfeld
  0 siblings, 0 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2018-06-18 15:25 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, X86 ML, Andy Lutomirski

On Mon, Jun 18, 2018 at 11:44 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jun 15, 2018 at 10:30:46PM +0200, Jason A. Donenfeld wrote:
> > On Fri, Jun 15, 2018 at 9:34 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > Didn't we recently do a bunch of crypto patches to help with this?
> > >
> > > I think they had the pattern:
> > >
> > >         kernel_fpu_begin();
> > >         for (units-of-work) {
> > >                 do_unit_of_work();
> > >                 if (need_resched()) {
> > >                         kernel_fpu_end();
> > >                         cond_resched();
> > >                         kernel_fpu_begin();
> > >                 }
> > >         }
> > >         kernel_fpu_end();
> >
> > Right, so that's the thing -- this is an optimization easily available
> > to individual crypto primitives. But I'm interested in applying this
> > kind of optimization to an entire queue of, say, tiny packets, where
> > each packet is processed individually. Or, to a cryptographic
> > construction, where several different primitives are used, such that
> > it'd be meaningful not to have to get the performance hit of
> > end()begin() in between each and everyone of them.
>
> I'm confused.. how does the above not apply to your situation?

In the example you've given, the optimization is applied at the level
of the, say, encryption function. Suppose you send a scattergather off
to an encryption function, which then walks the sglist and encrypts
each of the parts using some particular key. For each of the parts, it
benefits from the above optimization.

But what I'm referring to is encrypting multiple different things,
with different keys. In the case I'd like to optimize, I have a worker
thread that's processing a large queue of separate sglists and
encrypting them separately under different keys. In this case, having
kernel_fpu_begin/end inside the encryption function itself is a
problem, since that means toggling the FPU in between every queue
item. The solution, for now, is to just hoist the kernel_fpu_begin/end
out of the encryption function, and put them instead at the beginning
and end of my worker thread that handles all the items of the queue.

This is fine and dandy, but far from ideal, as putting that kind of
logic inside the encryption function itself makes more sense. For
example, the encryption function can decide whether or not it even
wants to use the FPU before calling kernel_fpu_begin. Ostensibly this
logic too could be hoisted outside, but at what point do you draw the
line and decide these optimizations are leading the API in the wrong
direction?

Hence, the idea here in this thread is to make it cost-free to place
kernel_fpu_begin/end as close as possible to the actual use of the
FPU. The solution, it seems, is to have the actual kernel_fpu_end work
occur on context switch.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 18:53     ` Andy Lutomirski
  2018-06-15 20:27       ` Jason A. Donenfeld
@ 2018-06-19 11:43       ` David Laight
  2018-06-19 13:08         ` Thomas Gleixner
  1 sibling, 1 reply; 23+ messages in thread
From: David Laight @ 2018-06-19 11:43 UTC (permalink / raw)
  To: 'Andy Lutomirski', Dave Hansen
  Cc: Jason A. Donenfeld, Rik van Riel, LKML, X86 ML

From: Andy Lutomirski
> Sent: 15 June 2018 19:54
> On Fri, Jun 15, 2018 at 11:50 AM Dave Hansen
> <dave.hansen@linux.intel.com> wrote:
> >
> > On 06/15/2018 11:31 AM, Andy Lutomirski wrote:
> > > for (thing) {
> > >   kernel_fpu_begin();
> > >   encrypt(thing);
> > >   kernel_fpu_end();
> > > }
> >
> > Don't forget that the processor has optimizations for this, too.  The
> > "modified optimization" will notice that between:
> >
> >         kernel_fpu_end();   -> XRSTOR
> > and
> >         kernel_fpu_start(); -> XSAVE(S|OPT)
> >
> > the processor has not modified the states.  It'll skip doing any writes
> > of the state.  Doing what Andy is describing is still way better than
> > letting the processor do it, but you should just know up front that this
> > may not be as much of a win as you would expect.
> 
> Even with the modified optimization, kernel_fpu_end() still needs to
> reload the state that was trashed by the kernel FPU use.  If the
> kernel is using something like AVX512 state, then kernel_fpu_end()
> will transfer an enormous amount of data no matter how clever the CPU
> is.  And I think I once measured XSAVEOPT taking a hundred cycles or
> so even when RFBM==0, so it's not exactly super fast.

If the kernel was entered by a system call do you need to save the AVX512
state at all?
IIRC the registers are all defined as 'called saved' so there is no expectation
that they will be saved across the syscall wrapper function call.
All you need to do is ensure that 'kernel' values aren't passed back to userspace.
There is a single instruction to zero all the AVX512 registers.

	David


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-19 11:43       ` David Laight
@ 2018-06-19 13:08         ` Thomas Gleixner
  0 siblings, 0 replies; 23+ messages in thread
From: Thomas Gleixner @ 2018-06-19 13:08 UTC (permalink / raw)
  To: David Laight
  Cc: 'Andy Lutomirski',
	Dave Hansen, Jason A. Donenfeld, Rik van Riel, LKML, X86 ML

On Tue, 19 Jun 2018, David Laight wrote:
> From: Andy Lutomirski
> > Sent: 15 June 2018 19:54
> > On Fri, Jun 15, 2018 at 11:50 AM Dave Hansen
> > <dave.hansen@linux.intel.com> wrote:
> > >
> > > On 06/15/2018 11:31 AM, Andy Lutomirski wrote:
> > > > for (thing) {
> > > >   kernel_fpu_begin();
> > > >   encrypt(thing);
> > > >   kernel_fpu_end();
> > > > }
> > >
> > > Don't forget that the processor has optimizations for this, too.  The
> > > "modified optimization" will notice that between:
> > >
> > >         kernel_fpu_end();   -> XRSTOR
> > > and
> > >         kernel_fpu_start(); -> XSAVE(S|OPT)
> > >
> > > the processor has not modified the states.  It'll skip doing any writes
> > > of the state.  Doing what Andy is describing is still way better than
> > > letting the processor do it, but you should just know up front that this
> > > may not be as much of a win as you would expect.
> > 
> > Even with the modified optimization, kernel_fpu_end() still needs to
> > reload the state that was trashed by the kernel FPU use.  If the
> > kernel is using something like AVX512 state, then kernel_fpu_end()
> > will transfer an enormous amount of data no matter how clever the CPU
> > is.  And I think I once measured XSAVEOPT taking a hundred cycles or
> > so even when RFBM==0, so it's not exactly super fast.
> 
> If the kernel was entered by a system call do you need to save the AVX512
> state at all?
> IIRC the registers are all defined as 'called saved' so there is no expectation
> that they will be saved across the syscall wrapper function call.
> All you need to do is ensure that 'kernel' values aren't passed back to userspace.
> There is a single instruction to zero all the AVX512 registers.

Then we need different treatment for exception entries and consecutive
preemption. Lots of corner cases to cover ...

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-06-15 20:33   ` Jason A. Donenfeld
  2018-06-15 20:42     ` Dave Hansen
@ 2018-07-11 16:28     ` Sebastian Andrzej Siewior
  2018-07-11 20:10       ` Rik van Riel
  1 sibling, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-07-11 16:28 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: Andrew Lutomirski, riel, dave.hansen, LKML, X86 ML

On 2018-06-15 22:33:47 [+0200], Jason A. Donenfeld wrote:
> On Fri, Jun 15, 2018 at 8:32 PM Andy Lutomirski <luto@kernel.org> wrote:
> > quite in the form you imagined.  The idea that we've tossed around is
> > to restore FPU state on return to user mode.  Roughly, we'd introduce
> > a new thread flag TIF_FPU_UNLOADED (name TBD).
> > prepare_exit_to_usermode() would notice this flag, copy the fpstate to
> > fpregs, and clear the flag.  (Or maybe exit_to_usermode_loop() -- No
> > one has quite thought it through, but I think it should be outside the
> > loop.)  We'd update all the FPU accessors to understand the flag.
> 
> Yes! This is exactly what I was thinking. Then those calls to begin()
> and end() could be placed as close to the actual FPU usage as
> possible.

I was thinking about this myself. Did anyone try to hack something in
the meantime? I might want to look into this, too :)

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
  2018-07-11 16:28     ` Sebastian Andrzej Siewior
@ 2018-07-11 20:10       ` Rik van Riel
  0 siblings, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2018-07-11 20:10 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Jason A. Donenfeld
  Cc: Andrew Lutomirski, dave.hansen, LKML, X86 ML

[-- Attachment #1: Type: text/plain, Size: 1216 bytes --]

On Wed, 2018-07-11 at 18:28 +0200, Sebastian Andrzej Siewior wrote:
> On 2018-06-15 22:33:47 [+0200], Jason A. Donenfeld wrote:
> > On Fri, Jun 15, 2018 at 8:32 PM Andy Lutomirski <luto@kernel.org>
> > wrote:
> > > quite in the form you imagined.  The idea that we've tossed
> > > around is
> > > to restore FPU state on return to user mode.  Roughly, we'd
> > > introduce
> > > a new thread flag TIF_FPU_UNLOADED (name TBD).
> > > prepare_exit_to_usermode() would notice this flag, copy the
> > > fpstate to
> > > fpregs, and clear the flag.  (Or maybe exit_to_usermode_loop() --
> > > No
> > > one has quite thought it through, but I think it should be
> > > outside the
> > > loop.)  We'd update all the FPU accessors to understand the flag.
> > 
> > Yes! This is exactly what I was thinking. Then those calls to
> > begin()
> > and end() could be placed as close to the actual FPU usage as
> > possible.
> 
> I was thinking about this myself. Did anyone try to hack something in
> the meantime? I might want to look into this, too :)

I have implemented this before, back before the
big rewrite of the syscall entry and exit code.

It seemed to work fine.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-07-11 20:10 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-15 13:11 Lazy FPU restoration / moving kernel_fpu_end() to context switch Jason A. Donenfeld
2018-06-15 16:25 ` Thomas Gleixner
2018-06-15 18:33   ` Brian Gerst
2018-06-15 19:34   ` Peter Zijlstra
2018-06-15 20:30     ` Jason A. Donenfeld
2018-06-18  9:44       ` Peter Zijlstra
2018-06-18 15:25         ` Jason A. Donenfeld
2018-06-15 18:31 ` Andy Lutomirski
2018-06-15 18:40   ` Rik van Riel
2018-06-15 18:41   ` Dave Hansen
2018-06-15 18:49     ` Andy Lutomirski
2018-06-15 18:48   ` Dave Hansen
2018-06-15 18:53     ` Andy Lutomirski
2018-06-15 20:27       ` Jason A. Donenfeld
2018-06-15 20:48         ` Jason A. Donenfeld
2018-06-19 11:43       ` David Laight
2018-06-19 13:08         ` Thomas Gleixner
2018-06-15 20:33   ` Jason A. Donenfeld
2018-06-15 20:42     ` Dave Hansen
2018-06-15 20:52       ` Andy Lutomirski
2018-06-15 20:56       ` Rik van Riel
2018-07-11 16:28     ` Sebastian Andrzej Siewior
2018-07-11 20:10       ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).