All of lore.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>, X86 ML <x86@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Will Deacon <will@kernel.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	kvm list <kvm@vger.kernel.org>,
	linux-arch <linux-arch@vger.kernel.org>,
	Mike Rapoport <rppt@linux.ibm.com>,
	Miroslav Benes <mbenes@suse.cz>
Subject: Re: [patch V2 07/17] x86/entry/64: Remove redundant interrupt disable
Date: Fri, 25 Oct 2019 00:33:52 +0200 (CEST)	[thread overview]
Message-ID: <alpine.DEB.2.21.1910242353330.1783@nanos.tec.linutronix.de> (raw)
In-Reply-To: <CALCETrXyhnMwwOyWQ-FtsNFAsrcG41-pPrAp8Wj2vc0N9JzP-Q@mail.gmail.com>

On Thu, 24 Oct 2019, Andy Lutomirski wrote:
> On Thu, Oct 24, 2019 at 1:53 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > I spent quite some time digging deeper into this. Finding all corner cases
> > which eventually enable interrupts from an exception handler is not as
> > trivial as it looked in the first place. Especially the fault handler is a
> > nightmare. Also PeterZ's approach of doing
> >
> >            if (regs->eflags & IF)
> >                 local_irq_disable();
> >
> > is doomed due to sys_iopl(). See below.
> 
> I missed something in the discussion.  What breaks?

Assume user space has issued CLI then the above check is giving the wrong
answer because it assumes that all faults in user mode have IF set.

> Can you check user_mode(regs) too?

Yes, but I still hate it with a passion :)

> > What's your plan with cr2? Stash it in pt_regs or something else?
> 
> Just read it from CR2.  I added a new idtentry macro arg called
> "entry_work", and setting it to 0 causes the enter_from_user_mode to
> be skipped.  Then C code calls enter_from_user_mode() after reading
> CR2 (and DR7).  WIP code is here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/idtentry
> 
> The idea is that, if everything is converted, then we get rid of the
> entry_work=1 case, which is easier if there's a macro.
> 
> So my suggestion is to use a macro for the 2-arg version and open-code
> all the 3-arg cases.  Then, when the dust settles, we get rid of the
> third arg and they can use the macro.

I'll have a look tomorrow with brain awake.
 
> > The interesting bells and whistels result from sys_iopl(). If user space
> > has been granted iopl(level = 3) it gains cli/sti priviledges. When the
> > application has interrupts disabled in userspace:
> >
> >   - invocation of a syscall
> >
> >   - any exception (aside of NMI/MCE) which conditionally enables interrupts
> >     depending on user_mode(regs) and therefor can be preempted and
> >     schedule
> >
> > is just undefined behaviour and I personally consider it to be a plain bug.
> >
> > Just for the record: This results in running a resulting or even completely
> > unrelated signal handler with interrupts disabled as well.
> 
> I am seriously tempted to say that the solution is to remove iopl(),
> at least on 64-bit kernels.  Doing STI in user mode is BS :)

STI would be halfways sane. CLI is the problem. And yes I agree it's BS :)

> Otherwise we need to give it semantics, no?  I personally have no
> actual problem with the fact that an NMI can cause scheduling to
> happen.  Big fscking deal.

Right, I don't care either. I do neither care that any exception/syscall
which hits a user space CLI region might schedule. It's been that way
forever.

But giving this semantics is insanely hard at least if you want sensible,
useful, consistent and understandable semantics. I know that's overrated.

> > Whatever we decide it is, leaving it completely inconsistent is not a
> > solution at all. The options are:
> >
> >   1)  Always do conditional tracing depending on the user_regs->eflags.IF
> >       state.
> 
> I'm okay with always tracing like user mode means IRQs on or doing it
> "correctly".  I consider the former to be simpler and therefore quite
> possibly better.
> 
> >
> >   2)  #1 + warn once when syscalls and exceptions (except NMI/MCE) happen
> >       and user_regs->eflags.IF is cleared.
> >
> >   3a) #2 + enforce signal handling to run with interrupts enabled.
> >
> >   3b) #2 + set regs->eflags.IF. So the state is always correct from the
> >       kernel POV. Of course that changes existing behaviour, but its
> >       changing undefined and inconsistent behaviour.
> >
> >   4) Let iopl(level) return -EPERM if level == 3.
> >
> >      Yeah, I know it's not possible due to regressions (DPKD uses iopl(3)),
> >      but TBH that'd be the sanest option of all.
> >
> >      Of course the infinite wisdom of hardware designers tied IN, INS, OUT,
> >      OUTS and CLI/STI together on IOPL so we cannot even distangle them in
> >      any way.
> 
> >
> >      The only way out would be to actually use a full 8K sized I/O bitmap,
> >      but that's a massive pain as it has to be copied on every context
> >      switch.
> 
> Hmm.  This actually doesn't seem that bad.  We already have a TIF_
> flag to optimize this.  So basically iopl() would effectively become
> ioperm(everything on).

Yes, and the insane user space would:

     1) Pay the latency price for copying 8K bitmap on every context switch
     	IN

     2) Inflict latency on the next task due to requiring memset of 8K
     	bitmap on every context switch OUT

     3) #GP when issuing CLI/STI

I personally have no problem with that. #1 and #3 are sane and as iopl()
requires CAP_RAW_IO it's not available to Joe User, so the sysadmin is
responsible for eventual issues resulting from #2.

Though the no-regression hammer might pound on #3 as it breaks random
engineering trainwrecks from hell.

#1/#2 could be easily mitigated though.

      struct tss_struct {
      	struct x86_hw_tss       x86_tss;
	unsigned long           io_bitmap[IO_BITMAP_LONGS + 1];
      };

and x86_tss has

    u16	io_bitmap_base;

which is either set to

  INVALID_IO_BITMAP_OFFSET ( 0x8000 )

or

  IO_BITMAP_OFFSET
    (offsetof(struct tss_struct, io_bitmap) - offsetof(struct tss_struct, x86_tss))

So we could add

	unsigned long           io_bitmap_all[IO_BITMAP_LONGS + 1];

and just set the base to this one.

But that involves also upping __KERNEL_TSS_LIMIT. Too tired to think about
the implications of that right now.

Thanks,

	tglx

  reply	other threads:[~2019-10-24 22:34 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-23 12:27 [patch V2 00/17] entry: Provide generic implementation for host and guest entry/exit work Thomas Gleixner
2019-10-23 12:27 ` [patch V2 01/17] x86/entry/32: Remove unused resume_userspace label Thomas Gleixner
2019-10-23 13:43   ` Sean Christopherson
2019-11-06 15:26   ` Alexandre Chartre
2019-11-16 12:02   ` [tip: x86/asm] " tip-bot2 for Thomas Gleixner
2019-10-23 12:27 ` [patch V2 02/17] x86/entry/64: Remove pointless jump in paranoid_exit Thomas Gleixner
2019-10-23 13:45   ` Sean Christopherson
2019-11-06 15:29   ` Alexandre Chartre
2019-11-16 12:02   ` [tip: x86/asm] " tip-bot2 for Thomas Gleixner
2019-10-23 12:27 ` [patch V2 03/17] x86/traps: Remove pointless irq enable from do_spurious_interrupt_bug() Thomas Gleixner
2019-10-23 13:52   ` Sean Christopherson
2019-10-23 21:31   ` Josh Poimboeuf
2019-10-23 22:35     ` Thomas Gleixner
2019-10-23 22:49       ` Josh Poimboeuf
2019-10-23 23:18         ` Thomas Gleixner
2019-11-06 15:33   ` Alexandre Chartre
2020-02-27 14:15   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2019-10-23 12:27 ` [patch V2 04/17] x86/entry: Make DEBUG_ENTRY_ASSERT_IRQS_OFF available for 32bit Thomas Gleixner
2019-10-23 14:16   ` Sean Christopherson
2019-11-06 15:50   ` Alexandre Chartre
2019-10-23 12:27 ` [patch V2 05/17] x86/traps: Make interrupt enable/disable symmetric in C code Thomas Gleixner
2019-10-23 14:16   ` Sean Christopherson
2019-10-23 22:01   ` Josh Poimboeuf
2019-10-23 23:23     ` Thomas Gleixner
2019-11-06 16:19   ` Alexandre Chartre
2019-10-23 12:27 ` [patch V2 06/17] x86/entry/32: Remove redundant interrupt disable Thomas Gleixner
2019-10-23 14:17   ` Sean Christopherson
2019-11-08 10:41   ` Alexandre Chartre
2019-10-23 12:27 ` [patch V2 07/17] x86/entry/64: " Thomas Gleixner
2019-10-23 14:20   ` Sean Christopherson
2019-10-23 22:06   ` Josh Poimboeuf
2019-10-23 23:52     ` Thomas Gleixner
2019-10-24 16:18       ` Andy Lutomirski
2019-10-24 20:52         ` Thomas Gleixner
2019-10-24 20:59           ` Thomas Gleixner
2019-10-24 21:21           ` Peter Zijlstra
2019-10-24 21:24           ` Andy Lutomirski
2019-10-24 22:33             ` Thomas Gleixner [this message]
2019-11-08 11:07   ` Alexandre Chartre
2019-10-23 12:27 ` [patch V2 08/17] x86/entry: Move syscall irq tracing to C code Thomas Gleixner
2019-10-23 21:30   ` Andy Lutomirski
2019-10-23 21:35     ` Andy Lutomirski
2019-10-23 23:31       ` Thomas Gleixner
2019-10-23 23:16     ` Thomas Gleixner
2019-10-24 16:24     ` Andy Lutomirski
2019-10-24 17:40       ` Peter Zijlstra
2019-10-24 20:54         ` Thomas Gleixner
2019-10-23 12:27 ` [patch V2 09/17] x86/entry: Remove _TIF_NOHZ from _TIF_WORK_SYSCALL_ENTRY Thomas Gleixner
2020-01-06  4:11   ` Frederic Weisbecker
2019-10-23 12:27 ` [patch V2 10/17] entry: Provide generic syscall entry functionality Thomas Gleixner
2019-10-23 12:27 ` [patch V2 11/17] x86/entry: Use generic syscall entry function Thomas Gleixner
2019-10-23 12:27 ` [patch V2 12/17] entry: Provide generic syscall exit function Thomas Gleixner
2019-10-23 12:27 ` [patch V2 13/17] x86/entry: Use generic syscall exit functionality Thomas Gleixner
2019-10-23 12:27 ` [patch V2 14/17] entry: Provide generic exit to usermode functionality Thomas Gleixner
2019-10-23 21:34   ` Andy Lutomirski
2019-10-23 23:20     ` Thomas Gleixner
2019-10-23 12:27 ` [patch V2 15/17] x86/entry: Use generic exit to usermode Thomas Gleixner
2019-10-23 12:27 ` [patch V2 16/17] kvm/workpending: Provide infrastructure for work before entering a guest Thomas Gleixner
2019-10-23 14:55   ` Sean Christopherson
2019-10-23 12:27 ` [patch V2 17/17] x86/kvm: Use generic exit to guest work function Thomas Gleixner
2019-10-23 14:48   ` Sean Christopherson
2019-10-23 14:37 ` [patch V2 00/17] entry: Provide generic implementation for host and guest entry/exit work Peter Zijlstra
2019-10-23 21:20 ` Josh Poimboeuf
2019-10-29 11:28 ` Will Deacon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.21.1910242353330.1783@nanos.tec.linutronix.de \
    --to=tglx@linutronix.de \
    --cc=jpoimboe@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mbenes@suse.cz \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@linux.ibm.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.