Re: [RFC][PATCHSET] VM_FAULT_RETRY fixes

From: Al Viro <viro@zeniv.linux.org.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>,
	linux-arch@vger.kernel.org, linux-alpha@vger.kernel.org,
	Richard Henderson <richard.henderson@linaro.org>
Subject: Re: [RFC][PATCHSET] VM_FAULT_RETRY fixes
Date: Thu, 2 Feb 2023 06:58:03 +0000	[thread overview]
Message-ID: <Y9te+4n4ajSF++Ex@ZenIV> (raw)
In-Reply-To: <CAHk-=wjiwFzEGd_60H3nbgVB=R_8KTcfUJmXy=hSXCvLrXQRFA@mail.gmail.com>

On Tue, Jan 31, 2023 at 01:19:59PM -0800, Linus Torvalds wrote:
> On Tue, Jan 31, 2023 at 1:10 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Umm...  What about the semantics of get_user() of unmapped address?
> > Some architectures do quiet EFAULT; some (including alpha) hit
> > the sucker with SIGBUS, no matter what.
> 
> I think we should strive to just make this all common.
> 
> The reason alpha is different is almost certainly not intentional, but
> a combination of "pure accident" and "nobody actually cares".

BTW, speaking of alpha page faults - maybe I'm misreading the manual,
but it seems to imply that interrupts are *not* disabled when entering
page fault handler:

Table 24–2 Entry Point Address Registers
Entry Point Value in a0             Value in a1   Value in a2         PS<IPL>
entArith    Exception summary       Register mask UNPREDICTABLE       Unchanged
entIF       Fault or trap type code UNPREDICTABLE UNPREDICTABLE       Unchanged
entInt      Interrupt type          Vector        Interrupt parameter Priority of interrupt
entMM       VA                      MMCSR         Cause               Unchanged
entSys      p0                      p1            p2                  Unchanged
entUna      VA                      Opcode        Src/Dst             Unchanged

So there's nothing to prevent an interrupt hitting just as we reach
entMM, with interrupt handler stepping on a vmalloc'ed area and
triggering another page fault.

If that is correct, this
                /* Synchronize this task's top level page-table
                   with the "reference" page table from init.  */
                long index = pgd_index(address);
                pgd_t *pgd, *pgd_k;

                pgd = current->active_mm->pgd + index;
                pgd_k = swapper_pg_dir + index;
                if (!pgd_present(*pgd) && pgd_present(*pgd_k)) {
                        pgd_val(*pgd) = pgd_val(*pgd_k);
                        return;
                }
                goto no_context;
is not just missing local_irq_save()/local_irq_restore() around that
fragment - if it finds pgd already present, it needs to check pte
before deciding to proceed to no_context.

Suppose we access vmalloc area and corresponding pgd is not
present in current->active_mm.  Just as we get to entMM,
an interrupt arrives and proceeds to access something
covered by the same pgd.  OK, current->active_mm is still
not present, we get another page fault and do_page_fault()
gets to the quoted code.  pgd is copied from the swapper_pg_dir,
do_page_fault() returns and we get back to the instruction in
interrupt handler that had triggered the second #PF.  This
time around it succeeds.  Once the interrupt handler completes
we are back to entMM.  Once *that* gets to do_page_fault()
we hit the quoted code again.  Only this time around pgd
*is* present and instead of returning we get to no_context.
And since it's been a normal access to vmalloc'ed memory,
there's nothing to be found in exception table.  Oops...

	AFAICS, one way to deal with that is to treat
(unlikely) pgd_present(*pgd) as "get to pte, return if
it looks legitimate, proceed to no_context if it isn't".

Other bugs in the same area:
	* we ought to compare address with VMALLOC_START,
not TASK_SIZE.
	* we ought to do that *before* checking for
kernel threads/pagefault_disable() being in effect.

Wait a minute - pgd_present() on alpha has become constant 1
since a73c948952cc "alpha: use pgtable-nopud instead of 4level-fixup"

So that thing had been completely broken for 3 years and nobody
had noticed.  And that's really completely broken - it stopped
copying top-level entries since that commit.

That's also not hard to fix, but...
	* CONFIG_ALPHA_LARGE_VMALLOC had been racy all along
	* it had very limited use (need for >8Gb of vmalloc
space)
	* it had stopped working in late 2019 and nobody cared.

How about removing that kludge?  Richard?