Athlon Prefetch workaround for 2.6.0test6
diff mbox series

Message ID 20030929125629.GA1746@averell
State New, archived
Headers show
Series
  • Athlon Prefetch workaround for 2.6.0test6
Related show

Commit Message

Andi Kleen Sept. 29, 2003, 12:56 p.m. UTC
Here is a new version of the Athlon/Opteron prefetch issue workaround
for 2.6.0test6.  The issue was hit regularly by the 2.6 kernel
while doing prefetches on NULL terminated hlists.

These CPUs sometimes generate illegal exception for prefetch
instructions. The operating system can work around this by checking
if the faulting instruction is a prefetch and ignoring it when
it's the case.

The code is structured carefully to ensure that the page fault
will never recurse more than once. Also unmapped EIPs are special
cased to give more friendly oopses when the kernel jumps to 
unmapped addresses.

It removes the previous dumb in kernel workaround for this and shrinks the 
kernel by >10k.

Small behaviour change is that a SIGBUS fault for a *_user access will
cause an EFAULT now, no SIGBUS.

This version addresses all criticism that I got for previous versions.

- Only checks on AMD K7+ CPUs. 
- Computes linear address for VM86 mode or code segments
with non zero base.
- Some cleanup
- No pointer comparisons
- More comments

Please consider applying,

-Andi


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Comments

Jamie Lokier Sept. 29, 2003, 5:03 p.m. UTC | #1
Andi Kleen wrote:
> +/* 
> + * Find an segment base in the LDT/GDT.
> + * Don't need to do any boundary checking because the CPU did that already 
> + * when the instruction was executed 
> + */

I'm not 100% sure, but I see some possible dangerous race conditions.

  1. Userspace thread A has a page fault at CS:EIP == 0x000f:0xbffff000.

     Simultaneously, userspace thread B calls modify_ldt() to change
     the LDT descriptor base to 0x40000000.

     The LDT descriptor is changed after A enters the page fault
     handler, but before it takes mmap_sem.  This can happen on SMP or
     on a pre-empt kernel.

     __is_prefetch() calls __get_user(0xfffff000), or whatever
     address userspace wants to trick it into reading.

     Result: unpriveleged userspace causes an MMIO read and crashes
     the system, or worse.

  2. segment_base sets "desc = (u32 *)&cpu_gdt_table[smp_processor_id()]".

     Pre-empt switches CPU.

     desc now points to the _old_ CPU, where the GDT for this task is
     no longer valid, and that is read in the next few lines.

I think for completeness you should check the segment type and limit
too (because they can be changed, see 1.).  These are easier to check
than they look (w.r.t. 16 bit segments etc.): you don't have to decode
the descriptor; just use the "lar" and "lsl" instructions.

If you can't decode the instruction because of transient access
problems, just return as if it was a prefetch, so that the faulting
instruction will be retried, or access permissions will cause the
program to trap in the normal ways.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Andi Kleen Sept. 29, 2003, 5:49 p.m. UTC | #2
> I'm not 100% sure, but I see some possible dangerous race conditions.

Thanks for the review.
> 
>   1. Userspace thread A has a page fault at CS:EIP == 0x000f:0xbffff000.
> 
>      Simultaneously, userspace thread B calls modify_ldt() to change
>      the LDT descriptor base to 0x40000000.
> 
>      The LDT descriptor is changed after A enters the page fault
>      handler, but before it takes mmap_sem.  This can happen on SMP or
>      on a pre-empt kernel.
> 
>      __is_prefetch() calls __get_user(0xfffff000), or whatever
>      address userspace wants to trick it into reading.
> 
>      Result: unpriveleged userspace causes an MMIO read and crashes
>      the system, or worse.


Ok, I added access_ok() checks when the fault came from ring 3 code.

That should catch everything.

I guess they should be added to the AMD64 version too. It ignores
all bases, but I'm not sure if the CPU catches the case where the linear
address computation wraps.

> 
>   2. segment_base sets "desc = (u32 *)&cpu_gdt_table[smp_processor_id()]".
> 
>      Pre-empt switches CPU.

True. Fixed.

> 
>      desc now points to the _old_ CPU, where the GDT for this task is
>      no longer valid, and that is read in the next few lines.
> 
> I think for completeness you should check the segment type and limit
> too (because they can be changed, see 1.).  These are easier to check
> than they look (w.r.t. 16 bit segments etc.): you don't have to decode
> the descriptor; just use the "lar" and "lsl" instructions.

I don't want to do that, the code is already too complicated and I don't
plan to reimplement an x86 here (just dealing with this segmentation
horror is bad enough) 

If it gets any more complicated I would be inclined to just
handle the in kernel prefetches using __ex_table entries and give up
on user space.

The x86-64 version just ignores all bases, that should be fine
too. Anybody who uses non zero code segments likely doesn't care about
performance and won't use prefetch ;-)

-Andi

New patch with the fixes attached:

Linus, can you consider merging it?

------------------

Here is a new version of the Athlon/Opteron prefetch issue workaround
for 2.6.0test6.  The issue was hit regularly by the 2.6 kernel
while doing prefetches on NULL terminated hlists.

These CPUs sometimes generate illegal exception for prefetch
instructions. The operating system can work around this by checking
if the faulting instruction is a prefetch and ignoring it when
it's the case.

The code is structured carefully to ensure that the page fault
will never recurse more than once. Also unmapped EIPs are special
cased to give more friendly oopses when the kernel jumps to 
unmapped addresses.

It removes the previous dumb in kernel workaround for this and shrinks the 
kernel by >10k.

Small behaviour change is that a SIGBUS fault for a *_user access will
cause an EFAULT now, no SIGBUS.


diff -u linux-2.6.0test6-work/include/asm-i386/processor.h-PRE linux-2.6.0test6-work/include/asm-i386/processor.h
--- linux-2.6.0test6-work/include/asm-i386/processor.h-PRE	2003-09-11 04:12:39.000000000 +0200
+++ linux-2.6.0test6-work/include/asm-i386/processor.h	2003-09-28 10:52:55.000000000 +0200
@@ -578,8 +589,6 @@
 #define ARCH_HAS_PREFETCH
 extern inline void prefetch(const void *x)
 {
-	if (cpu_data[0].x86_vendor == X86_VENDOR_AMD)
-		return;		/* Some athlons fault if the address is bad */
 	alternative_input(ASM_NOP4,
 			  "prefetchnta (%1)",
 			  X86_FEATURE_XMM,
diff -u linux-2.6.0test6-work/arch/i386/mm/fault.c-PRE linux-2.6.0test6-work/arch/i386/mm/fault.c
--- linux-2.6.0test6-work/arch/i386/mm/fault.c-PRE	2003-05-27 03:00:20.000000000 +0200
+++ linux-2.6.0test6-work/arch/i386/mm/fault.c	2003-09-29 19:42:44.000000000 +0200
@@ -55,6 +55,108 @@
 	console_loglevel = loglevel_save;
 }
 
+/* 
+ * Find an segment base in the LDT/GDT.
+ * Don't need to do any boundary checking because the CPU did that already 
+ * when the instruction was executed 
+ */
+static unsigned long segment_base(unsigned seg) 
+{ 
+	u32 *desc;
+	/* 
+	 * No need to use get/put_cpu here because when we switch CPUs the
+	 * segment base is always switched too.
+	 */
+	if (seg & (1<<2))
+		desc = current->mm->context.ldt;
+	else
+		desc = (u32 *)&cpu_gdt_table[smp_processor_id()];
+	desc = (void *)desc + (seg & ~7); 	
+	return  (desc[0] >> 16) | 
+	       ((desc[1] & 0xFF) << 16) | 
+	        (desc[1] & 0xFF000000);
+}  
+
+/* 
+ * Sometimes AMD Athlon/Opteron CPUs report invalid exceptions on prefetch.
+ * Check that here and ignore it.
+ */
+static int __is_prefetch(struct pt_regs *regs, unsigned long addr)
+{ 
+	unsigned char *instr = (unsigned char *)(regs->eip);
+	int scan_more = 1;
+	int prefetch = 0; 
+	int i;
+
+	/* 
+	 * Avoid recursive faults. This catches the kernel jumping to nirvana.
+	 * More complicated races with unmapped EIP are handled elsewhere for 
+	 * user space.
+	 */
+	if (regs->eip == addr)
+		return 0; 
+
+	if (unlikely(regs->eflags & VM_MASK))
+		addr += regs->xcs << 4; 
+	else if (unlikely(regs->xcs != __USER_CS &&regs->xcs != __KERNEL_CS))
+		addr += segment_base(regs->xcs);
+
+	for (i = 0; scan_more && i < 15; i++) { 
+		unsigned char opcode;
+		unsigned char instr_hi;
+		unsigned char instr_lo;
+
+		if ((regs->xcs & 3) && !access_ok(VERIFY_READ, instr, 1))
+			break;
+		if (__get_user(opcode, instr))
+			break; 
+
+		instr_hi = opcode & 0xf0; 
+		instr_lo = opcode & 0x0f; 
+		instr++;
+
+		switch (instr_hi) { 
+		case 0x20:
+		case 0x30:
+			/* Values 0x26,0x2E,0x36,0x3E are valid x86 prefixes. */
+			scan_more = ((instr_lo & 7) == 0x6);
+			break;
+			
+		case 0x60:
+			/* 0x64 thru 0x67 are valid prefixes in all modes. */
+			scan_more = (instr_lo & 0xC) == 0x4;
+			break;		
+		case 0xF0:
+			/* 0xF0, 0xF2, and 0xF3 are valid prefixes in all modes. */
+			scan_more = !instr_lo || (instr_lo>>1) == 1;
+			break;			
+		case 0x00:
+			/* Prefetch instruction is 0x0F0D or 0x0F18 */
+			scan_more = 0;
+			if ((regs->xcs & 3) && !access_ok(VERIFY_READ, instr, 1))
+				break;
+			if (__get_user(opcode, instr)) 
+				break;
+			prefetch = (instr_lo == 0xF) &&
+				(opcode == 0x0D || opcode == 0x18);
+			break;			
+		default:
+			scan_more = 0;
+			break;
+		} 
+	}
+
+	return prefetch;
+}
+
+static inline int is_prefetch(struct pt_regs *regs, unsigned long addr)
+{ 
+	if (likely(boot_cpu_data.x86_vendor != X86_VENDOR_AMD ||
+		   boot_cpu_data.x86 < 6))
+		return 0;
+	return __is_prefetch(regs, addr); 
+} 
+
 asmlinkage void do_invalid_op(struct pt_regs *, unsigned long);
 
 /*
@@ -110,7 +212,7 @@
 	 * atomic region then we must not take the fault..
 	 */
 	if (in_atomic() || !mm)
-		goto no_context;
+		goto bad_area_nosemaphore;
 
 	down_read(&mm->mmap_sem);
 
@@ -198,8 +300,16 @@
 bad_area:
 	up_read(&mm->mmap_sem);
 
+bad_area_nosemaphore:
 	/* User mode accesses just cause a SIGSEGV */
 	if (error_code & 4) {
+		/* 
+		 * Valid to do another page fault here because this one came 
+		 * from user space.
+		 */
+		if (is_prefetch(regs, address))
+			return;
+
 		tsk->thread.cr2 = address;
 		tsk->thread.error_code = error_code;
 		tsk->thread.trap_no = 14;
@@ -232,6 +342,14 @@
 	if (fixup_exception(regs))
 		return;
 
+	/* 
+	 * Valid to do another page fault here, because if this fault
+	 * had been triggered by is_prefetch fixup_exception would have 
+	 * handled it.
+	 */
+ 	if (is_prefetch(regs, address))
+ 		return;
+
 /*
  * Oops. The kernel tried to access some bad page. We'll have to
  * terminate things with extreme prejudice.
@@ -286,10 +404,14 @@
 do_sigbus:
 	up_read(&mm->mmap_sem);
 
-	/*
-	 * Send a sigbus, regardless of whether we were in kernel
-	 * or user mode.
-	 */
+	/* Kernel mode? Handle exceptions or die */
+	if (!(error_code & 4))
+		goto no_context;
+
+	/* User space => ok to do another page fault */
+	if (is_prefetch(regs, address))
+		return;
+
 	tsk->thread.cr2 = address;
 	tsk->thread.error_code = error_code;
 	tsk->thread.trap_no = 14;
@@ -298,10 +420,6 @@
 	info.si_code = BUS_ADRERR;
 	info.si_addr = (void *)address;
 	force_sig_info(SIGBUS, &info, tsk);
-
-	/* Kernel mode? Handle exceptions or die */
-	if (!(error_code & 4))
-		goto no_context;
 	return;
 
 vmalloc_fault:



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jamie Lokier Sept. 29, 2003, 8:08 p.m. UTC | #3
Andi Kleen wrote:
> Ok, I added access_ok() checks when the fault came from ring 3 code.

That looks fine.

> >   2. segment_base sets "desc = (u32 *)&cpu_gdt_table[smp_processor_id()]".
> > 
> >      Pre-empt switches CPU.
> 
> True. Fixed.

Where is this fixed?

> > I think for completeness you should check the segment type and limit
> > too (because they can be changed, see 1.).  These are easier to check
> > than they look (w.r.t. 16 bit segments etc.): you don't have to decode
> > the descriptor; just use the "lar" and "lsl" instructions.
> 
> I don't want to do that, the code is already too complicated and I don't
> plan to reimplement an x86 here (just dealing with this segmentation
> horror is bad enough) 

I sympathise with that aversion :)

My concern is that, by being complicated enough to decode segments
when CS is not equal to either of those, you risk opening a loophole
where userspace can trick, if not the kernel (with the access_ok
checks), then surrounding userspace virtualisation due to this code in
the kernel.

My thinking: Segments can be used by x86 virtualising programs which
use a segment to move a "window" of address range accessible to the
virtualised code.  It should _never_ be possible for the virtualised
code, which may be malicious, to trigger reads outside the protected
address range by various combinations of threads and triggering LDT
modifications in the virtualiser.

Btw, you assume that regs->xcs is a valid segment value.  I think that
the upper 16 bits are not guaranteed to be zero in general on the
IA32, although they clearly are zero for the majority of IA-32 chips.
Are they guaranteed to be zero on AMD's processors?

> If it gets any more complicated I would be inclined to just
> handle the in kernel prefetches using __ex_table entries and give up
> on user space.

__ex_table entries would have been a less controversial fix all along :)

Perhaps it is better to just not decode at all when CS != __KERNEL_CS
&& CS != __USER_CS?  Just fault.

Then you can drop the segment code, which you don't like anyway.
Anything as sophisticated as x86 virtualisation in userspace can fixup
spurious faults itself.  (That isn't perfect, but it's imperfect in a
safe way).

If you want to keep the segment base code, then I'll work out a patch
on top of yours to do the right access checks.  It is quite small, I
nearly have it already.

> The x86-64 version just ignores all bases, that should be fine
> too. Anybody who uses non zero code segments likely doesn't care about
> performance and won't use prefetch ;-)

Does the x86-64 version ignore bases in 32 bit programs?

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Bill Davidsen Sept. 29, 2003, 9:02 p.m. UTC | #4
In article <20030929125629.GA1746@averell>, Andi Kleen  <ak@muc.de> wrote:

| It removes the previous dumb in kernel workaround for this and shrinks the 
| kernel by >10k.
| 
| Small behaviour change is that a SIGBUS fault for a *_user access will
| cause an EFAULT now, no SIGBUS.
| 
| This version addresses all criticism that I got for previous versions.
| 
| - Only checks on AMD K7+ CPUs. 
| - Computes linear address for VM86 mode or code segments
| with non zero base.
| - Some cleanup
| - No pointer comparisons
| - More comments

I have to try this on a P4 and K7, but WRT "Only checks on AMD K7+ CPUs"
I hope you meant "only generates code if AMD CPU is target" and not that
the code size penalty is still there for CPUs which don't need it.

Will check Wednesday, life is very busy right now.
Jamie Lokier Sept. 29, 2003, 10:13 p.m. UTC | #5
Andi Kleen wrote:
> I guess they should be added to the AMD64 version too. It ignores
> all bases, but I'm not sure if the CPU catches the case where the linear
> address computation wraps.

The linear address is _allowed_ to wrap on x86, and there are real
DOS-era programs which take advantage of this.  It is used to create
the illusion of access to high addresses, by making them wrap to low
ones which can be mapped.

Some DOS extenders did this so that 32-bit programs could access BIOS
and video memory in the same DS segment as normal code, despite having
a large segment base address so that null pointers would be trapped by
page protection.

Of course no linux programs do this :)
(Well, maybe some do under DOSEMU?)

So it seems quite likely that the AMD64 CPU simply allows wrapping in
the linear address calculation.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jamie Lokier Sept. 30, 2003, 12:19 a.m. UTC | #6
Andi Kleen wrote:
> +	/* 
> +	 * Avoid recursive faults. This catches the kernel jumping to nirvana.
> +	 * More complicated races with unmapped EIP are handled elsewhere for 
> +	 * user space.
> +	 */
> +	if (regs->eip == addr)
> +		return 0; 

I'm curious - how does this help?

If the kernel jumps into nirvana, it will page fault.  is_prefetch()
will do a __get_user which will fault - recursion, ok.  The inner
fault handler will reach fixup_exception(), fixup the __get_user, and
return without recursing further.  is_prefetch() will simply return.

So how does the above test help?

> +	if (seg & (1<<2))
> +		desc = current->mm->context.ldt;
> +	else
> +		desc = (u32 *)&cpu_gdt_table[smp_processor_id()];
> +	desc = (void *)desc + (seg & ~7); 	
> +	return  (desc[0] >> 16) | 
> +	       ((desc[1] & 0xFF) << 16) | 
> +	        (desc[1] & 0xFF000000);

In addition to needing get_cpu() to protect the GDT access, this code
needs to take down(&current->mm->context.sem) for the LDT access.

Otherwise, context.ldt may have been vfree()'d by the time you use it,
and the desc[0..1] accesses will panic.

Thanks,
-- Jamie

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Nick Piggin Sept. 30, 2003, 12:50 a.m. UTC | #7
bill davidsen wrote:

>In article <20030929125629.GA1746@averell>, Andi Kleen  <ak@muc.de> wrote:
>
>| It removes the previous dumb in kernel workaround for this and shrinks the 
>| kernel by >10k.
>| 
>| Small behaviour change is that a SIGBUS fault for a *_user access will
>| cause an EFAULT now, no SIGBUS.
>| 
>| This version addresses all criticism that I got for previous versions.
>| 
>| - Only checks on AMD K7+ CPUs. 
>| - Computes linear address for VM86 mode or code segments
>| with non zero base.
>| - Some cleanup
>| - No pointer comparisons
>| - More comments
>
>I have to try this on a P4 and K7, but WRT "Only checks on AMD K7+ CPUs"
>I hope you meant "only generates code if AMD CPU is target" and not that
>the code size penalty is still there for CPUs which don't need it.
>
>Will check Wednesday, life is very busy right now.
>

No, the code is not conditionally compiled. That is a different issue to
this patch though. The target CPU selection scheme doesn't work at all
like you would expect and its impossible to compile this sort of code
out (when on x86 arch). See Adrian's code to rationalise cpu selection.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Andi Kleen Sept. 30, 2003, 5:38 a.m. UTC | #8
On Mon, Sep 29, 2003 at 11:13:46PM +0100, Jamie Lokier wrote:
> Andi Kleen wrote:
> > I guess they should be added to the AMD64 version too. It ignores
> > all bases, but I'm not sure if the CPU catches the case where the linear
> > address computation wraps.
> 
> The linear address is _allowed_ to wrap on x86, and there are real
> DOS-era programs which take advantage of this.  It is used to create
> the illusion of access to high addresses, by making them wrap to low
> ones which can be mapped.

Only when the A20 gate is set (?), which it never is in Linux.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Andi Kleen Sept. 30, 2003, 5:50 a.m. UTC | #9
On Mon, Sep 29, 2003 at 09:08:20PM +0100, Jamie Lokier wrote:
> My thinking: Segments can be used by x86 virtualising programs which
> use a segment to move a "window" of address range accessible to the
> virtualised code.  It should _never_ be possible for the virtualised
> code, which may be malicious, to trigger reads outside the protected
> address range by various combinations of threads and triggering LDT
> modifications in the virtualiser.

Agreed. But access_ok() takes care of that. It does not guarantee
that the right instruction is checked, but whoever uses non zero
code segment bases can add their own checker or more likely
not use prefetch at all (it is likely ancient legacy code 
code anyways). 

> Btw, you assume that regs->xcs is a valid segment value.  I think that
> the upper 16 bits are not guaranteed to be zero in general on the
> IA32, although they clearly are zero for the majority of IA-32 chips.
> Are they guaranteed to be zero on AMD's processors?

That would be new to me. Can you quote a line that says that from
the architecture manual?

Also remember the code only runs on AMD.

> > If it gets any more complicated I would be inclined to just
> > handle the in kernel prefetches using __ex_table entries and give up
> > on user space.
> 
> __ex_table entries would have been a less controversial fix all along :)
> 
> Perhaps it is better to just not decode at all when CS != __KERNEL_CS
> && CS != __USER_CS?  Just fault.

Yep, I considered that too.

I think I will do that for x86-64 and 32bit can handle it all in 
user space as I suspect the patch is unmergeable anyways.

> > The x86-64 version just ignores all bases, that should be fine
> > too. Anybody who uses non zero code segments likely doesn't care about
> > performance and won't use prefetch ;-)
> 
> Does the x86-64 version ignore bases in 32 bit programs?

The CPU doesn't, the prefetch fix does.

I will add a check for __USER32_CS/__USER_CS and __KERNEL_CS to make
sure it will never wrap.

I'm dropping the original segment checking/code checking proposal for 32bit. 
It obviously was misguided and Linus was right from the beginning 
to reject it.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Gabriel Paubert Sept. 30, 2003, 9:35 a.m. UTC | #10
On Mon, Sep 29, 2003 at 09:08:20PM +0100, Jamie Lokier wrote:
> Btw, you assume that regs->xcs is a valid segment value.  I think that
> the upper 16 bits are not guaranteed to be zero in general on the
> IA32, although they clearly are zero for the majority of IA-32 chips.
> Are they guaranteed to be zero on AMD's processors?

At least for pushes of segment registers a 486 decrements
the stack pointer by 4 but only writes the 2 least significant
bytes, leaving garbage in the upper half. 

That's documented in the list of differences between 486 and Pentium.
It may be different for the frame pushed on the stack during interrupt
entry; my take on it is that it is inherently dangerous to rely
on the upper 16 bits being zero.

Just my 2 cents,

	Regards,
	Gabriel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Dave Jones Sept. 30, 2003, 1:27 p.m. UTC | #11
On Mon, Sep 29, 2003 at 09:02:39PM +0000, bill davidsen wrote:
 > In article <20030929125629.GA1746@averell>, Andi Kleen  <ak@muc.de> wrote:
 > 
 > | It removes the previous dumb in kernel workaround for this and shrinks the 
 > | kernel by >10k.
 > | 
 > | Small behaviour change is that a SIGBUS fault for a *_user access will
 > | cause an EFAULT now, no SIGBUS.
 > | 
 > | This version addresses all criticism that I got for previous versions.
 > | 
 > | - Only checks on AMD K7+ CPUs. 
 > | - Computes linear address for VM86 mode or code segments
 > | with non zero base.
 > | - Some cleanup
 > | - No pointer comparisons
 > | - More comments
 > 
 > I have to try this on a P4 and K7, but WRT "Only checks on AMD K7+ CPUs"
 > I hope you meant "only generates code if AMD CPU is target" and not that
 > the code size penalty is still there for CPUs which don't need it.

NO NO NO NO NO.
This *has* to be there on a P4 kernel too, as we can now boot those on a K7 too.
The 'code size penalty' you talk about is in the region of a few hundred
bytes. Much less than a page. There are far more obvious bloat candidates.

		Dave
Bill Davidsen Sept. 30, 2003, 3:36 p.m. UTC | #12
On Tue, 30 Sep 2003, Dave Jones wrote:

> On Mon, Sep 29, 2003 at 09:02:39PM +0000, bill davidsen wrote:
>  > In article <20030929125629.GA1746@averell>, Andi Kleen  <ak@muc.de> wrote:

>  > I have to try this on a P4 and K7, but WRT "Only checks on AMD K7+ CPUs"
>  > I hope you meant "only generates code if AMD CPU is target" and not that
>  > the code size penalty is still there for CPUs which don't need it.
> 
> NO NO NO NO NO.
> This *has* to be there on a P4 kernel too, as we can now boot those on a K7 too.
> The 'code size penalty' you talk about is in the region of a few hundred
> bytes. Much less than a page. There are far more obvious bloat candidates.

Clearly you don't do embedded or small applications, and take the attitude
that people should try to find hundreds of bytes elsewhere to compensate
for the bytes you want to force into the kernel to support a bugfix for
one vendor.

You can build a 386 kernel which will run fine on K7 if you are too
limited in resources to build a proper matching kernel for the target
system. The attitude that there is bloat in the kernel already so we can
just add more as long as it doesn't inconvenience you is totally at odds
with the long tradition of Linux being lean and mean.

Linus wouldn't let dump to disk in the kernel because it wasn't generally
useful, I am amazed that he is letting 300 bytes of vendor bugfix which is
totally without value to any other system go in without ifdef's to limit
the memory penalty to the relevant CPUs.

Patch
diff mbox series

diff -u linux-2.6.0test6-work/include/asm-i386/processor.h-PRE linux-2.6.0test6-work/include/asm-i386/processor.h
--- linux-2.6.0test6-work/include/asm-i386/processor.h-PRE	2003-09-11 04:12:39.000000000 +0200
+++ linux-2.6.0test6-work/include/asm-i386/processor.h	2003-09-28 10:52:55.000000000 +0200
@@ -578,8 +589,6 @@ 
 #define ARCH_HAS_PREFETCH
 extern inline void prefetch(const void *x)
 {
-	if (cpu_data[0].x86_vendor == X86_VENDOR_AMD)
-		return;		/* Some athlons fault if the address is bad */
 	alternative_input(ASM_NOP4,
 			  "prefetchnta (%1)",
 			  X86_FEATURE_XMM,
diff -u linux-2.6.0test6-work/arch/i386/mm/fault.c-PRE linux-2.6.0test6-work/arch/i386/mm/fault.c
--- linux-2.6.0test6-work/arch/i386/mm/fault.c-PRE	2003-05-27 03:00:20.000000000 +0200
+++ linux-2.6.0test6-work/arch/i386/mm/fault.c	2003-09-29 14:48:44.000000000 +0200
@@ -55,6 +55,104 @@ 
 	console_loglevel = loglevel_save;
 }
 
+/* 
+ * Find an segment base in the LDT/GDT.
+ * Don't need to do any boundary checking because the CPU did that already 
+ * when the instruction was executed 
+ */
+static unsigned long segment_base(unsigned seg) 
+{ 
+	u32 *desc;
+	/* 
+	 * No need to use get/put_cpu here because when we switch CPUs the
+	 * segment base is always switched too.
+	 */
+	if (seg & (1<<2))
+		desc = current->mm->context.ldt;
+	else
+		desc = (u32 *)&cpu_gdt_table[smp_processor_id()];
+	desc = (void *)desc + (seg & ~7); 	
+	return  (desc[0] >> 16) | 
+	       ((desc[1] & 0xFF) << 16) | 
+	        (desc[1] & 0xFF000000);
+}  
+
+/* 
+ * Sometimes AMD Athlon/Opteron CPUs report invalid exceptions on prefetch.
+ * Check that here and ignore it.
+ */
+static int __is_prefetch(struct pt_regs *regs, unsigned long addr)
+{ 
+	unsigned char *instr = (unsigned char *)(regs->eip);
+	int scan_more = 1;
+	int prefetch = 0; 
+	int i;
+
+	/* 
+	 * Avoid recursive faults. This catches the kernel jumping to nirvana.
+	 * More complicated races with unmapped EIP are handled elsewhere for 
+	 * user space.
+	 */
+	if (regs->eip == addr)
+		return 0; 
+
+	if (unlikely(regs->eflags & VM_MASK))
+		addr += regs->xcs << 4; 
+	else if (unlikely(regs->xcs != __USER_CS &&regs->xcs != __KERNEL_CS))
+		addr += segment_base(regs->xcs);
+
+	for (i = 0; scan_more && i < 15; i++) { 
+		unsigned char opcode;
+		unsigned char instr_hi;
+		unsigned char instr_lo;
+
+		if (__get_user(opcode, instr))
+			break; 
+
+		instr_hi = opcode & 0xf0; 
+		instr_lo = opcode & 0x0f; 
+		instr++;
+
+		switch (instr_hi) { 
+		case 0x20:
+		case 0x30:
+			/* Values 0x26,0x2E,0x36,0x3E are valid x86 prefixes. */
+			scan_more = ((instr_lo & 7) == 0x6);
+			break;
+			
+		case 0x60:
+			/* 0x64 thru 0x67 are valid prefixes in all modes. */
+			scan_more = (instr_lo & 0xC) == 0x4;
+			break;		
+		case 0xF0:
+			/* 0xF0, 0xF2, and 0xF3 are valid prefixes in all modes. */
+			scan_more = !instr_lo || (instr_lo>>1) == 1;
+			break;			
+		case 0x00:
+			/* Prefetch instruction is 0x0F0D or 0x0F18 */
+			scan_more = 0;
+			if (__get_user(opcode, instr)) 
+				break;
+			prefetch = (instr_lo == 0xF) &&
+				(opcode == 0x0D || opcode == 0x18);
+			break;			
+		default:
+			scan_more = 0;
+			break;
+		} 
+	}
+
+	return prefetch;
+}
+
+static inline int is_prefetch(struct pt_regs *regs, unsigned long addr)
+{ 
+	if (likely(boot_cpu_data.x86_vendor != X86_VENDOR_AMD ||
+		   boot_cpu_data.x86 < 6))
+		return 0;
+	return __is_prefetch(regs, addr); 
+} 
+
 asmlinkage void do_invalid_op(struct pt_regs *, unsigned long);
 
 /*
@@ -110,7 +208,7 @@ 
 	 * atomic region then we must not take the fault..
 	 */
 	if (in_atomic() || !mm)
-		goto no_context;
+		goto bad_area_nosemaphore;
 
 	down_read(&mm->mmap_sem);
 
@@ -198,8 +296,16 @@ 
 bad_area:
 	up_read(&mm->mmap_sem);
 
+bad_area_nosemaphore:
 	/* User mode accesses just cause a SIGSEGV */
 	if (error_code & 4) {
+		/* 
+		 * Valid to do another page fault here because this one came 
+		 * from user space.
+		 */
+		if (is_prefetch(regs, address))
+			return;
+
 		tsk->thread.cr2 = address;
 		tsk->thread.error_code = error_code;
 		tsk->thread.trap_no = 14;
@@ -232,6 +338,14 @@ 
 	if (fixup_exception(regs))
 		return;
 
+	/* 
+	 * Valid to do another page fault here, because if this fault
+	 * had been triggered by is_prefetch fixup_exception would have 
+	 * handled it.
+	 */
+ 	if (is_prefetch(regs, address))
+ 		return;
+
 /*
  * Oops. The kernel tried to access some bad page. We'll have to
  * terminate things with extreme prejudice.
@@ -286,10 +400,14 @@ 
 do_sigbus:
 	up_read(&mm->mmap_sem);
 
-	/*
-	 * Send a sigbus, regardless of whether we were in kernel
-	 * or user mode.
-	 */
+	/* Kernel mode? Handle exceptions or die */
+	if (!(error_code & 4))
+		goto no_context;
+
+	/* User space => ok to do another page fault */
+	if (is_prefetch(regs, address))
+		return;
+
 	tsk->thread.cr2 = address;
 	tsk->thread.error_code = error_code;
 	tsk->thread.trap_no = 14;
@@ -298,10 +416,6 @@ 
 	info.si_code = BUS_ADRERR;
 	info.si_addr = (void *)address;
 	force_sig_info(SIGBUS, &info, tsk);
-
-	/* Kernel mode? Handle exceptions or die */
-	if (!(error_code & 4))
-		goto no_context;
 	return;
 
 vmalloc_fault: