RE: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RE: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
@ 2003-10-01  2:23 Nakajima, Jun
  2003-10-01  2:51 ` Jamie Lokier
  0 siblings, 1 reply; 45+ messages in thread
From: Nakajima, Jun @ 2003-10-01  2:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: jamie, davej, torvalds, linux-kernel, richard.brunner

Oh, I thought you already closed this issue and you were discussing a
different thing.

I agree. They kernel should fix up userspace for this CPU errata.

	Jun

> -----Original Message-----
> From: Andrew Morton [mailto:akpm@osdl.org]
> Sent: Tuesday, September 30, 2003 7:07 PM
> To: Nakajima, Jun
> Cc: jamie@shareable.org; davej@redhat.com; torvalds@osdl.org; linux-
> kernel@vger.kernel.org; richard.brunner@amd.com
> Subject: Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch
errata
> patch
> 
> "Nakajima, Jun" <jun.nakajima@intel.com> wrote:
> >
> >  > I think we should fix up userspace.
> >  What do you mean by this? Patch user code at runtime (it's much
more
> >  complex than it sounds) or remove prefetch instructions from
userspace?
> 
> Detect when user code stumbles over this CPU errata and make it look
like
> nothing happened.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01  2:23 [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch Nakajima, Jun
@ 2003-10-01  2:51 ` Jamie Lokier
  2003-10-01  3:14   ` Andrew Morton
  0 siblings, 1 reply; 45+ messages in thread
From: Jamie Lokier @ 2003-10-01  2:51 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: Andrew Morton, davej, torvalds, linux-kernel, richard.brunner

Nakajima, Jun wrote:
> Oh, I thought you already closed this issue and you were discussing a
> different thing.
> 
> I agree. They kernel should fix up userspace for this CPU errata.

Our question is whether kernels configured specifically for non-AMD
(e.g. Winchip or Elan kernels) must have the AMD fixup code (it is a
few hundred bytes), refuse to boot on AMD, ignore the problem, or work
but not fix up userspace.

-- Jamie

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01  2:51 ` Jamie Lokier
@ 2003-10-01  3:14   ` Andrew Morton
  0 siblings, 0 replies; 45+ messages in thread
From: Andrew Morton @ 2003-10-01  3:14 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: jun.nakajima, davej, torvalds, linux-kernel, richard.brunner

Jamie Lokier <jamie@shareable.org> wrote:
>
> Nakajima, Jun wrote:
> > Oh, I thought you already closed this issue and you were discussing a
> > different thing.
> > 
> > I agree. They kernel should fix up userspace for this CPU errata.
> 
> Our question is whether kernels configured specifically for non-AMD
> (e.g. Winchip or Elan kernels) must have the AMD fixup code (it is a
> few hundred bytes), refuse to boot on AMD, ignore the problem, or work
> but not fix up userspace.

Refusing to boot would be best I think.  Fixing it up anyway would be OK
too.

Ignoring the problem in kernel and/or userspace could be viewed as a
response to pilot error but as always it would be nice if, given a choice,
we can help the pilot.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01 14:51                     ` Andrew Morton
  2003-10-01 14:56                       ` Andi Kleen
@ 2003-10-01 16:18                       ` Jamie Lokier
  1 sibling, 0 replies; 45+ messages in thread
From: Jamie Lokier @ 2003-10-01 16:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: hugh, ak, linux-kernel

Andrew Morton wrote:
> I'm a bit confused as to what significance the faulting address has btw:
> kernel code can raise prefetch faults against addresses which are less
> than, and presumably greater than TASK_SIZE.

The address itself is not important.

What's important is that if a prefetch erratum fault happens while
mmap_sem or any spinlock are held, then do_page_fault must not try to
acquire mmap_sem.  That can deadlock or schedule; both are bad.

Checking the address is an easy way to ensure that: addr >= TASK_SIZE
implies that no vma lookup is needed, hence mmap_sem is not needed.

There is already an addr >= TASK_SIZE test, so this can be hooked in
without any penalty, in fact it will streamline the existing code a bit.

-- Jamie

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01 15:19                         ` Andrew Morton
@ 2003-10-01 15:24                           ` Andi Kleen
  0 siblings, 0 replies; 45+ messages in thread
From: Andi Kleen @ 2003-10-01 15:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, jamie, hugh, linux-kernel

> Why is it not sufficient to do something like:
> 
> 	if (!(error_code & 4) && is_prefetch(...))
> 		return;
> 
> near the start of do_page_fault()?

That could fault recursively when the __get_user in is_prefetch faults. 
You would have to check the exception table first too.

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01 14:56                       ` Andi Kleen
@ 2003-10-01 15:19                         ` Andrew Morton
  2003-10-01 15:24                           ` Andi Kleen
  0 siblings, 1 reply; 45+ messages in thread
From: Andrew Morton @ 2003-10-01 15:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: jamie, hugh, ak, linux-kernel

Andi Kleen <ak@suse.de> wrote:
>
> > I'm a bit confused as to what significance the faulting address has btw:
>  > kernel code can raise prefetch faults against addresses which are less
>  > than, and presumably greater than TASK_SIZE.
> 
>  Currently it can't - hlist either prefetches to zero or to a valid 
>  address and everybody else using prefetch should also use valid addresses.

mumble, mutter.

>  But it's conceivable that future kernels make more extensive use of
>  prefetch.

yup.

Why is it not sufficient to do something like:

	if (!(error_code & 4) && is_prefetch(...))
		return;

near the start of do_page_fault()?

kernel-mode faults aren't very common.  Most of them occur in
generic_file_read(), for reads into freshly-allocated memory (I think). 
And the frequency of these is limited by disk bandwidth...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01 14:51                     ` Andrew Morton
@ 2003-10-01 14:56                       ` Andi Kleen
  2003-10-01 15:19                         ` Andrew Morton
  2003-10-01 16:18                       ` Jamie Lokier
  1 sibling, 1 reply; 45+ messages in thread
From: Andi Kleen @ 2003-10-01 14:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jamie Lokier, hugh, ak, linux-kernel

> I'm a bit confused as to what significance the faulting address has btw:
> kernel code can raise prefetch faults against addresses which are less
> than, and presumably greater than TASK_SIZE.

Currently it can't - hlist either prefetches to zero or to a valid 
address and everybody else using prefetch should also use valid addresses.

But it's conceivable that future kernels make more extensive use of
prefetch.

e.g. x86-64 hit it because it has prefetch is copy_{from/to}_user /csum_copy_*
and everybody can pass arbitary addresses to that.

-Andi
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01  9:33                   ` Jamie Lokier
@ 2003-10-01 14:51                     ` Andrew Morton
  2003-10-01 14:56                       ` Andi Kleen
  2003-10-01 16:18                       ` Jamie Lokier
  0 siblings, 2 replies; 45+ messages in thread
From: Andrew Morton @ 2003-10-01 14:51 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: hugh, ak, linux-kernel

Jamie Lokier <jamie@shareable.org> wrote:
>
>  > What about the 4G+4G split?
> 
>  Whoever is looking after the 4G+4G patch can deal with it, presumably.
>  It'll be the same condition ad 4G+4G uses to decide if it's an access
>  to userspace (needing a search of the vma list), or not.

It should "just work".   4G+4G is false advertising:

#ifdef CONFIG_X86_4G_VM_LAYOUT
#define __PAGE_OFFSET		(0x02000000)
#define TASK_SIZE		(0xff000000)
#else

I'm a bit confused as to what significance the faulting address has btw:
kernel code can raise prefetch faults against addresses which are less
than, and presumably greater than TASK_SIZE.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01  8:02                 ` Hugh Dickins
  2003-10-01  8:49                   ` Andi Kleen
@ 2003-10-01  9:33                   ` Jamie Lokier
  2003-10-01 14:51                     ` Andrew Morton
  1 sibling, 1 reply; 45+ messages in thread
From: Jamie Lokier @ 2003-10-01  9:33 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andi Kleen, Andrew Morton, linux-kernel

Hugh Dickins wrote:
> On Wed, 1 Oct 2003, Jamie Lokier wrote:
> > 
> > See recent message from me.  All you need is a check "address >=
> > TASK_SIZE", which is thread already at the start of do_page_fault.
> 
> What about the 4G+4G split?

Whoever is looking after the 4G+4G patch can deal with it, presumably.
It'll be the same condition ad 4G+4G uses to decide if it's an access
to userspace (needing a search of the vma list), or not.

-- Jamie

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01  8:02                 ` Hugh Dickins
@ 2003-10-01  8:49                   ` Andi Kleen
  2003-10-01  9:33                   ` Jamie Lokier
  1 sibling, 0 replies; 45+ messages in thread
From: Andi Kleen @ 2003-10-01  8:49 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Jamie Lokier, Andi Kleen, Andrew Morton, linux-kernel

On Wed, Oct 01, 2003 at 09:02:11AM +0100, Hugh Dickins wrote:
> On Wed, 1 Oct 2003, Jamie Lokier wrote:
> > 
> > See recent message from me.  All you need is a check "address >=
> > TASK_SIZE", which is thread already at the start of do_page_fault.
> 
> What about the 4G+4G split?

Whoever runs 4G+4G split surely doesn't care about fast paths because
of the large overhead it adds everywhere ;-)

They can check it always.

-Andi


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata  patch
  2003-10-01  7:39             ` Andi Kleen
@ 2003-10-01  8:20               ` Jamie Lokier
  0 siblings, 0 replies; 45+ messages in thread
From: Jamie Lokier @ 2003-10-01  8:20 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, linux-kernel

Andi Kleen wrote:
> On Wed, Oct 01, 2003 at 08:20:11AM +0100, Jamie Lokier wrote:
> > I think the mmap_sem problems are fixed by an appropriate "address >=
> > TASK_SIZE" check at the beginning do_page_fault, which should jump
> 
> Assuming vsyscalls never contain prefetch. 

Fine as long as it doesn't need a vma.

> Imho that's the best way for 32bit too, non zero segment bases are
> just not worth caring about.

I could agree.  I was most concerned about the lack of limit check in
your last patch, allowing malicious code to trigger reads outside of
userspace x86 virtualisation jails which are built using segments.  An
obscure one, to be sure, but userspace assumptions broken by kernel
surprises is not good.

Just checking the standard segments is quite safe :)

Btw, I have a version of the segment code for x86_64 if you would take it.

> I had the same idea earlier, but discarded it because it would make
> the code much more ugly. It's better to just keep that stuff out of
> the fast path, not optimize it to the last cycle.

Personally think the code would be nicer, but opinions vary about my
coding style ;)

> > Fifth, the "if (regs->eip == addr)" check - is it helpful on 32-bit?
> 
> It avoids one fault recursion for the kernel jumping to zero.

You wrote before that it makes a prettier oops.  Does it?  AFAICT the
extra recursion is benign and doesn't change the oops.  Maybe I missed
something.

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01  7:31               ` Jamie Lokier
  2003-10-01  7:41                 ` Andi Kleen
@ 2003-10-01  8:02                 ` Hugh Dickins
  2003-10-01  8:49                   ` Andi Kleen
  2003-10-01  9:33                   ` Jamie Lokier
  1 sibling, 2 replies; 45+ messages in thread
From: Hugh Dickins @ 2003-10-01  8:02 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Andi Kleen, Andrew Morton, linux-kernel

On Wed, 1 Oct 2003, Jamie Lokier wrote:
> 
> See recent message from me.  All you need is a check "address >=
> TASK_SIZE", which is thread already at the start of do_page_fault.

What about the 4G+4G split?

Hugh


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01  7:55             ` Jamie Lokier
@ 2003-10-01  8:00               ` Andi Kleen
  0 siblings, 0 replies; 45+ messages in thread
From: Andi Kleen @ 2003-10-01  8:00 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Andi Kleen, akpm, linux-kernel

On Wed, Oct 01, 2003 at 08:55:51AM +0100, Jamie Lokier wrote:
> Andi Kleen wrote:
> > Jamie Lokier <jamie@shareable.org> writes:
> > > It is easy enough to fix by making the fault handler not take
> > > mmap_sem if the fault's in the kernel address range.  (With apologies
> > > to the folk running kernel mode userspace...)
> > 
> > It won't work because kernel can cause user space faults
> > (think get_user). And handling these must be protected.
> 
> Are we mis-communicating?  By "fault in the kernel address range", I

Yep, we were. I read it as "instruction faulting is in kernel range"
(aka you check the ring0 bit in the error_code), not checking cr2 >= TASK_SIZE.

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01  7:24           ` Andi Kleen
@ 2003-10-01  7:55             ` Jamie Lokier
  2003-10-01  8:00               ` Andi Kleen
  0 siblings, 1 reply; 45+ messages in thread
From: Jamie Lokier @ 2003-10-01  7:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-kernel

Andi Kleen wrote:
> Jamie Lokier <jamie@shareable.org> writes:
> > It is easy enough to fix by making the fault handler not take
> > mmap_sem if the fault's in the kernel address range.  (With apologies
> > to the folk running kernel mode userspace...)
> 
> It won't work because kernel can cause user space faults
> (think get_user). And handling these must be protected.

Are we mis-communicating?  By "fault in the kernel address range", I
mean that the faulting address is in that range, not where the
instruction is located.

get_user faults are obviously not in the kernel address range (except
when set_fs has been used, which is rare and not a problem here).

When __get_user faults in __is_prefetch, while decoding an instruction
in kernel space, it is good for do_page_fault to take the short-cut I
suggested, skip locks, and reach the exception table fixup.  It will
turn out to be a bad address, __is_prefetch will report that it isn't
a prefetch, and carry on to bust_spinlocks(), oops etc.

When __get_user faults in __is_prefetch, while decoding a userspace
instruction, then it won't take that short-cut because the (inner)
faulting address won't be in the kernel address range.  The normal vma
lookup will take place, install a user page, and the userspace
instruction is decoded just fine.

What have I missed?

-- Jamie

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01  7:31               ` Jamie Lokier
@ 2003-10-01  7:41                 ` Andi Kleen
  2003-10-01  8:02                 ` Hugh Dickins
  1 sibling, 0 replies; 45+ messages in thread
From: Andi Kleen @ 2003-10-01  7:41 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Andi Kleen, Andrew Morton, linux-kernel

On Wed, Oct 01, 2003 at 08:31:32AM +0100, Jamie Lokier wrote:
> doesn't that fix the problem and while also being an improvement in
> lots of other ways?  The only reason it wouldn't work is if the VMA
> list can contain regions >= TASK_SIZE, but I don't think that is done.

vsyscalls has at least one vma > TASK_SIZE for ptrace, but it is afaik not
included in the vma tree.

> search_exception_table doesn't take any locks for non-module entries,
> so it can fixup the __get_user in __is_prefetch when that occurs
> inside locked regions of the kernel.

Hmm. Good point. Yes, I agree the check could be removed now.

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata  patch
  2003-10-01  7:20           ` Jamie Lokier
@ 2003-10-01  7:39             ` Andi Kleen
  2003-10-01  8:20               ` Jamie Lokier
  0 siblings, 1 reply; 45+ messages in thread
From: Andi Kleen @ 2003-10-01  7:39 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Andi Kleen, Andrew Morton, linux-kernel

On Wed, Oct 01, 2003 at 08:20:11AM +0100, Jamie Lokier wrote:
> I think the mmap_sem problems are fixed by an appropriate "address >=
> TASK_SIZE" check at the beginning do_page_fault, which should jump

Assuming vsyscalls never contain prefetch. 

> straight to bad_area_nosemaphore.  As there is already such a check,
> there's no penalty for rearranging the logic there, and it will in
> fact speed up kernel faults slightly by avoiding the mmap_sem and
> find_vma() which are redundant for kernel faults.

I guess that would work, agreed.

I will fix it this way for x86-64.

> 
> I have some ideas for speeding up __is_prefetch().  First, take the
> get_segment_eip() stuff from my patch - that should speed up your
> latest "more checking" slightly, because it replaces the access_ok()
> checks with something slightly tighter.

At least for x86-64 I just switched to checking the three possible
segments explicitely.

Imho that's the best way for 32bit too, non zero segment bases are
just not worth caring about.

> 
> Second, instead of masking and a switch statement, do test_bit on a
> 256-bit vector.  (I admit I'm not quite sure how fast variable
> test_bit is; this is assuming it is quite fast).  If it's zero, return
> 0 from __is_prefetch().  If it's one, it's either a prefix byte to
> skip or 0x0f, to check the next byte for a prefetch.  That'll probably
> make the code smaller too, because the vector is only 32 bytes,
> shorter than the logic in the switch().

I had the same idea earlier, but discarded it because it would make
the code much more ugly. It's better to just keep that stuff out of
the fast path, not optimize it to the last cycle.

Also I already have wasted too much time on this errata so I won't
do further updates. Feel free to take up the ball.

> Fifth, the "if (regs->eip == addr)" check - is it helpful on 32-bit?

It avoids one fault recursion for the kernel jumping to zero.

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01  7:06             ` Andi Kleen
@ 2003-10-01  7:31               ` Jamie Lokier
  2003-10-01  7:41                 ` Andi Kleen
  2003-10-01  8:02                 ` Hugh Dickins
  0 siblings, 2 replies; 45+ messages in thread
From: Jamie Lokier @ 2003-10-01  7:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, linux-kernel

Andi Kleen wrote:
> Hmm.   I guess that's possible, but will be somewhat intrusive in 
> do_page_fault()

See recent message from me.  All you need is a check "address >=
TASK_SIZE", which is thread already at the start of do_page_fault.  So
if you split that conditional like this, from:

	if (address >= TASK_SIZE && !(error_code & 5))
		goto vmalloc_fault;
	mm = tsk->mm;
	info.si_code = SEGV_MAPERR;
	if (in_atomic() || !mm)
		goto bad_area_nosemaphore;

to:

	if (address >= TASK_SIZE) {
		if (!(error_code & 5))
			goto vmalloc_fault;
		mm = tsk->mm;
		info.si_code = SEGV_MAPERR;
		goto bad_area_nosemaphore;
	}

doesn't that fix the problem and while also being an improvement in
lots of other ways?  The only reason it wouldn't work is if the VMA
list can contain regions >= TASK_SIZE, but I don't think that is done.

> Also you have to be very careful to avoid recursive faults (EIP unmapped) 
> recursing further. In the original patch I did that for kernel mode by always
> checking the exception table first to catch the __get_user in __is_prefetch
> early.

I'm looking at "[PATCH] Athlon Prefetch workaround for 2.6.0test6" and
it appears to check the exception table first.  That's why I was
wondering why you have the regs->eip == addr check.

search_exception_table doesn't take any locks for non-module entries,
so it can fixup the __get_user in __is_prefetch when that occurs
inside locked regions of the kernel.

-- Jamie

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
       [not found]         ` <20031001065705.GI1131@mail.shareable.org.suse.lists.linux.kernel>
  2003-10-01  7:15           ` Andi Kleen
@ 2003-10-01  7:24           ` Andi Kleen
  2003-10-01  7:55             ` Jamie Lokier
  1 sibling, 1 reply; 45+ messages in thread
From: Andi Kleen @ 2003-10-01  7:24 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: akpm, linux-kernel

Jamie Lokier <jamie@shareable.org> writes:

> It is easy enough to fix by making the fault handler not take
> mmap_sem if the fault's in the kernel address range.  (With apologies
> to the folk running kernel mode userspace...)

It won't work because kernel can cause user space faults
(think get_user). And handling these must be protected.

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata  patch
  2003-10-01  6:47         ` Andi Kleen
  2003-10-01  7:00           ` Andrew Morton
@ 2003-10-01  7:20           ` Jamie Lokier
  2003-10-01  7:39             ` Andi Kleen
  1 sibling, 1 reply; 45+ messages in thread
From: Jamie Lokier @ 2003-10-01  7:20 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, linux-kernel

Andi Kleen wrote:
> Andrew Morton <akpm@osdl.org> writes:
> 
> > Looking at Andi's patch, it is also a dead box if the fault happens inside
> > down_write(mmap_sem).  That should be fixed, methinks.
> 
> The only way to fix all that would be to move the instruction checks early
> into the fast path.
> 
> [On a P4 the overhead is 3.7268 vs 3.6594 microseconds for a fault that 
> doesn't hit as measured by lmbench2's lat_sig. This was before the latest 
> changes which added more checking, so the overhead is probably bigger now]

Hi Andi!

I think the mmap_sem problems are fixed by an appropriate "address >=
TASK_SIZE" check at the beginning do_page_fault, which should jump
straight to bad_area_nosemaphore.  As there is already such a check,
there's no penalty for rearranging the logic there, and it will in
fact speed up kernel faults slightly by avoiding the mmap_sem and
find_vma() which are redundant for kernel faults.

I have some ideas for speeding up __is_prefetch().  First, take the
get_segment_eip() stuff from my patch - that should speed up your
latest "more checking" slightly, because it replaces the access_ok()
checks with something slightly tighter.

Second, instead of masking and a switch statement, do test_bit on a
256-bit vector.  (I admit I'm not quite sure how fast variable
test_bit is; this is assuming it is quite fast).  If it's zero, return
0 from __is_prefetch().  If it's one, it's either a prefix byte to
skip or 0x0f, to check the next byte for a prefetch.  That'll probably
make the code smaller too, because the vector is only 32 bytes,
shorter than the logic in the switch().

Third, once you have taken the "limit" variable idea from my patch,
clamp that at "instr + 14", and so remove the small amount of loop
setup and iteration overhead which is there to restrict the
instruction to 15 bytes.  Build it in to the other limit check.

Fourth, without a switch statement that's no need for a scan_more
variable.  Setting a boolean variable in one place to test it on
another is still surprisingly poor with GCC - it doesn't remove the
variable and turn the code into direct jumps as you'd think, much of
the time.  Just test the bit vector and if it's zero, return 0,
otherwise carry on to check if the byte is 0x0f, and if not, loop.

Fifth, the "if (regs->eip == addr)" check - is it helpful on 32-bit?

Thoughts for the day,
-- Jamie

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
       [not found]         ` <20031001065705.GI1131@mail.shareable.org.suse.lists.linux.kernel>
@ 2003-10-01  7:15           ` Andi Kleen
  2003-10-01  7:24           ` Andi Kleen
  1 sibling, 0 replies; 45+ messages in thread
From: Andi Kleen @ 2003-10-01  7:15 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

Jamie Lokier <jamie@shareable.org> writes:

> 
> Btw, does anyone know if the "prefetchw" instruction is affected by
> the erratum?  I see that in current 2.6, only the "prefetchnta"
> instruction is disabled so presumably "prefetchw" is ok?

All prefetch instructions on Athlon/Opteron including prefetchw have the 
problem. prefetchw requires a store instead of a load to hit it though.

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01  7:00           ` Andrew Morton
@ 2003-10-01  7:06             ` Andi Kleen
  2003-10-01  7:31               ` Jamie Lokier
  0 siblings, 1 reply; 45+ messages in thread
From: Andi Kleen @ 2003-10-01  7:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, jamie, linux-kernel

On Wed, Oct 01, 2003 at 12:00:15AM -0700, Andrew Morton wrote:
> Andi Kleen <ak@suse.de> wrote:
> >
> > Andrew Morton <akpm@osdl.org> writes:
> > 
> > > Looking at Andi's patch, it is also a dead box if the fault happens inside
> > > down_write(mmap_sem).  That should be fixed, methinks.
> > 
> > The only way to fix all that would be to move the instruction checks early
> > into the fast path.
> 
> Well the deadlock avoidance only needs to happen if the fault occured in
> kernel mode.  Presumably most faults are in userspace, so most of the
> overhead can be avoided.

Hmm.   I guess that's possible, but will be somewhat intrusive in 
do_page_fault()

Also you have to be very careful to avoid recursive faults (EIP unmapped) 
recursing further. In the original patch I did that for kernel mode by always
checking the exception table first to catch the __get_user in __is_prefetch
early.

Maybe it would be better to just use a down_read_timeout().

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata  patch
  2003-10-01  6:47         ` Andi Kleen
@ 2003-10-01  7:00           ` Andrew Morton
  2003-10-01  7:06             ` Andi Kleen
  2003-10-01  7:20           ` Jamie Lokier
  1 sibling, 1 reply; 45+ messages in thread
From: Andrew Morton @ 2003-10-01  7:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: jamie, linux-kernel

Andi Kleen <ak@suse.de> wrote:
>
> Andrew Morton <akpm@osdl.org> writes:
> 
> > Looking at Andi's patch, it is also a dead box if the fault happens inside
> > down_write(mmap_sem).  That should be fixed, methinks.
> 
> The only way to fix all that would be to move the instruction checks early
> into the fast path.

Well the deadlock avoidance only needs to happen if the fault occured in
kernel mode.  Presumably most faults are in userspace, so most of the
overhead can be avoided.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01  6:32       ` Andrew Morton
@ 2003-10-01  6:57         ` Jamie Lokier
  0 siblings, 0 replies; 45+ messages in thread
From: Jamie Lokier @ 2003-10-01  6:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: jun.nakajima, davej, torvalds, linux-kernel, richard.brunner

Andrew Morton wrote:
> Looking at Andi's patch, it is also a dead box if the fault happens inside
> down_write(mmap_sem).  That should be fixed, methinks.
> 
> And I think we're also a bit deadlocky if it happens inside down_read(),
> because double down_read() is illegal because an intervening down_write()
> from another thread can block the second down_read().  Or maybe not: the
> rwsem semantics may have changed since I last looked.
> 
> And if the fault happens inside spinlock on a !CONFIG_PREEMPT kernel we end
> up doing down_read() with spinlocks held, I think?

Ouch, ouch and ouch.  You're right, it is a serious problem with the
patch.  It is easy enough to fix by making the fault handler not take
mmap_sem if the fault's in the kernel address range.  (With apologies
to the folk running kernel mode userspace...)

> Yup.  Although prefetch-loop-arrays doesn't sound like something which will
> deref a dud pointer?

It might deref off the end of a mapped region.

> > I understand you're advocating a policy that says we can't do anything
> > about old systems, but from 2.6 onwards apps can depend on not being
> > hit by that erratum in userspace, is that right?
> 
> I expect this will be backported to 2.4 asap.

I agree, but there's going to be an installed base of _current_ 2.4
kernels, on these AMD CPUs, for some years to come.  App and library
writers will have to know about the erratum if they want to use the
prefetch instruction, for a while yet.

Btw, does anyone know if the "prefetchw" instruction is affected by
the erratum?  I see that in current 2.6, only the "prefetchnta"
instruction is disabled so presumably "prefetchw" is ok?

-- Jamie

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata  patch
       [not found]       ` <20030930233258.37ed9f7f.akpm@osdl.org.suse.lists.linux.kernel>
@ 2003-10-01  6:47         ` Andi Kleen
  2003-10-01  7:00           ` Andrew Morton
  2003-10-01  7:20           ` Jamie Lokier
       [not found]         ` <20031001065705.GI1131@mail.shareable.org.suse.lists.linux.kernel>
  1 sibling, 2 replies; 45+ messages in thread
From: Andi Kleen @ 2003-10-01  6:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: jamie, linux-kernel

Andrew Morton <akpm@osdl.org> writes:

> Looking at Andi's patch, it is also a dead box if the fault happens inside
> down_write(mmap_sem).  That should be fixed, methinks.

The only way to fix all that would be to move the instruction checks early
into the fast path.

[On a P4 the overhead is 3.7268 vs 3.6594 microseconds for a fault that 
doesn't hit as measured by lmbench2's lat_sig. This was before the latest 
changes which added more checking, so the overhead is probably bigger now]

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01  6:13     ` Jamie Lokier
@ 2003-10-01  6:32       ` Andrew Morton
  2003-10-01  6:57         ` Jamie Lokier
  0 siblings, 1 reply; 45+ messages in thread
From: Andrew Morton @ 2003-10-01  6:32 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: jun.nakajima, davej, torvalds, linux-kernel, richard.brunner

Jamie Lokier <jamie@shareable.org> wrote:
>
> Andrew Morton wrote:
> > > Doesn't refusing to boot seem to heavy handed for this bug?  The buggy
> > > CPUs have been around for many years (it is practically the entire AMD
> > > line for the last 4 years or so), and nobody in userspace has
> > > complained about the 2.4 behaviour so far.  (Linux 2.4 behaviour is,
> > > of course, to ignore the errata).
> > 
> > That is the case at present.  But the 2.6 kernel was hitting this
> > erracularity daily.
> 
> We're talking about what to offer userspace now...  I think we all
> agree that the kernel itself shouldn't be allowed to hit it, one way
> or another.

Oh yes, if it hits in-kernel you get a dead box.

Looking at Andi's patch, it is also a dead box if the fault happens inside
down_write(mmap_sem).  That should be fixed, methinks.

And I think we're also a bit deadlocky if it happens inside down_read(),
because double down_read() is illegal because an intervening down_write()
from another thread can block the second down_read().  Or maybe not: the
rwsem semantics may have changed since I last looked.

And if the fault happens inside spinlock on a !CONFIG_PREEMPT kernel we end
up doing down_read() with spinlocks held, I think?

> > If some smart cookie decides to add prefetches to some STL implementation
> > or something, they are likely to start hitting it with the same frequency.
> 
> Especially now that GCC has intrinsics for prefetches, and GCC's
> optimiser can generate prefetches automatically (-fprefetch-loop-arrays).

Yup.  Although prefetch-loop-arrays doesn't sound like something which will
deref a dud pointer?

> ...
> I understand you're advocating a policy that says we can't do anything
> about old systems, but from 2.6 onwards apps can depend on not being
> hit by that erratum in userspace, is that right?

I expect this will be backported to 2.4 asap.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01  5:48   ` Andrew Morton
@ 2003-10-01  6:13     ` Jamie Lokier
  2003-10-01  6:32       ` Andrew Morton
  0 siblings, 1 reply; 45+ messages in thread
From: Jamie Lokier @ 2003-10-01  6:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: jun.nakajima, davej, torvalds, linux-kernel, richard.brunner

Andrew Morton wrote:
> > Doesn't refusing to boot seem to heavy handed for this bug?  The buggy
> > CPUs have been around for many years (it is practically the entire AMD
> > line for the last 4 years or so), and nobody in userspace has
> > complained about the 2.4 behaviour so far.  (Linux 2.4 behaviour is,
> > of course, to ignore the errata).
> 
> That is the case at present.  But the 2.6 kernel was hitting this
> erracularity daily.

We're talking about what to offer userspace now...  I think we all
agree that the kernel itself shouldn't be allowed to hit it, one way
or another.

> If some smart cookie decides to add prefetches to some STL implementation
> or something, they are likely to start hitting it with the same frequency.

Especially now that GCC has intrinsics for prefetches, and GCC's
optimiser can generate prefetches automatically (-fprefetch-loop-arrays).

But if they assume the kernel hides it, they'll still hit it on any of
the huge installed base of <=4 year old AMD boxes running 2.2 and 2.4.

Ideally, userspace should just not compile with prefetch if they think
they might run on one of those - or select different code at run time
- it is not so different from the constraints which say SSE code
simply won't run on some CPUs or old kernels anyway.  (prefetch is
worse on a P5 than a K7 - it'll throw an illegal opcode exception :)

I understand you're advocating a policy that says we can't do anything
about old systems, but from 2.6 onwards apps can depend on not being
hit by that erratum in userspace, is that right?

-- Jamie

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01  5:38 ` Jamie Lokier
@ 2003-10-01  5:48   ` Andrew Morton
  2003-10-01  6:13     ` Jamie Lokier
  0 siblings, 1 reply; 45+ messages in thread
From: Andrew Morton @ 2003-10-01  5:48 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: jun.nakajima, davej, torvalds, linux-kernel, richard.brunner

Jamie Lokier <jamie@shareable.org> wrote:
>
> Nakajima, Jun wrote:
> > Yes. It would kind to tell the pilot about the errata of the engine (and
> > refuse to start), rather than telling him that the engine might break
> > down anytime after the takeoff.
> 
> Doesn't refusing to boot seem to heavy handed for this bug?  The buggy
> CPUs have been around for many years (it is practically the entire AMD
> line for the last 4 years or so), and nobody in userspace has
> complained about the 2.4 behaviour so far.  (Linux 2.4 behaviour is,
> of course, to ignore the errata).

That is the case at present.  But the 2.6 kernel was hitting this
erracularity daily.

If some smart cookie decides to add prefetches to some STL implementation
or something, they are likely to start hitting it with the same frequency.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01  4:30 Nakajima, Jun
@ 2003-10-01  5:38 ` Jamie Lokier
  2003-10-01  5:48   ` Andrew Morton
  0 siblings, 1 reply; 45+ messages in thread
From: Jamie Lokier @ 2003-10-01  5:38 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: Andrew Morton, davej, torvalds, linux-kernel, richard.brunner

Nakajima, Jun wrote:
> Yes. It would kind to tell the pilot about the errata of the engine (and
> refuse to start), rather than telling him that the engine might break
> down anytime after the takeoff.

Doesn't refusing to boot seem to heavy handed for this bug?  The buggy
CPUs have been around for many years (it is practically the entire AMD
line for the last 4 years or so), and nobody in userspace has
complained about the 2.4 behaviour so far.  (Linux 2.4 behaviour is,
of course, to ignore the errata).

ps. Why are we all saying errata - isn't erratum good enough? :)

-- Jamie

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
@ 2003-10-01  4:30 Nakajima, Jun
  2003-10-01  5:38 ` Jamie Lokier
  0 siblings, 1 reply; 45+ messages in thread
From: Nakajima, Jun @ 2003-10-01  4:30 UTC (permalink / raw)
  To: Andrew Morton, Jamie Lokier
  Cc: davej, torvalds, linux-kernel, richard.brunner

> Refusing to boot would be best I think.  Fixing it up anyway would be
OK
> too.
> 
I agree.

> Ignoring the problem in kernel and/or userspace could be viewed as a
> response to pilot error but as always it would be nice if, given a
choice,
> we can help the pilot.
Yes. It would kind to tell the pilot about the errata of the engine (and
refuse to start), rather than telling him that the engine might break
down anytime after the takeoff.

	Jun

> -----Original Message----
> From: Andrew Morton [mailto:akpm@osdl.org]
> Sent: Tuesday, September 30, 2003 8:14 PM
> To: Jamie Lokier
> Cc: Nakajima, Jun; davej@redhat.com; torvalds@osdl.org; linux-
> kernel@vger.kernel.org; richard.brunner@amd.com
> Subject: Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch
errata
> patch
> 
> Jamie Lokier <jamie@shareable.org> wrote:
> >
> > Nakajima, Jun wrote:
> > > Oh, I thought you already closed this issue and you were
discussing a
> > > different thing.
> > >
> > > I agree. The kernel should fix up userspace for this CPU errata.
> >
> > Our question is whether kernels configured specifically for non-AMD
> > (e.g. Winchip or Elan kernels) must have the AMD fixup code (it is a
> > few hundred bytes), refuse to boot on AMD, ignore the problem, or
work
> > but not fix up userspace.
> 
> Refusing to boot would be best I think.  Fixing it up anyway would be
OK
> too.
> 
> Ignoring the problem in kernel and/or userspace could be viewed as a
> response to pilot error but as always it would be nice if, given a
choice,
> we can help the pilot.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01  1:54 Nakajima, Jun
  2003-10-01  2:07 ` Andrew Morton
@ 2003-10-01  2:08 ` Mike Fedyk
  1 sibling, 0 replies; 45+ messages in thread
From: Mike Fedyk @ 2003-10-01  2:08 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: Andrew Morton, Jamie Lokier, davej, torvalds, linux-kernel,
	richard.brunner

On Tue, Sep 30, 2003 at 06:54:06PM -0700, Nakajima, Jun wrote:
> > I think we should fix up userspace.
> What do you mean by this? Patch user code at runtime (it's much more
> complex than it sounds) or remove prefetch instructions from userspace?

No, just make sure that userspace doesn't see the exceptions that the errata
will produce in obscure cases.  It is all explained in several threads about
this.  Search for "Andi Kleen" and "prefetch".

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-10-01  1:54 Nakajima, Jun
@ 2003-10-01  2:07 ` Andrew Morton
  2003-10-01  2:08 ` Mike Fedyk
  1 sibling, 0 replies; 45+ messages in thread
From: Andrew Morton @ 2003-10-01  2:07 UTC (permalink / raw)
  To: Nakajima, Jun; +Cc: jamie, davej, torvalds, linux-kernel, richard.brunner

"Nakajima, Jun" <jun.nakajima@intel.com> wrote:
>
>  > I think we should fix up userspace.
>  What do you mean by this? Patch user code at runtime (it's much more
>  complex than it sounds) or remove prefetch instructions from userspace?

Detect when user code stumbles over this CPU errata and make it look like
nothing happened.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
@ 2003-10-01  1:54 Nakajima, Jun
  2003-10-01  2:07 ` Andrew Morton
  2003-10-01  2:08 ` Mike Fedyk
  0 siblings, 2 replies; 45+ messages in thread
From: Nakajima, Jun @ 2003-10-01  1:54 UTC (permalink / raw)
  To: Andrew Morton, Jamie Lokier
  Cc: davej, torvalds, linux-kernel, richard.brunner

> I think we should fix up userspace.
What do you mean by this? Patch user code at runtime (it's much more
complex than it sounds) or remove prefetch instructions from userspace?

	Jun

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Andrew Morton
> Sent: Tuesday, September 30, 2003 5:27 PM
> To: Jamie Lokier
> Cc: davej@redhat.com; torvalds@osdl.org; linux-kernel@vger.kernel.org;
> richard.brunner@amd.com
> Subject: Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch
errata
> patch
> 
> Jamie Lokier <jamie@shareable.org> wrote:
> >
> > What I'd really like your opinion on is the appropriate userspace
> > behaviour.  If we don't care about fixing up userspace, then
> > __ex_table is a much tidier workaround for the prefetch bug.
> 
> I think we should fix up userspace.
> 
> >  If we do
> > care about fixing up userspace, then do we need a policy decision
that
> > says it's not acceptable to run on AMD without userspace fixups from
> > 2.6.0 onwards - it must fixup userspace or refuse to run?
> 
> If you're saying that the kernel should refuse to boot if it discovers
> that
> a) the CPU needs the workaround and b) the kernel does not implement
the
> workaround then yup.
> 
> -
> To unsubscribe from this list: send the line "unsubscribe
linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-09-30 23:55               ` Jamie Lokier
@ 2003-10-01  0:27                 ` Andrew Morton
  0 siblings, 0 replies; 45+ messages in thread
From: Andrew Morton @ 2003-10-01  0:27 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: davej, torvalds, linux-kernel, richard.brunner

Jamie Lokier <jamie@shareable.org> wrote:
>
> What I'd really like your opinion on is the appropriate userspace
> behaviour.  If we don't care about fixing up userspace, then
> __ex_table is a much tidier workaround for the prefetch bug.

I think we should fix up userspace.

>  If we do
> care about fixing up userspace, then do we need a policy decision that
> says it's not acceptable to run on AMD without userspace fixups from
> 2.6.0 onwards - it must fixup userspace or refuse to run?

If you're saying that the kernel should refuse to boot if it discovers that
a) the CPU needs the workaround and b) the kernel does not implement the
workaround then yup.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-09-30 17:26             ` Dave Jones
@ 2003-09-30 23:55               ` Jamie Lokier
  2003-10-01  0:27                 ` Andrew Morton
  0 siblings, 1 reply; 45+ messages in thread
From: Jamie Lokier @ 2003-09-30 23:55 UTC (permalink / raw)
  To: Dave Jones, akpm, torvalds, linux-kernel, richard.brunner

Dave Jones wrote:
> Which gets us back to the question of why this is needed at all ?
> You said earlier "In case you hadn't fully grokked it, my code doesn't
> disable the workaround!"  So why do you need this ?

To change the prefetch workaround from a critical requirement to an
optimisation knob.

X86_USE_3DNOW is very similar: if it's enabled, the kernel has some
extra code to make certain CPUs run faster, but they also run fine
without it.  X86_OOSTORE is another.

What I'd really like your opinion on is the appropriate userspace
behaviour.  If we don't care about fixing up userspace, then
__ex_table is a much tidier workaround for the prefetch bug.  If we do
care about fixing up userspace, then do we need a policy decision that
says it's not acceptable to run on AMD without userspace fixups from
2.6.0 onwards - it must fixup userspace or refuse to run?

-- Jamie

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-09-30 19:08               ` Andi Kleen
@ 2003-09-30 20:08                 ` H. Peter Anvin
  0 siblings, 0 replies; 45+ messages in thread
From: H. Peter Anvin @ 2003-09-30 20:08 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <p73pthiyu0e.fsf@oldwotan.suse.de>
By author:    Andi Kleen <ak@suse.de>
In newsgroup: linux.dev.kernel
> 
> I implemented it for long mode on x86-64.
> 
> It has to be done before the vesafb is initialized too, otherwise
> you cannot see the error message.
> 
> You could copy the code from arch/x86_64/boot/setup.S 
> (starting with /* check for long mode. */) and change it to
> check for the CPUID bits you want. x86-64 checks a basic collection
> that has been determined to be the base set of x86-64 supporting CPUs.
> But it could be less or more.
> 

Do it in boot/setup.S so that you can still issue a message via the
BIOS.  However, don't forget you might need bug fixes/workarounds like
the P6 SEP in there, too.  From that perspective it would be nice to
have something that can be written in C.  Recent binutils actually
allow gcc-generated .s-files to be assembled for a 16-bit environment
(with lots of overrides.)

	-hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
If you send me mail in HTML format I will assume it's spam.
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
       [not found]             ` <20030930172618.GE5507@redhat.com.suse.lists.linux.kernel>
@ 2003-09-30 19:08               ` Andi Kleen
  2003-09-30 20:08                 ` H. Peter Anvin
  0 siblings, 1 reply; 45+ messages in thread
From: Andi Kleen @ 2003-09-30 19:08 UTC (permalink / raw)
  To: Dave Jones; +Cc: linux-kernel

Dave Jones <davej@redhat.com> writes:
> 
>  > Anyway, it should complain about lack of cmov not crash :)
> 
> not easy, given we execute cmov instructions before we even hit
> a printk. Such a test & output needs to be done in assembly in early
> startup.

I implemented it for long mode on x86-64.

It has to be done before the vesafb is initialized too, otherwise
you cannot see the error message.

You could copy the code from arch/x86_64/boot/setup.S 
(starting with /* check for long mode. */) and change it to
check for the CPUID bits you want. x86-64 checks a basic collection
that has been determined to be the base set of x86-64 supporting CPUs.
But it could be less or more.

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-09-30 16:54           ` Jamie Lokier
@ 2003-09-30 17:26             ` Dave Jones
  2003-09-30 23:55               ` Jamie Lokier
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Jones @ 2003-09-30 17:26 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: akpm, torvalds, linux-kernel, richard.brunner

On Tue, Sep 30, 2003 at 05:54:50PM +0100, Jamie Lokier wrote:

 > > >    - I don't see anything that prevents a PPro-compiled kernel from booting
 > > >      on a P5MMX with the F00F erratum.
 > > Compiled with -m686 - Uses CMOV, won't boot.
 > Ok, not PPro, but with Processor Family set to K6, CYRIXIII, or any of
 > the 3 WINCHIP choices, it is compiled with -march=i585 and without F00F.

in theory, these should be interchangable. Not tested though.

 > (Fixing that by adding F00F too all those non-Intel processors, just to
 > make sure non-F00F kernels crash with a cmov instruction is too subtle
 > for my taste.)

this case is a little more obscure imo, and not worth the effort.

 > Anyway, it should complain about lack of cmov not crash :)

not easy, given we execute cmov instructions before we even hit
a printk. Such a test & output needs to be done in assembly in early
startup.

 > > 1. The splitting of X86_FEATURE_XMM into X86_FEATURE_XMM_PREFETCH and
 > >    X86_FEATURE_3DNOW_PREFETCH doesn't seem to really buy us anything
 > >    other than complication.
 > I once suggested turning off XMM entirely when prefetch is broken and
 > not fixed, but that resulted in a mild toasting, hence the extra
 > synthetic flag.

Which gets us back to the question of why this is needed at all ?
You said earlier "In case you hadn't fully grokked it, my code doesn't
disable the workaround!"  So why do you need this ?

 > > - If we haven't set CONFIG_X86_PREFETCH_FIXUP (say a P4 kernel), this
 > >   code path isn't taken, and we end up not doing prefetches on P4's too
 > >   as you're not setting X86_FEATURE_XMM_PREFETCH anywhere else, and apply_alternatives
 > >   leaves them as NOPs.
 > > - Newer C3s are 686's with prefetch, this nobbles them too.
 > Read the code again.  It does what you think it doesn't do, so to speak.

Yep, brain fart.

		Dave

-- 
 Dave Jones     http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-09-30 15:08         ` Dave Jones
@ 2003-09-30 16:54           ` Jamie Lokier
  2003-09-30 17:26             ` Dave Jones
  0 siblings, 1 reply; 45+ messages in thread
From: Jamie Lokier @ 2003-09-30 16:54 UTC (permalink / raw)
  To: Dave Jones, akpm, torvalds, linux-kernel, richard.brunner

Dave Jones wrote:
> >    - I don't see anything that prevents a PPro-compiled kernel from booting
> >      on a P5MMX with the F00F erratum.
> 
> Compiled with -m686 - Uses CMOV, won't boot.

Ok, not PPro, but with Processor Family set to K6, CYRIXIII, or any of
the 3 WINCHIP choices, it is compiled with -march=i585 and without F00F.

(Fixing that by adding F00F too all those non-Intel processors, just to
make sure non-F00F kernels crash with a cmov instruction is too subtle
for my taste.)

Anyway, it should complain about lack of cmov not crash :)

> 1. The splitting of X86_FEATURE_XMM into X86_FEATURE_XMM_PREFETCH and
>    X86_FEATURE_3DNOW_PREFETCH doesn't seem to really buy us anything
>    other than complication.

I once suggested turning off XMM entirely when prefetch is broken and
not fixed, but that resulted in a mild toasting, hence the extra
synthetic flag.

It's not really split: XMM_PREFETCH is an additional flag, on top of
XMM, which simply means Linux decided it's safe to use the prefetch
instruction.

The only reason it's a feature flag is because the "alternative" macro
needs one.

> +       /* Prefetch works ok? */
> +#ifndef CONFIG_X86_PREFETCH_FIXUP
> +       if (c->x86_vendor != X86_VENDOR_AMD || c->x86 < 6)
> +#endif
> +       {
> +               if (cpu_has(c, X86_FEATURE_XMM))
> +                       set_bit(X86_FEATURE_XMM_PREFETCH, c->x86_capability);
> +               if (cpu_has(c, X86_FEATURE_3DNOW))
> +                       set_bit(X86_FEATURE_3DNOW_PREFETCH, c->x86_capability);
> +       }
> 
> - If we haven't set CONFIG_X86_PREFETCH_FIXUP (say a P4 kernel), this
>   code path isn't taken, and we end up not doing prefetches on P4's too
>   as you're not setting X86_FEATURE_XMM_PREFETCH anywhere else, and apply_alternatives
>   leaves them as NOPs.
> - Newer C3s are 686's with prefetch, this nobbles them too.

Read the code again.  It does what you think it doesn't do, so to speak.

-- Jamie

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-09-30 14:45       ` Jamie Lokier
@ 2003-09-30 15:08         ` Dave Jones
  2003-09-30 16:54           ` Jamie Lokier
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Jones @ 2003-09-30 15:08 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: akpm, torvalds, linux-kernel, richard.brunner

On Tue, Sep 30, 2003 at 03:45:26PM +0100, Jamie Lokier wrote:

 > > And those people are wrong. If they want to save bloat, instead of
 > > 'fixing' things by removing <1 page of .text, how about working on
 > > some of the real problems like shrinking some of the growth of various
 > > data structures that actually *matter*.
 > How about both?

Sounds like wasted effort, in the same sense that rewriting a crap
algorithm in assembly won't be better than using a more efficient
algorithm in C.

 > I'm talking about people with embedded 486s or old 486s donated.  P4s
 > are abundant in RAM

Mine has 256MB. Sure its a huge amount in comparison to how we kitted
out 486s a few years back, but still hardly an abundance in todays bloatware..

 > but 2MB is still not unheard of in the small
 > boxes, and in 2MB, 512 bytes of code (which is about the size of the
 > prefetch workaround) is more significant.

I'l be *amazed* if you manage to get a 2.6 kernel to boot and function
efficiently in 2MB without config'ing away a lot of the 'growth' like
sysfs. (Sidenote: Before some loon actually tries this, by function
efficiently, I mean is actually usable for some purpose, not "it booted
to a bash prompt")
 
 > > F00F workaround was enabled on every kernel that is possible
 > > to boot on affected hardware last time I looked.
 > > This is what you seem to be missing, it's not optional.
 > > If its possible to boot that kernel on affected hardware, 
 > > it needs errata workarounds.
 > 
 > We have a few confusing issues here.
 > 
 > 1. First, your point about affected hardware.
 > 
 >    - I don't see anything that prevents a PPro-compiled kernel from booting
 >      on a P5MMX with the F00F erratum.

Compiled with -m686 - Uses CMOV, won't boot.

 >    - Nor do I see anything that prevents a PII-compiled kernel from booting
 >      on a PPro with the store ordering erratum (X86_PPRO_FENCE).

Correct. As noted in another mail, it arguably should contain the
workaround.

 >    Perhaps it's this apparent hypocrisy which needs healing.

Agreed.

 > 2. I'm not sure if you're criticising the other chap who wants
 >    rid of the AMD errata workaround, or my X86_PREFETCH_FIXUP code.

My criticism was twofold.

1. The splitting of X86_FEATURE_XMM into X86_FEATURE_XMM_PREFETCH and
   X86_FEATURE_3DNOW_PREFETCH doesn't seem to really buy us anything
   other than complication.
2. THis chunk...

+       /* Prefetch works ok? */
+#ifndef CONFIG_X86_PREFETCH_FIXUP
+       if (c->x86_vendor != X86_VENDOR_AMD || c->x86 < 6)
+#endif
+       {
+               if (cpu_has(c, X86_FEATURE_XMM))
+                       set_bit(X86_FEATURE_XMM_PREFETCH, c->x86_capability);
+               if (cpu_has(c, X86_FEATURE_3DNOW))
+                       set_bit(X86_FEATURE_3DNOW_PREFETCH, c->x86_capability);
+       }

- If we haven't set CONFIG_X86_PREFETCH_FIXUP (say a P4 kernel), this
  code path isn't taken, and we end up not doing prefetches on P4's too
  as you're not setting X86_FEATURE_XMM_PREFETCH anywhere else, and apply_alternatives
  leaves them as NOPs.
- Newer C3s are 686's with prefetch, this nobbles them too.


 >    In case you hadn't fully grokked it, my code doesn't disable the
 >    workaround!  It simply substitutes it for a smaller, slightly
 >    slower one, on kernels which are not optimised for AMD.

See above. Or have I missed something ?

 >    Given that, I'm not sure what the thrust of your argument is.

It's possible I'm missing something silly..

		Dave

-- 
 Dave Jones     http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-09-30 13:53     ` Dave Jones
@ 2003-09-30 14:45       ` Jamie Lokier
  2003-09-30 15:08         ` Dave Jones
  0 siblings, 1 reply; 45+ messages in thread
From: Jamie Lokier @ 2003-09-30 14:45 UTC (permalink / raw)
  To: Dave Jones, akpm, torvalds, linux-kernel, richard.brunner

Dave Jones wrote:
> On Tue, Sep 30, 2003 at 02:39:36PM +0100, Jamie Lokier wrote:
>  > Dave Jones wrote:
>  > > This looks to be completely gratuitous. Why disable it when we have the
>  > > ability to work around it ?
>  > 
>  > Because some people expressed a wish to have kernels that don't
>  > contain the workaround code, be they P4-optimised or 486-optimised
>  > kernels.
> 
> And those people are wrong. If they want to save bloat, instead of
> 'fixing' things by removing <1 page of .text, how about working on
> some of the real problems like shrinking some of the growth of various
> data structures that actually *matter*.

How about both?

> The "I don't want Athlon code in my kernel because I run a P4 and
> it makes it slow/bigger" argument is totally bogus. It's akin to
> the gentoo-esque "I compiled my distro with -march=p4 and now
> my /bin/ls is faster than yours" argument.

I'm talking about people with embedded 486s or old 486s donated.  P4s
are abundant in RAM, but 2MB is still not unheard of in the small
boxes, and in 2MB, 512 bytes of code (which is about the size of the
prefetch workaround) is more significant.  Not by itself, but by
virtue of being yet another unnecessary, and very clearly removable
chunk.  Diligence and all that.

>  > After all we have kernels that don't contain the F00F
>  > workaround too.  I'm not pushing this patch as is, it's for
>  > considering the pros and cons.
> 
> F00F workaround was enabled on every kernel that is possible
> to boot on affected hardware last time I looked.
> This is what you seem to be missing, it's not optional.
> If its possible to boot that kernel on affected hardware, 
> it needs errata workarounds.

We have a few confusing issues here.

1. First, your point about affected hardware.

   I see your point, and the new "alternative" macro is allowing things
   to develop along those lines even better: using fancier features (like
   prefetch) but not requiring them.  Most of the x86 kernel is like that now.
   All the stuff in arch/i386/kernel/cpu/* is mandatory.  Yet:

   - I don't see anything that prevents a PPro-compiled kernel from booting
     on a P5MMX with the F00F erratum.

   - Nor do I see anything that prevents a PII-compiled kernel from booting
     on a PPro with the store ordering erratum (X86_PPRO_FENCE).

   Those are both nasty bugs, and the latter can corrupt data on an SMP PPro.

   Currently, it seems quite hypocritical to treat the AMD workaround
   as, in effect, mandatory, while there are others which aren't.
   Perhaps it's this apparent hypocrisy which needs healing.

2. I'm not sure if you're criticising the other chap who wants
   rid of the AMD errata workaround, or my X86_PREFETCH_FIXUP code.

   In case you hadn't fully grokked it, my code doesn't disable the
   workaround!  It simply substitutes it for a smaller, slightly
   slower one, on kernels which are not optimised for AMD.  Those
   kernels continue to work just fine on AMD processors.

   Given that, I'm not sure what the thrust of your argument is.

>  > CONFIG_X86_PREFETCH_WORKAROUND makes more makes more sense with the
>  > recently available "split every x86 CPU into individually selectable
>  > options" patch, and, on reflection, that's probably where it belongs.
> 
> Said patch isn't included in mainline, so this argument is bogus.
> Relative merits of that patch were already discussed in another thread.

That wasn't an argument and the patch from me isn't in the mainline
either, but anyway I agree that topic has its own thread.  Let's
please leave it out of this one and focus on the merits, or otherwise,
of my patch and Andi's.

-- Jamie

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-09-30 13:39   ` Jamie Lokier
@ 2003-09-30 13:53     ` Dave Jones
  2003-09-30 14:45       ` Jamie Lokier
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Jones @ 2003-09-30 13:53 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: akpm, torvalds, linux-kernel, richard.brunner

On Tue, Sep 30, 2003 at 02:39:36PM +0100, Jamie Lokier wrote:
 > Dave Jones wrote:
 > > This looks to be completely gratuitous. Why disable it when we have the
 > > ability to work around it ?
 > 
 > Because some people expressed a wish to have kernels that don't
 > contain the workaround code, be they P4-optimised or 486-optimised
 > kernels.

And those people are wrong. If they want to save bloat, instead of
'fixing' things by removing <1 page of .text, how about working on
some of the real problems like shrinking some of the growth of various
data structures that actually *matter*.

The "I don't want Athlon code in my kernel because I run a P4 and
it makes it slow/bigger" argument is totally bogus. It's akin to
the gentoo-esque "I compiled my distro with -march=p4 and now
my /bin/ls is faster than yours" argument.

 > After all we have kernels that don't contain the F00F
 > workaround too.  I'm not pushing this patch as is, it's for
 > considering the pros and cons.

F00F workaround was enabled on every kernel that is possible
to boot on affected hardware last time I looked.
This is what you seem to be missing, it's not optional.
If its possible to boot that kernel on affected hardware, 
it needs errata workarounds.

 > CONFIG_X86_PREFETCH_WORKAROUND makes more makes more sense with the
 > recently available "split every x86 CPU into individually selectable
 > options" patch, and, on reflection, that's probably where it belongs.

Said patch isn't included in mainline, so this argument is bogus.
Relative merits of that patch were already discussed in another thread.

		Dave

-- 
 Dave Jones     http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-09-30 13:22 ` Dave Jones
@ 2003-09-30 13:39   ` Jamie Lokier
  2003-09-30 13:53     ` Dave Jones
  0 siblings, 1 reply; 45+ messages in thread
From: Jamie Lokier @ 2003-09-30 13:39 UTC (permalink / raw)
  To: Dave Jones, akpm, torvalds, linux-kernel, richard.brunner

Dave Jones wrote:
> This looks to be completely gratuitous. Why disable it when we have the
> ability to work around it ?

Because some people expressed a wish to have kernels that don't
contain the workaround code, be they P4-optimised or 486-optimised
kernels.  After all we have kernels that don't contain the F00F
workaround too.  I'm not pushing this patch as is, it's for
considering the pros and cons.

CONFIG_X86_PREFETCH_WORKAROUND makes more makes more sense with the
recently available "split every x86 CPU into individually selectable
options" patch, and, on reflection, that's probably where it belongs.

-- Jamie

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-09-30  7:38 Jamie Lokier
  2003-09-30  8:01 ` Nick Piggin
@ 2003-09-30 13:22 ` Dave Jones
  2003-09-30 13:39   ` Jamie Lokier
  1 sibling, 1 reply; 45+ messages in thread
From: Dave Jones @ 2003-09-30 13:22 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: akpm, torvalds, linux-kernel, richard.brunner

On Tue, Sep 30, 2003 at 08:38:14AM +0100, Jamie Lokier wrote:
 > diff -urN --exclude-from=dontdiff orig-2.6.0-test6/arch/i386/Kconfig dual-2.6.0-test6/arch/i386/Kconfig
 > --- orig-2.6.0-test6/arch/i386/Kconfig	2003-09-30 05:39:33.467103692 +0100
 > +++ dual-2.6.0-test6/arch/i386/Kconfig	2003-09-30 06:05:19.868623767 +0100
 > @@ -397,6 +397,12 @@
 >  	depends on MWINCHIP3D || MWINCHIP2 || MWINCHIPC6
 >  	default y
 >  
 > +config X86_PREFETCH_FIXUP
 > +	bool
 > +	depends on MK7 || MK8
 > +	default y
 > +
 > +
 >  config HPET_TIMER
 >  	bool "HPET Timer Support"
 >  	help
 > diff -urN --exclude-from=dontdiff orig-2.6.0-test6/arch/i386/kernel/cpu/common.c dual-2.6.0-test6/arch/i386/kernel/cpu/common.c
 > --- orig-2.6.0-test6/arch/i386/kernel/cpu/common.c	2003-09-30 05:39:33.871035696 +0100
 > +++ dual-2.6.0-test6/arch/i386/kernel/cpu/common.c	2003-09-30 06:03:46.222851345 +0100
 > @@ -327,6 +327,17 @@
 >  	 * we do "generic changes."
 >  	 */
 >  
 > +	/* Prefetch works ok? */
 > +#ifndef CONFIG_X86_PREFETCH_FIXUP
 > +	if (c->x86_vendor != X86_VENDOR_AMD || c->x86 < 6)
 > +#endif
 > +	{
 > +		if (cpu_has(c, X86_FEATURE_XMM))
 > +			set_bit(X86_FEATURE_XMM_PREFETCH, c->x86_capability);
 > +		if (cpu_has(c, X86_FEATURE_3DNOW))
 > +			set_bit(X86_FEATURE_3DNOW_PREFETCH, c->x86_capability);
 > +	}
 > +
 >  	/* TSC disabled? */
 >  	if ( tsc_disable )

This looks to be completely gratuitous. Why disable it when we have the
ability to work around it ?

		Dave

-- 
 Dave Jones     http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
  2003-09-30  7:38 Jamie Lokier
@ 2003-09-30  8:01 ` Nick Piggin
  2003-09-30 13:22 ` Dave Jones
  1 sibling, 0 replies; 45+ messages in thread
From: Nick Piggin @ 2003-09-30  8:01 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: akpm, torvalds, linux-kernel, richard.brunner



Jamie Lokier wrote:

>This builds upon Andi Kleen's excellent patch for the AMD prefetch bug
>workaround.  It is applies to 2.6.0-test6.
>
>There are four changes on top of Andi's recent patch:
>
>   1. The workaround code is only included when compiling a kernel
>      optimised for AMD processors with the bug.  For kernels optimised
>      for a different processor, there is no code size overhead.
>
>      This will make some people happier.
>
>      It will make some people unhappier, because it means a generic
>      kernel on an AMD will run fine, but won't fixup _userspace_
>      prefetch faults.  More on this below.
>
>   2. Nevertheless, when a kernel _not_ optimised for those AMD processors
>      is run on one, the "alternative" mechanism now correctly replaces
>      prefetch instructions with nops.
>
>   3. The is_prefetch() instruction decoder now handles all cases of
>      funny GDT and LDT segments, checks limits and so on.  As far as I
>      can see, there are no further bugs or flaws in that part.
>
>   4. A consequence of the is_prefetch() change is that, despite the
>      longer C code (rarely used code, for segments), is_prefetch()
>      should run marginally faster now because the access_ok() calls are
>      not required.  They're subsumed into the segment limit
>      calculation.
> 
>Like Andi's patch, this removes 10k or so from current x86 kernels by
>removing the silly conditional from the prefetch() function.
>
>Controversial point coming up!
>

It sounds good Jamie, but could you possibly compile it in
unconditionally? Otherwise it introduces yet another caveat to the CPU
selection / build system. This can be addressed seperately.

I don't think anyone advocates attempting to handle this nicely with the
current CPU selection system, nor were they targeting this patch in
particular. It just kicked off a discussion about the shortfalls of the
selection system.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch
@ 2003-09-30  7:38 Jamie Lokier
  2003-09-30  8:01 ` Nick Piggin
  2003-09-30 13:22 ` Dave Jones
  0 siblings, 2 replies; 45+ messages in thread
From: Jamie Lokier @ 2003-09-30  7:38 UTC (permalink / raw)
  To: akpm; +Cc: torvalds, linux-kernel, richard.brunner

This builds upon Andi Kleen's excellent patch for the AMD prefetch bug
workaround.  It is applies to 2.6.0-test6.

There are four changes on top of Andi's recent patch:

   1. The workaround code is only included when compiling a kernel
      optimised for AMD processors with the bug.  For kernels optimised
      for a different processor, there is no code size overhead.

      This will make some people happier.

      It will make some people unhappier, because it means a generic
      kernel on an AMD will run fine, but won't fixup _userspace_
      prefetch faults.  More on this below.

   2. Nevertheless, when a kernel _not_ optimised for those AMD processors
      is run on one, the "alternative" mechanism now correctly replaces
      prefetch instructions with nops.

   3. The is_prefetch() instruction decoder now handles all cases of
      funny GDT and LDT segments, checks limits and so on.  As far as I
      can see, there are no further bugs or flaws in that part.

   4. A consequence of the is_prefetch() change is that, despite the
      longer C code (rarely used code, for segments), is_prefetch()
      should run marginally faster now because the access_ok() calls are
      not required.  They're subsumed into the segment limit
      calculation.
 
Like Andi's patch, this removes 10k or so from current x86 kernels by
removing the silly conditional from the prefetch() function.

Controversial point coming up!

It's a reasonable argument that if a kernel version, say 2.6.0, promises
to hide the errata from _userspace_ programs, then the generic or
not-AMD-optimised kernels should do that too.  Otherwise userspace may
as well provide the workaround itself, by catching SIGSEGV - because it
is perfectly normal and common to run a generic kernel on an AMD.

If userspace has to do that, then there's no point in the kernel having
the fixup at all - it may as well use __ex_table entries for the
kernel-space prefetch instructions, and give up on hiding the processor
bug from userspace.  That would be much simpler than any of the patches
so far.

Arguably the bug has been around for years already (all K7's), so the
rare userspace programs which use prefetch should be aware of this already.

Something to think about.

Here is Andi's comment from the patch this is derived from:

  Here is a new version of the Athlon/Opteron prefetch issue workaround
  for 2.6.0test6.  The issue was hit regularly by the 2.6 kernel
  while doing prefetches on NULL terminated hlists.

  These CPUs sometimes generate illegal exception for prefetch
  instructions. The operating system can work around this by checking
  if the faulting instruction is a prefetch and ignoring it when
  it's the case.

  The code is structured carefully to ensure that the page fault
  will never recurse more than once. Also unmapped EIPs are special
  cased to give more friendly oopses when the kernel jumps to
  unmapped addresses.

  It removes the previous dumb in kernel workaround for this and shrinks the
  kernel by >10k.

  Small behaviour change is that a SIGBUS fault for a *_user access will
  cause an EFAULT now, no SIGBUS.


I'm running this now, on my dual AMD Athlon, and it's working fine.  (It
works when I comment out the fast path and force the full segment check,
before you ask ;)

Enjoy,
-- Jamie

diff -urN --exclude-from=dontdiff orig-2.6.0-test6/arch/i386/Kconfig dual-2.6.0-test6/arch/i386/Kconfig
--- orig-2.6.0-test6/arch/i386/Kconfig	2003-09-30 05:39:33.467103692 +0100
+++ dual-2.6.0-test6/arch/i386/Kconfig	2003-09-30 06:05:19.868623767 +0100
@@ -397,6 +397,12 @@
 	depends on MWINCHIP3D || MWINCHIP2 || MWINCHIPC6
 	default y
 
+config X86_PREFETCH_FIXUP
+	bool
+	depends on MK7 || MK8
+	default y
+
+
 config HPET_TIMER
 	bool "HPET Timer Support"
 	help
diff -urN --exclude-from=dontdiff orig-2.6.0-test6/arch/i386/kernel/cpu/common.c dual-2.6.0-test6/arch/i386/kernel/cpu/common.c
--- orig-2.6.0-test6/arch/i386/kernel/cpu/common.c	2003-09-30 05:39:33.871035696 +0100
+++ dual-2.6.0-test6/arch/i386/kernel/cpu/common.c	2003-09-30 06:03:46.222851345 +0100
@@ -327,6 +327,17 @@
 	 * we do "generic changes."
 	 */
 
+	/* Prefetch works ok? */
+#ifndef CONFIG_X86_PREFETCH_FIXUP
+	if (c->x86_vendor != X86_VENDOR_AMD || c->x86 < 6)
+#endif
+	{
+		if (cpu_has(c, X86_FEATURE_XMM))
+			set_bit(X86_FEATURE_XMM_PREFETCH, c->x86_capability);
+		if (cpu_has(c, X86_FEATURE_3DNOW))
+			set_bit(X86_FEATURE_3DNOW_PREFETCH, c->x86_capability);
+	}
+
 	/* TSC disabled? */
 	if ( tsc_disable )
 		clear_bit(X86_FEATURE_TSC, c->x86_capability);
diff -urN --exclude-from=dontdiff orig-2.6.0-test6/arch/i386/mm/fault.c dual-2.6.0-test6/arch/i386/mm/fault.c
--- orig-2.6.0-test6/arch/i386/mm/fault.c	2003-07-08 21:31:09.000000000 +0100
+++ dual-2.6.0-test6/arch/i386/mm/fault.c	2003-09-30 06:03:46.284839268 +0100
@@ -55,6 +55,157 @@
 	console_loglevel = loglevel_save;
 }
 
+#ifdef CONFIG_X86_PREFETCH_FIXUP
+/*
+ * Return EIP plus the CS segment base.  The segment limit is also
+ * adjusted, clamped to the kernel/user address space (whichever is
+ * appropriate), and returned in *eip_limit.
+ *
+ * The segment is checked, because it might have been changed by another
+ * task between the original faulting instruction and here.
+ *
+ * If CS is no longer a valid code segment, or if EIP is beyond the
+ * limit, or if it is a kernel address when CS is not a kernel segment,
+ * then the returned value will be greater than *eip_limit.
+ */
+static inline unsigned long get_segment_eip(struct pt_regs *regs,
+					    unsigned long *eip_limit)
+{
+	unsigned long eip = regs->eip;
+	unsigned seg = regs->xcs & 0xffff;
+	u32 seg_ar, seg_limit, base, *desc;
+
+	/* The standard kernel/user address space limit. */
+	*eip_limit = (seg & 3) ? USER_DS.seg : KERNEL_DS.seg;
+
+	/* Unlikely, but must come before segment checks. */
+	if (unlikely((regs->eflags & VM_MASK) != 0))
+		return eip + (seg << 4);
+	
+	/* By far the commonest cases. */
+	if (likely(seg == __USER_CS || seg == __KERNEL_CS))
+		return eip;
+
+	/* Check the segment exists, is within the current LDT/GDT size,
+	   that kernel/user (ring 0..3) has the appropriate privelege,
+	   that it's a code segment, and get the limit. */
+	__asm__ ("larl %3,%0; lsll %3,%1"
+		 : "=&r" (seg_ar), "=r" (seg_limit) : "0" (0), "rm" (seg));
+	if ((~seg_ar & 0x9800) || eip > seg_limit) {
+		*eip_limit = 0;
+		return 1;	 /* So that returned eip > *eip_limit. */
+	}
+
+	/* Get the GDT/LDT descriptor base. */
+	if (seg & (1<<2)) {
+		/* Must lock the LDT while reading it. */
+		down(&current->mm->context.sem);
+		desc = current->mm->context.ldt;
+	} else {
+		/* Must disable preemption while reading the GDT. */
+		desc = (u32 *)&cpu_gdt_table[get_cpu()];
+	}
+	desc = (void *)desc + (seg & ~7);
+	base = (desc[0] >> 16) |
+		((desc[1] & 0xff) << 16) |
+		(desc[1] & 0xff000000);
+	if (seg & (1<<2))
+		up(&current->mm->context.sem);
+	else
+		put_cpu();
+
+	/* Adjust EIP and segment limit, and clamp at the kernel limit.
+	   It's legitimate for segments to wrap at 0xffffffff. */
+	seg_limit += base;
+	if (seg_limit < *eip_limit && seg_limit >= base)
+		*eip_limit = seg_limit;
+	return eip + base;
+}
+
+/* 
+ * Sometimes AMD Athlon/Opteron CPUs report invalid exceptions on prefetch.
+ * Check that here and ignore it.
+ */
+static int __is_prefetch(struct pt_regs *regs, unsigned long addr)
+{ 
+	unsigned long limit;
+	unsigned long instr = get_segment_eip (regs, &limit);
+	int scan_more = 1;
+	int prefetch = 0; 
+	int i;
+
+	/* 
+	 * Avoid recursive faults. This catches the kernel jumping to nirvana.
+	 * More complicated races with unmapped EIP are handled elsewhere for 
+	 * user space.
+	 */
+	if (instr == addr)
+		return 0; 
+
+	for (i = 0; scan_more && i < 15; i++) { 
+		unsigned char opcode;
+		unsigned char instr_hi;
+		unsigned char instr_lo;
+
+		if (instr > limit)
+			break;
+		if (__get_user(opcode, (unsigned char *) instr))
+			break; 
+
+		instr_hi = opcode & 0xf0; 
+		instr_lo = opcode & 0x0f; 
+		instr++;
+
+		switch (instr_hi) { 
+		case 0x20:
+		case 0x30:
+			/* Values 0x26,0x2E,0x36,0x3E are valid x86 prefixes. */
+			scan_more = ((instr_lo & 7) == 0x6);
+			break;
+			
+		case 0x60:
+			/* 0x64 thru 0x67 are valid prefixes in all modes. */
+			scan_more = (instr_lo & 0xC) == 0x4;
+			break;		
+		case 0xF0:
+			/* 0xF0, 0xF2, and 0xF3 are valid prefixes in all modes. */
+			scan_more = !instr_lo || (instr_lo>>1) == 1;
+			break;			
+		case 0x00:
+			/* Prefetch instruction is 0x0F0D or 0x0F18 */
+			scan_more = 0;
+			if (instr > limit)
+				break;
+			if (__get_user(opcode, (unsigned char *) instr)) 
+				break;
+			prefetch = (instr_lo == 0xF) &&
+				(opcode == 0x0D || opcode == 0x18);
+			break;			
+		default:
+			scan_more = 0;
+			break;
+		} 
+	}
+
+	return prefetch;
+}
+
+static inline int is_prefetch(struct pt_regs *regs, unsigned long addr)
+{
+	/* This code is included only when optimising for AMD, so we
+	   may as well bias in favour of it. */
+	if (likely(boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
+		   boot_cpu_data.x86 >= 6))
+		return __is_prefetch(regs, addr);
+	return 0;
+} 
+
+#else /* CONFIG_X86_PREFETCH_FIXUP */
+
+#define is_prefetch(regs, addr)	(0)
+
+#endif
+
 asmlinkage void do_invalid_op(struct pt_regs *, unsigned long);
 
 /*
@@ -110,7 +261,7 @@
 	 * atomic region then we must not take the fault..
 	 */
 	if (in_atomic() || !mm)
-		goto no_context;
+		goto bad_area_nosemaphore;
 
 	down_read(&mm->mmap_sem);
 
@@ -198,8 +349,16 @@
 bad_area:
 	up_read(&mm->mmap_sem);
 
+bad_area_nosemaphore:
 	/* User mode accesses just cause a SIGSEGV */
 	if (error_code & 4) {
+		/* 
+		 * Valid to do another page fault here because this one came 
+		 * from user space.
+		 */
+		if (is_prefetch(regs, address))
+			return;
+
 		tsk->thread.cr2 = address;
 		tsk->thread.error_code = error_code;
 		tsk->thread.trap_no = 14;
@@ -232,6 +391,14 @@
 	if (fixup_exception(regs))
 		return;
 
+	/* 
+	 * Valid to do another page fault here, because if this fault
+	 * had been triggered by is_prefetch fixup_exception would have 
+	 * handled it.
+	 */
+ 	if (is_prefetch(regs, address))
+ 		return;
+
 /*
  * Oops. The kernel tried to access some bad page. We'll have to
  * terminate things with extreme prejudice.
@@ -286,10 +453,14 @@
 do_sigbus:
 	up_read(&mm->mmap_sem);
 
-	/*
-	 * Send a sigbus, regardless of whether we were in kernel
-	 * or user mode.
-	 */
+	/* Kernel mode? Handle exceptions or die */
+	if (!(error_code & 4))
+		goto no_context;
+
+	/* User space => ok to do another page fault */
+	if (is_prefetch(regs, address))
+		return;
+
 	tsk->thread.cr2 = address;
 	tsk->thread.error_code = error_code;
 	tsk->thread.trap_no = 14;
@@ -298,10 +469,6 @@
 	info.si_code = BUS_ADRERR;
 	info.si_addr = (void *)address;
 	force_sig_info(SIGBUS, &info, tsk);
-
-	/* Kernel mode? Handle exceptions or die */
-	if (!(error_code & 4))
-		goto no_context;
 	return;
 
 vmalloc_fault:
diff -urN --exclude-from=dontdiff orig-2.6.0-test6/include/asm-i386/cpufeature.h dual-2.6.0-test6/include/asm-i386/cpufeature.h
--- orig-2.6.0-test6/include/asm-i386/cpufeature.h	2003-09-30 05:41:04.520720265 +0100
+++ dual-2.6.0-test6/include/asm-i386/cpufeature.h	2003-09-30 06:03:46.302835762 +0100
@@ -68,6 +68,9 @@
 #define X86_FEATURE_K7		(3*32+ 5) /* Athlon */
 #define X86_FEATURE_P3		(3*32+ 6) /* P3 */
 #define X86_FEATURE_P4		(3*32+ 7) /* P4 */
+/* Prefetch insns without AMD spurious faults, or with fixup for them. */
+#define X86_FEATURE_XMM_PREFETCH	(3*32+ 8) /* SSE */
+#define X86_FEATURE_3DNOW_PREFETCH	(3*32+ 9) /* 3DNow! */
 
 /* Intel-defined CPU features, CPUID level 0x00000001 (ecx), word 4 */
 #define X86_FEATURE_EST		(4*32+ 7) /* Enhanced SpeedStep */
diff -urN --exclude-from=dontdiff orig-2.6.0-test6/include/asm-i386/processor.h dual-2.6.0-test6/include/asm-i386/processor.h
--- orig-2.6.0-test6/include/asm-i386/processor.h	2003-09-30 05:41:04.861655987 +0100
+++ dual-2.6.0-test6/include/asm-i386/processor.h	2003-09-30 06:03:46.334829529 +0100
@@ -589,11 +589,9 @@
 #define ARCH_HAS_PREFETCH
 extern inline void prefetch(const void *x)
 {
-	if (cpu_data[0].x86_vendor == X86_VENDOR_AMD)
-		return;		/* Some athlons fault if the address is bad */
 	alternative_input(ASM_NOP4,
 			  "prefetchnta (%1)",
-			  X86_FEATURE_XMM,
+			  X86_FEATURE_XMM_PREFETCH,
 			  "r" (x));
 }
 
@@ -607,7 +605,7 @@
 {
 	alternative_input(ASM_NOP4,
 			  "prefetchw (%1)",
-			  X86_FEATURE_3DNOW,
+			  X86_FEATURE_3DNOW_PREFETCH,
 			  "r" (x));
 }
 #define spin_lock_prefetch(x)	prefetchw(x)

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2003-10-01 16:18 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-10-01  2:23 [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch Nakajima, Jun
2003-10-01  2:51 ` Jamie Lokier
2003-10-01  3:14   ` Andrew Morton
     [not found] <7F740D512C7C1046AB53446D3720017304AFCF@scsmsx402.sc.intel.com.suse.lists.linux.kernel>
     [not found] ` <20031001053833.GB1131@mail.shareable.org.suse.lists.linux.kernel>
     [not found]   ` <20030930224853.15073447.akpm@osdl.org.suse.lists.linux.kernel>
     [not found]     ` <20031001061348.GE1131@mail.shareable.org.suse.lists.linux.kernel>
     [not found]       ` <20030930233258.37ed9f7f.akpm@osdl.org.suse.lists.linux.kernel>
2003-10-01  6:47         ` Andi Kleen
2003-10-01  7:00           ` Andrew Morton
2003-10-01  7:06             ` Andi Kleen
2003-10-01  7:31               ` Jamie Lokier
2003-10-01  7:41                 ` Andi Kleen
2003-10-01  8:02                 ` Hugh Dickins
2003-10-01  8:49                   ` Andi Kleen
2003-10-01  9:33                   ` Jamie Lokier
2003-10-01 14:51                     ` Andrew Morton
2003-10-01 14:56                       ` Andi Kleen
2003-10-01 15:19                         ` Andrew Morton
2003-10-01 15:24                           ` Andi Kleen
2003-10-01 16:18                       ` Jamie Lokier
2003-10-01  7:20           ` Jamie Lokier
2003-10-01  7:39             ` Andi Kleen
2003-10-01  8:20               ` Jamie Lokier
     [not found]         ` <20031001065705.GI1131@mail.shareable.org.suse.lists.linux.kernel>
2003-10-01  7:15           ` Andi Kleen
2003-10-01  7:24           ` Andi Kleen
2003-10-01  7:55             ` Jamie Lokier
2003-10-01  8:00               ` Andi Kleen
  -- strict thread matches above, loose matches on Subject: below --
2003-10-01  4:30 Nakajima, Jun
2003-10-01  5:38 ` Jamie Lokier
2003-10-01  5:48   ` Andrew Morton
2003-10-01  6:13     ` Jamie Lokier
2003-10-01  6:32       ` Andrew Morton
2003-10-01  6:57         ` Jamie Lokier
2003-10-01  1:54 Nakajima, Jun
2003-10-01  2:07 ` Andrew Morton
2003-10-01  2:08 ` Mike Fedyk
     [not found] <20030930073814.GA26649@mail.jlokier.co.uk.suse.lists.linux.kernel>
     [not found] ` <20030930132211.GA23333@redhat.com.suse.lists.linux.kernel>
     [not found]   ` <20030930133936.GA28876@mail.shareable.org.suse.lists.linux.kernel>
     [not found]     ` <20030930135324.GC5507@redhat.com.suse.lists.linux.kernel>
     [not found]       ` <20030930144526.GC28876@mail.shareable.org.suse.lists.linux.kernel>
     [not found]         ` <20030930150825.GD5507@redhat.com.suse.lists.linux.kernel>
     [not found]           ` <20030930165450.GF28876@mail.shareable.org.suse.lists.linux.kernel>
     [not found]             ` <20030930172618.GE5507@redhat.com.suse.lists.linux.kernel>
2003-09-30 19:08               ` Andi Kleen
2003-09-30 20:08                 ` H. Peter Anvin
2003-09-30  7:38 Jamie Lokier
2003-09-30  8:01 ` Nick Piggin
2003-09-30 13:22 ` Dave Jones
2003-09-30 13:39   ` Jamie Lokier
2003-09-30 13:53     ` Dave Jones
2003-09-30 14:45       ` Jamie Lokier
2003-09-30 15:08         ` Dave Jones
2003-09-30 16:54           ` Jamie Lokier
2003-09-30 17:26             ` Dave Jones
2003-09-30 23:55               ` Jamie Lokier
2003-10-01  0:27                 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).