All of lore.kernel.org
 help / color / mirror / Atom feed
* Reproducable data corruption on xen-unstable
@ 2005-01-30 21:09 Robin Green
  2005-01-30 22:30 ` Rik van Riel
  0 siblings, 1 reply; 23+ messages in thread
From: Robin Green @ 2005-01-30 21:09 UTC (permalink / raw)
  To: xen-devel; +Cc: riel

With the xen-unstable snapshot from today (and also the fedora-patched
one from the 25th) I am seeing lots of display corruption, weird
behaviour and crashes and hangs in X. Here is a reproducable test case
(non-deterministic, but it fails every time for me) for crashing or
incorrect behaviour, in case this is useful:

Note when I say "crashes", I'm referring to userspace crashes.

To reproduce:

1. Boot into Fedora Core 3 under Xen (see 
http://www.fedoraproject.org/wiki/FedoraXenQuickstart )
[not sure if this is necessary]

2. Disable X acceleration in Xorg.conf [not sure if this is necessary]

3. Download http://www.greenrd.org/sw/fptest/ which should be 100%
deterministic, but running under Xen-unstable, it isn't. It reads no
input, and just does lots of floating point tests.

4. Build it with ./build

5. Start up a Konsole and run ./test to run the test 100 times.

- Note it will NOT fail if you are using an xterm (presumably because they 
use different rendering techniques, and presumably the technique used by
xterm makes this memory corruption or whatever it is much less likely to 
occur).

Nor will it fail on the console. I haven't tried other terminal emulators.

Expected results: The last test should complete with no errors

Actual results: After a while, one of the test runs either crashes, or 
detects floating-point errors, or both.

None of the anomalous behaviour occurs under the same Fedora-patched 
kernel when it is not compiled for Xen.
-- 
Robin


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Reproducable data corruption on xen-unstable
  2005-01-30 21:09 Reproducable data corruption on xen-unstable Robin Green
@ 2005-01-30 22:30 ` Rik van Riel
  2005-01-31  1:03   ` Robin Green
  0 siblings, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2005-01-30 22:30 UTC (permalink / raw)
  To: Robin Green; +Cc: xen-devel

On Sun, 30 Jan 2005, Robin Green wrote:

> With the xen-unstable snapshot from today (and also the fedora-patched
> one from the 25th) I am seeing lots of display corruption, weird
> behaviour and crashes and hangs in X.

I'll:
1) upgrade Fedora rawhide to the latest xen-unstable code 
2) apply the AGP patch ;)

I will let the list know when a new Fedora package with these
changes is available.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: Reproducable data corruption on xen-unstable
  2005-01-30 22:30 ` Rik van Riel
@ 2005-01-31  1:03   ` Robin Green
  2005-02-02  2:54     ` Rik van Riel
  0 siblings, 1 reply; 23+ messages in thread
From: Robin Green @ 2005-01-31  1:03 UTC (permalink / raw)
  To: Rik van Riel; +Cc: xen-devel

On Sun, 30 Jan 2005, Rik van Riel wrote:

> On Sun, 30 Jan 2005, Robin Green wrote:
>
>> With the xen-unstable snapshot from today (and also the fedora-patched
>> one from the 25th) I am seeing lots of display corruption, weird
>> behaviour and crashes and hangs in X.
>
> I'll:
> 1) upgrade Fedora rawhide to the latest xen-unstable code 2) apply the AGP 
> patch ;)

I was under the impression that the Xorg server I'm using (savage) 
doesn't use AGP yet, at least not in the release that's in FC3.
Er, how exactly do I tell if it's using AGP?


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: Reproducable data corruption on xen-unstable
  2005-01-31  1:03   ` Robin Green
@ 2005-02-02  2:54     ` Rik van Riel
  2005-02-03  1:21       ` Robin Green
  0 siblings, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2005-02-02  2:54 UTC (permalink / raw)
  To: Robin Green; +Cc: xen-devel

On Sun, 30 Jan 2005, Robin Green wrote:

> I was under the impression that the Xorg server I'm using (savage) doesn't 
> use AGP yet, at least not in the release that's in FC3.
> Er, how exactly do I tell if it's using AGP?

Tonight's rawhide kernel (2.6.10-1.1120_FC4) has the
Xen agpgart patch, as well as last night's xen-unstable
source tree.  Could you please try reproducing the bug
with the latest kernel ?

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: Reproducable data corruption on xen-unstable
  2005-02-02  2:54     ` Rik van Riel
@ 2005-02-03  1:21       ` Robin Green
  0 siblings, 0 replies; 23+ messages in thread
From: Robin Green @ 2005-02-03  1:21 UTC (permalink / raw)
  To: Rik van Riel; +Cc: xen-devel

On Tue, 1 Feb 2005, Rik van Riel wrote:

> On Sun, 30 Jan 2005, Robin Green wrote:
>
>> I was under the impression that the Xorg server I'm using (savage) doesn't 
>> use AGP yet, at least not in the release that's in FC3.
>> Er, how exactly do I tell if it's using AGP?
>
> Tonight's rawhide kernel (2.6.10-1.1120_FC4) has the
> Xen agpgart patch, as well as last night's xen-unstable
> source tree.  Could you please try reproducing the bug
> with the latest kernel ?

It is reproducable still. Latest xen also.

I also rebuilt 2.6.10-1.1121_FC4 without any AGP options enabled, and
it is still reproducable.

And whether savage or vesa X server is used, or whether NoAccel is on or 
off, it still occurs. (However, the konsole window should be quite tall or
it may not occur - mine is 800 pixels [44 lines] high.)

I also found that

  su -c 'renice 19 `/sbin/pidof X`'

makes the probability of the bug occuring higher. Counterintuitive, maybe,
but true.

If no-one else can reproduce it, I can attempt to put some assertions in 
the paranoia.c code to determine if this is main memory corruption or 
register corruption. (I suspect the latter.) Let me know if you want me to 
try this.

Just to reiterate, no this is not hardware error, because it does not 
occur outside of Xen :)

Could this possibly be related to the other bug I found, the "APIC error
on CPU0"? That interrupt handler may be still operating, for all I know - 
my patch doesn't _disable_ it, it just shuts it up.

-- 
Robin


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Re: Reproducable data corruption on xen-unstable
@ 2005-02-15  4:08 Ian Pratt
  0 siblings, 0 replies; 23+ messages in thread
From: Ian Pratt @ 2005-02-15  4:08 UTC (permalink / raw)
  To: Robin Green; +Cc: Rik van Riel, xen-devel, ian.pratt

> On Sun, 6 Feb 2005, I wrote:
> > A syscall was made (connect). Immediately before the syscall, the 
> > floating-point stack was empty; immediately after the syscall, the 
> > floating-point stack was nonempty, and the TS flag (Task 
> Switch) was 
> > _cleared_.
> 
> I now have an "easier" way to reproduce this problem. Apply the patch 
> below to a xen0-kernel, which checks the FPU state against TS. What it
> basically does is:
> 
>   if (TS == 0 && fpu_stack_size > 0) panic ("Corrupt FPU");
> 
> An equivalent patch against a non-xen kernel yields no 
> problems that I can
> detect, but patching a xen0-kernel with this patch, causes it 
> to panic and
> reboot as soon as it hits the graphical login manager (in my 
> case, kdm).
> (Of course, it might be specific to kdm, or my hardware, or who knows 
> what.)

The fact that the bug is triggered when the Xserver starts makes me
suspect that the vm86 system call may have something to do with this.
Please can you find out whether your Xserver is using the vm86 bios or
vesa modules. Also instrument the vm86 syscall in linux just to make
sure. It may be able to get the Xserver to run without those modules --
you could try moving them off the module search path.

Also, what CPU type do you compile your kernel for? I'm wandering
whether this is an AMD-specific issue.

Another place to look is the fpu_kernel_begin()/end() to see whether
they're correct.

Ian



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_ide95&alloc_id\x14396&op=click

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Re: Reproducable data corruption on xen-unstable
  2005-02-07  3:05   ` Robin Green
  2005-02-07  4:05     ` Robin Green
@ 2005-02-13 22:12     ` Robin Green
  1 sibling, 0 replies; 23+ messages in thread
From: Robin Green @ 2005-02-13 22:12 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Rik van Riel, xen-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2208 bytes --]

On Sun, 6 Feb 2005, I wrote:
> A syscall was made (connect). Immediately before the syscall, the 
> floating-point stack was empty; immediately after the syscall, the 
> floating-point stack was nonempty, and the TS flag (Task Switch) was 
> _cleared_.

I now have an "easier" way to reproduce this problem. Apply the patch 
below to a xen0-kernel, which checks the FPU state against TS. What it
basically does is:

  if (TS == 0 && fpu_stack_size > 0) panic ("Corrupt FPU");

An equivalent patch against a non-xen kernel yields no problems that I can
detect, but patching a xen0-kernel with this patch, causes it to panic and
reboot as soon as it hits the graphical login manager (in my case, kdm).
(Of course, it might be specific to kdm, or my hardware, or who knows 
what.)

*** HELP WANTED! ***
If someone on a machine with a debug console could reproduce this, I'd be
most grateful. I don't have a serial console yet, so I'm a bit stuck.
********************

The logic behind this patch is, if there is something on the FPU stack 
from _another_ process, TS should be 1 to prevent data leakage between
processes. If, on the other hand, there is something on the FPU stack from 
the _same_ process being switched to, TS should still be 1, because who 
would have cleared it since it was set when that process was last 
switched away from? So, in either case, TS should be 1.

Also, I was wrong in my previous post:

> So, in theory there are two possible algorithms which the kernel could be 
> supposed to be following to avoid this situation.
>
> A. Always set TS on task switch (Seems like the logical choice!)
>
> B. Always set TS on task switch - except when the FPU has not been used
> by the switched-to process, in which case do an FINIT on task switch. (This 
> seems pointlessly complicated and slow, so I doubt the kernel follows this 
> approach.)

The _actual_ algorithm appears to be:

C. Always set TS on task switch - except when the FPU has not been used
in the previous timeslice by the switched-FROM process, in which case we 
assume (incorrectly in the case of xen0-kernels, but correctly in the case
of normal kernels!) that TS must be _already_ set if the FPU is dirty.

-- 
Robin

[-- Attachment #2: task switcher debugging patch --]
[-- Type: TEXT/PLAIN, Size: 537 bytes --]

--- arch/xen/i386/kernel/process.c.orig	2005-02-12 03:39:44.000000000 +0000
+++ arch/xen/i386/kernel/process.c	2005-02-13 02:46:03.000000000 +0000
@@ -563,6 +563,14 @@
 	if (prev_p->thread_info->status & TS_USEDFPU) {
 		save_init_fpu(prev_p);
 		queue_multicall0(__HYPERVISOR_fpu_taskswitch);
+	} else {
+		short msw;
+		asm ("smsw %[msw]" : [msw] "=g"(msw));
+		if (!(msw & 8)) {
+			signed short fenv[14];
+			asm ("fstenv %[fenv]" : [fenv] "=g"(fenv));
+			if (fenv [4] != -1) panic ("corrupt FPU");
+		}
 	}
 
 	/*

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Re: Reproducable data corruption on xen-unstable
  2005-02-07 16:50 Ian Pratt
@ 2005-02-07 17:03 ` Rik van Riel
  0 siblings, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2005-02-07 17:03 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Robin Green, xen-devel

On Mon, 7 Feb 2005, Ian Pratt wrote:

> No, this changeset:
>
> http://xen.bkbits.net:8080/xeno-unstable.bk/cset@1.1550.1.175?nav=index.html|ChangeSet@-1d

Doh!  No wonder I couldn't find it in the changes I
checked out this morning, since it was committed a
little bit later.

*bk pulls and restarts test rpm build*

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Re: Reproducable data corruption on xen-unstable
@ 2005-02-07 16:50 Ian Pratt
  2005-02-07 17:03 ` Rik van Riel
  0 siblings, 1 reply; 23+ messages in thread
From: Ian Pratt @ 2005-02-07 16:50 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Robin Green, xen-devel, ian.pratt


No, this changeset:

http://xen.bkbits.net:8080/xeno-unstable.bk/cset@1.1550.1.175?nav=index.
html|ChangeSet@-1d 

Ian

> -----Original Message-----
> From: Rik van Riel [mailto:riel@redhat.com] 
> Sent: 07 February 2005 16:40
> To: Ian Pratt
> Cc: Robin Green; xen-devel@lists.sourceforge.net; 
> ian.pratt@cl.cam.ac.uk
> Subject: RE: [Xen-devel] Re: Reproducable data corruption on 
> xen-unstable
> 
> On Mon, 7 Feb 2005, Ian Pratt wrote:
> 
> > The vm86 'Oops' should now be fixed, although it's not Robin's real
> > problem.
> 
> This fragment ? ;)
> 
> @@ -126,7 +129,7 @@
>          if (direct_remap_area_pages(&init_mm, (unsigned 
> long) addr, phys_addr,
>                                      size, 
> __pgprot(_PAGE_PRESENT | _PAGE_RW |
>                                                     
> _PAGE_DIRTY | _PAGE_ACCESSED
> -                                                  | flags), 
> DOMID_IO)) {
> +                                                  | flags), domid)) {
>                  vunmap((void __force *) addr);
>                  return NULL;
>          }
> 
> 
> -- 
> "Debugging is twice as hard as writing the code in the first place.
> Therefore, if you write the code as cleverly as possible, you are,
> by definition, not smart enough to debug it." - Brian W. Kernighan
> 


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Re: Reproducable data corruption on xen-unstable
  2005-02-07 16:26 Ian Pratt
@ 2005-02-07 16:39 ` Rik van Riel
  0 siblings, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2005-02-07 16:39 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Robin Green, xen-devel, ian.pratt

On Mon, 7 Feb 2005, Ian Pratt wrote:

> The vm86 'Oops' should now be fixed, although it's not Robin's real
> problem.

This fragment ? ;)

@@ -126,7 +129,7 @@
         if (direct_remap_area_pages(&init_mm, (unsigned long) addr, phys_addr,
                                     size, __pgprot(_PAGE_PRESENT | _PAGE_RW |
                                                    _PAGE_DIRTY | _PAGE_ACCESSED
-                                                  | flags), DOMID_IO)) {
+                                                  | flags), domid)) {
                 vunmap((void __force *) addr);
                 return NULL;
         }


-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Re: Reproducable data corruption on xen-unstable
@ 2005-02-07 16:26 Ian Pratt
  2005-02-07 16:39 ` Rik van Riel
  0 siblings, 1 reply; 23+ messages in thread
From: Ian Pratt @ 2005-02-07 16:26 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Robin Green, xen-devel, ian.pratt


The vm86 'Oops' should now be fixed, although it's not Robin's real
problem.

Ian 

> -----Original Message-----
> From: Rik van Riel [mailto:riel@redhat.com] 
> Sent: 04 February 2005 02:27
> To: Ian Pratt
> Cc: Robin Green; xen-devel@lists.sourceforge.net
> Subject: RE: [Xen-devel] Re: Reproducable data corruption on 
> xen-unstable
> 
> On Fri, 4 Feb 2005, Ian Pratt wrote:
> 
> > (vm86 is not widely used, so I can belive we could have 
> lurking bugs on 
> > that path).
> 
> Confirmed, by running Dave Jones's scrashme program, inside a
> xenU domain with 32 virtual CPUs:
> 
> vm86: could not access userspace vm86_info
> Unable to handle kernel paging request at virtual address 000171d3
>   printing eip:
> 0000ec83
> *pde = ma 308ac067 pa 01709067
> *pte = ma 00000000 pa 55555000
>   [<c0108d4b>] syscall_call+0x7/0xb
> Oops: 0000 [#2]
> SMP
> Modules linked in: nfs lockd md5 ipv6 autofs4 sunrpc dm_mod
> CPU:    0
> EIP:    0855:[<0000ec83>]    Not tainted VLI
> EFLAGS: 00030f42   (2.6.10-1.1121_FC4xenU)
> EIP is at 0xec83
> eax: 64e9000b   ebx: 0038cc83   ecx: 2cd0e800   edx: 1ee9000b
> esi: 8dffffff   edi: 0038cc83   ebp: 2cc0e800   esp: c62fdf24
> ds: 0000   es: 0000   ss: 0069
> Process scrashme (pid: 19875, threadinfo=c62fc000 task=c82fd020)
> Stack: 104613c3 00008d00 00000003 00000fc2 00000001 00006693 
> 31ff31ff f05589f6
>         0026b48d 8d000000 000027bc fe830000 8b187406 448b084d 
> 453b40b1 890c74f0
>         07e82404 8d00047f 4601387c 7e0cfe83 74778ddd e8243489 
> ffff4162 c789c085
> Call Trace:
>   [<c0108d4b>] syscall_call+0x7/0xb
> Code:  Bad EIP value.
> 
> -- 
> "Debugging is twice as hard as writing the code in the first place.
> Therefore, if you write the code as cleverly as possible, you are,
> by definition, not smart enough to debug it." - Brian W. Kernighan
> 


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Re: Reproducable data corruption on xen-unstable
@ 2005-02-07 15:58 Ian Pratt
  0 siblings, 0 replies; 23+ messages in thread
From: Ian Pratt @ 2005-02-07 15:58 UTC (permalink / raw)
  To: Robin Green; +Cc: Rik van Riel, xen-devel, ian.pratt

> Ah, sorry, I'm already using the unstable tree, but I'd 
> forgotten about 
> the emulation. So, never mind - it's not the use of mov _, 
> cr0 in ring 1 
> that's the problem; it must be something else. Disregard that post.

It's possible the emulation is broken, so be suspicious... 

Ian


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Re: Reproducable data corruption on xen-unstable
@ 2005-02-07 15:41 Ian Pratt
  0 siblings, 0 replies; 23+ messages in thread
From: Ian Pratt @ 2005-02-07 15:41 UTC (permalink / raw)
  To: Robin Green; +Cc: Rik van Riel, xen-devel, ian.pratt

It would be interesting to know whether you can reproduce in 2.0-testing
Ian 

> -----Original Message-----
> From: Robin Green [mailto:greenrd@presidium.org] 
> Sent: 07 February 2005 15:36
> To: Ian Pratt
> Cc: Rik van Riel; xen-devel@lists.sourceforge.net; 
> ian.pratt@cl.cam.ac.uk
> Subject: RE: [Xen-devel] Re: Reproducable data corruption on 
> xen-unstable
> 
> On Mon, 7 Feb 2005, Ian Pratt wrote:
> >> A few minutes ago, I wrote:
> >> Aha! Shouldn't the stts macro in xeno-linux be calling
> >> __HYPERVISOR_fpu_taskswitch instead of trying to write to 
> CR0 itself?
> >> Writing to CR0 directly is impossible in ring 1, isn't it?
> >
> > Please can you try the unstable tree: it has an extended instruction
> > emulator that was introduced to avoid some the code edits that some
> > people on the linux-kernel list were complaining about.
> 
> Ah, sorry, I'm already using the unstable tree, but I'd 
> forgotten about 
> the emulation. So, never mind - it's not the use of mov _, 
> cr0 in ring 1 
> that's the problem; it must be something else. Disregard that post.
> 
> -- 
> Robin
> 
> 


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Re: Reproducable data corruption on xen-unstable
  2005-02-07 13:05 Ian Pratt
@ 2005-02-07 15:35 ` Robin Green
  0 siblings, 0 replies; 23+ messages in thread
From: Robin Green @ 2005-02-07 15:35 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Rik van Riel, xen-devel, ian.pratt

On Mon, 7 Feb 2005, Ian Pratt wrote:
>> A few minutes ago, I wrote:
>> Aha! Shouldn't the stts macro in xeno-linux be calling
>> __HYPERVISOR_fpu_taskswitch instead of trying to write to CR0 itself?
>> Writing to CR0 directly is impossible in ring 1, isn't it?
>
> Please can you try the unstable tree: it has an extended instruction
> emulator that was introduced to avoid some the code edits that some
> people on the linux-kernel list were complaining about.

Ah, sorry, I'm already using the unstable tree, but I'd forgotten about 
the emulation. So, never mind - it's not the use of mov _, cr0 in ring 1 
that's the problem; it must be something else. Disregard that post.

-- 
Robin



-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Re: Reproducable data corruption on xen-unstable
@ 2005-02-07 13:05 Ian Pratt
  2005-02-07 15:35 ` Robin Green
  0 siblings, 1 reply; 23+ messages in thread
From: Ian Pratt @ 2005-02-07 13:05 UTC (permalink / raw)
  To: Robin Green; +Cc: Rik van Riel, xen-devel, ian.pratt


> A few minutes ago, I wrote:
> > So, it looks like we are looking for a code path in which 
> TS doesn't end
> > up set after a task switch.
> 
> Aha! Shouldn't the stts macro in xeno-linux be calling 
> __HYPERVISOR_fpu_taskswitch instead of trying to write to CR0 itself?
> Writing to CR0 directly is impossible in ring 1, isn't it?

Please can you try the unstable tree: it has an extended instruction
emulator that was introduced to avoid some the code edits that some
people on the linux-kernel list were complaining about.


Ian


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Re: Reproducable data corruption on xen-unstable
  2005-02-07  3:05   ` Robin Green
@ 2005-02-07  4:05     ` Robin Green
  2005-02-13 22:12     ` Robin Green
  1 sibling, 0 replies; 23+ messages in thread
From: Robin Green @ 2005-02-07  4:05 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Rik van Riel, xen-devel

A few minutes ago, I wrote:
> So, it looks like we are looking for a code path in which TS doesn't end
> up set after a task switch.

Aha! Shouldn't the stts macro in xeno-linux be calling 
__HYPERVISOR_fpu_taskswitch instead of trying to write to CR0 itself?
Writing to CR0 directly is impossible in ring 1, isn't it?

I think I may have solved the mystery! I'll have to try that out in the 
next few days.

stts is called by _mmx_memcpy, which is called by memcpy on Athlons. That 
_might_ explain why people who aren't using Athlons haven't seen this.

-- 
Robin



-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Re: Reproducable data corruption on xen-unstable
  2005-02-05 16:52 ` Robin Green
@ 2005-02-07  3:05   ` Robin Green
  2005-02-07  4:05     ` Robin Green
  2005-02-13 22:12     ` Robin Green
  0 siblings, 2 replies; 23+ messages in thread
From: Robin Green @ 2005-02-07  3:05 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Rik van Riel, xen-devel

On Sat, 5 Feb 2005, Robin Green wrote:
> On the assumption that this _is_ an FP save/restore bug,

Update: I have narrowed down this bug

I have confirmed that there IS definitely an FP save/restore bug 
with this kernel/xen combination (i.e. I've eliminated the possibility 
that it was just a non-floating-point-related bug)! I identified it using 
a different test case (running wget -d in a konsole), and I have established
that it is case 1 in the list of possible causes I gave, namely:

> 1. Something leaves the FPU in a state where it has bogus data in it,
>    but it won't trap to tell the kernel to restore the old, correct data

More specifically, in this particular case, according to my printf's, what 
happened was:

A syscall was made (connect). Immediately before the syscall, the 
floating-point stack was empty; immediately after the syscall, the 
floating-point stack was nonempty, and the TS flag (Task Switch) was _cleared_.
(Source code and output available on request.)

This may not immediately cause problems. But over time, it would tend to 
lead to floating-point stack overflow, which leads to floating-point 
calculations generating bogus output.

So, in theory there are two possible algorithms which the kernel could be 
supposed to be following to avoid this situation.

A. Always set TS on task switch (Seems like the logical choice!)

B. Always set TS on task switch - except when the FPU has not been used
by the switched-to process, in which case do an FINIT on task switch. 
(This seems pointlessly complicated and slow, so I doubt the kernel 
follows this approach.)

So, it looks like we are looking for a code path in which TS doesn't end
up set after a task switch. (And it might be specifically to do with
syscalls.)

I will look for one - but does anyone have any ideas for what that code 
path might be, or how I could efficiently debug the kernel (while in X, 
remember, because this doesn't seem to occur in text mode!) to find out 
what that code path is? I don't have a serial console.

-- 
Robin


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Re: Reproducable data corruption on xen-unstable
  2005-02-05 17:47 Ian Pratt
@ 2005-02-05 18:52 ` Robin Green
  0 siblings, 0 replies; 23+ messages in thread
From: Robin Green @ 2005-02-05 18:52 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Rik van Riel, xen-devel, ian.pratt

On Sat, 5 Feb 2005, Ian Pratt wrote:
> What happens if you run two copies of fptest in parallel on the text
> console?

They work perfectly.

> Your
> problem seems to be quite specific to also having the framebuffer
> active.

If by "framebuffer" you mean "kernel fb driver", it doesn't use that.
The X server just talks to the graphics card directly.

> What kernel modules is your X server using?

As far as I know, it isn't using any. I've specifically deselected agp
at kernel compile time.

My suspicion is that there is some unusual code path where the FP 
save/restore doesn't work, and the fact of konsole doing large amounts
of text rendering (which, I believe, involves FP calculations) and/or
scrolling, makes this code path more likely. (The window that is rendering
doesn't have to be the tty of fptest. Any konsole window that's displaying 
a large amount of scrolling text is enough to trigger it.)

-- 
Robin


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Re: Reproducable data corruption on xen-unstable
@ 2005-02-05 17:47 Ian Pratt
  2005-02-05 18:52 ` Robin Green
  0 siblings, 1 reply; 23+ messages in thread
From: Ian Pratt @ 2005-02-05 17:47 UTC (permalink / raw)
  To: Robin Green; +Cc: Rik van Riel, xen-devel, ian.pratt

> On the assumption that this _is_ an FP save/restore bug, I 
> was trying to 
> find the FP save/restore code in the xen-patched kernel.

It's likely to be quite specific to either your CPU, or other hardware
drivers. If it was a general problem it would almost certainly have been
spotted by someone else.

What happens if you run two copies of fptest in parallel on the text
console? This would fail if it was a general fpsave/restore bug. Your
problem seems to be quite specific to also having the framebuffer
active. What kernel modules is your X server using?

Ian


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Re: Reproducable data corruption on xen-unstable
  2005-02-04  0:58 Ian Pratt
  2005-02-04  2:26 ` Rik van Riel
@ 2005-02-05 16:52 ` Robin Green
  2005-02-07  3:05   ` Robin Green
  1 sibling, 1 reply; 23+ messages in thread
From: Robin Green @ 2005-02-05 16:52 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Rik van Riel, xen-devel

On Fri, 4 Feb 2005, Ian Pratt wrote:
>
> If it occurred just under the 'vesa' X server I'd be very suspicious
> that we had an FP save/restore bug in the vm86 support code. I'm not
> sure whether the savage server uses vm86 or not. Probably not.

It doesn't, according to strace.

On the assumption that this _is_ an FP save/restore bug, I was trying to 
find the FP save/restore code in the xen-patched kernel. I'm not familiar 
with low-level kernel issues like this. What I am trying to find is, where
does fxrstor (or whatever) get invoked from, when the kernel is doing a 
normal user-process-to-user-process context switch? As far as I can tell,
the idea seems to be that you don't bother to restore the FPU state
immediately - if and when the new process tries to access the FPU for the 
first time, the CPU automatically generates a trap, and only then does 
Linux restore the saved FPU state for that process - apparently in
arch/xen/i386/kernel/entry.S under ENTRY(device_no_available), if my guess 
is right.

Is that correct?

If that's the case, then, again still working on the assumption that it's 
an FPU state bug, looks like it could only be one of the following:

  1. Something leaves the FPU in a state where it has bogus data in it,
     but it won't trap to tell the kernel to restore the old, correct data
  2. Something forgets to save the FPU state when context-switching from
one userland process to another
  3. Something is overwriting the saved fpu state with bogus data (seems 
unlikely)
  4. Something in the kernel or in xen is using the FPU (extremely 
unlikely, since both appear to now be compiled with soft-math).

Is my reasoning sound? I'm a _little_ out of my depth here!

-- 
Robin


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Re: Reproducable data corruption on xen-unstable
@ 2005-02-04  2:59 Ian Pratt
  0 siblings, 0 replies; 23+ messages in thread
From: Ian Pratt @ 2005-02-04  2:59 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Robin Green, xen-devel

 
> > (vm86 is not widely used, so I can belive we could have 
> lurking bugs on 
> > that path).
> 
> Confirmed, by running Dave Jones's scrashme program, inside a
> xenU domain with 32 virtual CPUs:

Even with a single vcpu it's easy to cause an Oops with scrashme.
However, an fptest running in another process at the same time doesn't
seem to experience register coruption or anything else nasty. 

(http://www.codemonkey.org.uk/projects/scrashme/scrashme-1.0.tar.gz)

./scrashme -r -c113 >/dev/null	(vm86old)
./scrashme -r -c166 >/dev/null	(vm86)

It shouldn't be too hard to fix the Oops, but I'm not seeing anything
that would explain Robin's problem.

Ian



-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Re: Reproducable data corruption on xen-unstable
  2005-02-04  0:58 Ian Pratt
@ 2005-02-04  2:26 ` Rik van Riel
  2005-02-05 16:52 ` Robin Green
  1 sibling, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2005-02-04  2:26 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Robin Green, xen-devel

On Fri, 4 Feb 2005, Ian Pratt wrote:

> (vm86 is not widely used, so I can belive we could have lurking bugs on 
> that path).

Confirmed, by running Dave Jones's scrashme program, inside a
xenU domain with 32 virtual CPUs:

vm86: could not access userspace vm86_info
Unable to handle kernel paging request at virtual address 000171d3
  printing eip:
0000ec83
*pde = ma 308ac067 pa 01709067
*pte = ma 00000000 pa 55555000
  [<c0108d4b>] syscall_call+0x7/0xb
Oops: 0000 [#2]
SMP
Modules linked in: nfs lockd md5 ipv6 autofs4 sunrpc dm_mod
CPU:    0
EIP:    0855:[<0000ec83>]    Not tainted VLI
EFLAGS: 00030f42   (2.6.10-1.1121_FC4xenU)
EIP is at 0xec83
eax: 64e9000b   ebx: 0038cc83   ecx: 2cd0e800   edx: 1ee9000b
esi: 8dffffff   edi: 0038cc83   ebp: 2cc0e800   esp: c62fdf24
ds: 0000   es: 0000   ss: 0069
Process scrashme (pid: 19875, threadinfo=c62fc000 task=c82fd020)
Stack: 104613c3 00008d00 00000003 00000fc2 00000001 00006693 31ff31ff f05589f6
        0026b48d 8d000000 000027bc fe830000 8b187406 448b084d 453b40b1 890c74f0
        07e82404 8d00047f 4601387c 7e0cfe83 74778ddd e8243489 ffff4162 c789c085
Call Trace:
  [<c0108d4b>] syscall_call+0x7/0xb
Code:  Bad EIP value.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Re: Reproducable data corruption on xen-unstable
@ 2005-02-04  0:58 Ian Pratt
  2005-02-04  2:26 ` Rik van Riel
  2005-02-05 16:52 ` Robin Green
  0 siblings, 2 replies; 23+ messages in thread
From: Ian Pratt @ 2005-02-04  0:58 UTC (permalink / raw)
  To: Robin Green, Rik van Riel; +Cc: xen-devel


> And whether savage or vesa X server is used, or whether 
> NoAccel is on or 
> off, it still occurs. (However, the konsole window should be 
> quite tall or
> it may not occur - mine is 800 pixels [44 lines] high.)

If it occurred just under the 'vesa' X server I'd be very suspicious
that we had an FP save/restore bug in the vm86 support code. I'm not
sure whether the savage server uses vm86 or not. Probably not.

I'd certainly be very interested to hear if anyone else running the vesa
X server can reproduce the problem using the fptest/paranoia program.
(vm86 is not widely used, so I can belive we could have lurking bugs on
that path).  

> Could this possibly be related to the other bug I found, the 
> "APIC error
> on CPU0"? That interrupt handler may be still operating, for 
> all I know - 
> my patch doesn't _disable_ it, it just shuts it up.

Xen certainly doesn't sound too happy on your machine....


Ian


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2005-02-15  4:08 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-30 21:09 Reproducable data corruption on xen-unstable Robin Green
2005-01-30 22:30 ` Rik van Riel
2005-01-31  1:03   ` Robin Green
2005-02-02  2:54     ` Rik van Riel
2005-02-03  1:21       ` Robin Green
2005-02-04  0:58 Ian Pratt
2005-02-04  2:26 ` Rik van Riel
2005-02-05 16:52 ` Robin Green
2005-02-07  3:05   ` Robin Green
2005-02-07  4:05     ` Robin Green
2005-02-13 22:12     ` Robin Green
2005-02-04  2:59 Ian Pratt
2005-02-05 17:47 Ian Pratt
2005-02-05 18:52 ` Robin Green
2005-02-07 13:05 Ian Pratt
2005-02-07 15:35 ` Robin Green
2005-02-07 15:41 Ian Pratt
2005-02-07 15:58 Ian Pratt
2005-02-07 16:26 Ian Pratt
2005-02-07 16:39 ` Rik van Riel
2005-02-07 16:50 Ian Pratt
2005-02-07 17:03 ` Rik van Riel
2005-02-15  4:08 Ian Pratt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.