All of lore.kernel.org
 help / color / mirror / Atom feed
* State of GPLPV tests - 28.11.11
@ 2011-11-28 13:49 Andreas Kinzler
  2011-11-28 23:16 ` James Harper
  0 siblings, 1 reply; 5+ messages in thread
From: Andreas Kinzler @ 2011-11-28 13:49 UTC (permalink / raw)
  To: James Harper, xen-devel

Hello James,

I am still running tests 7 days a week on two test systems. Results are 
quite discouraging though. After experiencing crash after crash I wanted 
to test if the configuration I called "stable" (Xen 4.0.1, GPLPV 
0.11.0.213, dom0 kernel 2.6.32.18-pvops0-ak3) was stable indeed. But 
even that config crashed when running my torture test. It is stable on 
our production systems - running other workloads of course.

 > One thing I thought of... virtualisation gives an interesting
 > opportunity to exaggerate race conditions. If you have 8 vCPU's in a
 > DomU but only let one or two physical CPUs service those 8 vCPU's,then
 > it can give rise to race conditions which could only be rarely seen
 > (or never seen) in normal operation. It's awful for performance but
 > if you could try that and see if it gives rise to crashes a bit
 > more frequently it might help us track down the problem.

What exactly is the config you are talking about in terms of Xen/dom0 
command line? In terms of domU config files?

As always, I monitor your mercurial repo ;-) How would you see the 
relationship of commits 952+953 to our problem? 952 seems to affect LSO 
in some way since LsoV1TransmitComplete.TcpPayload is finally wrong 
(could it be negative since tx_length is smaller than the fixed 
tx_length?). What about 953?

One more thought: As mentioned earlier crashes often occurred after an 
uptime of 9-10 days and these crashes occurred too consistently to be a 
"by chance" event. In my torture tests I am NOT USING a Windows NTP 
service (I use the meinberg NTP daemon on Windows). But on production I 
do. Can you see any possible impact here?

Regards Andreas

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: State of GPLPV tests - 28.11.11
  2011-11-28 13:49 State of GPLPV tests - 28.11.11 Andreas Kinzler
@ 2011-11-28 23:16 ` James Harper
  2011-11-29 17:05   ` Andreas Kinzler
  0 siblings, 1 reply; 5+ messages in thread
From: James Harper @ 2011-11-28 23:16 UTC (permalink / raw)
  To: Andreas Kinzler, xen-devel

> Hello James,
> 
> I am still running tests 7 days a week on two test systems. Results
are quite
> discouraging though. After experiencing crash after crash I wanted to
test if
> the configuration I called "stable" (Xen 4.0.1, GPLPV 0.11.0.213, dom0
kernel
> 2.6.32.18-pvops0-ak3) was stable indeed. But even that config crashed
when
> running my torture test. It is stable on our production systems -
running
> other workloads of course.

What crash are you getting these days? Is it the same one as you used to
get?

>  > One thing I thought of... virtualisation gives an interesting  >
opportunity to
> exaggerate race conditions. If you have 8 vCPU's in a  > DomU but only
let
> one or two physical CPUs service those 8 vCPU's,then  > it can give
rise to
> race conditions which could only be rarely seen  > (or never seen) in
normal
> operation. It's awful for performance but  > if you could try that and
see if it
> gives rise to crashes a bit  > more frequently it might help us track
down the
> problem.
> 
> What exactly is the config you are talking about in terms of Xen/dom0
> command line? In terms of domU config files?

I don't remember the exact syntax, but if you specify vcpus=4 but only
let the DomU run on one physical cpu it might trip up more often, if the
problem is caused by a race. If the problem is an arithmetic error in
xennet then it won't help.

> 
> As always, I monitor your mercurial repo ;-) How would you see the
> relationship of commits 952+953 to our problem? 952 seems to affect
LSO in
> some way since LsoV1TransmitComplete.TcpPayload is finally wrong
(could it
> be negative since tx_length is smaller than the fixed tx_length?).
What about
> 953?

Not sure.

> One more thought: As mentioned earlier crashes often occurred after an
> uptime of 9-10 days and these crashes occurred too consistently to be
a "by
> chance" event. In my torture tests I am NOT USING a Windows NTP
service (I
> use the meinberg NTP daemon on Windows). But on production I do. Can
> you see any possible impact here?
> 

It's certainly more likely for a stray UDP packet to cause an upset I
guess. As the packets pass through a Linux firewall (iptables in Dom0)
it's more likely that errant TCP packets will be dropped there.

Do you have a crash dump against 0.11.0.323?

James

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: State of GPLPV tests - 28.11.11
  2011-11-28 23:16 ` James Harper
@ 2011-11-29 17:05   ` Andreas Kinzler
  2011-11-29 22:39     ` James Harper
  0 siblings, 1 reply; 5+ messages in thread
From: Andreas Kinzler @ 2011-11-29 17:05 UTC (permalink / raw)
  To: James Harper; +Cc: xen-devel

On 29.11.2011 00:16, James Harper wrote:
>> I am still running tests 7 days a week on two test systems. Results are quite
>> discouraging though. After experiencing crash after crash I wanted to test if
>> the configuration I called "stable" (Xen 4.0.1, GPLPV 0.11.0.213, dom0 kernel
>> 2.6.32.18-pvops0-ak3) was stable indeed. But even that config crashed when
>> running my torture test. It is stable on our production systems - running
>> other workloads of course.
> What crash are you getting these days? Is it the same one as you used to
> get?

Yes, still exactly the same crashes.

Good good news: I think I have found the bug. Since I am not really a 
Xen or Windows kernel developer it cannot say for sure but here is what 
I found:

When domU hang I ran xentop and found out that the number of vbd read 
requests was an number like 0x7FFFzzzz in hex which lead me to a thesis: 
GPLPV crashes as soon as the number of disk requests reaches 2^32. On my 
hardware with 5000 IIOPs/sec this is reached in
2^32 / 5000 IIOPs / 3600 sec-per-hour / 24 hours-per-day = 9.94 days
And there we go: there are the 9-10 days I was always seeing.

I studied the source code of blkback/blktap/aio and found nothing. But 
in GPLPV and its use of the ring macros I found suspicious code in every 
version of GPLPV I ever used

   while (more_to_do)
   {
     rp = xvdd->ring.sring->rsp_prod;
     KeMemoryBarrier();
     for (i = xvdd->ring.rsp_cons; i < rp; i++)
     {
       rep = XenVbd_GetResponse(xvdd, i);

If now rp is 10 for example and xvdd->ring.rsp_cons is 0xFFFFFFF7 then 
the for loop is skipped, responses are not delivered and we see the hang.

Regards Andreas

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: State of GPLPV tests - 28.11.11
  2011-11-29 17:05   ` Andreas Kinzler
@ 2011-11-29 22:39     ` James Harper
       [not found]       ` <CACaajQtWvkLt3d+H+CeQwK-WXxGo9MCUCBipLbvqnXka0yp3Vw@mail.gmail.com>
  0 siblings, 1 reply; 5+ messages in thread
From: James Harper @ 2011-11-29 22:39 UTC (permalink / raw)
  To: Andreas Kinzler; +Cc: xen-devel

> 
> On 29.11.2011 00:16, James Harper wrote:
> >> I am still running tests 7 days a week on two test systems. Results
> >> are quite discouraging though. After experiencing crash after crash
I
> >> wanted to test if the configuration I called "stable" (Xen 4.0.1,
> >> GPLPV 0.11.0.213, dom0 kernel
> >> 2.6.32.18-pvops0-ak3) was stable indeed. But even that config
crashed
> >> when running my torture test. It is stable on our production
systems
> >> - running other workloads of course.
> > What crash are you getting these days? Is it the same one as you
used
> > to get?
> 
> Yes, still exactly the same crashes.
> 
> Good good news: I think I have found the bug. Since I am not really a
Xen or
> Windows kernel developer it cannot say for sure but here is what I
found:
> 
> When domU hang I ran xentop and found out that the number of vbd read
> requests was an number like 0x7FFFzzzz in hex which lead me to a
thesis:
> GPLPV crashes as soon as the number of disk requests reaches 2^32. On
my
> hardware with 5000 IIOPs/sec this is reached in
> 2^32 / 5000 IIOPs / 3600 sec-per-hour / 24 hours-per-day = 9.94 days
And
> there we go: there are the 9-10 days I was always seeing.
> 
> I studied the source code of blkback/blktap/aio and found nothing. But
in
> GPLPV and its use of the ring macros I found suspicious code in every
version
> of GPLPV I ever used
> 
>    while (more_to_do)
>    {
>      rp = xvdd->ring.sring->rsp_prod;
>      KeMemoryBarrier();
>      for (i = xvdd->ring.rsp_cons; i < rp; i++)
>      {
>        rep = XenVbd_GetResponse(xvdd, i);
> 
> If now rp is 10 for example and xvdd->ring.rsp_cons is 0xFFFFFFF7 then
the
> for loop is skipped, responses are not delivered and we see the hang.
> 

Good work! I'm impressed :)

I'll get straight on that... I must have gone wrong somewhere very early
on in development.

James

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: State of GPLPV tests - 28.11.11
       [not found]           ` <CACaajQvRhQD_dtAyBfRQ=SeRmuDnJc+vW1_VAr9SgK+dyb5sig@mail.gmail.com>
@ 2012-02-10  8:52             ` Vasiliy Tolstov
  0 siblings, 0 replies; 5+ messages in thread
From: Vasiliy Tolstov @ 2012-02-10  8:52 UTC (permalink / raw)
  To: James Harper; +Cc: xen-devel, Andreas Kinzler

2012/1/31 Vasiliy Tolstov <v.tolstov@selfip.ru>:
> 2012/1/31 James Harper <james.harper@bendigoit.com.au>:
>>>
>>> Sorry for bumping old thread, where i can find latest signed drivers that
>>> contains all fixes ?=) http://www.meadowcourt.org/downloads/
>>> says, that latest version uploaded in Sunday, 10 July 2011...
>>
>> http://www.meadowcourt.org/private/<filename>
>>
>> where <filename> is one of:
>>
>> gplpv_2000_0.11.0.357_debug.msi
>> gplpv_XP_0.11.0.357_debug.msi
>> gplpv_2003x32_0.11.0.357_debug.msi
>> gplpv_2003x64_0.11.0.357_debug.msi
>> gplpv_Vista2008x32_0.11.0.357_debug.msi
>> gplpv_Vista2008x64_0.11.0.357_debug.msi
>> gplpv_2000_0.11.0.357.msi
>> gplpv_XP_0.11.0.357.msi
>> gplpv_2003x32_0.11.0.357.msi
>> gplpv_2003x64_0.11.0.357.msi
>> gplpv_Vista2008x32_0.11.0.357.msi
>> gplpv_Vista2008x64_0.11.0.357.msi
>>
>> james
>>
>


I'm get simple tests and windows does not take BSOD and get good
network speed (download is about 70-80 Mb/s, upload ~40 Mb/s), but now
i get very poor disk performance =(
Now i dont have any tests results, but six mounth ago i have windows
2008 install is about 30 min, now i get 1 hour. I'm use self made
winpe image with integrated xen gpl pv drivers.


-- 
Vasiliy Tolstov,
Clodo.ru
e-mail: v.tolstov@selfip.ru
jabber: vase@selfip.ru

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-02-10  8:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-28 13:49 State of GPLPV tests - 28.11.11 Andreas Kinzler
2011-11-28 23:16 ` James Harper
2011-11-29 17:05   ` Andreas Kinzler
2011-11-29 22:39     ` James Harper
     [not found]       ` <CACaajQtWvkLt3d+H+CeQwK-WXxGo9MCUCBipLbvqnXka0yp3Vw@mail.gmail.com>
     [not found]         ` <6035A0D088A63A46850C3988ED045A4B0550BB45@BITCOM1.int.sbss.com.au>
     [not found]           ` <CACaajQvRhQD_dtAyBfRQ=SeRmuDnJc+vW1_VAr9SgK+dyb5sig@mail.gmail.com>
2012-02-10  8:52             ` Vasiliy Tolstov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.