* State of GPLPV tests - 28.11.11
@ 2011-11-28 13:49 Andreas Kinzler
2011-11-28 23:16 ` James Harper
0 siblings, 1 reply; 5+ messages in thread
From: Andreas Kinzler @ 2011-11-28 13:49 UTC (permalink / raw)
To: James Harper, xen-devel
Hello James,
I am still running tests 7 days a week on two test systems. Results are
quite discouraging though. After experiencing crash after crash I wanted
to test if the configuration I called "stable" (Xen 4.0.1, GPLPV
0.11.0.213, dom0 kernel 2.6.32.18-pvops0-ak3) was stable indeed. But
even that config crashed when running my torture test. It is stable on
our production systems - running other workloads of course.
> One thing I thought of... virtualisation gives an interesting
> opportunity to exaggerate race conditions. If you have 8 vCPU's in a
> DomU but only let one or two physical CPUs service those 8 vCPU's,then
> it can give rise to race conditions which could only be rarely seen
> (or never seen) in normal operation. It's awful for performance but
> if you could try that and see if it gives rise to crashes a bit
> more frequently it might help us track down the problem.
What exactly is the config you are talking about in terms of Xen/dom0
command line? In terms of domU config files?
As always, I monitor your mercurial repo ;-) How would you see the
relationship of commits 952+953 to our problem? 952 seems to affect LSO
in some way since LsoV1TransmitComplete.TcpPayload is finally wrong
(could it be negative since tx_length is smaller than the fixed
tx_length?). What about 953?
One more thought: As mentioned earlier crashes often occurred after an
uptime of 9-10 days and these crashes occurred too consistently to be a
"by chance" event. In my torture tests I am NOT USING a Windows NTP
service (I use the meinberg NTP daemon on Windows). But on production I
do. Can you see any possible impact here?
Regards Andreas
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: State of GPLPV tests - 28.11.11
2011-11-28 13:49 State of GPLPV tests - 28.11.11 Andreas Kinzler
@ 2011-11-28 23:16 ` James Harper
2011-11-29 17:05 ` Andreas Kinzler
0 siblings, 1 reply; 5+ messages in thread
From: James Harper @ 2011-11-28 23:16 UTC (permalink / raw)
To: Andreas Kinzler, xen-devel
> Hello James,
>
> I am still running tests 7 days a week on two test systems. Results
are quite
> discouraging though. After experiencing crash after crash I wanted to
test if
> the configuration I called "stable" (Xen 4.0.1, GPLPV 0.11.0.213, dom0
kernel
> 2.6.32.18-pvops0-ak3) was stable indeed. But even that config crashed
when
> running my torture test. It is stable on our production systems -
running
> other workloads of course.
What crash are you getting these days? Is it the same one as you used to
get?
> > One thing I thought of... virtualisation gives an interesting >
opportunity to
> exaggerate race conditions. If you have 8 vCPU's in a > DomU but only
let
> one or two physical CPUs service those 8 vCPU's,then > it can give
rise to
> race conditions which could only be rarely seen > (or never seen) in
normal
> operation. It's awful for performance but > if you could try that and
see if it
> gives rise to crashes a bit > more frequently it might help us track
down the
> problem.
>
> What exactly is the config you are talking about in terms of Xen/dom0
> command line? In terms of domU config files?
I don't remember the exact syntax, but if you specify vcpus=4 but only
let the DomU run on one physical cpu it might trip up more often, if the
problem is caused by a race. If the problem is an arithmetic error in
xennet then it won't help.
>
> As always, I monitor your mercurial repo ;-) How would you see the
> relationship of commits 952+953 to our problem? 952 seems to affect
LSO in
> some way since LsoV1TransmitComplete.TcpPayload is finally wrong
(could it
> be negative since tx_length is smaller than the fixed tx_length?).
What about
> 953?
Not sure.
> One more thought: As mentioned earlier crashes often occurred after an
> uptime of 9-10 days and these crashes occurred too consistently to be
a "by
> chance" event. In my torture tests I am NOT USING a Windows NTP
service (I
> use the meinberg NTP daemon on Windows). But on production I do. Can
> you see any possible impact here?
>
It's certainly more likely for a stray UDP packet to cause an upset I
guess. As the packets pass through a Linux firewall (iptables in Dom0)
it's more likely that errant TCP packets will be dropped there.
Do you have a crash dump against 0.11.0.323?
James
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: State of GPLPV tests - 28.11.11
2011-11-28 23:16 ` James Harper
@ 2011-11-29 17:05 ` Andreas Kinzler
2011-11-29 22:39 ` James Harper
0 siblings, 1 reply; 5+ messages in thread
From: Andreas Kinzler @ 2011-11-29 17:05 UTC (permalink / raw)
To: James Harper; +Cc: xen-devel
On 29.11.2011 00:16, James Harper wrote:
>> I am still running tests 7 days a week on two test systems. Results are quite
>> discouraging though. After experiencing crash after crash I wanted to test if
>> the configuration I called "stable" (Xen 4.0.1, GPLPV 0.11.0.213, dom0 kernel
>> 2.6.32.18-pvops0-ak3) was stable indeed. But even that config crashed when
>> running my torture test. It is stable on our production systems - running
>> other workloads of course.
> What crash are you getting these days? Is it the same one as you used to
> get?
Yes, still exactly the same crashes.
Good good news: I think I have found the bug. Since I am not really a
Xen or Windows kernel developer it cannot say for sure but here is what
I found:
When domU hang I ran xentop and found out that the number of vbd read
requests was an number like 0x7FFFzzzz in hex which lead me to a thesis:
GPLPV crashes as soon as the number of disk requests reaches 2^32. On my
hardware with 5000 IIOPs/sec this is reached in
2^32 / 5000 IIOPs / 3600 sec-per-hour / 24 hours-per-day = 9.94 days
And there we go: there are the 9-10 days I was always seeing.
I studied the source code of blkback/blktap/aio and found nothing. But
in GPLPV and its use of the ring macros I found suspicious code in every
version of GPLPV I ever used
while (more_to_do)
{
rp = xvdd->ring.sring->rsp_prod;
KeMemoryBarrier();
for (i = xvdd->ring.rsp_cons; i < rp; i++)
{
rep = XenVbd_GetResponse(xvdd, i);
If now rp is 10 for example and xvdd->ring.rsp_cons is 0xFFFFFFF7 then
the for loop is skipped, responses are not delivered and we see the hang.
Regards Andreas
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: State of GPLPV tests - 28.11.11
2011-11-29 17:05 ` Andreas Kinzler
@ 2011-11-29 22:39 ` James Harper
[not found] ` <CACaajQtWvkLt3d+H+CeQwK-WXxGo9MCUCBipLbvqnXka0yp3Vw@mail.gmail.com>
0 siblings, 1 reply; 5+ messages in thread
From: James Harper @ 2011-11-29 22:39 UTC (permalink / raw)
To: Andreas Kinzler; +Cc: xen-devel
>
> On 29.11.2011 00:16, James Harper wrote:
> >> I am still running tests 7 days a week on two test systems. Results
> >> are quite discouraging though. After experiencing crash after crash
I
> >> wanted to test if the configuration I called "stable" (Xen 4.0.1,
> >> GPLPV 0.11.0.213, dom0 kernel
> >> 2.6.32.18-pvops0-ak3) was stable indeed. But even that config
crashed
> >> when running my torture test. It is stable on our production
systems
> >> - running other workloads of course.
> > What crash are you getting these days? Is it the same one as you
used
> > to get?
>
> Yes, still exactly the same crashes.
>
> Good good news: I think I have found the bug. Since I am not really a
Xen or
> Windows kernel developer it cannot say for sure but here is what I
found:
>
> When domU hang I ran xentop and found out that the number of vbd read
> requests was an number like 0x7FFFzzzz in hex which lead me to a
thesis:
> GPLPV crashes as soon as the number of disk requests reaches 2^32. On
my
> hardware with 5000 IIOPs/sec this is reached in
> 2^32 / 5000 IIOPs / 3600 sec-per-hour / 24 hours-per-day = 9.94 days
And
> there we go: there are the 9-10 days I was always seeing.
>
> I studied the source code of blkback/blktap/aio and found nothing. But
in
> GPLPV and its use of the ring macros I found suspicious code in every
version
> of GPLPV I ever used
>
> while (more_to_do)
> {
> rp = xvdd->ring.sring->rsp_prod;
> KeMemoryBarrier();
> for (i = xvdd->ring.rsp_cons; i < rp; i++)
> {
> rep = XenVbd_GetResponse(xvdd, i);
>
> If now rp is 10 for example and xvdd->ring.rsp_cons is 0xFFFFFFF7 then
the
> for loop is skipped, responses are not delivered and we see the hang.
>
Good work! I'm impressed :)
I'll get straight on that... I must have gone wrong somewhere very early
on in development.
James
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: State of GPLPV tests - 28.11.11
[not found] ` <CACaajQvRhQD_dtAyBfRQ=SeRmuDnJc+vW1_VAr9SgK+dyb5sig@mail.gmail.com>
@ 2012-02-10 8:52 ` Vasiliy Tolstov
0 siblings, 0 replies; 5+ messages in thread
From: Vasiliy Tolstov @ 2012-02-10 8:52 UTC (permalink / raw)
To: James Harper; +Cc: xen-devel, Andreas Kinzler
2012/1/31 Vasiliy Tolstov <v.tolstov@selfip.ru>:
> 2012/1/31 James Harper <james.harper@bendigoit.com.au>:
>>>
>>> Sorry for bumping old thread, where i can find latest signed drivers that
>>> contains all fixes ?=) http://www.meadowcourt.org/downloads/
>>> says, that latest version uploaded in Sunday, 10 July 2011...
>>
>> http://www.meadowcourt.org/private/<filename>
>>
>> where <filename> is one of:
>>
>> gplpv_2000_0.11.0.357_debug.msi
>> gplpv_XP_0.11.0.357_debug.msi
>> gplpv_2003x32_0.11.0.357_debug.msi
>> gplpv_2003x64_0.11.0.357_debug.msi
>> gplpv_Vista2008x32_0.11.0.357_debug.msi
>> gplpv_Vista2008x64_0.11.0.357_debug.msi
>> gplpv_2000_0.11.0.357.msi
>> gplpv_XP_0.11.0.357.msi
>> gplpv_2003x32_0.11.0.357.msi
>> gplpv_2003x64_0.11.0.357.msi
>> gplpv_Vista2008x32_0.11.0.357.msi
>> gplpv_Vista2008x64_0.11.0.357.msi
>>
>> james
>>
>
I'm get simple tests and windows does not take BSOD and get good
network speed (download is about 70-80 Mb/s, upload ~40 Mb/s), but now
i get very poor disk performance =(
Now i dont have any tests results, but six mounth ago i have windows
2008 install is about 30 min, now i get 1 hour. I'm use self made
winpe image with integrated xen gpl pv drivers.
--
Vasiliy Tolstov,
Clodo.ru
e-mail: v.tolstov@selfip.ru
jabber: vase@selfip.ru
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2012-02-10 8:52 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-28 13:49 State of GPLPV tests - 28.11.11 Andreas Kinzler
2011-11-28 23:16 ` James Harper
2011-11-29 17:05 ` Andreas Kinzler
2011-11-29 22:39 ` James Harper
[not found] ` <CACaajQtWvkLt3d+H+CeQwK-WXxGo9MCUCBipLbvqnXka0yp3Vw@mail.gmail.com>
[not found] ` <6035A0D088A63A46850C3988ED045A4B0550BB45@BITCOM1.int.sbss.com.au>
[not found] ` <CACaajQvRhQD_dtAyBfRQ=SeRmuDnJc+vW1_VAr9SgK+dyb5sig@mail.gmail.com>
2012-02-10 8:52 ` Vasiliy Tolstov
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.