All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] Migration sometimes fails with IDE and Qemu 2.2.1
@ 2015-04-06 18:47 Peter Lieven
  2015-04-06 18:50 ` [Qemu-devel] [Qemu-block] " John Snow
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Lieven @ 2015-04-06 18:47 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel

Hi all,

is there a known issue in Qemu 2.2.1 where IDE stalls sometimes after a migration with Qemu 2.2.1?
The migration succeeds, but it seems that the complete I/O is hanging. This happens only sometimes
and only with extreme old Linux Guests (SLES 10 with Kernel 2.6.16) thus the IDE controller as
storage controller for the system disk.

Maybe this sounds familiar to someone.

Thank you,
Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-06 18:47 [Qemu-devel] Migration sometimes fails with IDE and Qemu 2.2.1 Peter Lieven
@ 2015-04-06 18:50 ` John Snow
  2015-04-06 19:02   ` Peter Lieven
  0 siblings, 1 reply; 28+ messages in thread
From: John Snow @ 2015-04-06 18:50 UTC (permalink / raw)
  To: Peter Lieven, qemu-block; +Cc: qemu-devel



On 04/06/2015 02:47 PM, Peter Lieven wrote:
> Hi all,
>
> is there a known issue in Qemu 2.2.1 where IDE stalls sometimes after a migration with Qemu 2.2.1?
> The migration succeeds, but it seems that the complete I/O is hanging. This happens only sometimes
> and only with extreme old Linux Guests (SLES 10 with Kernel 2.6.16) thus the IDE controller as
> storage controller for the system disk.
>
> Maybe this sounds familiar to someone.
>
> Thank you,
> Peter
>

It's news to me.

Is this a regression?
Any particular workload or reproducer?

--js

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-06 18:50 ` [Qemu-devel] [Qemu-block] " John Snow
@ 2015-04-06 19:02   ` Peter Lieven
  2015-04-06 19:10     ` Peter Lieven
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Lieven @ 2015-04-06 19:02 UTC (permalink / raw)
  To: John Snow, qemu-block; +Cc: qemu-devel

Am 06.04.2015 um 20:50 schrieb John Snow:
>
>
> On 04/06/2015 02:47 PM, Peter Lieven wrote:
>> Hi all,
>>
>> is there a known issue in Qemu 2.2.1 where IDE stalls sometimes after a migration with Qemu 2.2.1?
>> The migration succeeds, but it seems that the complete I/O is hanging. This happens only sometimes
>> and only with extreme old Linux Guests (SLES 10 with Kernel 2.6.16) thus the IDE controller as
>> storage controller for the system disk.
>>
>> Maybe this sounds familiar to someone.
>>
>> Thank you,
>> Peter
>>
>
> It's news to me.

Okay, I was hoping that I just missed a patch or someone forgot to CC qemu-stable... :-)

>
> Is this a regression?

I can't say we see those vServers hang sometime after migration since we changed the hypervisor from qemu-kvm-1.2.0 to qemu-2.2.0.


> Any particular workload or reproducer?

Workload is almost zero. I try to figure out if there is a way to trigger it.

Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as
CPU flag since kvmclock seemed to be quite buggy in 2.6.16...

Exact cmdline is:
/usr/bin/qemu-2.2.1  -enable-kvm  -M pc-1.2  -nodefaults -netdev type=tap,id=guest2,script=no,downscript=no,ifname=tap2  -device e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native  -serial null  -parallel null  -m 1024 -smp 2,sockets=1,cores=2,threads=1  -monitor tcp:0:4003,server,nowait -vnc :3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot order=c,once=dc,menu=off  -drive index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on  -k de  -incoming tcp:0:5003  -pidfile /var/run/qemu/vm-146.pid  -mem-path /hugepages  -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus  -cpu qemu64,-kvmclock

Exact kernel is:
2.6.16.46-0.12-smp (i think this is SLES10 or sth.)

The machine does not hang. It seems just I/O is hanging. So you can type at the console or ping the system, but no longer login.

Thank you,
Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-06 19:02   ` Peter Lieven
@ 2015-04-06 19:10     ` Peter Lieven
  2015-04-07  8:43       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Lieven @ 2015-04-06 19:10 UTC (permalink / raw)
  To: John Snow, qemu-block; +Cc: qemu-devel

Am 06.04.2015 um 21:02 schrieb Peter Lieven:
> Am 06.04.2015 um 20:50 schrieb John Snow:
>>
>> On 04/06/2015 02:47 PM, Peter Lieven wrote:
>>> Hi all,
>>>
>>> is there a known issue in Qemu 2.2.1 where IDE stalls sometimes after a migration with Qemu 2.2.1?
>>> The migration succeeds, but it seems that the complete I/O is hanging. This happens only sometimes
>>> and only with extreme old Linux Guests (SLES 10 with Kernel 2.6.16) thus the IDE controller as
>>> storage controller for the system disk.
>>>
>>> Maybe this sounds familiar to someone.
>>>
>>> Thank you,
>>> Peter
>>>
>> It's news to me.
> Okay, I was hoping that I just missed a patch or someone forgot to CC qemu-stable... :-)
>
>> Is this a regression?
> I can't say we see those vServers hang sometime after migration since we changed the hypervisor from qemu-kvm-1.2.0 to qemu-2.2.0.
>
>
>> Any particular workload or reproducer?
> Workload is almost zero. I try to figure out if there is a way to trigger it.
>
> Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as
> CPU flag since kvmclock seemed to be quite buggy in 2.6.16...
>
> Exact cmdline is:
> /usr/bin/qemu-2.2.1  -enable-kvm  -M pc-1.2  -nodefaults -netdev type=tap,id=guest2,script=no,downscript=no,ifname=tap2  -device e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native  -serial null  -parallel null  -m 1024 -smp 2,sockets=1,cores=2,threads=1  -monitor tcp:0:4003,server,nowait -vnc :3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot order=c,once=dc,menu=off  -drive index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on  -k de  -incoming tcp:0:5003  -pidfile /var/run/qemu/vm-146.pid  -mem-path /hugepages  -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus  -cpu qemu64,-kvmclock
>
> Exact kernel is:
> 2.6.16.46-0.12-smp (i think this is SLES10 or sth.)
>
> The machine does not hang. It seems just I/O is hanging. So you can type at the console or ping the system, but no longer login.
>
> Thank you,
> Peter

Interesting observation: Migrating the vServer again seems to fix to problem (at least in one case I could test just now).

2.6.8-24-smp is also affected.

Thanks,
Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-06 19:10     ` Peter Lieven
@ 2015-04-07  8:43       ` Dr. David Alan Gilbert
  2015-04-07 15:11         ` Peter Lieven
  0 siblings, 1 reply; 28+ messages in thread
From: Dr. David Alan Gilbert @ 2015-04-07  8:43 UTC (permalink / raw)
  To: Peter Lieven; +Cc: John Snow, qemu-devel, qemu-block

* Peter Lieven (pl@kamp.de) wrote:
> Am 06.04.2015 um 21:02 schrieb Peter Lieven:
> > Am 06.04.2015 um 20:50 schrieb John Snow:
> >>
> >> On 04/06/2015 02:47 PM, Peter Lieven wrote:
> >>> Hi all,
> >>>
> >>> is there a known issue in Qemu 2.2.1 where IDE stalls sometimes after a migration with Qemu 2.2.1?
> >>> The migration succeeds, but it seems that the complete I/O is hanging. This happens only sometimes
> >>> and only with extreme old Linux Guests (SLES 10 with Kernel 2.6.16) thus the IDE controller as
> >>> storage controller for the system disk.
> >>>
> >>> Maybe this sounds familiar to someone.
> >>>
> >>> Thank you,
> >>> Peter
> >>>
> >> It's news to me.
> > Okay, I was hoping that I just missed a patch or someone forgot to CC qemu-stable... :-)
> >
> >> Is this a regression?
> > I can't say we see those vServers hang sometime after migration since we changed the hypervisor from qemu-kvm-1.2.0 to qemu-2.2.0.
> >
> >
> >> Any particular workload or reproducer?
> > Workload is almost zero. I try to figure out if there is a way to trigger it.
> >
> > Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as
> > CPU flag since kvmclock seemed to be quite buggy in 2.6.16...
> >
> > Exact cmdline is:
> > /usr/bin/qemu-2.2.1  -enable-kvm  -M pc-1.2  -nodefaults -netdev type=tap,id=guest2,script=no,downscript=no,ifname=tap2  -device e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native  -serial null  -parallel null  -m 1024 -smp 2,sockets=1,cores=2,threads=1  -monitor tcp:0:4003,server,nowait -vnc :3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot order=c,once=dc,menu=off  -drive index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on  -k de  -incoming tcp:0:5003  -pidfile /var/run/qemu/vm-146.pid  -mem-path /hugepages  -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus  -cpu qemu64,-kvmclock
> >
> > Exact kernel is:
> > 2.6.16.46-0.12-smp (i think this is SLES10 or sth.)
> >
> > The machine does not hang. It seems just I/O is hanging. So you can type at the console or ping the system, but no longer login.
> >
> > Thank you,
> > Peter
> 
> Interesting observation: Migrating the vServer again seems to fix to problem (at least in one case I could test just now).
> 
> 2.6.8-24-smp is also affected.

How often does it fail - you say 'sometimes' - is it a 1/10 or a 1/1000 ?

I'm not sure at what kernel version the switch is, but newer kernels use some
code shared with the newer SATA world (libata?)  where as older kernels had
separate IDE code, so the behaviour of the two can be quite different.

Dave


Dave

> Thanks,
> Peter
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-07  8:43       ` Dr. David Alan Gilbert
@ 2015-04-07 15:11         ` Peter Lieven
  2015-04-07 15:14           ` Paolo Bonzini
  2015-04-07 15:29           ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 28+ messages in thread
From: Peter Lieven @ 2015-04-07 15:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Paolo Bonzini, John Snow, qemu-devel, qemu-block

Hi David,

Am 07.04.2015 um 10:43 schrieb Dr. David Alan Gilbert:
>>>> Any particular workload or reproducer?
>>> Workload is almost zero. I try to figure out if there is a way to trigger it.
>>>
>>> Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as
>>> CPU flag since kvmclock seemed to be quite buggy in 2.6.16...
>>>
>>> Exact cmdline is:
>>> /usr/bin/qemu-2.2.1  -enable-kvm  -M pc-1.2  -nodefaults -netdev type=tap,id=guest2,script=no,downscript=no,ifname=tap2  -device e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native  -serial null  -parallel null  -m 1024 -smp 2,sockets=1,cores=2,threads=1  -monitor tcp:0:4003,server,nowait -vnc :3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot order=c,once=dc,menu=off  -drive index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on  -k de  -incoming tcp:0:5003  -pidfile /var/run/qemu/vm-146.pid  -mem-path /hugepages  -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus  -cpu qemu64,-kvmclock
>>>
>>> Exact kernel is:
>>> 2.6.16.46-0.12-smp (i think this is SLES10 or sth.)
>>>
>>> The machine does not hang. It seems just I/O is hanging. So you can type at the console or ping the system, but no longer login.
>>>
>>> Thank you,
>>> Peter
>> Interesting observation: Migrating the vServer again seems to fix to problem (at least in one case I could test just now).
>>
>> 2.6.8-24-smp is also affected.
> How often does it fail - you say 'sometimes' - is it a 1/10 or a 1/1000 ?
Its more often than 1/10 I would say.

>
> I'm not sure at what kernel version the switch is, but newer kernels use some
> code shared with the newer SATA world (libata?)  where as older kernels had
> separate IDE code, so the behaviour of the two can be quite different.

Thats a good point. I will check what the kernels have.
I remember that there was sth like a problem with error handling in
the old drivers? Paolo, you worked a lot on IDE lately. Do you remember?

Thanks,
Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-07 15:11         ` Peter Lieven
@ 2015-04-07 15:14           ` Paolo Bonzini
  2015-04-07 18:54             ` Peter Lieven
  2015-04-07 15:29           ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 28+ messages in thread
From: Paolo Bonzini @ 2015-04-07 15:14 UTC (permalink / raw)
  To: Peter Lieven, Dr. David Alan Gilbert; +Cc: John Snow, qemu-devel, qemu-block



On 07/04/2015 17:11, Peter Lieven wrote:
> > I'm not sure at what kernel version the switch is, but newer kernels use some
> > code shared with the newer SATA world (libata?)  where as older kernels had
> > separate IDE code, so the behaviour of the two can be quite different.
>
> Thats a good point. I will check what the kernels have.
> I remember that there was sth like a problem with error handling in
> the old drivers? Paolo, you worked a lot on IDE lately. Do you remember?

No, I don't know...  I didn't work a lot on IDE, those patches were all
several years old and finally John shepherded them in. :)

Still, what David says is correct.  At least RHEL5's 2.6.18 defaulted to
the old IDE drivers (/dev/hdXNN).

Paolo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-07 15:11         ` Peter Lieven
  2015-04-07 15:14           ` Paolo Bonzini
@ 2015-04-07 15:29           ` Dr. David Alan Gilbert
  2015-04-07 18:44             ` Peter Lieven
  1 sibling, 1 reply; 28+ messages in thread
From: Dr. David Alan Gilbert @ 2015-04-07 15:29 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Paolo Bonzini, John Snow, qemu-devel, qemu-block

* Peter Lieven (pl@kamp.de) wrote:
> Hi David,
> 
> Am 07.04.2015 um 10:43 schrieb Dr. David Alan Gilbert:
> >>>> Any particular workload or reproducer?
> >>> Workload is almost zero. I try to figure out if there is a way to trigger it.
> >>>
> >>> Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as
> >>> CPU flag since kvmclock seemed to be quite buggy in 2.6.16...
> >>>
> >>> Exact cmdline is:
> >>> /usr/bin/qemu-2.2.1  -enable-kvm  -M pc-1.2  -nodefaults -netdev type=tap,id=guest2,script=no,downscript=no,ifname=tap2  -device e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native  -serial null  -parallel null  -m 1024 -smp 2,sockets=1,cores=2,threads=1  -monitor tcp:0:4003,server,nowait -vnc :3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot order=c,once=dc,menu=off  -drive index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on  -k de  -incoming tcp:0:5003  -pidfile /var/run/qemu/vm-146.pid  -mem-path /hugepages  -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus  -cpu qemu64,-kvmclock
> >>>
> >>> Exact kernel is:
> >>> 2.6.16.46-0.12-smp (i think this is SLES10 or sth.)
> >>>
> >>> The machine does not hang. It seems just I/O is hanging. So you can type at the console or ping the system, but no longer login.
> >>>
> >>> Thank you,
> >>> Peter
> >> Interesting observation: Migrating the vServer again seems to fix to problem (at least in one case I could test just now).
> >>
> >> 2.6.8-24-smp is also affected.
> > How often does it fail - you say 'sometimes' - is it a 1/10 or a 1/1000 ?
> Its more often than 1/10 I would say.

OK, that's not too bad - it's the 1/1000 that are really nasty to find.
In your setup, how easy would it be for you to try :
    with either 2.1 or current head?
    with a newer machine-type?
    without the cdrom?
                                  
Dave

> 
> >
> > I'm not sure at what kernel version the switch is, but newer kernels use some
> > code shared with the newer SATA world (libata?)  where as older kernels had
> > separate IDE code, so the behaviour of the two can be quite different.
> 
> Thats a good point. I will check what the kernels have.
> I remember that there was sth like a problem with error handling in
> the old drivers? Paolo, you worked a lot on IDE lately. Do you remember?
> 
> Thanks,
> Peter
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-07 15:29           ` Dr. David Alan Gilbert
@ 2015-04-07 18:44             ` Peter Lieven
  2015-04-07 18:56               ` John Snow
                                 ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Peter Lieven @ 2015-04-07 18:44 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Paolo Bonzini, John Snow, qemu-devel, qemu-block

Am 07.04.2015 um 17:29 schrieb Dr. David Alan Gilbert:
> * Peter Lieven (pl@kamp.de) wrote:
>> Hi David,
>>
>> Am 07.04.2015 um 10:43 schrieb Dr. David Alan Gilbert:
>>>>>> Any particular workload or reproducer?
>>>>> Workload is almost zero. I try to figure out if there is a way to trigger it.
>>>>>
>>>>> Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as
>>>>> CPU flag since kvmclock seemed to be quite buggy in 2.6.16...
>>>>>
>>>>> Exact cmdline is:
>>>>> /usr/bin/qemu-2.2.1  -enable-kvm  -M pc-1.2  -nodefaults -netdev type=tap,id=guest2,script=no,downscript=no,ifname=tap2  -device e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native  -serial null  -parallel null  -m 1024 -smp 2,sockets=1,cores=2,threads=1  -monitor tcp:0:4003,server,nowait -vnc :3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot order=c,once=dc,menu=off  -drive index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on  -k de  -incoming tcp:0:5003  -pidfile /var/run/qemu/vm-146.pid  -mem-path /hugepages  -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus  -cpu qemu64,-kvmclock
>>>>>
>>>>> Exact kernel is:
>>>>> 2.6.16.46-0.12-smp (i think this is SLES10 or sth.)
>>>>>
>>>>> The machine does not hang. It seems just I/O is hanging. So you can type at the console or ping the system, but no longer login.
>>>>>
>>>>> Thank you,
>>>>> Peter
>>>> Interesting observation: Migrating the vServer again seems to fix to problem (at least in one case I could test just now).
>>>>
>>>> 2.6.8-24-smp is also affected.
>>> How often does it fail - you say 'sometimes' - is it a 1/10 or a 1/1000 ?
>> Its more often than 1/10 I would say.
> OK, that's not too bad - it's the 1/1000 that are really nasty to find.
> In your setup, how easy would it be for you to try :
>     with either 2.1 or current head?
>     with a newer machine-type?
>     without the cdrom?

Its all possible. I can clone the system and try everything on my test systems. I hope
it reproduces there.

Has the cdrom the power of taking down the bus?

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-07 15:14           ` Paolo Bonzini
@ 2015-04-07 18:54             ` Peter Lieven
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Lieven @ 2015-04-07 18:54 UTC (permalink / raw)
  To: Paolo Bonzini, Dr. David Alan Gilbert; +Cc: John Snow, qemu-devel, qemu-block

Am 07.04.2015 um 17:14 schrieb Paolo Bonzini:
>
> On 07/04/2015 17:11, Peter Lieven wrote:
>>> I'm not sure at what kernel version the switch is, but newer kernels use some
>>> code shared with the newer SATA world (libata?)  where as older kernels had
>>> separate IDE code, so the behaviour of the two can be quite different.
>> Thats a good point. I will check what the kernels have.
>> I remember that there was sth like a problem with error handling in
>> the old drivers? Paolo, you worked a lot on IDE lately. Do you remember?
> No, I don't know...  I didn't work a lot on IDE, those patches were all
> several years old and finally John shepherded them in. :)

Okay, I was thinking you just worked on that code and were the
new IDE expert ;-)

>
> Still, what David says is correct.  At least RHEL5's 2.6.18 defaulted to
> the old IDE drivers (/dev/hdXNN).

I can confirm that its /dev/hdX.

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-07 18:44             ` Peter Lieven
@ 2015-04-07 18:56               ` John Snow
  2015-04-07 19:02                 ` Peter Lieven
  2015-04-07 19:01               ` Dr. David Alan Gilbert
  2015-04-07 20:05               ` Paolo Bonzini
  2 siblings, 1 reply; 28+ messages in thread
From: John Snow @ 2015-04-07 18:56 UTC (permalink / raw)
  To: Peter Lieven, Dr. David Alan Gilbert
  Cc: Paolo Bonzini, qemu-devel, qemu-block



On 04/07/2015 02:44 PM, Peter Lieven wrote:
> Am 07.04.2015 um 17:29 schrieb Dr. David Alan Gilbert:
>> * Peter Lieven (pl@kamp.de) wrote:
>>> Hi David,
>>>
>>> Am 07.04.2015 um 10:43 schrieb Dr. David Alan Gilbert:
>>>>>>> Any particular workload or reproducer?
>>>>>> Workload is almost zero. I try to figure out if there is a way to trigger it.
>>>>>>
>>>>>> Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as
>>>>>> CPU flag since kvmclock seemed to be quite buggy in 2.6.16...
>>>>>>
>>>>>> Exact cmdline is:
>>>>>> /usr/bin/qemu-2.2.1  -enable-kvm  -M pc-1.2  -nodefaults -netdev type=tap,id=guest2,script=no,downscript=no,ifname=tap2  -device e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native  -serial null  -parallel null  -m 1024 -smp 2,sockets=1,cores=2,threads=1  -monitor tcp:0:4003,server,nowait -vnc :3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot order=c,once=dc,menu=off  -drive index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on  -k de  -incoming tcp:0:5003  -pidfile /var/run/qemu/vm-146.pid  -mem-path /hugepages  -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus  -cpu qemu64,-kvmclock
>>>>>>
>>>>>> Exact kernel is:
>>>>>> 2.6.16.46-0.12-smp (i think this is SLES10 or sth.)
>>>>>>
>>>>>> The machine does not hang. It seems just I/O is hanging. So you can type at the console or ping the system, but no longer login.
>>>>>>
>>>>>> Thank you,
>>>>>> Peter
>>>>> Interesting observation: Migrating the vServer again seems to fix to problem (at least in one case I could test just now).
>>>>>
>>>>> 2.6.8-24-smp is also affected.
>>>> How often does it fail - you say 'sometimes' - is it a 1/10 or a 1/1000 ?
>>> Its more often than 1/10 I would say.
>> OK, that's not too bad - it's the 1/1000 that are really nasty to find.
>> In your setup, how easy would it be for you to try :
>>      with either 2.1 or current head?
>>      with a newer machine-type?
>>      without the cdrom?
>
> Its all possible. I can clone the system and try everything on my test systems. I hope
> it reproduces there.
>
> Has the cdrom the power of taking down the bus?
>
> Peter
>

I don't know if CDROM could stall the entire bus, but I suspect the 
reason for asking is this: dgilbert and I had tracked down a problem 
previously where during migration, outstanding requests being handled by 
the ATAPI code can get lost during migration if, for instance, the user 
has only prepared the command (via bmdma) but has not yet written to the 
register to activate the command yet.

So if something like this happens:

- User writes to the ATA registers to program a command
- Migration occurs
- User writes to the BMDMA register to initiate the command

We can lose some of the state and data of the request. David had checked 
in a workaround for at least ATAPI that simply coaxes the guest OS into 
trying the command again to unstick it.

I think we determined last time that we couldn't fix this problem 
without changing the migration format, so we opted not to do it for 2.3. 
We had also only noticed it with ATAPI drives, not HDDs, so a proper fix 
got kicked down the road since we thought the workaround was sufficient.

IIRC our success rate with reproducing it was something on the order of 
1/50, too.

If you can reproduce it without a CDROM but using the BMDMA interface, 
that's a good data point. If you can't reproduce it using the ISA 
interface, that's a phenomenal data point and implicates BMDMA pretty 
heavily.

--js

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-07 18:44             ` Peter Lieven
  2015-04-07 18:56               ` John Snow
@ 2015-04-07 19:01               ` Dr. David Alan Gilbert
  2015-04-07 19:04                 ` Peter Lieven
  2015-04-09 12:49                 ` Peter Lieven
  2015-04-07 20:05               ` Paolo Bonzini
  2 siblings, 2 replies; 28+ messages in thread
From: Dr. David Alan Gilbert @ 2015-04-07 19:01 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Paolo Bonzini, John Snow, qemu-devel, qemu-block

* Peter Lieven (pl@kamp.de) wrote:
> Am 07.04.2015 um 17:29 schrieb Dr. David Alan Gilbert:
> > * Peter Lieven (pl@kamp.de) wrote:
> >> Hi David,
> >>
> >> Am 07.04.2015 um 10:43 schrieb Dr. David Alan Gilbert:
> >>>>>> Any particular workload or reproducer?
> >>>>> Workload is almost zero. I try to figure out if there is a way to trigger it.
> >>>>>
> >>>>> Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as
> >>>>> CPU flag since kvmclock seemed to be quite buggy in 2.6.16...
> >>>>>
> >>>>> Exact cmdline is:
> >>>>> /usr/bin/qemu-2.2.1  -enable-kvm  -M pc-1.2  -nodefaults -netdev type=tap,id=guest2,script=no,downscript=no,ifname=tap2  -device e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native  -serial null  -parallel null  -m 1024 -smp 2,sockets=1,cores=2,threads=1  -monitor tcp:0:4003,server,nowait -vnc :3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot order=c,once=dc,menu=off  -drive index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on  -k de  -incoming tcp:0:5003  -pidfile /var/run/qemu/vm-146.pid  -mem-path /hugepages  -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus  -cpu qemu64,-kvmclock
> >>>>>
> >>>>> Exact kernel is:
> >>>>> 2.6.16.46-0.12-smp (i think this is SLES10 or sth.)
> >>>>>
> >>>>> The machine does not hang. It seems just I/O is hanging. So you can type at the console or ping the system, but no longer login.
> >>>>>
> >>>>> Thank you,
> >>>>> Peter
> >>>> Interesting observation: Migrating the vServer again seems to fix to problem (at least in one case I could test just now).
> >>>>
> >>>> 2.6.8-24-smp is also affected.
> >>> How often does it fail - you say 'sometimes' - is it a 1/10 or a 1/1000 ?
> >> Its more often than 1/10 I would say.
> > OK, that's not too bad - it's the 1/1000 that are really nasty to find.
> > In your setup, how easy would it be for you to try :
> >     with either 2.1 or current head?
> >     with a newer machine-type?
> >     without the cdrom?
> 
> Its all possible. I can clone the system and try everything on my test systems. I hope
> it reproduces there.

Great.  I think the order I would go would be:
    Try head - if it works we know we've already got the fix somewhere
    Try 2.1  - if it works we know it's something we introduced between
               2.1 and 2.2.1
    Try a newer machine type - because pc-1.2 probably isn't tested much
    CDROM at the end.

> Has the cdrom the power of taking down the bus?

I just know the cdrom migration is a bit lacking and the simpler
the test case the better.

Dave

> 
> Peter
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-07 18:56               ` John Snow
@ 2015-04-07 19:02                 ` Peter Lieven
  2015-04-07 19:13                   ` John Snow
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Lieven @ 2015-04-07 19:02 UTC (permalink / raw)
  To: John Snow, Dr. David Alan Gilbert; +Cc: Paolo Bonzini, qemu-devel, qemu-block

Am 07.04.2015 um 20:56 schrieb John Snow:
>
>
> On 04/07/2015 02:44 PM, Peter Lieven wrote:
>> Am 07.04.2015 um 17:29 schrieb Dr. David Alan Gilbert:
>>> * Peter Lieven (pl@kamp.de) wrote:
>>>> Hi David,
>>>>
>>>> Am 07.04.2015 um 10:43 schrieb Dr. David Alan Gilbert:
>>>>>>>> Any particular workload or reproducer?
>>>>>>> Workload is almost zero. I try to figure out if there is a way to trigger it.
>>>>>>>
>>>>>>> Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as
>>>>>>> CPU flag since kvmclock seemed to be quite buggy in 2.6.16...
>>>>>>>
>>>>>>> Exact cmdline is:
>>>>>>> /usr/bin/qemu-2.2.1  -enable-kvm  -M pc-1.2  -nodefaults -netdev type=tap,id=guest2,script=no,downscript=no,ifname=tap2  -device e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native  -serial null  -parallel null  -m 1024 -smp 2,sockets=1,cores=2,threads=1  -monitor tcp:0:4003,server,nowait -vnc :3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot order=c,once=dc,menu=off  -drive index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on  -k de  -incoming tcp:0:5003  -pidfile /var/run/qemu/vm-146.pid  -mem-path /hugepages  -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus  -cpu qemu64,-kvmclock
>>>>>>>
>>>>>>> Exact kernel is:
>>>>>>> 2.6.16.46-0.12-smp (i think this is SLES10 or sth.)
>>>>>>>
>>>>>>> The machine does not hang. It seems just I/O is hanging. So you can type at the console or ping the system, but no longer login.
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Peter
>>>>>> Interesting observation: Migrating the vServer again seems to fix to problem (at least in one case I could test just now).
>>>>>>
>>>>>> 2.6.8-24-smp is also affected.
>>>>> How often does it fail - you say 'sometimes' - is it a 1/10 or a 1/1000 ?
>>>> Its more often than 1/10 I would say.
>>> OK, that's not too bad - it's the 1/1000 that are really nasty to find.
>>> In your setup, how easy would it be for you to try :
>>>      with either 2.1 or current head?
>>>      with a newer machine-type?
>>>      without the cdrom?
>>
>> Its all possible. I can clone the system and try everything on my test systems. I hope
>> it reproduces there.
>>
>> Has the cdrom the power of taking down the bus?
>>
>> Peter
>>
>
> I don't know if CDROM could stall the entire bus, but I suspect the reason for asking is this: dgilbert and I had tracked down a problem previously where during migration, outstanding requests being handled by the ATAPI code can get lost during migration if, for instance, the user has only prepared the command (via bmdma) but has not yet written to the register to activate the command yet.

That sounds like it could be related.

>
> So if something like this happens:
>
> - User writes to the ATA registers to program a command
> - Migration occurs
> - User writes to the BMDMA register to initiate the command
>
> We can lose some of the state and data of the request. David had checked in a workaround for at least ATAPI that simply coaxes the guest OS into trying the command again to unstick it.

Do you have the commit for me?

>
> I think we determined last time that we couldn't fix this problem without changing the migration format, so we opted not to do it for 2.3. We had also only noticed it with ATAPI drives, not HDDs, so a proper fix got kicked down the road since we thought the workaround was sufficient.

Maybe normally we use virtio nowadays and maybe the new kernel implementation (libata /dev/sdX) can't get locked? What I do not understand is how a second migration can unlock from this state?

>
> IIRC our success rate with reproducing it was something on the order of 1/50, too.
>
> If you can reproduce it without a CDROM but using the BMDMA interface, that's a good data point. If you can't reproduce it using the ISA interface, that's a phenomenal data point and implicates BMDMA pretty heavily.

To be 100% sure we are talking about the same? How would I use the ISA and how would I use the BMDMA interface?

Thanks,
Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-07 19:01               ` Dr. David Alan Gilbert
@ 2015-04-07 19:04                 ` Peter Lieven
  2015-04-09 12:49                 ` Peter Lieven
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Lieven @ 2015-04-07 19:04 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Paolo Bonzini, John Snow, qemu-devel, qemu-block

Am 07.04.2015 um 21:01 schrieb Dr. David Alan Gilbert:
> * Peter Lieven (pl@kamp.de) wrote:
>> Am 07.04.2015 um 17:29 schrieb Dr. David Alan Gilbert:
>>> * Peter Lieven (pl@kamp.de) wrote:
>>>> Hi David,
>>>>
>>>> Am 07.04.2015 um 10:43 schrieb Dr. David Alan Gilbert:
>>>>>>>> Any particular workload or reproducer?
>>>>>>> Workload is almost zero. I try to figure out if there is a way to trigger it.
>>>>>>>
>>>>>>> Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as
>>>>>>> CPU flag since kvmclock seemed to be quite buggy in 2.6.16...
>>>>>>>
>>>>>>> Exact cmdline is:
>>>>>>> /usr/bin/qemu-2.2.1  -enable-kvm  -M pc-1.2  -nodefaults -netdev type=tap,id=guest2,script=no,downscript=no,ifname=tap2  -device e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native  -serial null  -parallel null  -m 1024 -smp 2,sockets=1,cores=2,threads=1  -monitor tcp:0:4003,server,nowait -vnc :3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot order=c,once=dc,menu=off  -drive index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on  -k de  -incoming tcp:0:5003  -pidfile /var/run/qemu/vm-146.pid  -mem-path /hugepages  -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus  -cpu qemu64,-kvmclock
>>>>>>>
>>>>>>> Exact kernel is:
>>>>>>> 2.6.16.46-0.12-smp (i think this is SLES10 or sth.)
>>>>>>>
>>>>>>> The machine does not hang. It seems just I/O is hanging. So you can type at the console or ping the system, but no longer login.
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Peter
>>>>>> Interesting observation: Migrating the vServer again seems to fix to problem (at least in one case I could test just now).
>>>>>>
>>>>>> 2.6.8-24-smp is also affected.
>>>>> How often does it fail - you say 'sometimes' - is it a 1/10 or a 1/1000 ?
>>>> Its more often than 1/10 I would say.
>>> OK, that's not too bad - it's the 1/1000 that are really nasty to find.
>>> In your setup, how easy would it be for you to try :
>>>     with either 2.1 or current head?
>>>     with a newer machine-type?
>>>     without the cdrom?
>> Its all possible. I can clone the system and try everything on my test systems. I hope
>> it reproduces there.
> Great.  I think the order I would go would be:
>     Try head - if it works we know we've already got the fix somewhere
>     Try 2.1  - if it works we know it's something we introduced between
>                2.1 and 2.2.1
>     Try a newer machine type - because pc-1.2 probably isn't tested much

I don't mind chaning the machine time. The reason it is pc-1.2 is we
set the machine type the vServer was created with.

>     CDROM at the end.
>
>> Has the cdrom the power of taking down the bus?
> I just know the cdrom migration is a bit lacking and the simpler
> the test case the better.

Just for the record there was no CD inserted during migration.

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-07 19:02                 ` Peter Lieven
@ 2015-04-07 19:13                   ` John Snow
  2015-04-09  6:34                     ` Peter Lieven
  2015-04-09 12:46                     ` Peter Lieven
  0 siblings, 2 replies; 28+ messages in thread
From: John Snow @ 2015-04-07 19:13 UTC (permalink / raw)
  To: Peter Lieven, Dr. David Alan Gilbert
  Cc: Paolo Bonzini, qemu-devel, qemu-block



On 04/07/2015 03:02 PM, Peter Lieven wrote:
> Am 07.04.2015 um 20:56 schrieb John Snow:
>>
>>
>> On 04/07/2015 02:44 PM, Peter Lieven wrote:
>>> Am 07.04.2015 um 17:29 schrieb Dr. David Alan Gilbert:
>>>> * Peter Lieven (pl@kamp.de) wrote:
>>>>> Hi David,
>>>>>
>>>>> Am 07.04.2015 um 10:43 schrieb Dr. David Alan Gilbert:
>>>>>>>>> Any particular workload or reproducer?
>>>>>>>> Workload is almost zero. I try to figure out if there is a way to trigger it.
>>>>>>>>
>>>>>>>> Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as
>>>>>>>> CPU flag since kvmclock seemed to be quite buggy in 2.6.16...
>>>>>>>>
>>>>>>>> Exact cmdline is:
>>>>>>>> /usr/bin/qemu-2.2.1  -enable-kvm  -M pc-1.2  -nodefaults -netdev type=tap,id=guest2,script=no,downscript=no,ifname=tap2  -device e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native  -serial null  -parallel null  -m 1024 -smp 2,sockets=1,cores=2,threads=1  -monitor tcp:0:4003,server,nowait -vnc :3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot order=c,once=dc,menu=off  -drive index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on  -k de  -incoming tcp:0:5003  -pidfile /var/run/qemu/vm-146.pid  -mem-path /hugepages  -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus  -cpu qemu64,-kvmclock
>>>>>>>>
>>>>>>>> Exact kernel is:
>>>>>>>> 2.6.16.46-0.12-smp (i think this is SLES10 or sth.)
>>>>>>>>
>>>>>>>> The machine does not hang. It seems just I/O is hanging. So you can type at the console or ping the system, but no longer login.
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Peter
>>>>>>> Interesting observation: Migrating the vServer again seems to fix to problem (at least in one case I could test just now).
>>>>>>>
>>>>>>> 2.6.8-24-smp is also affected.
>>>>>> How often does it fail - you say 'sometimes' - is it a 1/10 or a 1/1000 ?
>>>>> Its more often than 1/10 I would say.
>>>> OK, that's not too bad - it's the 1/1000 that are really nasty to find.
>>>> In your setup, how easy would it be for you to try :
>>>>       with either 2.1 or current head?
>>>>       with a newer machine-type?
>>>>       without the cdrom?
>>>
>>> Its all possible. I can clone the system and try everything on my test systems. I hope
>>> it reproduces there.
>>>
>>> Has the cdrom the power of taking down the bus?
>>>
>>> Peter
>>>
>>
>> I don't know if CDROM could stall the entire bus, but I suspect the reason for asking is this: dgilbert and I had tracked down a problem previously where during migration, outstanding requests being handled by the ATAPI code can get lost during migration if, for instance, the user has only prepared the command (via bmdma) but has not yet written to the register to activate the command yet.
>
> That sounds like it could be related.
>
>>
>> So if something like this happens:
>>
>> - User writes to the ATA registers to program a command
>> - Migration occurs
>> - User writes to the BMDMA register to initiate the command
>>
>> We can lose some of the state and data of the request. David had checked in a workaround for at least ATAPI that simply coaxes the guest OS into trying the command again to unstick it.
>
> Do you have the commit for me?
>

http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg01109.html

>>
>> I think we determined last time that we couldn't fix this problem without changing the migration format, so we opted not to do it for 2.3. We had also only noticed it with ATAPI drives, not HDDs, so a proper fix got kicked down the road since we thought the workaround was sufficient.
>
> Maybe normally we use virtio nowadays and maybe the new kernel implementation (libata /dev/sdX) can't get locked? What I do not understand is how a second migration can unlock from this state?
>
>>
>> IIRC our success rate with reproducing it was something on the order of 1/50, too.
>>
>> If you can reproduce it without a CDROM but using the BMDMA interface, that's a good data point. If you can't reproduce it using the ISA interface, that's a phenomenal data point and implicates BMDMA pretty heavily.
>
> To be 100% sure we are talking about the same? How would I use the ISA and how would I use the BMDMA interface?
>
> Thanks,
> Peter
>

BMDMA is the PCI HBA for IDE, I think it's the default for most machines 
that aren't using the AHCI HBA.

To get ISA, try launching with the machine "isapc" which will force it, 
or add the device manually, it's named "isa-ide".
The BMDMA PCI device is just named "ide".

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-07 18:44             ` Peter Lieven
  2015-04-07 18:56               ` John Snow
  2015-04-07 19:01               ` Dr. David Alan Gilbert
@ 2015-04-07 20:05               ` Paolo Bonzini
  2015-04-09  6:43                 ` Peter Lieven
  2 siblings, 1 reply; 28+ messages in thread
From: Paolo Bonzini @ 2015-04-07 20:05 UTC (permalink / raw)
  To: Peter Lieven, Dr. David Alan Gilbert; +Cc: John Snow, qemu-devel, qemu-block



On 07/04/2015 20:44, Peter Lieven wrote:
> Has the cdrom the power of taking down the bus?

IDE can only issue one command per bus, so hda/hdb can take down each
other, and hdc/hdd can take down each other.  However, hda cannot take
down hdc and vice versa---so likely the CDROM cannot take down the hard
disk.

Paolo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-07 19:13                   ` John Snow
@ 2015-04-09  6:34                     ` Peter Lieven
  2015-04-09 12:46                     ` Peter Lieven
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Lieven @ 2015-04-09  6:34 UTC (permalink / raw)
  To: John Snow, Dr. David Alan Gilbert; +Cc: Paolo Bonzini, qemu-devel, qemu-block

Am 07.04.2015 um 21:13 schrieb John Snow:
>
>
> On 04/07/2015 03:02 PM, Peter Lieven wrote:
>> Am 07.04.2015 um 20:56 schrieb John Snow:
>>>
>>>
>>> On 04/07/2015 02:44 PM, Peter Lieven wrote:
>>>> Am 07.04.2015 um 17:29 schrieb Dr. David Alan Gilbert:
>>>>> * Peter Lieven (pl@kamp.de) wrote:
>>>>>> Hi David,
>>>>>>
>>>>>> Am 07.04.2015 um 10:43 schrieb Dr. David Alan Gilbert:
>>>>>>>>>> Any particular workload or reproducer?
>>>>>>>>> Workload is almost zero. I try to figure out if there is a way to trigger it.
>>>>>>>>>
>>>>>>>>> Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as
>>>>>>>>> CPU flag since kvmclock seemed to be quite buggy in 2.6.16...
>>>>>>>>>
>>>>>>>>> Exact cmdline is:
>>>>>>>>> /usr/bin/qemu-2.2.1  -enable-kvm  -M pc-1.2 -nodefaults -netdev type=tap,id=guest2,script=no,downscript=no,ifname=tap2 -device e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive 
>>>>>>>>> format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native -serial null  -parallel null  -m 1024 -smp 2,sockets=1,cores=2,threads=1  -monitor 
>>>>>>>>> tcp:0:4003,server,nowait -vnc :3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot order=c,once=dc,menu=off  -drive index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on -k de  -incoming tcp:0:5003  -pidfile /var/run/qemu/vm-146.pid  
>>>>>>>>> -mem-path /hugepages -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus  -cpu qemu64,-kvmclock
>>>>>>>>>
>>>>>>>>> Exact kernel is:
>>>>>>>>> 2.6.16.46-0.12-smp (i think this is SLES10 or sth.)
>>>>>>>>>
>>>>>>>>> The machine does not hang. It seems just I/O is hanging. So you can type at the console or ping the system, but no longer login.
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Peter
>>>>>>>> Interesting observation: Migrating the vServer again seems to fix to problem (at least in one case I could test just now).
>>>>>>>>
>>>>>>>> 2.6.8-24-smp is also affected.
>>>>>>> How often does it fail - you say 'sometimes' - is it a 1/10 or a 1/1000 ?
>>>>>> Its more often than 1/10 I would say.
>>>>> OK, that's not too bad - it's the 1/1000 that are really nasty to find.
>>>>> In your setup, how easy would it be for you to try :
>>>>>       with either 2.1 or current head?
>>>>>       with a newer machine-type?
>>>>>       without the cdrom?
>>>>
>>>> Its all possible. I can clone the system and try everything on my test systems. I hope
>>>> it reproduces there.
>>>>
>>>> Has the cdrom the power of taking down the bus?
>>>>
>>>> Peter
>>>>
>>>
>>> I don't know if CDROM could stall the entire bus, but I suspect the reason for asking is this: dgilbert and I had tracked down a problem previously where during migration, outstanding requests being handled by the ATAPI code can get lost during 
>>> migration if, for instance, the user has only prepared the command (via bmdma) but has not yet written to the register to activate the command yet.
>>
>> That sounds like it could be related.
>>
>>>
>>> So if something like this happens:
>>>
>>> - User writes to the ATA registers to program a command
>>> - Migration occurs
>>> - User writes to the BMDMA register to initiate the command
>>>
>>> We can lose some of the state and data of the request. David had checked in a workaround for at least ATAPI that simply coaxes the guest OS into trying the command again to unstick it.
>>
>> Do you have the commit for me?
>>
>
> http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg01109.html
>
>>>
>>> I think we determined last time that we couldn't fix this problem without changing the migration format, so we opted not to do it for 2.3. We had also only noticed it with ATAPI drives, not HDDs, so a proper fix got kicked down the road since we 
>>> thought the workaround was sufficient.
>>
>> Maybe normally we use virtio nowadays and maybe the new kernel implementation (libata /dev/sdX) can't get locked? What I do not understand is how a second migration can unlock from this state?
>>
>>>
>>> IIRC our success rate with reproducing it was something on the order of 1/50, too.
>>>
>>> If you can reproduce it without a CDROM but using the BMDMA interface, that's a good data point. If you can't reproduce it using the ISA interface, that's a phenomenal data point and implicates BMDMA pretty heavily.
>>
>> To be 100% sure we are talking about the same? How would I use the ISA and how would I use the BMDMA interface?
>>
>> Thanks,
>> Peter
>>
>
> BMDMA is the PCI HBA for IDE, I think it's the default for most machines that aren't using the AHCI HBA.
>
> To get ISA, try launching with the machine "isapc" which will force it, or add the device manually, it's named "isa-ide".
> The BMDMA PCI device is just named "ide".

I will start more debugging today I found that other SuSE servers which use the newer interface (presenting as /dev/sdX)
do not suffer from the problem.

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-07 20:05               ` Paolo Bonzini
@ 2015-04-09  6:43                 ` Peter Lieven
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Lieven @ 2015-04-09  6:43 UTC (permalink / raw)
  To: Paolo Bonzini, Dr. David Alan Gilbert; +Cc: John Snow, qemu-devel, qemu-block

Am 07.04.2015 um 22:05 schrieb Paolo Bonzini:
>
> On 07/04/2015 20:44, Peter Lieven wrote:
>> Has the cdrom the power of taking down the bus?
> IDE can only issue one command per bus, so hda/hdb can take down each
> other, and hdc/hdd can take down each other.  However, hda cannot take
> down hdc and vice versa---so likely the CDROM cannot take down the hard
> disk.

Right confirmed that the machines use BMDMA and the CDROM is hdc while
the boot disk is hda. IDE driveres report as E-IDE Revision 7.0.0alpha2

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-07 19:13                   ` John Snow
  2015-04-09  6:34                     ` Peter Lieven
@ 2015-04-09 12:46                     ` Peter Lieven
  2015-04-09 12:50                       ` Paolo Bonzini
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Lieven @ 2015-04-09 12:46 UTC (permalink / raw)
  To: John Snow, Dr. David Alan Gilbert; +Cc: Paolo Bonzini, qemu-devel, qemu-block

Am 07.04.2015 um 21:13 schrieb John Snow:
>
>
> On 04/07/2015 03:02 PM, Peter Lieven wrote:
>> Am 07.04.2015 um 20:56 schrieb John Snow:
>>>
>>>
>>> On 04/07/2015 02:44 PM, Peter Lieven wrote:
>>>> Am 07.04.2015 um 17:29 schrieb Dr. David Alan Gilbert:
>>>>> * Peter Lieven (pl@kamp.de) wrote:
>>>>>> Hi David,
>>>>>>
>>>>>> Am 07.04.2015 um 10:43 schrieb Dr. David Alan Gilbert:
>>>>>>>>>> Any particular workload or reproducer?
>>>>>>>>> Workload is almost zero. I try to figure out if there is a way to trigger it.
>>>>>>>>>
>>>>>>>>> Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as
>>>>>>>>> CPU flag since kvmclock seemed to be quite buggy in 2.6.16...
>>>>>>>>>
>>>>>>>>> Exact cmdline is:
>>>>>>>>> /usr/bin/qemu-2.2.1  -enable-kvm  -M pc-1.2 -nodefaults -netdev type=tap,id=guest2,script=no,downscript=no,ifname=tap2 -device e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive 
>>>>>>>>> format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native -serial null  -parallel null  -m 1024 -smp 2,sockets=1,cores=2,threads=1  -monitor 
>>>>>>>>> tcp:0:4003,server,nowait -vnc :3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot order=c,once=dc,menu=off  -drive index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on -k de  -incoming tcp:0:5003  -pidfile /var/run/qemu/vm-146.pid  
>>>>>>>>> -mem-path /hugepages -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus  -cpu qemu64,-kvmclock
>>>>>>>>>
>>>>>>>>> Exact kernel is:
>>>>>>>>> 2.6.16.46-0.12-smp (i think this is SLES10 or sth.)
>>>>>>>>>
>>>>>>>>> The machine does not hang. It seems just I/O is hanging. So you can type at the console or ping the system, but no longer login.
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Peter
>>>>>>>> Interesting observation: Migrating the vServer again seems to fix to problem (at least in one case I could test just now).
>>>>>>>>
>>>>>>>> 2.6.8-24-smp is also affected.
>>>>>>> How often does it fail - you say 'sometimes' - is it a 1/10 or a 1/1000 ?
>>>>>> Its more often than 1/10 I would say.
>>>>> OK, that's not too bad - it's the 1/1000 that are really nasty to find.
>>>>> In your setup, how easy would it be for you to try :
>>>>>       with either 2.1 or current head?
>>>>>       with a newer machine-type?
>>>>>       without the cdrom?
>>>>
>>>> Its all possible. I can clone the system and try everything on my test systems. I hope
>>>> it reproduces there.
>>>>
>>>> Has the cdrom the power of taking down the bus?
>>>>
>>>> Peter
>>>>
>>>
>>> I don't know if CDROM could stall the entire bus, but I suspect the reason for asking is this: dgilbert and I had tracked down a problem previously where during migration, outstanding requests being handled by the ATAPI code can get lost during 
>>> migration if, for instance, the user has only prepared the command (via bmdma) but has not yet written to the register to activate the command yet.
>>
>> That sounds like it could be related.
>>
>>>
>>> So if something like this happens:
>>>
>>> - User writes to the ATA registers to program a command
>>> - Migration occurs
>>> - User writes to the BMDMA register to initiate the command
>>>
>>> We can lose some of the state and data of the request. David had checked in a workaround for at least ATAPI that simply coaxes the guest OS into trying the command again to unstick it.
>>
>> Do you have the commit for me?
>>
>
> http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg01109.html
>
>>>
>>> I think we determined last time that we couldn't fix this problem without changing the migration format, so we opted not to do it for 2.3. We had also only noticed it with ATAPI drives, not HDDs, so a proper fix got kicked down the road since we 
>>> thought the workaround was sufficient.
>>
>> Maybe normally we use virtio nowadays and maybe the new kernel implementation (libata /dev/sdX) can't get locked? What I do not understand is how a second migration can unlock from this state?
>>
>>>
>>> IIRC our success rate with reproducing it was something on the order of 1/50, too.
>>>
>>> If you can reproduce it without a CDROM but using the BMDMA interface, that's a good data point. If you can't reproduce it using the ISA interface, that's a phenomenal data point and implicates BMDMA pretty heavily.
>>
>> To be 100% sure we are talking about the same? How would I use the ISA and how would I use the BMDMA interface?
>>
>> Thanks,
>> Peter
>>
>
> BMDMA is the PCI HBA for IDE, I think it's the default for most machines that aren't using the AHCI HBA.
>
> To get ISA, try launching with the machine "isapc" which will force it, or add the device manually, it's named "isa-ide".
> The BMDMA PCI device is just named "ide".

Unfortunately, the BIOS can't boot if I specify device isa-ide.

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-07 19:01               ` Dr. David Alan Gilbert
  2015-04-07 19:04                 ` Peter Lieven
@ 2015-04-09 12:49                 ` Peter Lieven
  2015-04-09 13:32                   ` Peter Lieven
  2015-04-09 13:43                   ` Dr. David Alan Gilbert
  1 sibling, 2 replies; 28+ messages in thread
From: Peter Lieven @ 2015-04-09 12:49 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Paolo Bonzini, John Snow, qemu-devel, qemu-block

Am 07.04.2015 um 21:01 schrieb Dr. David Alan Gilbert:
> * Peter Lieven (pl@kamp.de) wrote:
>> Am 07.04.2015 um 17:29 schrieb Dr. David Alan Gilbert:
>>> * Peter Lieven (pl@kamp.de) wrote:
>>>> Hi David,
>>>>
>>>> Am 07.04.2015 um 10:43 schrieb Dr. David Alan Gilbert:
>>>>>>>> Any particular workload or reproducer?
>>>>>>> Workload is almost zero. I try to figure out if there is a way to trigger it.
>>>>>>>
>>>>>>> Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as
>>>>>>> CPU flag since kvmclock seemed to be quite buggy in 2.6.16...
>>>>>>>
>>>>>>> Exact cmdline is:
>>>>>>> /usr/bin/qemu-2.2.1  -enable-kvm  -M pc-1.2  -nodefaults -netdev type=tap,id=guest2,script=no,downscript=no,ifname=tap2  -device e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native  -serial null  -parallel null  -m 1024 -smp 2,sockets=1,cores=2,threads=1  -monitor tcp:0:4003,server,nowait -vnc :3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot order=c,once=dc,menu=off  -drive index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on  -k de  -incoming tcp:0:5003  -pidfile /var/run/qemu/vm-146.pid  -mem-path /hugepages  -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus  -cpu qemu64,-kvmclock
>>>>>>>
>>>>>>> Exact kernel is:
>>>>>>> 2.6.16.46-0.12-smp (i think this is SLES10 or sth.)
>>>>>>>
>>>>>>> The machine does not hang. It seems just I/O is hanging. So you can type at the console or ping the system, but no longer login.
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Peter
>>>>>> Interesting observation: Migrating the vServer again seems to fix to problem (at least in one case I could test just now).
>>>>>>
>>>>>> 2.6.8-24-smp is also affected.
>>>>> How often does it fail - you say 'sometimes' - is it a 1/10 or a 1/1000 ?
>>>> Its more often than 1/10 I would say.
>>> OK, that's not too bad - it's the 1/1000 that are really nasty to find.
>>> In your setup, how easy would it be for you to try :
>>>      with either 2.1 or current head?
>>>      with a newer machine-type?
>>>      without the cdrom?
>> Its all possible. I can clone the system and try everything on my test systems. I hope
>> it reproduces there.
> Great.  I think the order I would go would be:
>      Try head - if it works we know we've already got the fix somewhere
>      Try 2.1  - if it works we know it's something we introduced between
>                 2.1 and 2.2.1
>      Try a newer machine type - because pc-1.2 probably isn't tested much
>      CDROM at the end.

Update:
  - head -> not working
  - 2.1.3 -> not working
  - without CROM -> not working
  - with head and no machine type specified -> not working
  - with -device isa-ide -> BIOS not booting harddisk

Will now try 1.3.1 just to be sure.

Any ideas how to debug the IDE state after migration and/or check if the issue is similar to the ATAPI IDE
problem?

Thanks,
Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-09 12:46                     ` Peter Lieven
@ 2015-04-09 12:50                       ` Paolo Bonzini
  0 siblings, 0 replies; 28+ messages in thread
From: Paolo Bonzini @ 2015-04-09 12:50 UTC (permalink / raw)
  To: Peter Lieven, John Snow, Dr. David Alan Gilbert; +Cc: qemu-devel, qemu-block



On 09/04/2015 14:46, Peter Lieven wrote:
>>>
>>>
>>
>> BMDMA is the PCI HBA for IDE, I think it's the default for most
>> machines that aren't using the AHCI HBA.
>>
>> To get ISA, try launching with the machine "isapc" which will force
>> it, or add the device manually, it's named "isa-ide".
>> The BMDMA PCI device is just named "ide".
> 
> Unfortunately, the BIOS can't boot if I specify device isa-ide.

Booting with ide-core.nodma=1 should have the same effect.

Paolo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-09 12:49                 ` Peter Lieven
@ 2015-04-09 13:32                   ` Peter Lieven
  2015-04-09 13:43                   ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Lieven @ 2015-04-09 13:32 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Paolo Bonzini, John Snow, qemu-devel, qemu-block

Am 09.04.2015 um 14:49 schrieb Peter Lieven:
> Am 07.04.2015 um 21:01 schrieb Dr. David Alan Gilbert:
>> * Peter Lieven (pl@kamp.de) wrote:
>>> Am 07.04.2015 um 17:29 schrieb Dr. David Alan Gilbert:
>>>> * Peter Lieven (pl@kamp.de) wrote:
>>>>> Hi David,
>>>>>
>>>>> Am 07.04.2015 um 10:43 schrieb Dr. David Alan Gilbert:
>>>>>>>>> Any particular workload or reproducer?
>>>>>>>> Workload is almost zero. I try to figure out if there is a way to trigger it.
>>>>>>>>
>>>>>>>> Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as
>>>>>>>> CPU flag since kvmclock seemed to be quite buggy in 2.6.16...
>>>>>>>>
>>>>>>>> Exact cmdline is:
>>>>>>>> /usr/bin/qemu-2.2.1  -enable-kvm  -M pc-1.2 -nodefaults -netdev type=tap,id=guest2,script=no,downscript=no,ifname=tap2 -device e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive 
>>>>>>>> format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native -serial null  -parallel null  -m 1024 -smp 2,sockets=1,cores=2,threads=1  -monitor 
>>>>>>>> tcp:0:4003,server,nowait -vnc :3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot order=c,once=dc,menu=off  -drive index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on -k de  -incoming tcp:0:5003  -pidfile /var/run/qemu/vm-146.pid  
>>>>>>>> -mem-path /hugepages -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus  -cpu qemu64,-kvmclock
>>>>>>>>
>>>>>>>> Exact kernel is:
>>>>>>>> 2.6.16.46-0.12-smp (i think this is SLES10 or sth.)
>>>>>>>>
>>>>>>>> The machine does not hang. It seems just I/O is hanging. So you can type at the console or ping the system, but no longer login.
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Peter
>>>>>>> Interesting observation: Migrating the vServer again seems to fix to problem (at least in one case I could test just now).
>>>>>>>
>>>>>>> 2.6.8-24-smp is also affected.
>>>>>> How often does it fail - you say 'sometimes' - is it a 1/10 or a 1/1000 ?
>>>>> Its more often than 1/10 I would say.
>>>> OK, that's not too bad - it's the 1/1000 that are really nasty to find.
>>>> In your setup, how easy would it be for you to try :
>>>>      with either 2.1 or current head?
>>>>      with a newer machine-type?
>>>>      without the cdrom?
>>> Its all possible. I can clone the system and try everything on my test systems. I hope
>>> it reproduces there.
>> Great.  I think the order I would go would be:
>>      Try head - if it works we know we've already got the fix somewhere
>>      Try 2.1  - if it works we know it's something we introduced between
>>                 2.1 and 2.2.1
>>      Try a newer machine type - because pc-1.2 probably isn't tested much
>>      CDROM at the end.
>
> Update:
>  - head -> not working
>  - 2.1.3 -> not working
>  - without CROM -> not working
>  - with head and no machine type specified -> not working
>  - with -device isa-ide -> BIOS not booting harddisk
>
> Will now try 1.3.1 just to be sure.

1.3.1 => not working
kernel parameter ide=nodma (ide-core.nodma not supported by kernel) => not working.

I usually crash in my setup at around 50-80 migrations. In production it seems to happen more often.
Maybe it is load depending.

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-09 12:49                 ` Peter Lieven
  2015-04-09 13:32                   ` Peter Lieven
@ 2015-04-09 13:43                   ` Dr. David Alan Gilbert
  2015-04-09 14:54                     ` Peter Lieven
  1 sibling, 1 reply; 28+ messages in thread
From: Dr. David Alan Gilbert @ 2015-04-09 13:43 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Paolo Bonzini, John Snow, qemu-devel, qemu-block

* Peter Lieven (pl@kamp.de) wrote:
> Am 07.04.2015 um 21:01 schrieb Dr. David Alan Gilbert:
> >* Peter Lieven (pl@kamp.de) wrote:
> >>Am 07.04.2015 um 17:29 schrieb Dr. David Alan Gilbert:
> >>>* Peter Lieven (pl@kamp.de) wrote:
> >>>>Hi David,
> >>>>
> >>>>Am 07.04.2015 um 10:43 schrieb Dr. David Alan Gilbert:
> >>>>>>>>Any particular workload or reproducer?
> >>>>>>>Workload is almost zero. I try to figure out if there is a way to trigger it.
> >>>>>>>
> >>>>>>>Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as
> >>>>>>>CPU flag since kvmclock seemed to be quite buggy in 2.6.16...
> >>>>>>>
> >>>>>>>Exact cmdline is:
> >>>>>>>/usr/bin/qemu-2.2.1  -enable-kvm  -M pc-1.2  -nodefaults -netdev type=tap,id=guest2,script=no,downscript=no,ifname=tap2  -device e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native  -serial null  -parallel null  -m 1024 -smp 2,sockets=1,cores=2,threads=1  -monitor tcp:0:4003,server,nowait -vnc :3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot order=c,once=dc,menu=off  -drive index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on  -k de  -incoming tcp:0:5003  -pidfile /var/run/qemu/vm-146.pid  -mem-path /hugepages  -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus  -cpu qemu64,-kvmclock
> >>>>>>>
> >>>>>>>Exact kernel is:
> >>>>>>>2.6.16.46-0.12-smp (i think this is SLES10 or sth.)
> >>>>>>>
> >>>>>>>The machine does not hang. It seems just I/O is hanging. So you can type at the console or ping the system, but no longer login.
> >>>>>>>
> >>>>>>>Thank you,
> >>>>>>>Peter
> >>>>>>Interesting observation: Migrating the vServer again seems to fix to problem (at least in one case I could test just now).
> >>>>>>
> >>>>>>2.6.8-24-smp is also affected.
> >>>>>How often does it fail - you say 'sometimes' - is it a 1/10 or a 1/1000 ?
> >>>>Its more often than 1/10 I would say.
> >>>OK, that's not too bad - it's the 1/1000 that are really nasty to find.
> >>>In your setup, how easy would it be for you to try :
> >>>     with either 2.1 or current head?
> >>>     with a newer machine-type?
> >>>     without the cdrom?
> >>Its all possible. I can clone the system and try everything on my test systems. I hope
> >>it reproduces there.
> >Great.  I think the order I would go would be:
> >     Try head - if it works we know we've already got the fix somewhere
> >     Try 2.1  - if it works we know it's something we introduced between
> >                2.1 and 2.2.1
> >     Try a newer machine type - because pc-1.2 probably isn't tested much
> >     CDROM at the end.
> 
> Update:
>  - head -> not working
>  - 2.1.3 -> not working
>  - without CROM -> not working
>  - with head and no machine type specified -> not working
>  - with -device isa-ide -> BIOS not booting harddisk

Well, at least it's consistent....

> Will now try 1.3.1 just to be sure.
> 
> Any ideas how to debug the IDE state after migration and/or check if the issue is similar to the ATAPI IDE
> problem?

It's unlikely to be quite the same - most of the ATAPI problems were related to ATAPI
being quite separate and not saving much state.

The way I found the CDROM problems was to turn on most of the debugging in the ide and bmdma code
and on a failed migrate try and see what the state of any IO was at the point it migrated.

One other thing to check; I found the newer kernel code recovers better after
IDE problems; so on a newer guest kernel are there any log warnings about IDE problems,
even if the guests are otherwise apparently happy?

Dave

> Thanks,
> Peter
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-09 13:43                   ` Dr. David Alan Gilbert
@ 2015-04-09 14:54                     ` Peter Lieven
  2015-04-09 15:17                       ` Paolo Bonzini
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Lieven @ 2015-04-09 14:54 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Paolo Bonzini, John Snow, qemu-devel, qemu-block

Am 09.04.2015 um 15:43 schrieb Dr. David Alan Gilbert:
> * Peter Lieven (pl@kamp.de) wrote:
>> Am 07.04.2015 um 21:01 schrieb Dr. David Alan Gilbert:
>>> * Peter Lieven (pl@kamp.de) wrote:
>>>> Am 07.04.2015 um 17:29 schrieb Dr. David Alan Gilbert:
>>>>> * Peter Lieven (pl@kamp.de) wrote:
>>>>>> Hi David,
>>>>>>
>>>>>> Am 07.04.2015 um 10:43 schrieb Dr. David Alan Gilbert:
>>>>>>>>>> Any particular workload or reproducer?
>>>>>>>>> Workload is almost zero. I try to figure out if there is a way to trigger it.
>>>>>>>>>
>>>>>>>>> Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as
>>>>>>>>> CPU flag since kvmclock seemed to be quite buggy in 2.6.16...
>>>>>>>>>
>>>>>>>>> Exact cmdline is:
>>>>>>>>> /usr/bin/qemu-2.2.1  -enable-kvm  -M pc-1.2  -nodefaults -netdev type=tap,id=guest2,script=no,downscript=no,ifname=tap2  -device e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native  -serial null  -parallel null  -m 1024 -smp 2,sockets=1,cores=2,threads=1  -monitor tcp:0:4003,server,nowait -vnc :3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot order=c,once=dc,menu=off  -drive index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on  -k de  -incoming tcp:0:5003  -pidfile /var/run/qemu/vm-146.pid  -mem-path /hugepages  -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus  -cpu qemu64,-kvmclock
>>>>>>>>>
>>>>>>>>> Exact kernel is:
>>>>>>>>> 2.6.16.46-0.12-smp (i think this is SLES10 or sth.)
>>>>>>>>>
>>>>>>>>> The machine does not hang. It seems just I/O is hanging. So you can type at the console or ping the system, but no longer login.
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Peter
>>>>>>>> Interesting observation: Migrating the vServer again seems to fix to problem (at least in one case I could test just now).
>>>>>>>>
>>>>>>>> 2.6.8-24-smp is also affected.
>>>>>>> How often does it fail - you say 'sometimes' - is it a 1/10 or a 1/1000 ?
>>>>>> Its more often than 1/10 I would say.
>>>>> OK, that's not too bad - it's the 1/1000 that are really nasty to find.
>>>>> In your setup, how easy would it be for you to try :
>>>>>      with either 2.1 or current head?
>>>>>      with a newer machine-type?
>>>>>      without the cdrom?
>>>> Its all possible. I can clone the system and try everything on my test systems. I hope
>>>> it reproduces there.
>>> Great.  I think the order I would go would be:
>>>      Try head - if it works we know we've already got the fix somewhere
>>>      Try 2.1  - if it works we know it's something we introduced between
>>>                 2.1 and 2.2.1
>>>      Try a newer machine type - because pc-1.2 probably isn't tested much
>>>      CDROM at the end.
>> Update:
>>   - head -> not working
>>   - 2.1.3 -> not working
>>   - without CROM -> not working
>>   - with head and no machine type specified -> not working
>>   - with -device isa-ide -> BIOS not booting harddisk
> Well, at least it's consistent....
>
>> Will now try 1.3.1 just to be sure.
>>
>> Any ideas how to debug the IDE state after migration and/or check if the issue is similar to the ATAPI IDE
>> problem?
> It's unlikely to be quite the same - most of the ATAPI problems were related to ATAPI
> being quite separate and not saving much state.
>
> The way I found the CDROM problems was to turn on most of the debugging in the ide and bmdma code
> and on a failed migrate try and see what the state of any IO was at the point it migrated.

Thats tough. I enalbed DEBUG_IDE and DEBUG_AIO at first. But I have never debugged IDE before so I first
have to understand how that works....

What debugging confirms is that the IDE interface ideed stalls completely.

One thing I found curious in pci.c:

#define BM_MIGRATION_COMPAT_STATUS_BITS \
         (IDE_RETRY_DMA | IDE_RETRY_PIO | \
         IDE_RETRY_READ | IDE_RETRY_FLUSH)

Why is there no IDE_RETRY_WRITE ?
Honestly, I have not yet understood that that BM_MIGRATION_COMPAT_STATUS_BITS is for.

>
> One other thing to check; I found the newer kernel code recovers better after
> IDE problems; so on a newer guest kernel are there any log warnings about IDE problems,
> even if the guests are otherwise apparently happy?

I will check for that.

Thanks,
Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-09 14:54                     ` Peter Lieven
@ 2015-04-09 15:17                       ` Paolo Bonzini
  2015-04-11 13:11                         ` Peter Lieven
  0 siblings, 1 reply; 28+ messages in thread
From: Paolo Bonzini @ 2015-04-09 15:17 UTC (permalink / raw)
  To: Peter Lieven, Dr. David Alan Gilbert; +Cc: John Snow, qemu-devel, qemu-block



On 09/04/2015 16:54, Peter Lieven wrote:
> 
> #define BM_MIGRATION_COMPAT_STATUS_BITS \
>         (IDE_RETRY_DMA | IDE_RETRY_PIO | \
>         IDE_RETRY_READ | IDE_RETRY_FLUSH)
> 
> Why is there no IDE_RETRY_WRITE ?

Because that's represented by none of read and flush being set. :)

> Honestly, I have not yet understood that that
> BM_MIGRATION_COMPAT_STATUS_BITS is for.

It's just for migrations while the VM is stopped due to I/O errors
(rerror=stop/werror=stop).

Paolo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-09 15:17                       ` Paolo Bonzini
@ 2015-04-11 13:11                         ` Peter Lieven
  2015-04-11 15:00                           ` Peter Lieven
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Lieven @ 2015-04-11 13:11 UTC (permalink / raw)
  To: Paolo Bonzini, Dr. David Alan Gilbert; +Cc: John Snow, qemu-devel, qemu-block

Am 09.04.2015 um 17:17 schrieb Paolo Bonzini:
>
> On 09/04/2015 16:54, Peter Lieven wrote:
>> #define BM_MIGRATION_COMPAT_STATUS_BITS \
>>         (IDE_RETRY_DMA | IDE_RETRY_PIO | \
>>         IDE_RETRY_READ | IDE_RETRY_FLUSH)
>>
>> Why is there no IDE_RETRY_WRITE ?
> Because that's represented by none of read and flush being set. :)
>
>> Honestly, I have not yet understood that that
>> BM_MIGRATION_COMPAT_STATUS_BITS is for.
> It's just for migrations while the VM is stopped due to I/O errors
> (rerror=stop/werror=stop).

My migration problem seems to be a regression or incompatiblity in kvm-kmod. I started debugging
with an old kvm module accidently. It seems to work with the old module shipped
with the kernel (3.13) and fails with (3.19).

Any ideas, Paolo?

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-11 13:11                         ` Peter Lieven
@ 2015-04-11 15:00                           ` Peter Lieven
  2015-04-13  7:20                             ` Peter Lieven
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Lieven @ 2015-04-11 15:00 UTC (permalink / raw)
  To: Paolo Bonzini, Dr. David Alan Gilbert; +Cc: John Snow, qemu-devel, qemu-block

Am 11.04.2015 um 15:11 schrieb Peter Lieven:
> Am 09.04.2015 um 17:17 schrieb Paolo Bonzini:
>> On 09/04/2015 16:54, Peter Lieven wrote:
>>> #define BM_MIGRATION_COMPAT_STATUS_BITS \
>>>         (IDE_RETRY_DMA | IDE_RETRY_PIO | \
>>>         IDE_RETRY_READ | IDE_RETRY_FLUSH)
>>>
>>> Why is there no IDE_RETRY_WRITE ?
>> Because that's represented by none of read and flush being set. :)
>>
>>> Honestly, I have not yet understood that that
>>> BM_MIGRATION_COMPAT_STATUS_BITS is for.
>> It's just for migrations while the VM is stopped due to I/O errors
>> (rerror=stop/werror=stop).
> My migration problem seems to be a regression or incompatiblity in kvm-kmod. I started debugging
> with an old kvm module accidently. It seems to work with the old module shipped
> with the kernel (3.13) and fails with (3.19).

3.17 (kvm-kmod master) also seems to work. I had to move to 3.19 some time ago to
mititage another bug that triggered a new check in Qemu.

kvm-kmod next currently does not compile under my 3.13 host kernel. And according to
the buildbot output for kvm-kmod it seems to fail for almost all kernels <= 3.18.

I will keep my tests running with 3.17 kvm-kmod. Currently it has done nearly 1000 migrations
in a row without crashing.

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Migration sometimes fails with IDE and Qemu 2.2.1
  2015-04-11 15:00                           ` Peter Lieven
@ 2015-04-13  7:20                             ` Peter Lieven
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Lieven @ 2015-04-13  7:20 UTC (permalink / raw)
  To: Paolo Bonzini, Dr. David Alan Gilbert
  Cc: Marcelo Tosatti, John Snow, qemu-devel, qemu-block

Am 11.04.2015 um 17:00 schrieb Peter Lieven:
> Am 11.04.2015 um 15:11 schrieb Peter Lieven:
>> Am 09.04.2015 um 17:17 schrieb Paolo Bonzini:
>>> On 09/04/2015 16:54, Peter Lieven wrote:
>>>> #define BM_MIGRATION_COMPAT_STATUS_BITS \
>>>>          (IDE_RETRY_DMA | IDE_RETRY_PIO | \
>>>>          IDE_RETRY_READ | IDE_RETRY_FLUSH)
>>>>
>>>> Why is there no IDE_RETRY_WRITE ?
>>> Because that's represented by none of read and flush being set. :)
>>>
>>>> Honestly, I have not yet understood that that
>>>> BM_MIGRATION_COMPAT_STATUS_BITS is for.
>>> It's just for migrations while the VM is stopped due to I/O errors
>>> (rerror=stop/werror=stop).
>> My migration problem seems to be a regression or incompatiblity in kvm-kmod. I started debugging
>> with an old kvm module accidently. It seems to work with the old module shipped
>> with the kernel (3.13) and fails with (3.19).
> 3.17 (kvm-kmod master) also seems to work. I had to move to 3.19 some time ago to
> mititage another bug that triggered a new check in Qemu.
>
> kvm-kmod next currently does not compile under my 3.13 host kernel. And according to
> the buildbot output for kvm-kmod it seems to fail for almost all kernels <= 3.18.

I meanwhile managed to compile the kvm-kmod next. The bug is still in there.

I will now try kvm-kmod master with

KVM: x86: update masterclock values on TSC writes

on top.

Help appreciated.

Thanks,
Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2015-04-13  7:21 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-06 18:47 [Qemu-devel] Migration sometimes fails with IDE and Qemu 2.2.1 Peter Lieven
2015-04-06 18:50 ` [Qemu-devel] [Qemu-block] " John Snow
2015-04-06 19:02   ` Peter Lieven
2015-04-06 19:10     ` Peter Lieven
2015-04-07  8:43       ` Dr. David Alan Gilbert
2015-04-07 15:11         ` Peter Lieven
2015-04-07 15:14           ` Paolo Bonzini
2015-04-07 18:54             ` Peter Lieven
2015-04-07 15:29           ` Dr. David Alan Gilbert
2015-04-07 18:44             ` Peter Lieven
2015-04-07 18:56               ` John Snow
2015-04-07 19:02                 ` Peter Lieven
2015-04-07 19:13                   ` John Snow
2015-04-09  6:34                     ` Peter Lieven
2015-04-09 12:46                     ` Peter Lieven
2015-04-09 12:50                       ` Paolo Bonzini
2015-04-07 19:01               ` Dr. David Alan Gilbert
2015-04-07 19:04                 ` Peter Lieven
2015-04-09 12:49                 ` Peter Lieven
2015-04-09 13:32                   ` Peter Lieven
2015-04-09 13:43                   ` Dr. David Alan Gilbert
2015-04-09 14:54                     ` Peter Lieven
2015-04-09 15:17                       ` Paolo Bonzini
2015-04-11 13:11                         ` Peter Lieven
2015-04-11 15:00                           ` Peter Lieven
2015-04-13  7:20                             ` Peter Lieven
2015-04-07 20:05               ` Paolo Bonzini
2015-04-09  6:43                 ` Peter Lieven

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.