All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL
@ 2019-12-12 22:35 osstest service owner
  2019-12-13  8:31 ` Jürgen Groß
  0 siblings, 1 reply; 16+ messages in thread
From: osstest service owner @ 2019-12-12 22:35 UTC (permalink / raw)
  To: xen-devel, osstest-admin

flight 144736 xen-4.13-testing real [real]
http://logs.test-lab.xenproject.org/osstest/logs/144736/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-arm64-arm64-xl-credit1   7 xen-boot                 fail REGR. vs. 144673
 test-armhf-armhf-xl-vhd      18 leak-check/check         fail REGR. vs. 144673

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-rtds     18 guest-localmigrate/x10       fail  like 144673
 test-armhf-armhf-xl-rtds     16 guest-start/debian.repeat    fail  like 144673
 test-amd64-i386-xl-pvshim    12 guest-start                  fail   never pass
 test-amd64-amd64-libvirt     13 migrate-support-check        fail   never pass
 test-arm64-arm64-xl-seattle  13 migrate-support-check        fail   never pass
 test-arm64-arm64-xl-seattle  14 saverestore-support-check    fail   never pass
 test-amd64-i386-libvirt-xsm  13 migrate-support-check        fail   never pass
 test-amd64-i386-libvirt      13 migrate-support-check        fail   never pass
 test-amd64-amd64-libvirt-xsm 13 migrate-support-check        fail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check fail never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check fail never pass
 test-amd64-amd64-qemuu-nested-amd 17 debian-hvm-install/l1/l2  fail never pass
 test-arm64-arm64-xl-xsm      13 migrate-support-check        fail   never pass
 test-arm64-arm64-xl-xsm      14 saverestore-support-check    fail   never pass
 test-arm64-arm64-xl          13 migrate-support-check        fail   never pass
 test-arm64-arm64-xl          14 saverestore-support-check    fail   never pass
 test-arm64-arm64-xl-credit2  13 migrate-support-check        fail   never pass
 test-arm64-arm64-xl-credit2  14 saverestore-support-check    fail   never pass
 test-arm64-arm64-libvirt-xsm 13 migrate-support-check        fail   never pass
 test-arm64-arm64-libvirt-xsm 14 saverestore-support-check    fail   never pass
 test-arm64-arm64-xl-thunderx 13 migrate-support-check        fail   never pass
 test-arm64-arm64-xl-thunderx 14 saverestore-support-check    fail   never pass
 test-amd64-amd64-xl-qemuu-win7-amd64 17 guest-stop             fail never pass
 test-amd64-amd64-libvirt-vhd 12 migrate-support-check        fail   never pass
 test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stop              fail never pass
 test-amd64-i386-xl-qemut-win7-amd64 17 guest-stop              fail never pass
 test-armhf-armhf-xl-rtds     13 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-rtds     14 saverestore-support-check    fail   never pass
 test-armhf-armhf-xl          13 migrate-support-check        fail   never pass
 test-armhf-armhf-xl          14 saverestore-support-check    fail   never pass
 test-armhf-armhf-xl-credit2  13 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-credit2  14 saverestore-support-check    fail   never pass
 test-armhf-armhf-xl-cubietruck 13 migrate-support-check        fail never pass
 test-armhf-armhf-xl-multivcpu 13 migrate-support-check        fail  never pass
 test-armhf-armhf-xl-cubietruck 14 saverestore-support-check    fail never pass
 test-armhf-armhf-xl-multivcpu 14 saverestore-support-check    fail  never pass
 test-armhf-armhf-libvirt     13 migrate-support-check        fail   never pass
 test-armhf-armhf-libvirt     14 saverestore-support-check    fail   never pass
 test-amd64-i386-xl-qemut-ws16-amd64 17 guest-stop              fail never pass
 test-amd64-amd64-xl-qemut-win7-amd64 17 guest-stop             fail never pass
 test-armhf-armhf-libvirt-raw 12 migrate-support-check        fail   never pass
 test-armhf-armhf-libvirt-raw 13 saverestore-support-check    fail   never pass
 test-amd64-i386-xl-qemuu-ws16-amd64 17 guest-stop              fail never pass
 test-armhf-armhf-xl-vhd      12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-vhd      13 saverestore-support-check    fail   never pass
 test-amd64-amd64-xl-qemuu-ws16-amd64 17 guest-stop             fail never pass
 test-armhf-armhf-xl-arndale  13 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-arndale  14 saverestore-support-check    fail   never pass
 test-armhf-armhf-xl-credit1  13 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-credit1  14 saverestore-support-check    fail   never pass
 test-amd64-amd64-xl-qemut-ws16-amd64 17 guest-stop             fail never pass

version targeted for testing:
 xen                  ecd3e34ff88b4a8130e7bc6dc18b09682ac3da2b
baseline version:
 xen                  b0f0bbca95bd532212fb1956f3e23d1ab13a53cf

Last test of basis   144673  2019-12-10 19:07:50 Z    2 days
Failing since        144708  2019-12-11 11:38:22 Z    1 days    2 attempts
Testing same since   144736  2019-12-11 19:06:02 Z    1 days    1 attempts

------------------------------------------------------------
People who touched revisions under test:
  Andrew Cooper <andrew.cooper3@citrix.com>
  George Dunlap <george.dunlap@citrix.com>
  Jan Beulich <jbeulich@suse.com>
  Juergen Gross <jgross@suse.com>
  Julien Grall <julien@xen.org>
  Kevin Tian <kevin.tian@intel.com>

jobs:
 build-amd64-xsm                                              pass    
 build-arm64-xsm                                              pass    
 build-i386-xsm                                               pass    
 build-amd64-xtf                                              pass    
 build-amd64                                                  pass    
 build-arm64                                                  pass    
 build-armhf                                                  pass    
 build-i386                                                   pass    
 build-amd64-libvirt                                          pass    
 build-arm64-libvirt                                          pass    
 build-armhf-libvirt                                          pass    
 build-i386-libvirt                                           pass    
 build-amd64-prev                                             pass    
 build-i386-prev                                              pass    
 build-amd64-pvops                                            pass    
 build-arm64-pvops                                            pass    
 build-armhf-pvops                                            pass    
 build-i386-pvops                                             pass    
 test-xtf-amd64-amd64-1                                       pass    
 test-xtf-amd64-amd64-2                                       pass    
 test-xtf-amd64-amd64-3                                       pass    
 test-xtf-amd64-amd64-4                                       pass    
 test-xtf-amd64-amd64-5                                       pass    
 test-amd64-amd64-xl                                          pass    
 test-arm64-arm64-xl                                          pass    
 test-armhf-armhf-xl                                          pass    
 test-amd64-i386-xl                                           pass    
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm           pass    
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm            pass    
 test-amd64-amd64-xl-qemut-stubdom-debianhvm-amd64-xsm        pass    
 test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm         pass    
 test-amd64-amd64-xl-qemut-debianhvm-i386-xsm                 pass    
 test-amd64-i386-xl-qemut-debianhvm-i386-xsm                  pass    
 test-amd64-amd64-xl-qemuu-debianhvm-i386-xsm                 pass    
 test-amd64-i386-xl-qemuu-debianhvm-i386-xsm                  pass    
 test-amd64-amd64-libvirt-xsm                                 pass    
 test-arm64-arm64-libvirt-xsm                                 pass    
 test-amd64-i386-libvirt-xsm                                  pass    
 test-amd64-amd64-xl-xsm                                      pass    
 test-arm64-arm64-xl-xsm                                      pass    
 test-amd64-i386-xl-xsm                                       pass    
 test-amd64-amd64-qemuu-nested-amd                            fail    
 test-amd64-amd64-xl-pvhv2-amd                                pass    
 test-amd64-i386-qemut-rhel6hvm-amd                           pass    
 test-amd64-i386-qemuu-rhel6hvm-amd                           pass    
 test-amd64-amd64-xl-qemut-debianhvm-amd64                    pass    
 test-amd64-i386-xl-qemut-debianhvm-amd64                     pass    
 test-amd64-amd64-xl-qemuu-debianhvm-amd64                    pass    
 test-amd64-i386-xl-qemuu-debianhvm-amd64                     pass    
 test-amd64-i386-freebsd10-amd64                              pass    
 test-amd64-amd64-xl-qemuu-ovmf-amd64                         pass    
 test-amd64-i386-xl-qemuu-ovmf-amd64                          pass    
 test-amd64-amd64-xl-qemut-win7-amd64                         fail    
 test-amd64-i386-xl-qemut-win7-amd64                          fail    
 test-amd64-amd64-xl-qemuu-win7-amd64                         fail    
 test-amd64-i386-xl-qemuu-win7-amd64                          fail    
 test-amd64-amd64-xl-qemut-ws16-amd64                         fail    
 test-amd64-i386-xl-qemut-ws16-amd64                          fail    
 test-amd64-amd64-xl-qemuu-ws16-amd64                         fail    
 test-amd64-i386-xl-qemuu-ws16-amd64                          fail    
 test-armhf-armhf-xl-arndale                                  pass    
 test-amd64-amd64-xl-credit1                                  pass    
 test-arm64-arm64-xl-credit1                                  fail    
 test-armhf-armhf-xl-credit1                                  pass    
 test-amd64-amd64-xl-credit2                                  pass    
 test-arm64-arm64-xl-credit2                                  pass    
 test-armhf-armhf-xl-credit2                                  pass    
 test-armhf-armhf-xl-cubietruck                               pass    
 test-amd64-amd64-xl-qemuu-dmrestrict-amd64-dmrestrict        pass    
 test-amd64-i386-xl-qemuu-dmrestrict-amd64-dmrestrict         pass    
 test-amd64-i386-freebsd10-i386                               pass    
 test-amd64-amd64-qemuu-nested-intel                          pass    
 test-amd64-amd64-xl-pvhv2-intel                              pass    
 test-amd64-i386-qemut-rhel6hvm-intel                         pass    
 test-amd64-i386-qemuu-rhel6hvm-intel                         pass    
 test-amd64-amd64-libvirt                                     pass    
 test-armhf-armhf-libvirt                                     pass    
 test-amd64-i386-libvirt                                      pass    
 test-amd64-amd64-livepatch                                   pass    
 test-amd64-i386-livepatch                                    pass    
 test-amd64-amd64-migrupgrade                                 pass    
 test-amd64-i386-migrupgrade                                  pass    
 test-amd64-amd64-xl-multivcpu                                pass    
 test-armhf-armhf-xl-multivcpu                                pass    
 test-amd64-amd64-pair                                        pass    
 test-amd64-i386-pair                                         pass    
 test-amd64-amd64-libvirt-pair                                pass    
 test-amd64-i386-libvirt-pair                                 pass    
 test-amd64-amd64-amd64-pvgrub                                pass    
 test-amd64-amd64-i386-pvgrub                                 pass    
 test-amd64-amd64-xl-pvshim                                   pass    
 test-amd64-i386-xl-pvshim                                    fail    
 test-amd64-amd64-pygrub                                      pass    
 test-amd64-amd64-xl-qcow2                                    pass    
 test-armhf-armhf-libvirt-raw                                 pass    
 test-amd64-i386-xl-raw                                       pass    
 test-amd64-amd64-xl-rtds                                     fail    
 test-armhf-armhf-xl-rtds                                     fail    
 test-arm64-arm64-xl-seattle                                  pass    
 test-amd64-amd64-xl-qemuu-debianhvm-amd64-shadow             pass    
 test-amd64-i386-xl-qemuu-debianhvm-amd64-shadow              pass    
 test-amd64-amd64-xl-shadow                                   pass    
 test-amd64-i386-xl-shadow                                    pass    
 test-arm64-arm64-xl-thunderx                                 pass    
 test-amd64-amd64-libvirt-vhd                                 pass    
 test-armhf-armhf-xl-vhd                                      fail    


------------------------------------------------------------
sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
    http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
    http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
    http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
    http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Not pushing.

(No revision log; it would be 305 lines long.)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL
  2019-12-12 22:35 [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL osstest service owner
@ 2019-12-13  8:31 ` Jürgen Groß
  2019-12-13 11:14   ` Julien Grall
  0 siblings, 1 reply; 16+ messages in thread
From: Jürgen Groß @ 2019-12-13  8:31 UTC (permalink / raw)
  To: osstest service owner, xen-devel, Ian Jackson, Julien Grall

On 12.12.19 23:35, osstest service owner wrote:
> flight 144736 xen-4.13-testing real [real]
> http://logs.test-lab.xenproject.org/osstest/logs/144736/
> 
> Regressions :-(
> 
> Tests which did not succeed and are blocking,
> including tests which could not be run:
>   test-arm64-arm64-xl-credit1   7 xen-boot                 fail REGR. vs. 144673

Looking into the serial log this looks like a hardware problem to me.

Ian, do you agree?

>   test-armhf-armhf-xl-vhd      18 leak-check/check         fail REGR. vs. 144673

That one is strange. A qemu process seems to have have died producing
a core file, but I couldn't find any log containing any other indication
of a crashed program.

And I can't believe the ARM changes in the hypervisor would result in
qemu crashing now...

Julien, could you please have a look?


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL
  2019-12-13  8:31 ` Jürgen Groß
@ 2019-12-13 11:14   ` Julien Grall
  2019-12-13 11:24     ` Jürgen Groß
  2019-12-13 11:40     ` Ian Jackson
  0 siblings, 2 replies; 16+ messages in thread
From: Julien Grall @ 2019-12-13 11:14 UTC (permalink / raw)
  To: Jürgen Groß,
	osstest service owner, xen-devel, Ian Jackson,
	Stefano Stabellini

Hi Juergen,

On 13/12/2019 08:31, Jürgen Groß wrote:
> On 12.12.19 23:35, osstest service owner wrote:
>> flight 144736 xen-4.13-testing real [real]
>> http://logs.test-lab.xenproject.org/osstest/logs/144736/
>>
>> Regressions :-(
>>
>> Tests which did not succeed and are blocking,
>> including tests which could not be run:
>>   test-arm64-arm64-xl-credit1   7 xen-boot                 fail REGR. 
>> vs. 144673
> 
> Looking into the serial log this looks like a hardware problem to me.

Looking at [1], the board were able to pick up new job. So I would 
assume this just a temporary failure.

AMD Seattle boards (laxton*) are known to fail booting time to time 
because of PCI training issue. We have workaround for it (involving 
longer power cycle) but this is not 100% reliable.

> 
> Ian, do you agree?
> 
>>   test-armhf-armhf-xl-vhd      18 leak-check/check         fail REGR. 
>> vs. 144673
> 
> That one is strange. A qemu process seems to have have died producing
> a core file, but I couldn't find any log containing any other indication
> of a crashed program.

I haven't found anything interesting in the log. @Ian could you set up 
a repro for this?

For the future, it would be worth considering to collect core files.

> 
> And I can't believe the ARM changes in the hypervisor would result in
> qemu crashing now...

I have seen weird behavior happening in Dom0 because of changes in Xen 
before. :) For instance, get_cycles() was wrongly implemented and 
resulted to network loss.

Anyway,  QEMU is the same as the previous flight. The only difference 
here is in Xen:

d8538f71edc954f8c518de2f9cc9ae89ee05f6a1
x86+Arm32: make find_next_{,zero_}bit() have well defined behavior

> Julien, could you please have a look?

I don't have much idea what's happening. A repro would useful to be able 
to do more debug.

Cheers,

[1] http://logs.test-lab.xenproject.org/osstest/results/host/laxton0.html

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL
  2019-12-13 11:14   ` Julien Grall
@ 2019-12-13 11:24     ` Jürgen Groß
  2019-12-13 11:28       ` Julien Grall
  2019-12-13 11:40     ` Ian Jackson
  1 sibling, 1 reply; 16+ messages in thread
From: Jürgen Groß @ 2019-12-13 11:24 UTC (permalink / raw)
  To: Julien Grall, osstest service owner, xen-devel, Ian Jackson,
	Stefano Stabellini

On 13.12.19 12:14, Julien Grall wrote:
> Hi Juergen,
> 
> On 13/12/2019 08:31, Jürgen Groß wrote:
>> On 12.12.19 23:35, osstest service owner wrote:
>>> flight 144736 xen-4.13-testing real [real]
>>> http://logs.test-lab.xenproject.org/osstest/logs/144736/
>>>
>>> Regressions :-(
>>>
>>> Tests which did not succeed and are blocking,
>>> including tests which could not be run:
>>>   test-arm64-arm64-xl-credit1   7 xen-boot                 fail REGR. 
>>> vs. 144673
>>
>> Looking into the serial log this looks like a hardware problem to me.
> 
> Looking at [1], the board were able to pick up new job. So I would 
> assume this just a temporary failure.
> 
> AMD Seattle boards (laxton*) are known to fail booting time to time 
> because of PCI training issue. We have workaround for it (involving 
> longer power cycle) but this is not 100% reliable.

I guess repeating the power cycle should work, too (especially as the
new job did work as you said)?

> 
>>
>> Ian, do you agree?
>>
>>>   test-armhf-armhf-xl-vhd      18 leak-check/check         fail REGR. 
>>> vs. 144673
>>
>> That one is strange. A qemu process seems to have have died producing
>> a core file, but I couldn't find any log containing any other indication
>> of a crashed program.
> 
> I haven't found anything interesting in the log. @Ian could you set up a 
> repro for this?
> 
> For the future, it would be worth considering to collect core files.

OSStest does:

http://logs.test-lab.xenproject.org/osstest/logs/144736/test-armhf-armhf-xl-vhd/cubietruck-metzinger---var-core-1576147280.1979.qemu-system-i38.core.gz

> 
>>
>> And I can't believe the ARM changes in the hypervisor would result in
>> qemu crashing now...
> 
> I have seen weird behavior happening in Dom0 because of changes in Xen 
> before. :) For instance, get_cycles() was wrongly implemented and 
> resulted to network loss.
> 
> Anyway,  QEMU is the same as the previous flight. The only difference 
> here is in Xen:
> 
> d8538f71edc954f8c518de2f9cc9ae89ee05f6a1
> x86+Arm32: make find_next_{,zero_}bit() have well defined behavior

Right, that was what I meant. :-)

> 
>> Julien, could you please have a look?
> 
> I don't have much idea what's happening. A repro would useful to be able 
> to do more debug.
> 
> Cheers,
> 
> [1] http://logs.test-lab.xenproject.org/osstest/results/host/laxton0.html
> 

Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL
  2019-12-13 11:24     ` Jürgen Groß
@ 2019-12-13 11:28       ` Julien Grall
  0 siblings, 0 replies; 16+ messages in thread
From: Julien Grall @ 2019-12-13 11:28 UTC (permalink / raw)
  To: Jürgen Groß,
	osstest service owner, xen-devel, Ian Jackson,
	Stefano Stabellini



On 13/12/2019 11:24, Jürgen Groß wrote:
> On 13.12.19 12:14, Julien Grall wrote:
>> Hi Juergen,
>>
>> On 13/12/2019 08:31, Jürgen Groß wrote:
>>> On 12.12.19 23:35, osstest service owner wrote:
>>>> flight 144736 xen-4.13-testing real [real]
>>>> http://logs.test-lab.xenproject.org/osstest/logs/144736/
>>>>
>>>> Regressions :-(
>>>>
>>>> Tests which did not succeed and are blocking,
>>>> including tests which could not be run:
>>>>   test-arm64-arm64-xl-credit1   7 xen-boot                 fail 
>>>> REGR. vs. 144673
>>>
>>> Looking into the serial log this looks like a hardware problem to me.
>>
>> Looking at [1], the board were able to pick up new job. So I would 
>> assume this just a temporary failure.
>>
>> AMD Seattle boards (laxton*) are known to fail booting time to time 
>> because of PCI training issue. We have workaround for it (involving 
>> longer power cycle) but this is not 100% reliable.
> 
> I guess repeating the power cycle should work, too (especially as the
> new job did work as you said)?
Well, how do you define whether this is stuck because of an hardware 
failure or because Xen crash?

But with the current workaround, we already have limited failure. So I 
don't think it is worth the trouble to try power cycling again.

> 
>>
>>>
>>> Ian, do you agree?
>>>
>>>>   test-armhf-armhf-xl-vhd      18 leak-check/check         fail 
>>>> REGR. vs. 144673
>>>
>>> That one is strange. A qemu process seems to have have died producing
>>> a core file, but I couldn't find any log containing any other indication
>>> of a crashed program.
>>
>> I haven't found anything interesting in the log. @Ian could you set up 
>> a repro for this?
>>
>> For the future, it would be worth considering to collect core files.
> 
> OSStest does:
> 
> http://logs.test-lab.xenproject.org/osstest/logs/144736/test-armhf-armhf-xl-vhd/cubietruck-metzinger---var-core-1576147280.1979.qemu-system-i38.core.gz 

Dam, I didn't spot it. Sorry for the noise.

I will have a look.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL
  2019-12-13 11:14   ` Julien Grall
  2019-12-13 11:24     ` Jürgen Groß
@ 2019-12-13 11:40     ` Ian Jackson
  2019-12-13 15:36       ` Julien Grall
  1 sibling, 1 reply; 16+ messages in thread
From: Ian Jackson @ 2019-12-13 11:40 UTC (permalink / raw)
  To: Julien Grall
  Cc: Jürgen Groß,
	xen-devel, Stefano Stabellini, osstest service owner

Julien Grall writes ("Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL"):
> AMD Seattle boards (laxton*) are known to fail booting time to time 
> because of PCI training issue. We have workaround for it (involving 
> longer power cycle) but this is not 100% reliable.

This wasn't a power cycle.  It was a software-initiated reboot.  It
does appear to hang in the firmware somewhere.  Do we expect the pci
training issue to occur in this case ?

> >>   test-armhf-armhf-xl-vhd      18 leak-check/check         fail REGR. 
> >> vs. 144673
> > 
> > That one is strange. A qemu process seems to have have died producing
> > a core file, but I couldn't find any log containing any other indication
> > of a crashed program.
> 
> I haven't found anything interesting in the log. @Ian could you set up 
> a repro for this?

There is some heisenbug where qemu crashes with very low probability.
(I forget whether only on arm or on x86 too).  This has been around
for a little while.  I doubt this particular failure will be
reproducible.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL
  2019-12-13 11:40     ` Ian Jackson
@ 2019-12-13 15:36       ` Julien Grall
  2019-12-13 15:55         ` Durrant, Paul
  0 siblings, 1 reply; 16+ messages in thread
From: Julien Grall @ 2019-12-13 15:36 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Jürgen Groß,
	xen-devel, Stefano Stabellini, osstest service owner,
	Anthony Perard

+Anthony

On 13/12/2019 11:40, Ian Jackson wrote:
> Julien Grall writes ("Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL"):
>> AMD Seattle boards (laxton*) are known to fail booting time to time
>> because of PCI training issue. We have workaround for it (involving
>> longer power cycle) but this is not 100% reliable.
> 
> This wasn't a power cycle.  It was a software-initiated reboot.  It
> does appear to hang in the firmware somewhere.  Do we expect the pci
> training issue to occur in this case ?

The PCI training happens at every reset (including software). So I may 
have confused the workaround for firmware corruption with the PCI 
training. We definitely have a workfround for the former.

For the latter, I can't remember if we did use a new firmware or just 
hope it does not happen often.

I think we had a thread on infra@ about the workaround some times last 
year. Sadly this was sent on my Arm e-mail address and I didn't archive 
it before leaving :(. Can you have a look if you can find the thread?

> 
>>>>    test-armhf-armhf-xl-vhd      18 leak-check/check         fail REGR.
>>>> vs. 144673
>>>
>>> That one is strange. A qemu process seems to have have died producing
>>> a core file, but I couldn't find any log containing any other indication
>>> of a crashed program.
>>
>> I haven't found anything interesting in the log. @Ian could you set up
>> a repro for this?
> 
> There is some heisenbug where qemu crashes with very low probability.
> (I forget whether only on arm or on x86 too).  This has been around
> for a little while.  I doubt this particular failure will be
> reproducible.	

I can't remember such bug been reported on Arm before. Anyway, I managed 
to get the stack trace from gdb:

Core was generated by `/usr/local/lib/xen/bin/qemu-system-i386 
-xen-domid 1 -chardev socket,id=libxl-c'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x006342be in xen_block_handle_requests (dataplane=0x108e600) at 
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/hw/block/dataplane/xen-block.c:531
531 
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/hw/block/dataplane/xen-block.c: 
No such file or directory.
[Current thread is 1 (LWP 1987)]
(gdb) bt
#0  0x006342be in xen_block_handle_requests (dataplane=0x108e600) at 
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/hw/block/dataplane/xen-block.c:531
#1  0x0063447c in xen_block_dataplane_event (opaque=0x108e600) at 
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/hw/block/dataplane/xen-block.c:626
#2  0x008d005c in xen_device_poll (opaque=0x107a3b0) at 
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/hw/xen/xen-bus.c:1077
#3  0x00a4175c in run_poll_handlers_once (ctx=0x1079708, 
timeout=0xb1ba17f8) at 
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-posix.c:520
#4  0x00a41826 in run_poll_handlers (ctx=0x1079708, max_ns=8000, 
timeout=0xb1ba17f8) at 
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-posix.c:562
#5  0x00a41956 in try_poll_mode (ctx=0x1079708, timeout=0xb1ba17f8) at 
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-posix.c:597
#6  0x00a41a2c in aio_poll (ctx=0x1079708, blocking=true) at 
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-posix.c:639
#7  0x0071dc16 in iothread_run (opaque=0x107d328) at 
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/iothread.c:75
#8  0x00a44c80 in qemu_thread_start (args=0x1079538) at 
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/qemu-thread-posix.c:502
#9  0xb67ae5d8 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

This feels like a race condition between the init/free code with 
handler. Anthony, does it ring any bell?

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL
  2019-12-13 15:36       ` Julien Grall
@ 2019-12-13 15:55         ` Durrant, Paul
  2019-12-14  0:34             ` [Xen-devel] xen-block: race condition when stopping the device (WAS: " Julien Grall
  0 siblings, 1 reply; 16+ messages in thread
From: Durrant, Paul @ 2019-12-13 15:55 UTC (permalink / raw)
  To: Julien Grall, Ian Jackson
  Cc: Jürgen Groß,
	xen-devel, Stefano Stabellini, osstest service owner,
	Anthony Perard

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
> Julien Grall
> Sent: 13 December 2019 15:37
> To: Ian Jackson <ian.jackson@citrix.com>
> Cc: Jürgen Groß <jgross@suse.com>; xen-devel@lists.xenproject.org; Stefano
> Stabellini <sstabellini@kernel.org>; osstest service owner <osstest-
> admin@xenproject.org>; Anthony Perard <anthony.perard@citrix.com>
> Subject: Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions -
> FAIL
> 
> +Anthony
> 
> On 13/12/2019 11:40, Ian Jackson wrote:
> > Julien Grall writes ("Re: [Xen-devel] [xen-4.13-testing test] 144736:
> regressions - FAIL"):
> >> AMD Seattle boards (laxton*) are known to fail booting time to time
> >> because of PCI training issue. We have workaround for it (involving
> >> longer power cycle) but this is not 100% reliable.
> >
> > This wasn't a power cycle.  It was a software-initiated reboot.  It
> > does appear to hang in the firmware somewhere.  Do we expect the pci
> > training issue to occur in this case ?
> 
> The PCI training happens at every reset (including software). So I may
> have confused the workaround for firmware corruption with the PCI
> training. We definitely have a workfround for the former.
> 
> For the latter, I can't remember if we did use a new firmware or just
> hope it does not happen often.
> 
> I think we had a thread on infra@ about the workaround some times last
> year. Sadly this was sent on my Arm e-mail address and I didn't archive
> it before leaving :(. Can you have a look if you can find the thread?
> 
> >
> >>>>    test-armhf-armhf-xl-vhd      18 leak-check/check         fail
> REGR.
> >>>> vs. 144673
> >>>
> >>> That one is strange. A qemu process seems to have have died producing
> >>> a core file, but I couldn't find any log containing any other
> indication
> >>> of a crashed program.
> >>
> >> I haven't found anything interesting in the log. @Ian could you set up
> >> a repro for this?
> >
> > There is some heisenbug where qemu crashes with very low probability.
> > (I forget whether only on arm or on x86 too).  This has been around
> > for a little while.  I doubt this particular failure will be
> > reproducible.
> 
> I can't remember such bug been reported on Arm before. Anyway, I managed
> to get the stack trace from gdb:
> 
> Core was generated by `/usr/local/lib/xen/bin/qemu-system-i386
> -xen-domid 1 -chardev socket,id=libxl-c'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x006342be in xen_block_handle_requests (dataplane=0x108e600) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
> dir/hw/block/dataplane/xen-block.c:531
> 531
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
> dir/hw/block/dataplane/xen-block.c:
> No such file or directory.
> [Current thread is 1 (LWP 1987)]
> (gdb) bt
> #0  0x006342be in xen_block_handle_requests (dataplane=0x108e600) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
> dir/hw/block/dataplane/xen-block.c:531
> #1  0x0063447c in xen_block_dataplane_event (opaque=0x108e600) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
> dir/hw/block/dataplane/xen-block.c:626
> #2  0x008d005c in xen_device_poll (opaque=0x107a3b0) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/hw/xen/xen-
> bus.c:1077
> #3  0x00a4175c in run_poll_handlers_once (ctx=0x1079708,
> timeout=0xb1ba17f8) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
> posix.c:520
> #4  0x00a41826 in run_poll_handlers (ctx=0x1079708, max_ns=8000,
> timeout=0xb1ba17f8) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
> posix.c:562
> #5  0x00a41956 in try_poll_mode (ctx=0x1079708, timeout=0xb1ba17f8) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
> posix.c:597
> #6  0x00a41a2c in aio_poll (ctx=0x1079708, blocking=true) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
> posix.c:639
> #7  0x0071dc16 in iothread_run (opaque=0x107d328) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
> dir/iothread.c:75
> #8  0x00a44c80 in qemu_thread_start (args=0x1079538) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/qemu-
> thread-posix.c:502
> #9  0xb67ae5d8 in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> This feels like a race condition between the init/free code with
> handler. Anthony, does it ring any bell?
> 

From that stack bt it looks like an iothread managed to run after the sring was NULLed. This should not be able happen as the dataplane should have been moved back onto QEMU's main thread context before the ring is unmapped.

  Paul

> Cheers,
> 
> --
> Julien Grall
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* xen-block: race condition when stopping the device (WAS: Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL)
  2019-12-13 15:55         ` Durrant, Paul
@ 2019-12-14  0:34             ` Julien Grall
  0 siblings, 0 replies; 16+ messages in thread
From: Julien Grall @ 2019-12-14  0:34 UTC (permalink / raw)
  To: Durrant, Paul, Ian Jackson
  Cc: Jürgen Groß,
	Stefano Stabellini, qemu-devel, osstest service owner,
	Anthony Perard, xen-devel

Hi Paul,

On 13/12/2019 15:55, Durrant, Paul wrote:
>> -----Original Message-----
>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
>> Julien Grall
>> Sent: 13 December 2019 15:37
>> To: Ian Jackson <ian.jackson@citrix.com>
>> Cc: Jürgen Groß <jgross@suse.com>; xen-devel@lists.xenproject.org; Stefano
>> Stabellini <sstabellini@kernel.org>; osstest service owner <osstest-
>> admin@xenproject.org>; Anthony Perard <anthony.perard@citrix.com>
>> Subject: Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions -
>> FAIL
>>
>> +Anthony
>>
>> On 13/12/2019 11:40, Ian Jackson wrote:
>>> Julien Grall writes ("Re: [Xen-devel] [xen-4.13-testing test] 144736:
>> regressions - FAIL"):
>>>> AMD Seattle boards (laxton*) are known to fail booting time to time
>>>> because of PCI training issue. We have workaround for it (involving
>>>> longer power cycle) but this is not 100% reliable.
>>>
>>> This wasn't a power cycle.  It was a software-initiated reboot.  It
>>> does appear to hang in the firmware somewhere.  Do we expect the pci
>>> training issue to occur in this case ?
>>
>> The PCI training happens at every reset (including software). So I may
>> have confused the workaround for firmware corruption with the PCI
>> training. We definitely have a workfround for the former.
>>
>> For the latter, I can't remember if we did use a new firmware or just
>> hope it does not happen often.
>>
>> I think we had a thread on infra@ about the workaround some times last
>> year. Sadly this was sent on my Arm e-mail address and I didn't archive
>> it before leaving :(. Can you have a look if you can find the thread?
>>
>>>
>>>>>>     test-armhf-armhf-xl-vhd      18 leak-check/check         fail
>> REGR.
>>>>>> vs. 144673
>>>>>
>>>>> That one is strange. A qemu process seems to have have died producing
>>>>> a core file, but I couldn't find any log containing any other
>> indication
>>>>> of a crashed program.
>>>>
>>>> I haven't found anything interesting in the log. @Ian could you set up
>>>> a repro for this?
>>>
>>> There is some heisenbug where qemu crashes with very low probability.
>>> (I forget whether only on arm or on x86 too).  This has been around
>>> for a little while.  I doubt this particular failure will be
>>> reproducible.
>>
>> I can't remember such bug been reported on Arm before. Anyway, I managed
>> to get the stack trace from gdb:
>>
>> Core was generated by `/usr/local/lib/xen/bin/qemu-system-i386
>> -xen-domid 1 -chardev socket,id=libxl-c'.
>> Program terminated with signal SIGSEGV, Segmentation fault.
>> #0  0x006342be in xen_block_handle_requests (dataplane=0x108e600) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
>> dir/hw/block/dataplane/xen-block.c:531
>> 531
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
>> dir/hw/block/dataplane/xen-block.c:
>> No such file or directory.
>> [Current thread is 1 (LWP 1987)]
>> (gdb) bt
>> #0  0x006342be in xen_block_handle_requests (dataplane=0x108e600) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
>> dir/hw/block/dataplane/xen-block.c:531
>> #1  0x0063447c in xen_block_dataplane_event (opaque=0x108e600) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
>> dir/hw/block/dataplane/xen-block.c:626
>> #2  0x008d005c in xen_device_poll (opaque=0x107a3b0) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/hw/xen/xen-
>> bus.c:1077
>> #3  0x00a4175c in run_poll_handlers_once (ctx=0x1079708,
>> timeout=0xb1ba17f8) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
>> posix.c:520
>> #4  0x00a41826 in run_poll_handlers (ctx=0x1079708, max_ns=8000,
>> timeout=0xb1ba17f8) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
>> posix.c:562
>> #5  0x00a41956 in try_poll_mode (ctx=0x1079708, timeout=0xb1ba17f8) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
>> posix.c:597
>> #6  0x00a41a2c in aio_poll (ctx=0x1079708, blocking=true) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
>> posix.c:639
>> #7  0x0071dc16 in iothread_run (opaque=0x107d328) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
>> dir/iothread.c:75
>> #8  0x00a44c80 in qemu_thread_start (args=0x1079538) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/qemu-
>> thread-posix.c:502
>> #9  0xb67ae5d8 in ?? ()
>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>
>> This feels like a race condition between the init/free code with
>> handler. Anthony, does it ring any bell?
>>
> 
>  From that stack bt it looks like an iothread managed to run after the sring was NULLed. This should not be able happen as the dataplane should have been moved back onto QEMU's main thread context before the ring is unmapped.

My knowledge of this code is fairly limited, so correct me if I am wrong.

blk_set_aio_context() would set the context for the block aio. AFAICT, 
the only aio for the block is xen_block_complete_aio().

In the stack above, we are not dealing with a block aio but an aio tie 
to the event channel (see the call from xen_device_poll). So I don't 
think the blk_set_aio_context() would affect the aio.

So it would be possible to get the iothread running because we received 
a notification on the event channel while we are stopping the block (i.e 
xen_block_dataplane_stop()).

If xen_block_dataplane_stop() grab the context lock first, then the 
iothread dealing with the event may wait on the lock until its released.

By the time the lock is grabbed, we may have free all the resources 
(including srings). So the event iothread will end up to dereference a 
NULL pointer.

It feels to me we need a way to quiesce all the iothreads (blk, 
event,...) before continuing. But I am a bit unsure how to do this in QEMU.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Xen-devel] xen-block: race condition when stopping the device (WAS: Re: [xen-4.13-testing test] 144736: regressions - FAIL)
@ 2019-12-14  0:34             ` Julien Grall
  0 siblings, 0 replies; 16+ messages in thread
From: Julien Grall @ 2019-12-14  0:34 UTC (permalink / raw)
  To: Durrant, Paul, Ian Jackson
  Cc: Jürgen Groß,
	Stefano Stabellini, qemu-devel, osstest service owner,
	Anthony Perard, xen-devel

Hi Paul,

On 13/12/2019 15:55, Durrant, Paul wrote:
>> -----Original Message-----
>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
>> Julien Grall
>> Sent: 13 December 2019 15:37
>> To: Ian Jackson <ian.jackson@citrix.com>
>> Cc: Jürgen Groß <jgross@suse.com>; xen-devel@lists.xenproject.org; Stefano
>> Stabellini <sstabellini@kernel.org>; osstest service owner <osstest-
>> admin@xenproject.org>; Anthony Perard <anthony.perard@citrix.com>
>> Subject: Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions -
>> FAIL
>>
>> +Anthony
>>
>> On 13/12/2019 11:40, Ian Jackson wrote:
>>> Julien Grall writes ("Re: [Xen-devel] [xen-4.13-testing test] 144736:
>> regressions - FAIL"):
>>>> AMD Seattle boards (laxton*) are known to fail booting time to time
>>>> because of PCI training issue. We have workaround for it (involving
>>>> longer power cycle) but this is not 100% reliable.
>>>
>>> This wasn't a power cycle.  It was a software-initiated reboot.  It
>>> does appear to hang in the firmware somewhere.  Do we expect the pci
>>> training issue to occur in this case ?
>>
>> The PCI training happens at every reset (including software). So I may
>> have confused the workaround for firmware corruption with the PCI
>> training. We definitely have a workfround for the former.
>>
>> For the latter, I can't remember if we did use a new firmware or just
>> hope it does not happen often.
>>
>> I think we had a thread on infra@ about the workaround some times last
>> year. Sadly this was sent on my Arm e-mail address and I didn't archive
>> it before leaving :(. Can you have a look if you can find the thread?
>>
>>>
>>>>>>     test-armhf-armhf-xl-vhd      18 leak-check/check         fail
>> REGR.
>>>>>> vs. 144673
>>>>>
>>>>> That one is strange. A qemu process seems to have have died producing
>>>>> a core file, but I couldn't find any log containing any other
>> indication
>>>>> of a crashed program.
>>>>
>>>> I haven't found anything interesting in the log. @Ian could you set up
>>>> a repro for this?
>>>
>>> There is some heisenbug where qemu crashes with very low probability.
>>> (I forget whether only on arm or on x86 too).  This has been around
>>> for a little while.  I doubt this particular failure will be
>>> reproducible.
>>
>> I can't remember such bug been reported on Arm before. Anyway, I managed
>> to get the stack trace from gdb:
>>
>> Core was generated by `/usr/local/lib/xen/bin/qemu-system-i386
>> -xen-domid 1 -chardev socket,id=libxl-c'.
>> Program terminated with signal SIGSEGV, Segmentation fault.
>> #0  0x006342be in xen_block_handle_requests (dataplane=0x108e600) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
>> dir/hw/block/dataplane/xen-block.c:531
>> 531
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
>> dir/hw/block/dataplane/xen-block.c:
>> No such file or directory.
>> [Current thread is 1 (LWP 1987)]
>> (gdb) bt
>> #0  0x006342be in xen_block_handle_requests (dataplane=0x108e600) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
>> dir/hw/block/dataplane/xen-block.c:531
>> #1  0x0063447c in xen_block_dataplane_event (opaque=0x108e600) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
>> dir/hw/block/dataplane/xen-block.c:626
>> #2  0x008d005c in xen_device_poll (opaque=0x107a3b0) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/hw/xen/xen-
>> bus.c:1077
>> #3  0x00a4175c in run_poll_handlers_once (ctx=0x1079708,
>> timeout=0xb1ba17f8) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
>> posix.c:520
>> #4  0x00a41826 in run_poll_handlers (ctx=0x1079708, max_ns=8000,
>> timeout=0xb1ba17f8) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
>> posix.c:562
>> #5  0x00a41956 in try_poll_mode (ctx=0x1079708, timeout=0xb1ba17f8) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
>> posix.c:597
>> #6  0x00a41a2c in aio_poll (ctx=0x1079708, blocking=true) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
>> posix.c:639
>> #7  0x0071dc16 in iothread_run (opaque=0x107d328) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
>> dir/iothread.c:75
>> #8  0x00a44c80 in qemu_thread_start (args=0x1079538) at
>> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/qemu-
>> thread-posix.c:502
>> #9  0xb67ae5d8 in ?? ()
>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>
>> This feels like a race condition between the init/free code with
>> handler. Anthony, does it ring any bell?
>>
> 
>  From that stack bt it looks like an iothread managed to run after the sring was NULLed. This should not be able happen as the dataplane should have been moved back onto QEMU's main thread context before the ring is unmapped.

My knowledge of this code is fairly limited, so correct me if I am wrong.

blk_set_aio_context() would set the context for the block aio. AFAICT, 
the only aio for the block is xen_block_complete_aio().

In the stack above, we are not dealing with a block aio but an aio tie 
to the event channel (see the call from xen_device_poll). So I don't 
think the blk_set_aio_context() would affect the aio.

So it would be possible to get the iothread running because we received 
a notification on the event channel while we are stopping the block (i.e 
xen_block_dataplane_stop()).

If xen_block_dataplane_stop() grab the context lock first, then the 
iothread dealing with the event may wait on the lock until its released.

By the time the lock is grabbed, we may have free all the resources 
(including srings). So the event iothread will end up to dereference a 
NULL pointer.

It feels to me we need a way to quiesce all the iothreads (blk, 
event,...) before continuing. But I am a bit unsure how to do this in QEMU.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: xen-block: race condition when stopping the device (WAS: Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL)
  2019-12-14  0:34             ` [Xen-devel] xen-block: race condition when stopping the device (WAS: " Julien Grall
@ 2019-12-16  9:34               ` Durrant, Paul
  -1 siblings, 0 replies; 16+ messages in thread
From: Durrant, Paul @ 2019-12-16  9:34 UTC (permalink / raw)
  To: Julien Grall, Ian Jackson
  Cc: Jürgen Groß,
	xen-devel, Stefano Stabellini, osstest service owner,
	Anthony Perard, qemu-devel

> -----Original Message-----
[snip]
> >>
> >> This feels like a race condition between the init/free code with
> >> handler. Anthony, does it ring any bell?
> >>
> >
> >  From that stack bt it looks like an iothread managed to run after the
> sring was NULLed. This should not be able happen as the dataplane should
> have been moved back onto QEMU's main thread context before the ring is
> unmapped.
> 
> My knowledge of this code is fairly limited, so correct me if I am wrong.
> 
> blk_set_aio_context() would set the context for the block aio. AFAICT,
> the only aio for the block is xen_block_complete_aio().

Not quite. xen_block_dataplane_start() calls xen_device_bind_event_channel() and that will add an event channel fd into the aio context, so the shared ring is polled by the iothread as well as block i/o completion.

> 
> In the stack above, we are not dealing with a block aio but an aio tie
> to the event channel (see the call from xen_device_poll). So I don't
> think the blk_set_aio_context() would affect the aio.
> 

For the reason I outline above, it does.

> So it would be possible to get the iothread running because we received
> a notification on the event channel while we are stopping the block (i.e
> xen_block_dataplane_stop()).
> 

We should assume an iothread can essentially run at any time, as it is a polling entity. It should eventually block polling on fds assign to its aio context but I don't think the abstraction guarantees that it cannot be awoken for other reasons (e.g. off a timeout). However and event from the frontend will certainly cause the evtchn fd poll to wake up.

> If xen_block_dataplane_stop() grab the context lock first, then the
> iothread dealing with the event may wait on the lock until its released.
> 
> By the time the lock is grabbed, we may have free all the resources
> (including srings). So the event iothread will end up to dereference a
> NULL pointer.
> 

I think the problem may actually be that xen_block_dataplane_event() does not acquire the context and thus is not synchronized with xen_block_dataplane_stop(). The documentation in multiple-iothreads.txt is not clear whether a poll handler called by an iothread needs to acquire the context though; TBH I would not have thought it necessary.

> It feels to me we need a way to quiesce all the iothreads (blk,
> event,...) before continuing. But I am a bit unsure how to do this in
> QEMU.
> 

Looking at virtio-blk.c I see that it does seem to close off its evtchn equivalent from iothread context via aio_wait_bh_oneshot(). So I wonder whether the 'right' thing to do is to call xen_device_unbind_event_channel() using the same mechanism to ensure xen_block_dataplane_event() can't race.

  Paul

> Cheers,
> 
> --
> Julien Grall

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] xen-block: race condition when stopping the device (WAS: Re: [xen-4.13-testing test] 144736: regressions - FAIL)
@ 2019-12-16  9:34               ` Durrant, Paul
  0 siblings, 0 replies; 16+ messages in thread
From: Durrant, Paul @ 2019-12-16  9:34 UTC (permalink / raw)
  To: Julien Grall, Ian Jackson
  Cc: Jürgen Groß,
	Stefano Stabellini, qemu-devel, osstest service owner,
	Anthony Perard, xen-devel

> -----Original Message-----
[snip]
> >>
> >> This feels like a race condition between the init/free code with
> >> handler. Anthony, does it ring any bell?
> >>
> >
> >  From that stack bt it looks like an iothread managed to run after the
> sring was NULLed. This should not be able happen as the dataplane should
> have been moved back onto QEMU's main thread context before the ring is
> unmapped.
> 
> My knowledge of this code is fairly limited, so correct me if I am wrong.
> 
> blk_set_aio_context() would set the context for the block aio. AFAICT,
> the only aio for the block is xen_block_complete_aio().

Not quite. xen_block_dataplane_start() calls xen_device_bind_event_channel() and that will add an event channel fd into the aio context, so the shared ring is polled by the iothread as well as block i/o completion.

> 
> In the stack above, we are not dealing with a block aio but an aio tie
> to the event channel (see the call from xen_device_poll). So I don't
> think the blk_set_aio_context() would affect the aio.
> 

For the reason I outline above, it does.

> So it would be possible to get the iothread running because we received
> a notification on the event channel while we are stopping the block (i.e
> xen_block_dataplane_stop()).
> 

We should assume an iothread can essentially run at any time, as it is a polling entity. It should eventually block polling on fds assign to its aio context but I don't think the abstraction guarantees that it cannot be awoken for other reasons (e.g. off a timeout). However and event from the frontend will certainly cause the evtchn fd poll to wake up.

> If xen_block_dataplane_stop() grab the context lock first, then the
> iothread dealing with the event may wait on the lock until its released.
> 
> By the time the lock is grabbed, we may have free all the resources
> (including srings). So the event iothread will end up to dereference a
> NULL pointer.
> 

I think the problem may actually be that xen_block_dataplane_event() does not acquire the context and thus is not synchronized with xen_block_dataplane_stop(). The documentation in multiple-iothreads.txt is not clear whether a poll handler called by an iothread needs to acquire the context though; TBH I would not have thought it necessary.

> It feels to me we need a way to quiesce all the iothreads (blk,
> event,...) before continuing. But I am a bit unsure how to do this in
> QEMU.
> 

Looking at virtio-blk.c I see that it does seem to close off its evtchn equivalent from iothread context via aio_wait_bh_oneshot(). So I wonder whether the 'right' thing to do is to call xen_device_unbind_event_channel() using the same mechanism to ensure xen_block_dataplane_event() can't race.

  Paul

> Cheers,
> 
> --
> Julien Grall
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [Xen-devel] xen-block: race condition when stopping the device (WAS: Re: [xen-4.13-testing test] 144736: regressions - FAIL)
  2019-12-16  9:34               ` [Xen-devel] xen-block: race condition when stopping the device (WAS: " Durrant, Paul
@ 2019-12-16  9:50                 ` Durrant, Paul
  -1 siblings, 0 replies; 16+ messages in thread
From: Durrant, Paul @ 2019-12-16  9:50 UTC (permalink / raw)
  To: Durrant, Paul, Julien Grall, Ian Jackson
  Cc: Jürgen Groß,
	Stefano Stabellini, qemu-devel, osstest service owner,
	Anthony Perard, xen-devel

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
> Durrant, Paul
> Sent: 16 December 2019 09:34
> To: Julien Grall <julien@xen.org>; Ian Jackson <ian.jackson@citrix.com>
> Cc: Jürgen Groß <jgross@suse.com>; Stefano Stabellini
> <sstabellini@kernel.org>; qemu-devel@nongnu.org; osstest service owner
> <osstest-admin@xenproject.org>; Anthony Perard
> <anthony.perard@citrix.com>; xen-devel@lists.xenproject.org
> Subject: Re: [Xen-devel] xen-block: race condition when stopping the
> device (WAS: Re: [xen-4.13-testing test] 144736: regressions - FAIL)
> 
> > -----Original Message-----
> [snip]
> > >>
> > >> This feels like a race condition between the init/free code with
> > >> handler. Anthony, does it ring any bell?
> > >>
> > >
> > >  From that stack bt it looks like an iothread managed to run after the
> > sring was NULLed. This should not be able happen as the dataplane should
> > have been moved back onto QEMU's main thread context before the ring is
> > unmapped.
> >
> > My knowledge of this code is fairly limited, so correct me if I am
> wrong.
> >
> > blk_set_aio_context() would set the context for the block aio. AFAICT,
> > the only aio for the block is xen_block_complete_aio().
> 
> Not quite. xen_block_dataplane_start() calls
> xen_device_bind_event_channel() and that will add an event channel fd into
> the aio context, so the shared ring is polled by the iothread as well as
> block i/o completion.
> 
> >
> > In the stack above, we are not dealing with a block aio but an aio tie
> > to the event channel (see the call from xen_device_poll). So I don't
> > think the blk_set_aio_context() would affect the aio.
> >
> 
> For the reason I outline above, it does.
> 
> > So it would be possible to get the iothread running because we received
> > a notification on the event channel while we are stopping the block (i.e
> > xen_block_dataplane_stop()).
> >
> 
> We should assume an iothread can essentially run at any time, as it is a
> polling entity. It should eventually block polling on fds assign to its
> aio context but I don't think the abstraction guarantees that it cannot be
> awoken for other reasons (e.g. off a timeout). However and event from the
> frontend will certainly cause the evtchn fd poll to wake up.
> 
> > If xen_block_dataplane_stop() grab the context lock first, then the
> > iothread dealing with the event may wait on the lock until its released.
> >
> > By the time the lock is grabbed, we may have free all the resources
> > (including srings). So the event iothread will end up to dereference a
> > NULL pointer.
> >
> 
> I think the problem may actually be that xen_block_dataplane_event() does
> not acquire the context and thus is not synchronized with
> xen_block_dataplane_stop(). The documentation in multiple-iothreads.txt is
> not clear whether a poll handler called by an iothread needs to acquire
> the context though; TBH I would not have thought it necessary.
> 
> > It feels to me we need a way to quiesce all the iothreads (blk,
> > event,...) before continuing. But I am a bit unsure how to do this in
> > QEMU.
> >
> 
> Looking at virtio-blk.c I see that it does seem to close off its evtchn
> equivalent from iothread context via aio_wait_bh_oneshot(). So I wonder
> whether the 'right' thing to do is to call
> xen_device_unbind_event_channel() using the same mechanism to ensure
> xen_block_dataplane_event() can't race.

Digging around the virtio-blk history I see:

commit 1010cadf62332017648abee0d7a3dc7f2eef9632
Author: Stefan Hajnoczi <stefanha@redhat.com>
Date:   Wed Mar 7 14:42:03 2018 +0000

    virtio-blk: fix race between .ioeventfd_stop() and vq handler

    If the main loop thread invokes .ioeventfd_stop() just as the vq handler
    function begins in the IOThread then the handler may lose the race for
    the AioContext lock.  By the time the vq handler is able to acquire the
    AioContext lock the ioeventfd has already been removed and the handler
    isn't supposed to run anymore!

    Use the new aio_wait_bh_oneshot() function to perform ioeventfd removal
    from within the IOThread.  This way no races with the vq handler are
    possible.

    Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
    Reviewed-by: Fam Zheng <famz@redhat.com>
    Acked-by: Paolo Bonzini <pbonzini@redhat.com>
    Message-id: 20180307144205.20619-3-stefanha@redhat.com
    Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

...so I think xen-block has exactly the same problem. I think we may also be missing a qemu_bh_cancel() to make sure block aio completions are stopped. I'll prep a patch.

  Paul

> 
>   Paul
> 
> > Cheers,
> >
> > --
> > Julien Grall
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] xen-block: race condition when stopping the device (WAS: Re: [xen-4.13-testing test] 144736: regressions - FAIL)
@ 2019-12-16  9:50                 ` Durrant, Paul
  0 siblings, 0 replies; 16+ messages in thread
From: Durrant, Paul @ 2019-12-16  9:50 UTC (permalink / raw)
  To: Durrant, Paul, Julien Grall, Ian Jackson
  Cc: Jürgen Groß,
	Stefano Stabellini, qemu-devel, osstest service owner,
	Anthony Perard, xen-devel

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
> Durrant, Paul
> Sent: 16 December 2019 09:34
> To: Julien Grall <julien@xen.org>; Ian Jackson <ian.jackson@citrix.com>
> Cc: Jürgen Groß <jgross@suse.com>; Stefano Stabellini
> <sstabellini@kernel.org>; qemu-devel@nongnu.org; osstest service owner
> <osstest-admin@xenproject.org>; Anthony Perard
> <anthony.perard@citrix.com>; xen-devel@lists.xenproject.org
> Subject: Re: [Xen-devel] xen-block: race condition when stopping the
> device (WAS: Re: [xen-4.13-testing test] 144736: regressions - FAIL)
> 
> > -----Original Message-----
> [snip]
> > >>
> > >> This feels like a race condition between the init/free code with
> > >> handler. Anthony, does it ring any bell?
> > >>
> > >
> > >  From that stack bt it looks like an iothread managed to run after the
> > sring was NULLed. This should not be able happen as the dataplane should
> > have been moved back onto QEMU's main thread context before the ring is
> > unmapped.
> >
> > My knowledge of this code is fairly limited, so correct me if I am
> wrong.
> >
> > blk_set_aio_context() would set the context for the block aio. AFAICT,
> > the only aio for the block is xen_block_complete_aio().
> 
> Not quite. xen_block_dataplane_start() calls
> xen_device_bind_event_channel() and that will add an event channel fd into
> the aio context, so the shared ring is polled by the iothread as well as
> block i/o completion.
> 
> >
> > In the stack above, we are not dealing with a block aio but an aio tie
> > to the event channel (see the call from xen_device_poll). So I don't
> > think the blk_set_aio_context() would affect the aio.
> >
> 
> For the reason I outline above, it does.
> 
> > So it would be possible to get the iothread running because we received
> > a notification on the event channel while we are stopping the block (i.e
> > xen_block_dataplane_stop()).
> >
> 
> We should assume an iothread can essentially run at any time, as it is a
> polling entity. It should eventually block polling on fds assign to its
> aio context but I don't think the abstraction guarantees that it cannot be
> awoken for other reasons (e.g. off a timeout). However and event from the
> frontend will certainly cause the evtchn fd poll to wake up.
> 
> > If xen_block_dataplane_stop() grab the context lock first, then the
> > iothread dealing with the event may wait on the lock until its released.
> >
> > By the time the lock is grabbed, we may have free all the resources
> > (including srings). So the event iothread will end up to dereference a
> > NULL pointer.
> >
> 
> I think the problem may actually be that xen_block_dataplane_event() does
> not acquire the context and thus is not synchronized with
> xen_block_dataplane_stop(). The documentation in multiple-iothreads.txt is
> not clear whether a poll handler called by an iothread needs to acquire
> the context though; TBH I would not have thought it necessary.
> 
> > It feels to me we need a way to quiesce all the iothreads (blk,
> > event,...) before continuing. But I am a bit unsure how to do this in
> > QEMU.
> >
> 
> Looking at virtio-blk.c I see that it does seem to close off its evtchn
> equivalent from iothread context via aio_wait_bh_oneshot(). So I wonder
> whether the 'right' thing to do is to call
> xen_device_unbind_event_channel() using the same mechanism to ensure
> xen_block_dataplane_event() can't race.

Digging around the virtio-blk history I see:

commit 1010cadf62332017648abee0d7a3dc7f2eef9632
Author: Stefan Hajnoczi <stefanha@redhat.com>
Date:   Wed Mar 7 14:42:03 2018 +0000

    virtio-blk: fix race between .ioeventfd_stop() and vq handler

    If the main loop thread invokes .ioeventfd_stop() just as the vq handler
    function begins in the IOThread then the handler may lose the race for
    the AioContext lock.  By the time the vq handler is able to acquire the
    AioContext lock the ioeventfd has already been removed and the handler
    isn't supposed to run anymore!

    Use the new aio_wait_bh_oneshot() function to perform ioeventfd removal
    from within the IOThread.  This way no races with the vq handler are
    possible.

    Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
    Reviewed-by: Fam Zheng <famz@redhat.com>
    Acked-by: Paolo Bonzini <pbonzini@redhat.com>
    Message-id: 20180307144205.20619-3-stefanha@redhat.com
    Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

...so I think xen-block has exactly the same problem. I think we may also be missing a qemu_bh_cancel() to make sure block aio completions are stopped. I'll prep a patch.

  Paul

> 
>   Paul
> 
> > Cheers,
> >
> > --
> > Julien Grall
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [Xen-devel] xen-block: race condition when stopping the device (WAS: Re: [xen-4.13-testing test] 144736: regressions - FAIL)
  2019-12-16  9:50                 ` Durrant, Paul
@ 2019-12-16 10:24                   ` Durrant, Paul
  -1 siblings, 0 replies; 16+ messages in thread
From: Durrant, Paul @ 2019-12-16 10:24 UTC (permalink / raw)
  To: Julien Grall, Ian Jackson
  Cc: Jürgen Groß,
	Stefano Stabellini, qemu-devel, osstest service owner,
	Anthony Perard, xen-devel

> -----Original Message-----
> From: Durrant, Paul <pdurrant@amazon.com>
> Sent: 16 December 2019 09:50
> To: Durrant, Paul <pdurrant@amazon.com>; Julien Grall <julien@xen.org>;
> Ian Jackson <ian.jackson@citrix.com>
> Cc: Jürgen Groß <jgross@suse.com>; Stefano Stabellini
> <sstabellini@kernel.org>; qemu-devel@nongnu.org; osstest service owner
> <osstest-admin@xenproject.org>; Anthony Perard
> <anthony.perard@citrix.com>; xen-devel@lists.xenproject.org
> Subject: RE: [Xen-devel] xen-block: race condition when stopping the
> device (WAS: Re: [xen-4.13-testing test] 144736: regressions - FAIL)
> 
> > -----Original Message-----
> > From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
> > Durrant, Paul
> > Sent: 16 December 2019 09:34
> > To: Julien Grall <julien@xen.org>; Ian Jackson <ian.jackson@citrix.com>
> > Cc: Jürgen Groß <jgross@suse.com>; Stefano Stabellini
> > <sstabellini@kernel.org>; qemu-devel@nongnu.org; osstest service owner
> > <osstest-admin@xenproject.org>; Anthony Perard
> > <anthony.perard@citrix.com>; xen-devel@lists.xenproject.org
> > Subject: Re: [Xen-devel] xen-block: race condition when stopping the
> > device (WAS: Re: [xen-4.13-testing test] 144736: regressions - FAIL)
> >
> > > -----Original Message-----
> > [snip]
> > > >>
> > > >> This feels like a race condition between the init/free code with
> > > >> handler. Anthony, does it ring any bell?
> > > >>
> > > >
> > > >  From that stack bt it looks like an iothread managed to run after
> the
> > > sring was NULLed. This should not be able happen as the dataplane
> should
> > > have been moved back onto QEMU's main thread context before the ring
> is
> > > unmapped.
> > >
> > > My knowledge of this code is fairly limited, so correct me if I am
> > wrong.
> > >
> > > blk_set_aio_context() would set the context for the block aio. AFAICT,
> > > the only aio for the block is xen_block_complete_aio().
> >
> > Not quite. xen_block_dataplane_start() calls
> > xen_device_bind_event_channel() and that will add an event channel fd
> into
> > the aio context, so the shared ring is polled by the iothread as well as
> > block i/o completion.
> >
> > >
> > > In the stack above, we are not dealing with a block aio but an aio tie
> > > to the event channel (see the call from xen_device_poll). So I don't
> > > think the blk_set_aio_context() would affect the aio.
> > >
> >
> > For the reason I outline above, it does.
> >
> > > So it would be possible to get the iothread running because we
> received
> > > a notification on the event channel while we are stopping the block
> (i.e
> > > xen_block_dataplane_stop()).
> > >
> >
> > We should assume an iothread can essentially run at any time, as it is a
> > polling entity. It should eventually block polling on fds assign to its
> > aio context but I don't think the abstraction guarantees that it cannot
> be
> > awoken for other reasons (e.g. off a timeout). However and event from
> the
> > frontend will certainly cause the evtchn fd poll to wake up.
> >
> > > If xen_block_dataplane_stop() grab the context lock first, then the
> > > iothread dealing with the event may wait on the lock until its
> released.
> > >
> > > By the time the lock is grabbed, we may have free all the resources
> > > (including srings). So the event iothread will end up to dereference a
> > > NULL pointer.
> > >
> >
> > I think the problem may actually be that xen_block_dataplane_event()
> does
> > not acquire the context and thus is not synchronized with
> > xen_block_dataplane_stop(). The documentation in multiple-iothreads.txt
> is
> > not clear whether a poll handler called by an iothread needs to acquire
> > the context though; TBH I would not have thought it necessary.
> >
> > > It feels to me we need a way to quiesce all the iothreads (blk,
> > > event,...) before continuing. But I am a bit unsure how to do this in
> > > QEMU.
> > >
> >
> > Looking at virtio-blk.c I see that it does seem to close off its evtchn
> > equivalent from iothread context via aio_wait_bh_oneshot(). So I wonder
> > whether the 'right' thing to do is to call
> > xen_device_unbind_event_channel() using the same mechanism to ensure
> > xen_block_dataplane_event() can't race.
> 
> Digging around the virtio-blk history I see:
> 
> commit 1010cadf62332017648abee0d7a3dc7f2eef9632
> Author: Stefan Hajnoczi <stefanha@redhat.com>
> Date:   Wed Mar 7 14:42:03 2018 +0000
> 
>     virtio-blk: fix race between .ioeventfd_stop() and vq handler
> 
>     If the main loop thread invokes .ioeventfd_stop() just as the vq
> handler
>     function begins in the IOThread then the handler may lose the race for
>     the AioContext lock.  By the time the vq handler is able to acquire
> the
>     AioContext lock the ioeventfd has already been removed and the handler
>     isn't supposed to run anymore!
> 
>     Use the new aio_wait_bh_oneshot() function to perform ioeventfd
> removal
>     from within the IOThread.  This way no races with the vq handler are
>     possible.
> 
>     Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>     Reviewed-by: Fam Zheng <famz@redhat.com>
>     Acked-by: Paolo Bonzini <pbonzini@redhat.com>
>     Message-id: 20180307144205.20619-3-stefanha@redhat.com
>     Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> 
> ...so I think xen-block has exactly the same problem. I think we may also
> be missing a qemu_bh_cancel() to make sure block aio completions are
> stopped. I'll prep a patch.
> 

Having discussed with Julien off-list, we agreed that the oneshot handler may be overly elaborate for our purposes and actually destroying the event channel at that point will still pose problems for pending aio. What we really need is an equivalent of blk_set_aio_context() for event channels.

  Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] xen-block: race condition when stopping the device (WAS: Re: [xen-4.13-testing test] 144736: regressions - FAIL)
@ 2019-12-16 10:24                   ` Durrant, Paul
  0 siblings, 0 replies; 16+ messages in thread
From: Durrant, Paul @ 2019-12-16 10:24 UTC (permalink / raw)
  To: Julien Grall, Ian Jackson
  Cc: Jürgen Groß,
	Stefano Stabellini, qemu-devel, osstest service owner,
	Anthony Perard, xen-devel

> -----Original Message-----
> From: Durrant, Paul <pdurrant@amazon.com>
> Sent: 16 December 2019 09:50
> To: Durrant, Paul <pdurrant@amazon.com>; Julien Grall <julien@xen.org>;
> Ian Jackson <ian.jackson@citrix.com>
> Cc: Jürgen Groß <jgross@suse.com>; Stefano Stabellini
> <sstabellini@kernel.org>; qemu-devel@nongnu.org; osstest service owner
> <osstest-admin@xenproject.org>; Anthony Perard
> <anthony.perard@citrix.com>; xen-devel@lists.xenproject.org
> Subject: RE: [Xen-devel] xen-block: race condition when stopping the
> device (WAS: Re: [xen-4.13-testing test] 144736: regressions - FAIL)
> 
> > -----Original Message-----
> > From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
> > Durrant, Paul
> > Sent: 16 December 2019 09:34
> > To: Julien Grall <julien@xen.org>; Ian Jackson <ian.jackson@citrix.com>
> > Cc: Jürgen Groß <jgross@suse.com>; Stefano Stabellini
> > <sstabellini@kernel.org>; qemu-devel@nongnu.org; osstest service owner
> > <osstest-admin@xenproject.org>; Anthony Perard
> > <anthony.perard@citrix.com>; xen-devel@lists.xenproject.org
> > Subject: Re: [Xen-devel] xen-block: race condition when stopping the
> > device (WAS: Re: [xen-4.13-testing test] 144736: regressions - FAIL)
> >
> > > -----Original Message-----
> > [snip]
> > > >>
> > > >> This feels like a race condition between the init/free code with
> > > >> handler. Anthony, does it ring any bell?
> > > >>
> > > >
> > > >  From that stack bt it looks like an iothread managed to run after
> the
> > > sring was NULLed. This should not be able happen as the dataplane
> should
> > > have been moved back onto QEMU's main thread context before the ring
> is
> > > unmapped.
> > >
> > > My knowledge of this code is fairly limited, so correct me if I am
> > wrong.
> > >
> > > blk_set_aio_context() would set the context for the block aio. AFAICT,
> > > the only aio for the block is xen_block_complete_aio().
> >
> > Not quite. xen_block_dataplane_start() calls
> > xen_device_bind_event_channel() and that will add an event channel fd
> into
> > the aio context, so the shared ring is polled by the iothread as well as
> > block i/o completion.
> >
> > >
> > > In the stack above, we are not dealing with a block aio but an aio tie
> > > to the event channel (see the call from xen_device_poll). So I don't
> > > think the blk_set_aio_context() would affect the aio.
> > >
> >
> > For the reason I outline above, it does.
> >
> > > So it would be possible to get the iothread running because we
> received
> > > a notification on the event channel while we are stopping the block
> (i.e
> > > xen_block_dataplane_stop()).
> > >
> >
> > We should assume an iothread can essentially run at any time, as it is a
> > polling entity. It should eventually block polling on fds assign to its
> > aio context but I don't think the abstraction guarantees that it cannot
> be
> > awoken for other reasons (e.g. off a timeout). However and event from
> the
> > frontend will certainly cause the evtchn fd poll to wake up.
> >
> > > If xen_block_dataplane_stop() grab the context lock first, then the
> > > iothread dealing with the event may wait on the lock until its
> released.
> > >
> > > By the time the lock is grabbed, we may have free all the resources
> > > (including srings). So the event iothread will end up to dereference a
> > > NULL pointer.
> > >
> >
> > I think the problem may actually be that xen_block_dataplane_event()
> does
> > not acquire the context and thus is not synchronized with
> > xen_block_dataplane_stop(). The documentation in multiple-iothreads.txt
> is
> > not clear whether a poll handler called by an iothread needs to acquire
> > the context though; TBH I would not have thought it necessary.
> >
> > > It feels to me we need a way to quiesce all the iothreads (blk,
> > > event,...) before continuing. But I am a bit unsure how to do this in
> > > QEMU.
> > >
> >
> > Looking at virtio-blk.c I see that it does seem to close off its evtchn
> > equivalent from iothread context via aio_wait_bh_oneshot(). So I wonder
> > whether the 'right' thing to do is to call
> > xen_device_unbind_event_channel() using the same mechanism to ensure
> > xen_block_dataplane_event() can't race.
> 
> Digging around the virtio-blk history I see:
> 
> commit 1010cadf62332017648abee0d7a3dc7f2eef9632
> Author: Stefan Hajnoczi <stefanha@redhat.com>
> Date:   Wed Mar 7 14:42:03 2018 +0000
> 
>     virtio-blk: fix race between .ioeventfd_stop() and vq handler
> 
>     If the main loop thread invokes .ioeventfd_stop() just as the vq
> handler
>     function begins in the IOThread then the handler may lose the race for
>     the AioContext lock.  By the time the vq handler is able to acquire
> the
>     AioContext lock the ioeventfd has already been removed and the handler
>     isn't supposed to run anymore!
> 
>     Use the new aio_wait_bh_oneshot() function to perform ioeventfd
> removal
>     from within the IOThread.  This way no races with the vq handler are
>     possible.
> 
>     Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>     Reviewed-by: Fam Zheng <famz@redhat.com>
>     Acked-by: Paolo Bonzini <pbonzini@redhat.com>
>     Message-id: 20180307144205.20619-3-stefanha@redhat.com
>     Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> 
> ...so I think xen-block has exactly the same problem. I think we may also
> be missing a qemu_bh_cancel() to make sure block aio completions are
> stopped. I'll prep a patch.
> 

Having discussed with Julien off-list, we agreed that the oneshot handler may be overly elaborate for our purposes and actually destroying the event channel at that point will still pose problems for pending aio. What we really need is an equivalent of blk_set_aio_context() for event channels.

  Paul
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-12-16 10:25 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-12 22:35 [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL osstest service owner
2019-12-13  8:31 ` Jürgen Groß
2019-12-13 11:14   ` Julien Grall
2019-12-13 11:24     ` Jürgen Groß
2019-12-13 11:28       ` Julien Grall
2019-12-13 11:40     ` Ian Jackson
2019-12-13 15:36       ` Julien Grall
2019-12-13 15:55         ` Durrant, Paul
2019-12-14  0:34           ` xen-block: race condition when stopping the device (WAS: Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL) Julien Grall
2019-12-14  0:34             ` [Xen-devel] xen-block: race condition when stopping the device (WAS: " Julien Grall
2019-12-16  9:34             ` xen-block: race condition when stopping the device (WAS: Re: [Xen-devel] " Durrant, Paul
2019-12-16  9:34               ` [Xen-devel] xen-block: race condition when stopping the device (WAS: " Durrant, Paul
2019-12-16  9:50               ` Durrant, Paul
2019-12-16  9:50                 ` Durrant, Paul
2019-12-16 10:24                 ` Durrant, Paul
2019-12-16 10:24                   ` Durrant, Paul

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.