All of lore.kernel.org
 help / color / mirror / Atom feed
* acceptance-system-fedora failures
@ 2020-10-06 23:07 John Snow
  2020-10-07  5:20 ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 19+ messages in thread
From: John Snow @ 2020-10-06 23:07 UTC (permalink / raw)
  To: Cleber Rosa, Alex Bennée; +Cc: QEMU Developers

I'm seeing this gitlab test fail quite often in my Python work; I don't 
*think* this has anything to do with my patches, but maybe I need to try 
and bisect this more aggressively.

The very first hint of an error I see is on line 154:

https://gitlab.com/jsnow/qemu/-/jobs/776334918#L154

22:05:36 ERROR|
22:05:36 ERROR| Reproduced traceback from: 
/builds/jsnow/qemu/build/tests/venv/lib64/python3.8/site-packages/avocado/core/test.py:753
22:05:36 ERROR| Traceback (most recent call last):
22:05:36 ERROR|   File 
"/builds/jsnow/qemu/build/tests/acceptance/avocado_qemu/__init__.py", 
line 171, in setUp
22:05:36 ERROR|     self.cancel("No QEMU binary defined or found in the 
build tree")

Is this a known problem?

--js



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: acceptance-system-fedora failures
  2020-10-06 23:07 acceptance-system-fedora failures John Snow
@ 2020-10-07  5:20 ` Philippe Mathieu-Daudé
  2020-10-07  7:13   ` Philippe Mathieu-Daudé
                     ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-10-07  5:20 UTC (permalink / raw)
  To: John Snow, Cleber Rosa
  Cc: Thomas Huth, Eduardo Habkost, QEMU Developers,
	Wainer dos Santos Moschetta, Alex Bennée, Richard Henderson

On 10/7/20 1:07 AM, John Snow wrote:
> I'm seeing this gitlab test fail quite often in my Python work; I don't
> *think* this has anything to do with my patches, but maybe I need to try
> and bisect this more aggressively.
> 
> The very first hint of an error I see is on line 154:
> 
> https://gitlab.com/jsnow/qemu/-/jobs/776334918#L154
> 
> 22:05:36 ERROR|
> 22:05:36 ERROR| Reproduced traceback from:
> /builds/jsnow/qemu/build/tests/venv/lib64/python3.8/site-packages/avocado/core/test.py:753
> 
> 22:05:36 ERROR| Traceback (most recent call last):
> 22:05:36 ERROR|   File
> "/builds/jsnow/qemu/build/tests/acceptance/avocado_qemu/__init__.py",
> line 171, in setUp
> 22:05:36 ERROR|     self.cancel("No QEMU binary defined or found in the
> build tree")

Last year the Avocado developers said we could have a clearer
log error report, but meanwhile this verbose output is better
that not having anything ¯\_(ツ)_/¯

> 
> Is this a known problem?

"No QEMU binary defined or found in the build tree" is not a
problem, it means a test is skipped because the qemu-system-$ARCH
binary is not found. In your case this is because your job
(acceptance-system-fedora) is based on build-system-fedora
which only build the following targets:

    TARGETS: tricore-softmmu microblaze-softmmu mips-softmmu
      xtensa-softmmu m68k-softmmu riscv32-softmmu ppc-softmmu
      sparc64-softmmu

Now I don't understand what binary the EmptyCPUModel/Migration tests
are expecting. Maybe these tests only work when a single target is
built? IOW not expecting that the python code searching for a binary
return multiple ones? -> Eduardo/Cleber.

w.r.t. the error in your build, I told Thomas about the
test_ppc_mac99/day15/invaders.elf timeouting but he said this is
not his area. Richard has been looking yesterday to see if it is
a TCG regression, and said the test either finished/crashed raising
SIGCHLD, but Avocado parent is still waiting for a timeout, so the
children become zombie and the test hang.

Not sure that helps :)

Regards,

Phil.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: acceptance-system-fedora failures
  2020-10-07  5:20 ` Philippe Mathieu-Daudé
@ 2020-10-07  7:13   ` Philippe Mathieu-Daudé
  2020-10-07  8:23     ` Thomas Huth
  2020-10-07  7:23   ` Thomas Huth
  2020-10-07 14:38   ` Cleber Rosa
  2 siblings, 1 reply; 19+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-10-07  7:13 UTC (permalink / raw)
  To: John Snow, Cleber Rosa, Pavel Dovgalyuk
  Cc: Thomas Huth, Eduardo Habkost, QEMU Developers,
	Wainer dos Santos Moschetta, Alex Bennée, Richard Henderson

On 10/7/20 7:20 AM, Philippe Mathieu-Daudé wrote:
> On 10/7/20 1:07 AM, John Snow wrote:
>> I'm seeing this gitlab test fail quite often in my Python work; I don't
>> *think* this has anything to do with my patches, but maybe I need to try
>> and bisect this more aggressively.
>>
>> The very first hint of an error I see is on line 154:
>>
>> https://gitlab.com/jsnow/qemu/-/jobs/776334918#L154
>>
>> 22:05:36 ERROR|
>> 22:05:36 ERROR| Reproduced traceback from:
>> /builds/jsnow/qemu/build/tests/venv/lib64/python3.8/site-packages/avocado/core/test.py:753
>>
>> 22:05:36 ERROR| Traceback (most recent call last):
>> 22:05:36 ERROR|   File
>> "/builds/jsnow/qemu/build/tests/acceptance/avocado_qemu/__init__.py",
>> line 171, in setUp
>> 22:05:36 ERROR|     self.cancel("No QEMU binary defined or found in the
>> build tree")
> 
> Last year the Avocado developers said we could have a clearer
> log error report, but meanwhile this verbose output is better
> that not having anything ¯\_(ツ)_/¯
> 
>>
>> Is this a known problem?
> 
> "No QEMU binary defined or found in the build tree" is not a
> problem, it means a test is skipped because the qemu-system-$ARCH
> binary is not found. In your case this is because your job
> (acceptance-system-fedora) is based on build-system-fedora
> which only build the following targets:
> 
>     TARGETS: tricore-softmmu microblaze-softmmu mips-softmmu
>       xtensa-softmmu m68k-softmmu riscv32-softmmu ppc-softmmu
>       sparc64-softmmu
> 
> Now I don't understand what binary the EmptyCPUModel/Migration tests
> are expecting. Maybe these tests only work when a single target is
> built? IOW not expecting that the python code searching for a binary
> return multiple ones? -> Eduardo/Cleber.
> 
> w.r.t. the error in your build, I told Thomas about the
> test_ppc_mac99/day15/invaders.elf timeouting but he said this is
> not his area. Richard has been looking yesterday to see if it is
> a TCG regression, and said the test either finished/crashed raising
> SIGCHLD, but Avocado parent is still waiting for a timeout, so the
> children become zombie and the test hang.

Expected output:

Quiescing Open Firmware ...
Booting Linux via __start() @ 0x01000000 ...

But QEMU exits in replay_char_write_event_load():

Quiescing Open Firmware ...
qemu-system-ppc: Missing character write event in the replay log
$ echo $?
1

Latest events are CHECKPOINT CHECKPOINT INTERRUPT INTERRUPT INTERRUPT.

Replay file is ~22MiB. End of record using "system_powerdown + quit"
in HMP.


I guess we have 2 bugs:
- replay log
- avocado doesn't catch children exit(1)


Quick reproducer:

$ make qemu-system-ppc check-venv
$ tests/venv/bin/python -m \
  avocado --show=app,console,replay \
  run --job-timeout 300 -t machine:mac99 \
  tests/acceptance/replay_kernel.py

> 
> Not sure that helps :)
> 
> Regards,
> 
> Phil.
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: acceptance-system-fedora failures
  2020-10-07  5:20 ` Philippe Mathieu-Daudé
  2020-10-07  7:13   ` Philippe Mathieu-Daudé
@ 2020-10-07  7:23   ` Thomas Huth
  2020-10-07  8:19     ` Philippe Mathieu-Daudé
  2020-10-07 14:38   ` Cleber Rosa
  2 siblings, 1 reply; 19+ messages in thread
From: Thomas Huth @ 2020-10-07  7:23 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, John Snow, Cleber Rosa, Pavel Dovgalyuk
  Cc: Peter Maydell, Eduardo Habkost, QEMU Developers,
	Wainer dos Santos Moschetta, Alex Bennée, Richard Henderson

On 07/10/2020 07.20, Philippe Mathieu-Daudé wrote:
> On 10/7/20 1:07 AM, John Snow wrote:
>> I'm seeing this gitlab test fail quite often in my Python work; I don't
>> *think* this has anything to do with my patches, but maybe I need to try
>> and bisect this more aggressively.
[...]
> w.r.t. the error in your build, I told Thomas about the
> test_ppc_mac99/day15/invaders.elf timeouting but he said this is
> not his area. Richard has been looking yesterday to see if it is
> a TCG regression, and said the test either finished/crashed raising
> SIGCHLD, but Avocado parent is still waiting for a timeout, so the
> children become zombie and the test hang.
> 
> Not sure that helps :)

No clue why that invaders.elf is now failing with the replay test, but if
you look at the history of the CI runs:

 https://gitlab.com/qemu-project/qemu/-/pipelines

... you can clearly see that the problem started with John's
ide-pull-request 5 days ago:

 https://gitlab.com/qemu-project/qemu/-/pipelines/197124608

So maybe it's worth the effort to have a look at the patches that got merged
here?

 Thomas



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: acceptance-system-fedora failures
  2020-10-07  7:23   ` Thomas Huth
@ 2020-10-07  8:19     ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 19+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-10-07  8:19 UTC (permalink / raw)
  To: Thomas Huth, John Snow, Cleber Rosa, Pavel Dovgalyuk
  Cc: Peter Maydell, Eduardo Habkost, QEMU Developers,
	Wainer dos Santos Moschetta, Alex Bennée, Richard Henderson

On 10/7/20 9:23 AM, Thomas Huth wrote:
> On 07/10/2020 07.20, Philippe Mathieu-Daudé wrote:
>> On 10/7/20 1:07 AM, John Snow wrote:
>>> I'm seeing this gitlab test fail quite often in my Python work; I don't
>>> *think* this has anything to do with my patches, but maybe I need to try
>>> and bisect this more aggressively.
> [...]
>> w.r.t. the error in your build, I told Thomas about the
>> test_ppc_mac99/day15/invaders.elf timeouting but he said this is
>> not his area. Richard has been looking yesterday to see if it is
>> a TCG regression, and said the test either finished/crashed raising
>> SIGCHLD, but Avocado parent is still waiting for a timeout, so the
>> children become zombie and the test hang.
>>
>> Not sure that helps :)
> 
> No clue why that invaders.elf is now failing with the replay test, but if
> you look at the history of the CI runs:
> 
>  https://gitlab.com/qemu-project/qemu/-/pipelines
> 
> ... you can clearly see that the problem started with John's
> ide-pull-request 5 days ago:
> 
>  https://gitlab.com/qemu-project/qemu/-/pipelines/197124608
> 
> So maybe it's worth the effort to have a look at the patches that got merged
> here?

Great idea!

Bisected using:

$ make qemu-system-ppc check-venv && \
  tests/venv/bin/python -m \
  avocado --show=app,console,replay \
  run -t machine:mac99 \
  tests/acceptance/replay_kernel.py

55adb3c45620c31f29978f209e2a44a08d34e2da is the first bad commit
commit 55adb3c45620c31f29978f209e2a44a08d34e2da
Author: John Snow <jsnow@redhat.com>
Date:   Fri Jul 24 01:23:00 2020 -0400

    ide: cancel pending callbacks on SRST

    The SRST implementation did not keep up with the rest of IDE;
    it is possible to perform a weak reset on an IDE device to
    remove the BSY/DRQ bits, and then issue writes to the
    control/device registers which can cause chaos with the state
    machine.

    Fix that by actually performing a real reset.

    Reported-by: Alexander Bulekov <alxndr@bu.edu>
    Fixes: https://bugs.launchpad.net/qemu/+bug/1878253
    Fixes: https://bugs.launchpad.net/qemu/+bug/1887303
    Fixes: https://bugs.launchpad.net/qemu/+bug/1887309
    Signed-off-by: John Snow <jsnow@redhat.com>

 hw/ide/core.c | 58
++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 18 deletions(-)

Regards,

Phil.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: acceptance-system-fedora failures
  2020-10-07  7:13   ` Philippe Mathieu-Daudé
@ 2020-10-07  8:23     ` Thomas Huth
  2020-10-07  8:51       ` Pavel Dovgalyuk
  2020-10-07 14:03       ` John Snow
  0 siblings, 2 replies; 19+ messages in thread
From: Thomas Huth @ 2020-10-07  8:23 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, John Snow, Cleber Rosa, Pavel Dovgalyuk
  Cc: Richard Henderson, Alex Bennée, Eduardo Habkost,
	Wainer dos Santos Moschetta, QEMU Developers

On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote:
> On 10/7/20 7:20 AM, Philippe Mathieu-Daudé wrote:
>> On 10/7/20 1:07 AM, John Snow wrote:
>>> I'm seeing this gitlab test fail quite often in my Python work; I don't
>>> *think* this has anything to do with my patches, but maybe I need to try
>>> and bisect this more aggressively.
[...]
>> w.r.t. the error in your build, I told Thomas about the
>> test_ppc_mac99/day15/invaders.elf timeouting but he said this is
>> not his area. Richard has been looking yesterday to see if it is
>> a TCG regression, and said the test either finished/crashed raising
>> SIGCHLD, but Avocado parent is still waiting for a timeout, so the
>> children become zombie and the test hang.
> 
> Expected output:
> 
> Quiescing Open Firmware ...
> Booting Linux via __start() @ 0x01000000 ...
> 
> But QEMU exits in replay_char_write_event_load():
> 
> Quiescing Open Firmware ...
> qemu-system-ppc: Missing character write event in the replay log
> $ echo $?
> 1
> 
> Latest events are CHECKPOINT CHECKPOINT INTERRUPT INTERRUPT INTERRUPT.
> 
> Replay file is ~22MiB. End of record using "system_powerdown + quit"
> in HMP.
> 
> I guess we have 2 bugs:
> - replay log
> - avocado doesn't catch children exit(1)
> 
> Quick reproducer:
> 
> $ make qemu-system-ppc check-venv
> $ tests/venv/bin/python -m \
>   avocado --show=app,console,replay \
>   run --job-timeout 300 -t machine:mac99 \
>   tests/acceptance/replay_kernel.py

Thanks, that was helpful. ... and the winner is:

    commit   55adb3c45620c31f29978f209e2a44a08d34e2da
    Author:  John Snow <jsnow@redhat.com>
    Date:    Fri Jul 24 01:23:00 2020 -0400
    Subject: ide: cancel pending callbacks on SRST

... starting with this commit, the tests starts failing. John, any idea what
might be causing this?

 Thomas



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: acceptance-system-fedora failures
  2020-10-07  8:23     ` Thomas Huth
@ 2020-10-07  8:51       ` Pavel Dovgalyuk
  2020-10-07  9:57         ` Philippe Mathieu-Daudé
  2020-10-07 14:03       ` John Snow
  1 sibling, 1 reply; 19+ messages in thread
From: Pavel Dovgalyuk @ 2020-10-07  8:51 UTC (permalink / raw)
  To: Thomas Huth, Philippe Mathieu-Daudé,
	John Snow, Cleber Rosa, Pavel Dovgalyuk
  Cc: Richard Henderson, Alex Bennée, Eduardo Habkost,
	Wainer dos Santos Moschetta, QEMU Developers

On 07.10.2020 11:23, Thomas Huth wrote:
> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote:
>> On 10/7/20 7:20 AM, Philippe Mathieu-Daudé wrote:
>>> On 10/7/20 1:07 AM, John Snow wrote:
>>>> I'm seeing this gitlab test fail quite often in my Python work; I don't
>>>> *think* this has anything to do with my patches, but maybe I need to try
>>>> and bisect this more aggressively.
> [...]
>>> w.r.t. the error in your build, I told Thomas about the
>>> test_ppc_mac99/day15/invaders.elf timeouting but he said this is
>>> not his area. Richard has been looking yesterday to see if it is
>>> a TCG regression, and said the test either finished/crashed raising
>>> SIGCHLD, but Avocado parent is still waiting for a timeout, so the
>>> children become zombie and the test hang.
>>
>> Expected output:
>>
>> Quiescing Open Firmware ...
>> Booting Linux via __start() @ 0x01000000 ...
>>
>> But QEMU exits in replay_char_write_event_load():
>>
>> Quiescing Open Firmware ...
>> qemu-system-ppc: Missing character write event in the replay log
>> $ echo $?
>> 1
>>
>> Latest events are CHECKPOINT CHECKPOINT INTERRUPT INTERRUPT INTERRUPT.
>>
>> Replay file is ~22MiB. End of record using "system_powerdown + quit"
>> in HMP.
>>
>> I guess we have 2 bugs:
>> - replay log
>> - avocado doesn't catch children exit(1)
>>
>> Quick reproducer:
>>
>> $ make qemu-system-ppc check-venv
>> $ tests/venv/bin/python -m \
>>    avocado --show=app,console,replay \
>>    run --job-timeout 300 -t machine:mac99 \
>>    tests/acceptance/replay_kernel.py
> 
> Thanks, that was helpful. ... and the winner is:
> 
>      commit   55adb3c45620c31f29978f209e2a44a08d34e2da
>      Author:  John Snow <jsnow@redhat.com>
>      Date:    Fri Jul 24 01:23:00 2020 -0400
>      Subject: ide: cancel pending callbacks on SRST
> 
> ... starting with this commit, the tests starts failing. John, any idea what
> might be causing this?

This patch includes the following lines:

+        aio_bh_schedule_oneshot(qemu_get_aio_context(),
+                                ide_bus_perform_srst, bus);

replay_bh_schedule_oneshot_event should be used instead of this 
function, because it synchronizes non-deterministic BHs.


> 
>   Thomas
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: acceptance-system-fedora failures
  2020-10-07  8:51       ` Pavel Dovgalyuk
@ 2020-10-07  9:57         ` Philippe Mathieu-Daudé
  2020-10-07 11:22           ` Alex Bennée
  2020-10-07 12:17           ` Pavel Dovgalyuk
  0 siblings, 2 replies; 19+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-10-07  9:57 UTC (permalink / raw)
  To: Pavel Dovgalyuk, Thomas Huth, John Snow, Cleber Rosa, Pavel Dovgalyuk
  Cc: Richard Henderson, Alex Bennée, Eduardo Habkost,
	Wainer dos Santos Moschetta, QEMU Developers

On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote:
> On 07.10.2020 11:23, Thomas Huth wrote:
>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote:
>>> On 10/7/20 7:20 AM, Philippe Mathieu-Daudé wrote:
>>>> On 10/7/20 1:07 AM, John Snow wrote:
>>>>> I'm seeing this gitlab test fail quite often in my Python work; I
>>>>> don't
>>>>> *think* this has anything to do with my patches, but maybe I need
>>>>> to try
>>>>> and bisect this more aggressively.
>> [...]
>>>> w.r.t. the error in your build, I told Thomas about the
>>>> test_ppc_mac99/day15/invaders.elf timeouting but he said this is
>>>> not his area. Richard has been looking yesterday to see if it is
>>>> a TCG regression, and said the test either finished/crashed raising
>>>> SIGCHLD, but Avocado parent is still waiting for a timeout, so the
>>>> children become zombie and the test hang.
>>>
>>> Expected output:
>>>
>>> Quiescing Open Firmware ...
>>> Booting Linux via __start() @ 0x01000000 ...
>>>
>>> But QEMU exits in replay_char_write_event_load():
>>>
>>> Quiescing Open Firmware ...
>>> qemu-system-ppc: Missing character write event in the replay log
>>> $ echo $?
>>> 1
>>>
>>> Latest events are CHECKPOINT CHECKPOINT INTERRUPT INTERRUPT INTERRUPT.
>>>
>>> Replay file is ~22MiB. End of record using "system_powerdown + quit"
>>> in HMP.
>>>
>>> I guess we have 2 bugs:
>>> - replay log
>>> - avocado doesn't catch children exit(1)
>>>
>>> Quick reproducer:
>>>
>>> $ make qemu-system-ppc check-venv
>>> $ tests/venv/bin/python -m \
>>>    avocado --show=app,console,replay \
>>>    run --job-timeout 300 -t machine:mac99 \
>>>    tests/acceptance/replay_kernel.py
>>
>> Thanks, that was helpful. ... and the winner is:
>>
>>      commit   55adb3c45620c31f29978f209e2a44a08d34e2da
>>      Author:  John Snow <jsnow@redhat.com>
>>      Date:    Fri Jul 24 01:23:00 2020 -0400
>>      Subject: ide: cancel pending callbacks on SRST
>>
>> ... starting with this commit, the tests starts failing. John, any
>> idea what
>> might be causing this?
> 
> This patch includes the following lines:
> 
> +        aio_bh_schedule_oneshot(qemu_get_aio_context(),
> +                                ide_bus_perform_srst, bus);
> 
> replay_bh_schedule_oneshot_event should be used instead of this
> function, because it synchronizes non-deterministic BHs.

Why do we have 2 different functions? BH are already complex
enough, and we need to also think about the replay API...

What about the other cases such vhost-user (blk/net), virtio-blk?

> 
> 
>>
>>   Thomas
>>
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: acceptance-system-fedora failures
  2020-10-07  9:57         ` Philippe Mathieu-Daudé
@ 2020-10-07 11:22           ` Alex Bennée
  2020-10-07 12:20             ` Pavel Dovgalyuk
  2020-10-07 12:17           ` Pavel Dovgalyuk
  1 sibling, 1 reply; 19+ messages in thread
From: Alex Bennée @ 2020-10-07 11:22 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Thomas Huth, Pavel Dovgalyuk, Eduardo Habkost, QEMU Developers,
	Wainer dos Santos Moschetta, Pavel Dovgalyuk, Cleber Rosa,
	John Snow, Richard Henderson


Philippe Mathieu-Daudé <philmd@redhat.com> writes:

> On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote:
>> On 07.10.2020 11:23, Thomas Huth wrote:
>>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote:
>>>> On 10/7/20 7:20 AM, Philippe Mathieu-Daudé wrote:
>>>>> On 10/7/20 1:07 AM, John Snow wrote:
>>>>>> I'm seeing this gitlab test fail quite often in my Python work; I
>>>>>> don't
>>>>>> *think* this has anything to do with my patches, but maybe I need
>>>>>> to try
>>>>>> and bisect this more aggressively.
>>> [...]
>>>>> w.r.t. the error in your build, I told Thomas about the
>>>>> test_ppc_mac99/day15/invaders.elf timeouting but he said this is
>>>>> not his area. Richard has been looking yesterday to see if it is
>>>>> a TCG regression, and said the test either finished/crashed raising
>>>>> SIGCHLD, but Avocado parent is still waiting for a timeout, so the
>>>>> children become zombie and the test hang.
>>>>
>>>> Expected output:
>>>>
>>>> Quiescing Open Firmware ...
>>>> Booting Linux via __start() @ 0x01000000 ...
>>>>
>>>> But QEMU exits in replay_char_write_event_load():
>>>>
>>>> Quiescing Open Firmware ...
>>>> qemu-system-ppc: Missing character write event in the replay log
>>>> $ echo $?
>>>> 1
>>>>
>>>> Latest events are CHECKPOINT CHECKPOINT INTERRUPT INTERRUPT INTERRUPT.
>>>>
>>>> Replay file is ~22MiB. End of record using "system_powerdown + quit"
>>>> in HMP.
>>>>
>>>> I guess we have 2 bugs:
>>>> - replay log
>>>> - avocado doesn't catch children exit(1)
>>>>
>>>> Quick reproducer:
>>>>
>>>> $ make qemu-system-ppc check-venv
>>>> $ tests/venv/bin/python -m \
>>>>    avocado --show=app,console,replay \
>>>>    run --job-timeout 300 -t machine:mac99 \
>>>>    tests/acceptance/replay_kernel.py
>>>
>>> Thanks, that was helpful. ... and the winner is:
>>>
>>>      commit   55adb3c45620c31f29978f209e2a44a08d34e2da
>>>      Author:  John Snow <jsnow@redhat.com>
>>>      Date:    Fri Jul 24 01:23:00 2020 -0400
>>>      Subject: ide: cancel pending callbacks on SRST
>>>
>>> ... starting with this commit, the tests starts failing. John, any
>>> idea what
>>> might be causing this?
>> 
>> This patch includes the following lines:
>> 
>> +        aio_bh_schedule_oneshot(qemu_get_aio_context(),
>> +                                ide_bus_perform_srst, bus);
>> 
>> replay_bh_schedule_oneshot_event should be used instead of this
>> function, because it synchronizes non-deterministic BHs.
>
> Why do we have 2 different functions? BH are already complex
> enough, and we need to also think about the replay API...
>
> What about the other cases such vhost-user (blk/net), virtio-blk?

This does seem like something that should be wrapped up inside
aio_bh_schedule_oneshot itself or maybe we need a
aio_bh_schedule_transaction_oneshot to distinguish it from the other
uses the function has.


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: acceptance-system-fedora failures
  2020-10-07  9:57         ` Philippe Mathieu-Daudé
  2020-10-07 11:22           ` Alex Bennée
@ 2020-10-07 12:17           ` Pavel Dovgalyuk
  1 sibling, 0 replies; 19+ messages in thread
From: Pavel Dovgalyuk @ 2020-10-07 12:17 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, Thomas Huth, John Snow, Cleber Rosa
  Cc: Richard Henderson, Alex Bennée, Eduardo Habkost,
	Wainer dos Santos Moschetta, QEMU Developers

On 07.10.2020 12:57, Philippe Mathieu-Daudé wrote:
> On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote:
>> On 07.10.2020 11:23, Thomas Huth wrote:
>>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote:
>>>> On 10/7/20 7:20 AM, Philippe Mathieu-Daudé wrote:
>>>>> On 10/7/20 1:07 AM, John Snow wrote:
>>>>>> I'm seeing this gitlab test fail quite often in my Python work; I
>>>>>> don't
>>>>>> *think* this has anything to do with my patches, but maybe I need
>>>>>> to try
>>>>>> and bisect this more aggressively.
>>> [...]
>>>>> w.r.t. the error in your build, I told Thomas about the
>>>>> test_ppc_mac99/day15/invaders.elf timeouting but he said this is
>>>>> not his area. Richard has been looking yesterday to see if it is
>>>>> a TCG regression, and said the test either finished/crashed raising
>>>>> SIGCHLD, but Avocado parent is still waiting for a timeout, so the
>>>>> children become zombie and the test hang.
>>>>
>>>> Expected output:
>>>>
>>>> Quiescing Open Firmware ...
>>>> Booting Linux via __start() @ 0x01000000 ...
>>>>
>>>> But QEMU exits in replay_char_write_event_load():
>>>>
>>>> Quiescing Open Firmware ...
>>>> qemu-system-ppc: Missing character write event in the replay log
>>>> $ echo $?
>>>> 1
>>>>
>>>> Latest events are CHECKPOINT CHECKPOINT INTERRUPT INTERRUPT INTERRUPT.
>>>>
>>>> Replay file is ~22MiB. End of record using "system_powerdown + quit"
>>>> in HMP.
>>>>
>>>> I guess we have 2 bugs:
>>>> - replay log
>>>> - avocado doesn't catch children exit(1)
>>>>
>>>> Quick reproducer:
>>>>
>>>> $ make qemu-system-ppc check-venv
>>>> $ tests/venv/bin/python -m \
>>>>     avocado --show=app,console,replay \
>>>>     run --job-timeout 300 -t machine:mac99 \
>>>>     tests/acceptance/replay_kernel.py
>>>
>>> Thanks, that was helpful. ... and the winner is:
>>>
>>>       commit   55adb3c45620c31f29978f209e2a44a08d34e2da
>>>       Author:  John Snow <jsnow@redhat.com>
>>>       Date:    Fri Jul 24 01:23:00 2020 -0400
>>>       Subject: ide: cancel pending callbacks on SRST
>>>
>>> ... starting with this commit, the tests starts failing. John, any
>>> idea what
>>> might be causing this?
>>
>> This patch includes the following lines:
>>
>> +        aio_bh_schedule_oneshot(qemu_get_aio_context(),
>> +                                ide_bus_perform_srst, bus);
>>
>> replay_bh_schedule_oneshot_event should be used instead of this
>> function, because it synchronizes non-deterministic BHs.
> 
> Why do we have 2 different functions? BH are already complex
> enough, and we need to also think about the replay API...
> 

There is note about it in docs/devel/replay.txt
It's hard to protect the guest state from incorrect operations.
There was the similar problem with icount - everyone who modify 
translate modules, needs to take it info account.
But now we have record/replay tests that assert that patches do not 
break icount/rr.

> What about the other cases such vhost-user (blk/net), virtio-blk?

I do not know much about these modules.
The main idea is the following: if the code works with the guest state, 
it should deal with icount and rr functions.

Pavel Dovgalyuk


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: acceptance-system-fedora failures
  2020-10-07 11:22           ` Alex Bennée
@ 2020-10-07 12:20             ` Pavel Dovgalyuk
  2020-10-07 12:49               ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 19+ messages in thread
From: Pavel Dovgalyuk @ 2020-10-07 12:20 UTC (permalink / raw)
  To: Alex Bennée, Philippe Mathieu-Daudé
  Cc: Thomas Huth, Eduardo Habkost, QEMU Developers,
	Wainer dos Santos Moschetta, Pavel Dovgalyuk, Cleber Rosa,
	John Snow, Richard Henderson

On 07.10.2020 14:22, Alex Bennée wrote:
> 
> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
> 
>> On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote:
>>> On 07.10.2020 11:23, Thomas Huth wrote:
>>>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote:
>>>> Thanks, that was helpful. ... and the winner is:
>>>>
>>>>       commit   55adb3c45620c31f29978f209e2a44a08d34e2da
>>>>       Author:  John Snow <jsnow@redhat.com>
>>>>       Date:    Fri Jul 24 01:23:00 2020 -0400
>>>>       Subject: ide: cancel pending callbacks on SRST
>>>>
>>>> ... starting with this commit, the tests starts failing. John, any
>>>> idea what
>>>> might be causing this?
>>>
>>> This patch includes the following lines:
>>>
>>> +        aio_bh_schedule_oneshot(qemu_get_aio_context(),
>>> +                                ide_bus_perform_srst, bus);
>>>
>>> replay_bh_schedule_oneshot_event should be used instead of this
>>> function, because it synchronizes non-deterministic BHs.
>>
>> Why do we have 2 different functions? BH are already complex
>> enough, and we need to also think about the replay API...
>>
>> What about the other cases such vhost-user (blk/net), virtio-blk?
> 
> This does seem like something that should be wrapped up inside
> aio_bh_schedule_oneshot itself or maybe we need a
> aio_bh_schedule_transaction_oneshot to distinguish it from the other
> uses the function has.
> 

Maybe there should be two functions:
- one for the guest modification
- one for internal qemu things

The first one may be implemented though the rr+second one.
Maybe replay_ prefix is confusing and the whole BH interface should look 
like that?


Pavel Dovgalyuk


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: acceptance-system-fedora failures
  2020-10-07 12:20             ` Pavel Dovgalyuk
@ 2020-10-07 12:49               ` Philippe Mathieu-Daudé
  2020-10-07 13:11                 ` Pavel Dovgalyuk
  0 siblings, 1 reply; 19+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-10-07 12:49 UTC (permalink / raw)
  To: Pavel Dovgalyuk, Alex Bennée, Stefan Hajnoczi
  Cc: Thomas Huth, Eduardo Habkost, QEMU Developers,
	Wainer dos Santos Moschetta, Pavel Dovgalyuk, Cleber Rosa,
	Paolo Bonzini, John Snow

On 10/7/20 2:20 PM, Pavel Dovgalyuk wrote:
> On 07.10.2020 14:22, Alex Bennée wrote:
>>
>> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
>>
>>> On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote:
>>>> On 07.10.2020 11:23, Thomas Huth wrote:
>>>>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote:
>>>>> Thanks, that was helpful. ... and the winner is:
>>>>>
>>>>>       commit   55adb3c45620c31f29978f209e2a44a08d34e2da
>>>>>       Author:  John Snow <jsnow@redhat.com>
>>>>>       Date:    Fri Jul 24 01:23:00 2020 -0400
>>>>>       Subject: ide: cancel pending callbacks on SRST
>>>>>
>>>>> ... starting with this commit, the tests starts failing. John, any
>>>>> idea what
>>>>> might be causing this?
>>>>
>>>> This patch includes the following lines:
>>>>
>>>> +        aio_bh_schedule_oneshot(qemu_get_aio_context(),
>>>> +                                ide_bus_perform_srst, bus);
>>>>
>>>> replay_bh_schedule_oneshot_event should be used instead of this
>>>> function, because it synchronizes non-deterministic BHs.
>>>
>>> Why do we have 2 different functions? BH are already complex
>>> enough, and we need to also think about the replay API...
>>>
>>> What about the other cases such vhost-user (blk/net), virtio-blk?
>>
>> This does seem like something that should be wrapped up inside
>> aio_bh_schedule_oneshot itself or maybe we need a
>> aio_bh_schedule_transaction_oneshot to distinguish it from the other
>> uses the function has.
>>
> 
> Maybe there should be two functions:
> - one for the guest modification

aio_bh_schedule_oneshot_deterministic()?

> - one for internal qemu things

Not sure why there is a difference, BH are used to
avoid delaying the guest, so there always something
related to "guest modification".

> 
> The first one may be implemented though the rr+second one.
> Maybe replay_ prefix is confusing and the whole BH interface should look
> like that?

Yes, but it would be safer/clearer if we don't need to use
a replay_ API.

Can we embed these functions?

- replay_bh_schedule_event
- replay_bh_schedule_oneshot_event

If compiled without rr, events_enabled=false and
compiler can optimize:

-- >8 --
diff --git a/util/async.c b/util/async.c
index f758354c6a..376b6d4e27 100644
--- a/util/async.c
+++ b/util/async.c
@@ -109,14 +109,19 @@ static QEMUBH *aio_bh_dequeue(BHList *head,
unsigned *flags)

 void aio_bh_schedule_oneshot(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
 {
-    QEMUBH *bh;
-    bh = g_new(QEMUBH, 1);
-    *bh = (QEMUBH){
-        .ctx = ctx,
-        .cb = cb,
-        .opaque = opaque,
-    };
-    aio_bh_enqueue(bh, BH_SCHEDULED | BH_ONESHOT);
+    if (events_enabled) {
+        replay_add_event(REPLAY_ASYNC_EVENT_BH_ONESHOT, cb,
+                         opaque, replay_get_current_icount());
+    } else {
+        QEMUBH *bh;
+        bh = g_new(QEMUBH, 1);
+        *bh = (QEMUBH){
+            .ctx = ctx,
+            .cb = cb,
+            .opaque = opaque,
+        };
+        aio_bh_enqueue(bh, BH_SCHEDULED | BH_ONESHOT);
+    }
 }

 QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
@@ -178,7 +183,12 @@ void qemu_bh_schedule_idle(QEMUBH *bh)

 void qemu_bh_schedule(QEMUBH *bh)
 {
-    aio_bh_enqueue(bh, BH_SCHEDULED);
+    if (events_enabled) {
+        replay_add_event(REPLAY_ASYNC_EVENT_BH, bh, NULL,
+                         replay_get_current_icount());
+    } else {
+        aio_bh_enqueue(bh, BH_SCHEDULED);
+    }
 }

---

> 
> 
> Pavel Dovgalyuk
> 



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: acceptance-system-fedora failures
  2020-10-07 12:49               ` Philippe Mathieu-Daudé
@ 2020-10-07 13:11                 ` Pavel Dovgalyuk
  2020-10-08 10:26                   ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 19+ messages in thread
From: Pavel Dovgalyuk @ 2020-10-07 13:11 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, Alex Bennée, Stefan Hajnoczi
  Cc: Thomas Huth, Eduardo Habkost, QEMU Developers,
	Wainer dos Santos Moschetta, Cleber Rosa, Paolo Bonzini,
	John Snow

On 07.10.2020 15:49, Philippe Mathieu-Daudé wrote:
> On 10/7/20 2:20 PM, Pavel Dovgalyuk wrote:
>> On 07.10.2020 14:22, Alex Bennée wrote:
>>>
>>> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
>>>
>>>> On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote:
>>>>> On 07.10.2020 11:23, Thomas Huth wrote:
>>>>>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote:
>>>>>> Thanks, that was helpful. ... and the winner is:
>>>>>>
>>>>>>        commit   55adb3c45620c31f29978f209e2a44a08d34e2da
>>>>>>        Author:  John Snow <jsnow@redhat.com>
>>>>>>        Date:    Fri Jul 24 01:23:00 2020 -0400
>>>>>>        Subject: ide: cancel pending callbacks on SRST
>>>>>>
>>>>>> ... starting with this commit, the tests starts failing. John, any
>>>>>> idea what
>>>>>> might be causing this?
>>>>>
>>>>> This patch includes the following lines:
>>>>>
>>>>> +        aio_bh_schedule_oneshot(qemu_get_aio_context(),
>>>>> +                                ide_bus_perform_srst, bus);
>>>>>
>>>>> replay_bh_schedule_oneshot_event should be used instead of this
>>>>> function, because it synchronizes non-deterministic BHs.
>>>>
>>>> Why do we have 2 different functions? BH are already complex
>>>> enough, and we need to also think about the replay API...
>>>>
>>>> What about the other cases such vhost-user (blk/net), virtio-blk?
>>>
>>> This does seem like something that should be wrapped up inside
>>> aio_bh_schedule_oneshot itself or maybe we need a
>>> aio_bh_schedule_transaction_oneshot to distinguish it from the other
>>> uses the function has.
>>>
>>
>> Maybe there should be two functions:
>> - one for the guest modification
> 
> aio_bh_schedule_oneshot_deterministic()?
> 
>> - one for internal qemu things
> 
> Not sure why there is a difference, BH are used to
> avoid delaying the guest, so there always something
> related to "guest modification".

Not exactly. At least there is one non-related-to-guest case
in monitor_init_qmp:
         /*
          * We can't call qemu_chr_fe_set_handlers() directly here
          * since chardev might be running in the monitor I/O
          * thread.  Schedule a bottom half.
          */
         aio_bh_schedule_oneshot(iothread_get_aio_context(mon_iothread),
                                 monitor_qmp_setup_handlers_bh, mon);


> 
>>
>> The first one may be implemented though the rr+second one.
>> Maybe replay_ prefix is confusing and the whole BH interface should look
>> like that?
> 
> Yes, but it would be safer/clearer if we don't need to use
> a replay_ API.
> 
> Can we embed these functions?
> 
> - replay_bh_schedule_event
> - replay_bh_schedule_oneshot_event
> 
> If compiled without rr, events_enabled=false and
> compiler can optimize:
> 
> -- >8 --
> diff --git a/util/async.c b/util/async.c
> index f758354c6a..376b6d4e27 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -109,14 +109,19 @@ static QEMUBH *aio_bh_dequeue(BHList *head,
> unsigned *flags)
> 
>   void aio_bh_schedule_oneshot(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
>   {
> -    QEMUBH *bh;
> -    bh = g_new(QEMUBH, 1);
> -    *bh = (QEMUBH){
> -        .ctx = ctx,
> -        .cb = cb,
> -        .opaque = opaque,
> -    };
> -    aio_bh_enqueue(bh, BH_SCHEDULED | BH_ONESHOT);
> +    if (events_enabled) {
> +        replay_add_event(REPLAY_ASYNC_EVENT_BH_ONESHOT, cb,
> +                         opaque, replay_get_current_icount());
> +    } else {
> +        QEMUBH *bh;
> +        bh = g_new(QEMUBH, 1);
> +        *bh = (QEMUBH){
> +            .ctx = ctx,
> +            .cb = cb,
> +            .opaque = opaque,
> +        };
> +        aio_bh_enqueue(bh, BH_SCHEDULED | BH_ONESHOT);
> +    }
>   }
> 
>   QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
> @@ -178,7 +183,12 @@ void qemu_bh_schedule_idle(QEMUBH *bh)
> 
>   void qemu_bh_schedule(QEMUBH *bh)

qemu_bh_schedule is even worse.
Many modules use it (e.g., loadvm_postcopy_handle_run), and there is no 
need to use replay version there. In such cases QEMU will halt if trying 
to call replay_bh_schedule_event.

>   {
> -    aio_bh_enqueue(bh, BH_SCHEDULED);
> +    if (events_enabled) {
> +        replay_add_event(REPLAY_ASYNC_EVENT_BH, bh, NULL,
> +                         replay_get_current_icount());
> +    } else {
> +        aio_bh_enqueue(bh, BH_SCHEDULED);
> +    }
>   }
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: acceptance-system-fedora failures
  2020-10-07  8:23     ` Thomas Huth
  2020-10-07  8:51       ` Pavel Dovgalyuk
@ 2020-10-07 14:03       ` John Snow
  1 sibling, 0 replies; 19+ messages in thread
From: John Snow @ 2020-10-07 14:03 UTC (permalink / raw)
  To: Thomas Huth, Philippe Mathieu-Daudé, Cleber Rosa, Pavel Dovgalyuk
  Cc: Richard Henderson, Alex Bennée, Eduardo Habkost,
	Wainer dos Santos Moschetta, QEMU Developers

On 10/7/20 4:23 AM, Thomas Huth wrote:
> 
> ... starting with this commit, the tests starts failing. John, any idea what
> might be causing this?
> 
>   Thomas

Ahh, oh no! It *was* my fault, just nothing to do with what I was 
actually working on. good!

I'll look a little more closely, I see some more discussion below; I'll 
see if the recommended changes to the BH oneshot help alleviate the problem.

Thanks.

--js



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: acceptance-system-fedora failures
  2020-10-07  5:20 ` Philippe Mathieu-Daudé
  2020-10-07  7:13   ` Philippe Mathieu-Daudé
  2020-10-07  7:23   ` Thomas Huth
@ 2020-10-07 14:38   ` Cleber Rosa
  2 siblings, 0 replies; 19+ messages in thread
From: Cleber Rosa @ 2020-10-07 14:38 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Thomas Huth, Eduardo Habkost, John Snow, QEMU Developers,
	Wainer dos Santos Moschetta, Alex Bennée, Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 2810 bytes --]

On Wed, Oct 07, 2020 at 07:20:49AM +0200, Philippe Mathieu-Daudé wrote:
> On 10/7/20 1:07 AM, John Snow wrote:
> > I'm seeing this gitlab test fail quite often in my Python work; I don't
> > *think* this has anything to do with my patches, but maybe I need to try
> > and bisect this more aggressively.
> > 
> > The very first hint of an error I see is on line 154:
> > 
> > https://gitlab.com/jsnow/qemu/-/jobs/776334918#L154
> > 
> > 22:05:36 ERROR|
> > 22:05:36 ERROR| Reproduced traceback from:
> > /builds/jsnow/qemu/build/tests/venv/lib64/python3.8/site-packages/avocado/core/test.py:753
> > 
> > 22:05:36 ERROR| Traceback (most recent call last):
> > 22:05:36 ERROR|   File
> > "/builds/jsnow/qemu/build/tests/acceptance/avocado_qemu/__init__.py",
> > line 171, in setUp
> > 22:05:36 ERROR|     self.cancel("No QEMU binary defined or found in the
> > build tree")
>

One very bad aspect of this output is that the test outcome (a test
cancelation) is determined by an exception handler by the runner, and
the "ERROR" prefix is indeed very misleading.

But yes, in short, it's not an *error*, but the test getting canceled.

> Last year the Avocado developers said we could have a clearer
> log error report, but meanwhile this verbose output is better
> that not having anything ¯\_(ツ)_/¯
>

With the new test runner ("avocado run --test-runner=nrunner") that
just made its way into Avocado 81.0, there's a much better separation
of what happens within the test, and within the runner.

The next step is to post the QEMU and "Avocado >= 82.0" integration, so
hopefully this will improve soon.

> > 
> > Is this a known problem?
> 
> "No QEMU binary defined or found in the build tree" is not a
> problem, it means a test is skipped because the qemu-system-$ARCH
> binary is not found. In your case this is because your job
> (acceptance-system-fedora) is based on build-system-fedora
> which only build the following targets:
> 
>     TARGETS: tricore-softmmu microblaze-softmmu mips-softmmu
>       xtensa-softmmu m68k-softmmu riscv32-softmmu ppc-softmmu
>       sparc64-softmmu
>

Right.

> Now I don't understand what binary the EmptyCPUModel/Migration tests
> are expecting. Maybe these tests only work when a single target is
> built? IOW not expecting that the python code searching for a binary
> return multiple ones? -> Eduardo/Cleber.
>

`avocado_qemu.pick_default_qemu()`, if not given an arch, will try to
find a target binary that matches the host.  The problem with picking
any (first available?) built target is the non-deterministic aspect.

BTW, with regards to how `avocado_qemu.pick_default_qemu()` will get
an arch, it can come either from a test parameter, or from an "arch"
tag.

Cheers,
- Cleber.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: acceptance-system-fedora failures
  2020-10-07 13:11                 ` Pavel Dovgalyuk
@ 2020-10-08 10:26                   ` Philippe Mathieu-Daudé
  2020-10-08 11:50                     ` Kevin Wolf
  0 siblings, 1 reply; 19+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-10-08 10:26 UTC (permalink / raw)
  To: Pavel Dovgalyuk, Alex Bennée, Stefan Hajnoczi
  Cc: Thomas Huth, Eduardo Habkost, Qemu-block, QEMU Developers,
	Wainer dos Santos Moschetta, Cleber Rosa, Paolo Bonzini,
	John Snow

On 10/7/20 3:11 PM, Pavel Dovgalyuk wrote:
> On 07.10.2020 15:49, Philippe Mathieu-Daudé wrote:
>> On 10/7/20 2:20 PM, Pavel Dovgalyuk wrote:
>>> On 07.10.2020 14:22, Alex Bennée wrote:
>>>>
>>>> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
>>>>
>>>>> On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote:
>>>>>> On 07.10.2020 11:23, Thomas Huth wrote:
>>>>>>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote:
>>>>>>> Thanks, that was helpful. ... and the winner is:
>>>>>>>
>>>>>>>        commit   55adb3c45620c31f29978f209e2a44a08d34e2da
>>>>>>>        Author:  John Snow <jsnow@redhat.com>
>>>>>>>        Date:    Fri Jul 24 01:23:00 2020 -0400
>>>>>>>        Subject: ide: cancel pending callbacks on SRST
>>>>>>>
>>>>>>> ... starting with this commit, the tests starts failing. John, any
>>>>>>> idea what
>>>>>>> might be causing this?
>>>>>>
>>>>>> This patch includes the following lines:
>>>>>>
>>>>>> +        aio_bh_schedule_oneshot(qemu_get_aio_context(),
>>>>>> +                                ide_bus_perform_srst, bus);
>>>>>>
>>>>>> replay_bh_schedule_oneshot_event should be used instead of this
>>>>>> function, because it synchronizes non-deterministic BHs.
>>>>>
>>>>> Why do we have 2 different functions? BH are already complex
>>>>> enough, and we need to also think about the replay API...
>>>>>
>>>>> What about the other cases such vhost-user (blk/net), virtio-blk?
>>>>
>>>> This does seem like something that should be wrapped up inside
>>>> aio_bh_schedule_oneshot itself or maybe we need a
>>>> aio_bh_schedule_transaction_oneshot to distinguish it from the other
>>>> uses the function has.
>>>>
>>>
>>> Maybe there should be two functions:
>>> - one for the guest modification
>>
>> aio_bh_schedule_oneshot_deterministic()?
>>
>>> - one for internal qemu things
>>
>> Not sure why there is a difference, BH are used to
>> avoid delaying the guest, so there always something
>> related to "guest modification".
> 
> Not exactly. At least there is one non-related-to-guest case
> in monitor_init_qmp:
>         /*
>          * We can't call qemu_chr_fe_set_handlers() directly here
>          * since chardev might be running in the monitor I/O
>          * thread.  Schedule a bottom half.
>          */
>         aio_bh_schedule_oneshot(iothread_get_aio_context(mon_iothread),
>                                 monitor_qmp_setup_handlers_bh, mon);

I don't understand the documentation in docs/devel/replay.txt:

---
Bottom halves
=============

Bottom half callbacks, that affect the guest state, should be invoked
through
replay_bh_schedule_event or replay_bh_schedule_oneshot_event functions.
Their invocations are saved in record mode and synchronized with the
existing
log in replay mode.
---

But then it is only used in block drivers, which are not
related to guest state:

$ git grep replay_bh_schedule_oneshot_event
block/block-backend.c:1385:
replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
block/block-backend.c:1450:
replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
block/io.c:371:
replay_bh_schedule_oneshot_event(bdrv_get_aio_context(bs),
block/iscsi.c:286:
replay_bh_schedule_oneshot_event(iTask->iscsilun->aio_context,
block/nfs.c:262:
replay_bh_schedule_oneshot_event(task->client->aio_context,
block/null.c:183:
replay_bh_schedule_oneshot_event(bdrv_get_aio_context(bs),
block/nvme.c:323:        replay_bh_schedule_oneshot_event(q->aio_context,
block/nvme.c:1075:    replay_bh_schedule_oneshot_event(data->ctx,
nvme_rw_cb_bh, data);
block/rbd.c:865:
replay_bh_schedule_oneshot_event(bdrv_get_aio_context(acb->common.bs),
docs/devel/replay.txt:25:replay_bh_schedule_event or
replay_bh_schedule_oneshot_event functions.
include/sysemu/replay.h:178:void
replay_bh_schedule_oneshot_event(AioContext *ctx,
replay/replay-events.c:141:void
replay_bh_schedule_oneshot_event(AioContext *ctx,
stubs/replay-user.c:5:void replay_bh_schedule_oneshot_event(AioContext *ctx,

Is replay_bh_schedule_oneshot_event ever used by replay?
Maybe we can remove it and use aio_bh_schedule_oneshot()
in place?

Else the documentation need to be clarified please.

> 
> 
>>
>>>
>>> The first one may be implemented though the rr+second one.
>>> Maybe replay_ prefix is confusing and the whole BH interface should look
>>> like that?
>>
>> Yes, but it would be safer/clearer if we don't need to use
>> a replay_ API.
>>
>> Can we embed these functions?
>>
>> - replay_bh_schedule_event
>> - replay_bh_schedule_oneshot_event
>>
>> If compiled without rr, events_enabled=false and
>> compiler can optimize:
>>
>> -- >8 --
>> diff --git a/util/async.c b/util/async.c
>> index f758354c6a..376b6d4e27 100644
>> --- a/util/async.c
>> +++ b/util/async.c
>> @@ -109,14 +109,19 @@ static QEMUBH *aio_bh_dequeue(BHList *head,
>> unsigned *flags)
>>
>>   void aio_bh_schedule_oneshot(AioContext *ctx, QEMUBHFunc *cb, void
>> *opaque)
>>   {
>> -    QEMUBH *bh;
>> -    bh = g_new(QEMUBH, 1);
>> -    *bh = (QEMUBH){
>> -        .ctx = ctx,
>> -        .cb = cb,
>> -        .opaque = opaque,
>> -    };
>> -    aio_bh_enqueue(bh, BH_SCHEDULED | BH_ONESHOT);
>> +    if (events_enabled) {
>> +        replay_add_event(REPLAY_ASYNC_EVENT_BH_ONESHOT, cb,
>> +                         opaque, replay_get_current_icount());
>> +    } else {
>> +        QEMUBH *bh;
>> +        bh = g_new(QEMUBH, 1);
>> +        *bh = (QEMUBH){
>> +            .ctx = ctx,
>> +            .cb = cb,
>> +            .opaque = opaque,
>> +        };
>> +        aio_bh_enqueue(bh, BH_SCHEDULED | BH_ONESHOT);
>> +    }
>>   }
>>
>>   QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
>> @@ -178,7 +183,12 @@ void qemu_bh_schedule_idle(QEMUBH *bh)
>>
>>   void qemu_bh_schedule(QEMUBH *bh)
> 
> qemu_bh_schedule is even worse.
> Many modules use it (e.g., loadvm_postcopy_handle_run), and there is no
> need to use replay version there. In such cases QEMU will halt if trying
> to call replay_bh_schedule_event.
> 
>>   {
>> -    aio_bh_enqueue(bh, BH_SCHEDULED);
>> +    if (events_enabled) {
>> +        replay_add_event(REPLAY_ASYNC_EVENT_BH, bh, NULL,
>> +                         replay_get_current_icount());
>> +    } else {
>> +        aio_bh_enqueue(bh, BH_SCHEDULED);
>> +    }
>>   }
>>
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: acceptance-system-fedora failures
  2020-10-08 10:26                   ` Philippe Mathieu-Daudé
@ 2020-10-08 11:50                     ` Kevin Wolf
  2020-10-09 10:37                       ` Pavel Dovgalyuk
  0 siblings, 1 reply; 19+ messages in thread
From: Kevin Wolf @ 2020-10-08 11:50 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Thomas Huth, Pavel Dovgalyuk, Eduardo Habkost, Qemu-block,
	QEMU Developers, Wainer dos Santos Moschetta, Stefan Hajnoczi,
	Cleber Rosa, Paolo Bonzini, Alex Bennée

Am 08.10.2020 um 12:26 hat Philippe Mathieu-Daudé geschrieben:
> On 10/7/20 3:11 PM, Pavel Dovgalyuk wrote:
> > On 07.10.2020 15:49, Philippe Mathieu-Daudé wrote:
> >> On 10/7/20 2:20 PM, Pavel Dovgalyuk wrote:
> >>> On 07.10.2020 14:22, Alex Bennée wrote:
> >>>>
> >>>> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
> >>>>
> >>>>> On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote:
> >>>>>> On 07.10.2020 11:23, Thomas Huth wrote:
> >>>>>>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote:
> >>>>>>> Thanks, that was helpful. ... and the winner is:
> >>>>>>>
> >>>>>>>        commit   55adb3c45620c31f29978f209e2a44a08d34e2da
> >>>>>>>        Author:  John Snow <jsnow@redhat.com>
> >>>>>>>        Date:    Fri Jul 24 01:23:00 2020 -0400
> >>>>>>>        Subject: ide: cancel pending callbacks on SRST
> >>>>>>>
> >>>>>>> ... starting with this commit, the tests starts failing. John, any
> >>>>>>> idea what
> >>>>>>> might be causing this?
> >>>>>>
> >>>>>> This patch includes the following lines:
> >>>>>>
> >>>>>> +        aio_bh_schedule_oneshot(qemu_get_aio_context(),
> >>>>>> +                                ide_bus_perform_srst, bus);
> >>>>>>
> >>>>>> replay_bh_schedule_oneshot_event should be used instead of this
> >>>>>> function, because it synchronizes non-deterministic BHs.
> >>>>>
> >>>>> Why do we have 2 different functions? BH are already complex
> >>>>> enough, and we need to also think about the replay API...
> >>>>>
> >>>>> What about the other cases such vhost-user (blk/net), virtio-blk?
> >>>>
> >>>> This does seem like something that should be wrapped up inside
> >>>> aio_bh_schedule_oneshot itself or maybe we need a
> >>>> aio_bh_schedule_transaction_oneshot to distinguish it from the other
> >>>> uses the function has.
> >>>>
> >>>
> >>> Maybe there should be two functions:
> >>> - one for the guest modification
> >>
> >> aio_bh_schedule_oneshot_deterministic()?
> >>
> >>> - one for internal qemu things
> >>
> >> Not sure why there is a difference, BH are used to
> >> avoid delaying the guest, so there always something
> >> related to "guest modification".
> > 
> > Not exactly. At least there is one non-related-to-guest case
> > in monitor_init_qmp:
> >         /*
> >          * We can't call qemu_chr_fe_set_handlers() directly here
> >          * since chardev might be running in the monitor I/O
> >          * thread.  Schedule a bottom half.
> >          */
> >         aio_bh_schedule_oneshot(iothread_get_aio_context(mon_iothread),
> >                                 monitor_qmp_setup_handlers_bh, mon);
> 
> I don't understand the documentation in docs/devel/replay.txt:
> 
> ---
> Bottom halves
> =============
> 
> Bottom half callbacks, that affect the guest state, should be invoked
> through
> replay_bh_schedule_event or replay_bh_schedule_oneshot_event functions.
> Their invocations are saved in record mode and synchronized with the
> existing
> log in replay mode.
> ---
> 
> But then it is only used in block drivers, which are not
> related to guest state:

Pavel can tell you the details, but I think the idea was that you need
to use this function not when the code calling it modifies guest state,
but when the BH implementation can do so.

In the case of generic callbacks like provided by the blk_aio_*()
functions, we don't know whether this is the case, but it's generally
device emulation code, so chances are relatively high that they do.

I seem to remember that when reviewing the code that introduced
replay_bh_schedule_event(), I was relatively sure that we didn't catch
all necessary instances, but since it worked for Pavel and I didn't feel
like getting too involved with replay code, we just merged it anyway.

As I said, the details are a question for Pavel.

Kevin

> $ git grep replay_bh_schedule_oneshot_event
> block/block-backend.c:1385:
> replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> block/block-backend.c:1450:
> replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> block/io.c:371:
> replay_bh_schedule_oneshot_event(bdrv_get_aio_context(bs),
> block/iscsi.c:286:
> replay_bh_schedule_oneshot_event(iTask->iscsilun->aio_context,
> block/nfs.c:262:
> replay_bh_schedule_oneshot_event(task->client->aio_context,
> block/null.c:183:
> replay_bh_schedule_oneshot_event(bdrv_get_aio_context(bs),
> block/nvme.c:323:        replay_bh_schedule_oneshot_event(q->aio_context,
> block/nvme.c:1075:    replay_bh_schedule_oneshot_event(data->ctx,
> nvme_rw_cb_bh, data);
> block/rbd.c:865:
> replay_bh_schedule_oneshot_event(bdrv_get_aio_context(acb->common.bs),
> docs/devel/replay.txt:25:replay_bh_schedule_event or
> replay_bh_schedule_oneshot_event functions.
> include/sysemu/replay.h:178:void
> replay_bh_schedule_oneshot_event(AioContext *ctx,
> replay/replay-events.c:141:void
> replay_bh_schedule_oneshot_event(AioContext *ctx,
> stubs/replay-user.c:5:void replay_bh_schedule_oneshot_event(AioContext *ctx,
> 
> Is replay_bh_schedule_oneshot_event ever used by replay?
> Maybe we can remove it and use aio_bh_schedule_oneshot()
> in place?
> 
> Else the documentation need to be clarified please.
> 
> > 
> > 
> >>
> >>>
> >>> The first one may be implemented though the rr+second one.
> >>> Maybe replay_ prefix is confusing and the whole BH interface should look
> >>> like that?
> >>
> >> Yes, but it would be safer/clearer if we don't need to use
> >> a replay_ API.
> >>
> >> Can we embed these functions?
> >>
> >> - replay_bh_schedule_event
> >> - replay_bh_schedule_oneshot_event
> >>
> >> If compiled without rr, events_enabled=false and
> >> compiler can optimize:
> >>
> >> -- >8 --
> >> diff --git a/util/async.c b/util/async.c
> >> index f758354c6a..376b6d4e27 100644
> >> --- a/util/async.c
> >> +++ b/util/async.c
> >> @@ -109,14 +109,19 @@ static QEMUBH *aio_bh_dequeue(BHList *head,
> >> unsigned *flags)
> >>
> >>   void aio_bh_schedule_oneshot(AioContext *ctx, QEMUBHFunc *cb, void
> >> *opaque)
> >>   {
> >> -    QEMUBH *bh;
> >> -    bh = g_new(QEMUBH, 1);
> >> -    *bh = (QEMUBH){
> >> -        .ctx = ctx,
> >> -        .cb = cb,
> >> -        .opaque = opaque,
> >> -    };
> >> -    aio_bh_enqueue(bh, BH_SCHEDULED | BH_ONESHOT);
> >> +    if (events_enabled) {
> >> +        replay_add_event(REPLAY_ASYNC_EVENT_BH_ONESHOT, cb,
> >> +                         opaque, replay_get_current_icount());
> >> +    } else {
> >> +        QEMUBH *bh;
> >> +        bh = g_new(QEMUBH, 1);
> >> +        *bh = (QEMUBH){
> >> +            .ctx = ctx,
> >> +            .cb = cb,
> >> +            .opaque = opaque,
> >> +        };
> >> +        aio_bh_enqueue(bh, BH_SCHEDULED | BH_ONESHOT);
> >> +    }
> >>   }
> >>
> >>   QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
> >> @@ -178,7 +183,12 @@ void qemu_bh_schedule_idle(QEMUBH *bh)
> >>
> >>   void qemu_bh_schedule(QEMUBH *bh)
> > 
> > qemu_bh_schedule is even worse.
> > Many modules use it (e.g., loadvm_postcopy_handle_run), and there is no
> > need to use replay version there. In such cases QEMU will halt if trying
> > to call replay_bh_schedule_event.
> > 
> >>   {
> >> -    aio_bh_enqueue(bh, BH_SCHEDULED);
> >> +    if (events_enabled) {
> >> +        replay_add_event(REPLAY_ASYNC_EVENT_BH, bh, NULL,
> >> +                         replay_get_current_icount());
> >> +    } else {
> >> +        aio_bh_enqueue(bh, BH_SCHEDULED);
> >> +    }
> >>   }
> >>
> > 
> 
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: acceptance-system-fedora failures
  2020-10-08 11:50                     ` Kevin Wolf
@ 2020-10-09 10:37                       ` Pavel Dovgalyuk
  2020-10-13  8:57                         ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 19+ messages in thread
From: Pavel Dovgalyuk @ 2020-10-09 10:37 UTC (permalink / raw)
  To: Kevin Wolf, Philippe Mathieu-Daudé
  Cc: Thomas Huth, Eduardo Habkost, Qemu-block, QEMU Developers,
	Wainer dos Santos Moschetta, Stefan Hajnoczi, Cleber Rosa,
	Paolo Bonzini, Alex Bennée

On 08.10.2020 14:50, Kevin Wolf wrote:
> Am 08.10.2020 um 12:26 hat Philippe Mathieu-Daudé geschrieben:
>> On 10/7/20 3:11 PM, Pavel Dovgalyuk wrote:
>>> On 07.10.2020 15:49, Philippe Mathieu-Daudé wrote:
>>>> On 10/7/20 2:20 PM, Pavel Dovgalyuk wrote:
>>>>> On 07.10.2020 14:22, Alex Bennée wrote:
>>>>>>
>>>>>> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
>>>>>>
>>>>>>> On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote:
>>>>>>>> On 07.10.2020 11:23, Thomas Huth wrote:
>>>>>>>>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote:
>>>>>>>>> Thanks, that was helpful. ... and the winner is:
>>>>>>>>>
>>>>>>>>>        commit   55adb3c45620c31f29978f209e2a44a08d34e2da
>>>>>>>>>        Author:  John Snow <jsnow@redhat.com>
>>>>>>>>>        Date:    Fri Jul 24 01:23:00 2020 -0400
>>>>>>>>>        Subject: ide: cancel pending callbacks on SRST
>>>>>>>>>
>>>>>>>>> ... starting with this commit, the tests starts failing. John, any
>>>>>>>>> idea what
>>>>>>>>> might be causing this?
>>>>>>>>
>>>>>>>> This patch includes the following lines:
>>>>>>>>
>>>>>>>> +        aio_bh_schedule_oneshot(qemu_get_aio_context(),
>>>>>>>> +                                ide_bus_perform_srst, bus);
>>>>>>>>
>>>>>>>> replay_bh_schedule_oneshot_event should be used instead of this
>>>>>>>> function, because it synchronizes non-deterministic BHs.
>>>>>>>
>>>>>>> Why do we have 2 different functions? BH are already complex
>>>>>>> enough, and we need to also think about the replay API...
>>>>>>>
>>>>>>> What about the other cases such vhost-user (blk/net), virtio-blk?
>>>>>>
>>>>>> This does seem like something that should be wrapped up inside
>>>>>> aio_bh_schedule_oneshot itself or maybe we need a
>>>>>> aio_bh_schedule_transaction_oneshot to distinguish it from the other
>>>>>> uses the function has.
>>>>>>
>>>>>
>>>>> Maybe there should be two functions:
>>>>> - one for the guest modification
>>>>
>>>> aio_bh_schedule_oneshot_deterministic()?
>>>>
>>>>> - one for internal qemu things
>>>>
>>>> Not sure why there is a difference, BH are used to
>>>> avoid delaying the guest, so there always something
>>>> related to "guest modification".
>>>
>>> Not exactly. At least there is one non-related-to-guest case
>>> in monitor_init_qmp:
>>>         /*
>>>          * We can't call qemu_chr_fe_set_handlers() directly here
>>>          * since chardev might be running in the monitor I/O
>>>          * thread.  Schedule a bottom half.
>>>          */
>>>         aio_bh_schedule_oneshot(iothread_get_aio_context(mon_iothread),
>>>                                 monitor_qmp_setup_handlers_bh, mon);
>>
>> I don't understand the documentation in docs/devel/replay.txt:
>>
>> ---
>> Bottom halves
>> =============
>>
>> Bottom half callbacks, that affect the guest state, should be invoked
>> through
>> replay_bh_schedule_event or replay_bh_schedule_oneshot_event functions.
>> Their invocations are saved in record mode and synchronized with the
>> existing
>> log in replay mode.
>> ---
>>
>> But then it is only used in block drivers, which are not
>> related to guest state:
> 
> Pavel can tell you the details, but I think the idea was that you need
> to use this function not when the code calling it modifies guest state,
> but when the BH implementation can do so.
> 
> In the case of generic callbacks like provided by the blk_aio_*()
> functions, we don't know whether this is the case, but it's generally
> device emulation code, so chances are relatively high that they do.
> 
> I seem to remember that when reviewing the code that introduced
> replay_bh_schedule_event(), I was relatively sure that we didn't catch
> all necessary instances, but since it worked for Pavel and I didn't feel
> like getting too involved with replay code, we just merged it anyway.

That's right.
Block layer does not touch guest by itself.
But it includes callbacks that may invoke interrupts on finishing disk 
operations. That is why we synchronize these callbacks with vCPU through 
the replay layer.

Pavel Dovgalyuk


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: acceptance-system-fedora failures
  2020-10-09 10:37                       ` Pavel Dovgalyuk
@ 2020-10-13  8:57                         ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 19+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-10-13  8:57 UTC (permalink / raw)
  To: Pavel Dovgalyuk, Kevin Wolf
  Cc: Thomas Huth, Eduardo Habkost, Qemu-block, QEMU Developers,
	Wainer dos Santos Moschetta, Stefan Hajnoczi, Cleber Rosa,
	Paolo Bonzini, Alex Bennée

On 10/9/20 12:37 PM, Pavel Dovgalyuk wrote:
> On 08.10.2020 14:50, Kevin Wolf wrote:
>> Am 08.10.2020 um 12:26 hat Philippe Mathieu-Daudé geschrieben:
>>> On 10/7/20 3:11 PM, Pavel Dovgalyuk wrote:
>>>> On 07.10.2020 15:49, Philippe Mathieu-Daudé wrote:
>>>>> On 10/7/20 2:20 PM, Pavel Dovgalyuk wrote:
>>>>>> On 07.10.2020 14:22, Alex Bennée wrote:
>>>>>>>
>>>>>>> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
>>>>>>>
>>>>>>>> On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote:
>>>>>>>>> On 07.10.2020 11:23, Thomas Huth wrote:
>>>>>>>>>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote:
>>>>>>>>>> Thanks, that was helpful. ... and the winner is:
>>>>>>>>>>
>>>>>>>>>>        commit   55adb3c45620c31f29978f209e2a44a08d34e2da
>>>>>>>>>>        Author:  John Snow <jsnow@redhat.com>
>>>>>>>>>>        Date:    Fri Jul 24 01:23:00 2020 -0400
>>>>>>>>>>        Subject: ide: cancel pending callbacks on SRST
>>>>>>>>>>
>>>>>>>>>> ... starting with this commit, the tests starts failing. John, 
>>>>>>>>>> any
>>>>>>>>>> idea what
>>>>>>>>>> might be causing this?
>>>>>>>>>
>>>>>>>>> This patch includes the following lines:
>>>>>>>>>
>>>>>>>>> +        aio_bh_schedule_oneshot(qemu_get_aio_context(),
>>>>>>>>> +                                
>>>>>>>>> ide_bus_perform_srst, bus);
>>>>>>>>>
>>>>>>>>> replay_bh_schedule_oneshot_event should be used instead of this
>>>>>>>>> function, because it synchronizes non-deterministic BHs.
>>>>>>>>
>>>>>>>> Why do we have 2 different functions? BH are already complex
>>>>>>>> enough, and we need to also think about the replay API...
>>>>>>>>
>>>>>>>> What about the other cases such vhost-user (blk/net), virtio-blk?
>>>>>>>
>>>>>>> This does seem like something that should be wrapped up inside
>>>>>>> aio_bh_schedule_oneshot itself or maybe we need a
>>>>>>> aio_bh_schedule_transaction_oneshot to distinguish it from the other
>>>>>>> uses the function has.
>>>>>>>
>>>>>>
>>>>>> Maybe there should be two functions:
>>>>>> - one for the guest modification
>>>>>
>>>>> aio_bh_schedule_oneshot_deterministic()?
>>>>>
>>>>>> - one for internal qemu things
>>>>>
>>>>> Not sure why there is a difference, BH are used to
>>>>> avoid delaying the guest, so there always something
>>>>> related to "guest modification".
>>>>
>>>> Not exactly. At least there is one non-related-to-guest case
>>>> in monitor_init_qmp:
>>>>         /*
>>>>          * We can't call qemu_chr_fe_set_handlers() directly 
>>>> here
>>>>          * since chardev might be running in the monitor I/O
>>>>          * thread.  Schedule a bottom half.
>>>>          */
>>>>         
>>>> aio_bh_schedule_oneshot(iothread_get_aio_context(mon_iothread),
>>>>                                 
>>>> monitor_qmp_setup_handlers_bh, mon);
>>>
>>> I don't understand the documentation in docs/devel/replay.txt:
>>>
>>> ---
>>> Bottom halves
>>> =============
>>>
>>> Bottom half callbacks, that affect the guest state, should be invoked
>>> through
>>> replay_bh_schedule_event or replay_bh_schedule_oneshot_event functions.
>>> Their invocations are saved in record mode and synchronized with the
>>> existing
>>> log in replay mode.
>>> ---
>>>
>>> But then it is only used in block drivers, which are not
>>> related to guest state:
>>
>> Pavel can tell you the details, but I think the idea was that you need
>> to use this function not when the code calling it modifies guest state,
>> but when the BH implementation can do so.
>>
>> In the case of generic callbacks like provided by the blk_aio_*()
>> functions, we don't know whether this is the case, but it's generally
>> device emulation code, so chances are relatively high that they do.
>>
>> I seem to remember that when reviewing the code that introduced
>> replay_bh_schedule_event(), I was relatively sure that we didn't catch
>> all necessary instances, but since it worked for Pavel and I didn't feel
>> like getting too involved with replay code, we just merged it anyway.
> 
> That's right.
> Block layer does not touch guest by itself.
> But it includes callbacks that may invoke interrupts on finishing disk 
> operations. That is why we synchronize these callbacks with vCPU through 
> the replay layer.

Instead having to remember to use replay_bh_schedule_event when
guest state is modified else the code is buggy, what about expecting
replay used everywhere, and disabling its use when we know guest state
is not modified?

> 
> Pavel Dovgalyuk
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-10-13  9:02 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-06 23:07 acceptance-system-fedora failures John Snow
2020-10-07  5:20 ` Philippe Mathieu-Daudé
2020-10-07  7:13   ` Philippe Mathieu-Daudé
2020-10-07  8:23     ` Thomas Huth
2020-10-07  8:51       ` Pavel Dovgalyuk
2020-10-07  9:57         ` Philippe Mathieu-Daudé
2020-10-07 11:22           ` Alex Bennée
2020-10-07 12:20             ` Pavel Dovgalyuk
2020-10-07 12:49               ` Philippe Mathieu-Daudé
2020-10-07 13:11                 ` Pavel Dovgalyuk
2020-10-08 10:26                   ` Philippe Mathieu-Daudé
2020-10-08 11:50                     ` Kevin Wolf
2020-10-09 10:37                       ` Pavel Dovgalyuk
2020-10-13  8:57                         ` Philippe Mathieu-Daudé
2020-10-07 12:17           ` Pavel Dovgalyuk
2020-10-07 14:03       ` John Snow
2020-10-07  7:23   ` Thomas Huth
2020-10-07  8:19     ` Philippe Mathieu-Daudé
2020-10-07 14:38   ` Cleber Rosa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.