QEMU-Devel Archive on lore.kernel.org
 help / color / Atom feed
From: Max Reitz <mreitz@redhat.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	qemu-block@nongnu.org
Cc: kwolf@redhat.com, wencongyang2@huawei.com,
	xiechanglong.d@gmail.com, qemu-devel@nongnu.org,
	armbru@redhat.com, den@openvz.org, jsnow@redhat.com
Subject: Re: [PATCH v4 00/23] backup performance: block_status + async
Date: Wed, 20 Jan 2021 15:44:52 +0100
Message-ID: <3db87f48-b628-8000-a46a-6d07cdf1ccc3@redhat.com> (raw)
In-Reply-To: <cfe3b7dd-8a1f-7e49-e576-ebca82ee4d98@redhat.com>

On 20.01.21 15:34, Max Reitz wrote:
> On 20.01.21 14:50, Max Reitz wrote:
>> On 20.01.21 11:39, Max Reitz wrote:
>>> On 19.01.21 20:22, Vladimir Sementsov-Ogievskiy wrote:
>>>> 19.01.2021 21:40, Max Reitz wrote:
>>>>> On 16.01.21 22:46, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> Hi Max!
>>>>>> I applied my series onto yours 129-fixing and found, that 129 
>>>>>> fails for backup.
>>>>>> And setting small max-chunk and even max-workers to 1 doesn't 
>>>>>> help! (setting
>>>>>> speed like in v3 still helps).
>>>>>>
>>>>>> And I found, that the problem is that really, the whole backup job 
>>>>>> goes during
>>>>>> drain, because in new architecture we do just job_yield() during 
>>>>>> the whole
>>>>>> background block-copy.
>>>>>>
>>>>>> This leads to modifying the existing patch in the series, which 
>>>>>> does job_enter()
>>>>>> from job_user_pause: we just need call job_enter() from 
>>>>>> job_pause() to cover
>>>>>> not only user pauses but also drained_begin.
>>>>>>
>>>>>> So, now I don't need any additional fixing of 129.
>>>>>>
>>>>>> Changes in v4:
>>>>>> - add a lot of Max's r-b's, thanks!
>>>>>>
>>>>>> 03: fix over-80 line (in comment), add r-b
>>>>>> 09: was "[PATCH v3 10/25] job: call job_enter from job_user_pause",
>>>>>>      now changed to finally fix 129 iotest, drop r-b
>>>>>>
>>>>>> 10: squash-in additional wording on max-chunk, fix error message, 
>>>>>> keep r-b
>>>>>> 17: drop extra include, assert job_is_cancelled() instead of 
>>>>>> check, add r-b
>>>>>> 18: adjust commit message, add r-b
>>>>>> 23: add comments and assertion, about the fact that test doesn't 
>>>>>> support
>>>>>>      paths with colon inside
>>>>>>      fix s/disable-copy-range/use-copy-range/
>>>>>
>>>>> Hmmm, for me, 129 sometimes fails still, because it completes too 
>>>>> quickly...  (The error then is that 'return[0]' does not exist in 
>>>>> query-block-jobs’s result, because the job is already gone.)
>>>>>
>>>>> When I insert a print(result) after the query-block-jobs, I can see 
>>>>> that the job has always progressed really far, even if its still 
>>>>> running. (Like, generally the offset is just one MB shy of 1G.)
>>>>>
>>>>> I suspect the problem is that block-copy just copies too much from 
>>>>> the start (by default); i.e., it starts 64 workers with, hm, well, 
>>>>> 1 MB of chunk size?  Shouldn’t fill the 128 MB immediately...
>>>>>
>>>>> Anyway, limiting the number of workers (to 1) and the chunk size 
>>>>> (to 64k) with x-perf does ensure that the backup job’s progress is 
>>>>> limited to 1 MB or so, which looks fine to me.
>>>>>
>>>>> I suppose we should do that, then (in 129), before patch 17?
>>>>
>>>> Yes, that sounds reasonable
>>>>
>>>>>
>>>>> (PS: I can also see a MacOS failure in iotest 256.  I suspect it’s 
>>>>> related to this series, because 256 is a backup test (with 
>>>>> iothreads), but I’m not sure yet.  The log is here:
>>>>>
>>>>> https://cirrus-ci.com/task/5276331753603072
>>>>> )
>>>>>
>>>>
>>>> qemu received signal 31 ?
>>>>
>>>> googling for MacOS...
>>>>
>>>>   31    SIGUSR2      terminate process    User defined signal 2
>>>
>>> coroutine-sigaltstack uses SIGUSR2 to set up new coroutines.  Perhaps 
>>> it’s unrelated to backup?  Guess I’ll just run the test one more 
>>> time. O:)
>>
>> I ran it again, got the same error.  There is no error on master, or 
>> before backup uses block_copy.
>>
>> I’m trying to run a test directly on the “move to block-copy” commit, 
>> but so far Cirrus doesn’t seem to want me to do another test run right 
>> now.
>>
>> (Though I’m pretty sure if there is no error before the block-copy 
>> commit, then using block-copy must be the problem.  The remaining 
>> patches in my block branch are just disabling copy_range, some 
>> clean-up, the simplebench patches, the locking code error reporting 
>> change, and a new iotest.)
> 
> I was able to reproduce the signal on Linux, by passing
> --with-coroutine=sigaltstack to configure (sometimes takes like 10 runs 
> to fail, though).  “move to block-copy” is indeed the first failing 
> commiit.
> 
> Letting qemu_coroutine_new() set up an aborting signal handler for 
> SIGUSR2 instead of restoring the default action (i.e. exiting without 
> core dump) yields this back trace:
> 
> Thread 1:
> 
> #0  0x00007f61870ba9d5 in raise () at /lib64/libc.so.6
> #1  0x00007f61870a38a4 in abort () at /lib64/libc.so.6
> #2  0x0000562cd2be7350 in sigusr2_handler (s=<optimized out>) at 
> ../util/coroutine-sigaltstack.c:151
> #3  0x00007f6187cee1e0 in <signal handler called> () at 
> /lib64/libpthread.so.0
> #4  0x00007f61870bad8a in sigsuspend () at /lib64/libc.so.6
> #5  0x0000562cd30d9e48 in qemu_coroutine_new () at 
> ../util/coroutine-sigaltstack.c:220
> #6  0x0000562cd3106206 in qemu_coroutine_create
>      (entry=entry@entry=0x562cd303c950 <block_copy_async_co_entry>, 
> opaque=opaque@entry=0x7f6170002c00)
>      at ../util/qemu-coroutine.c:75
> #7  0x0000562cd303cd0e in block_copy_async
>      (s=0x562cd4f94400, offset=offset@entry=0, bytes=67108864, 
> max_workers=64, max_chunk=0, cb=cb@entry=0x562cd301dfe0 
> <backup_block_copy_callback>, cb_opaque=0x562cd6256940) at 
> ../block/block-copy.c:752
> #8  0x0000562cd301dd4e in backup_loop (job=<optimized out>) at 
> ../block/backup.c:156
> #9  backup_run (job=0x562cd6256940, errp=<optimized out>) at 
> ../block/backup.c:299
> #10 0x0000562cd2fc84d2 in job_co_entry (opaque=0x562cd6256940) at 
> ../job.c:911
> #11 0x0000562cd30da04b in coroutine_bootstrap 
> (self=self@entry=0x562cd6262ef0, co=co@entry=0x562cd6262ef0)
>      at ../util/coroutine-sigaltstack.c:105
> #12 0x0000562cd30da0e1 in coroutine_trampoline (signal=<optimized out>) 
> at ../util/coroutine-sigaltstack.c:146
> #13 0x00007f6187cee1e0 in <signal handler called> () at 
> /lib64/libpthread.so.0
> #14 0x00007f61870bad8a in sigsuspend () at /lib64/libc.so.6
> 
> Thread 2:
> 
> (gdb) bt
> #0  0x00007f6187cee2b6 in __libc_sigaction () at /lib64/libpthread.so.0
> #1  0x0000562cd30d9dc7 in qemu_coroutine_new () at 
> ../util/coroutine-sigaltstack.c:194
> #2  0x0000562cd3106206 in qemu_coroutine_create
>      (entry=entry@entry=0x562cd3016c20 <aio_task_co>, 
> opaque=opaque@entry=0x7f616c0063d0)
>      at ../util/qemu-coroutine.c:75
> #3  0x0000562cd3016dd4 in aio_task_pool_start_task (pool=0x7f616c0067d0, 
> task=0x7f616c0063d0)
>      at ../block/aio_task.c:94
> #4  0x0000562cd303c193 in block_copy_task_run (task=<optimized out>, 
> pool=<optimized out>)
>      at ../block/block-copy.c:330
> #5  block_copy_dirty_clusters (call_state=0x7f616c002c00) at 
> ../block/block-copy.c:646
> #6  block_copy_common (call_state=0x7f616c002c00) at 
> ../block/block-copy.c:696
> #7  0x0000562cd30da04b in coroutine_bootstrap 
> (self=self@entry=0x7f616c003660, co=co@entry=0x7f616c003660)
>      at ../util/coroutine-sigaltstack.c:105
> #8  0x0000562cd30da0e1 in coroutine_trampoline (signal=<optimized out>) 
> at ../util/coroutine-sigaltstack.c:146
> #9  0x00007f6187cee1e0 in <signal handler called> () at 
> /lib64/libpthread.so.0
> #10 0x00007f61870bad8a in sigsuspend () at /lib64/libc.so.6
> #11 0x0000562cd30d9f61 in qemu_coroutine_new () at 
> ../util/coroutine-sigaltstack.c:254
> #12 0x0000562cd61b40d0 in  ()
> #13 0x00007f6163fff930 in  ()
> #14 0x00007f6187cecc5f in annobin_pt_longjmp.c_end () at 
> /lib64/libpthread.so.0
> #15 0x0000562cd30da00d in qemu_coroutine_switch
>      (from_=0x562cd5cc3400, to_=0x7f616c003108, 
> action=action@entry=COROUTINE_YIELD)
>      at ../util/coroutine-sigaltstack.c:284
> #16 0x0000562cd31065f0 in qemu_coroutine_yield () at 
> ../util/qemu-coroutine.c:193
> #17 0x0000562cd2fc751d in job_do_yield (job=0x562cd5cc3400, 
> ns=18446744073709551615) at ../job.c:478
> #18 0x0000562cd2fc855f in job_yield (job=0x562cd5cc3400) at ../job.c:525
> #19 0x0000562cd301dd74 in backup_loop (job=<optimized out>) at 
> ../block/backup.c:164
> #20 backup_run (job=0x562cd5cc3400, errp=<optimized out>) at 
> ../block/backup.c:299
> #21 0x0000562cd2fc84d2 in job_co_entry (opaque=0x562cd5cc3400) at 
> ../job.c:911
> #22 0x0000562cd30da04b in coroutine_bootstrap 
> (self=self@entry=0x562cd61b40d0, co=co@entry=0x562cd61b40d0)
>      at ../util/coroutine-sigaltstack.c:105
> #23 0x0000562cd30da0e1 in coroutine_trampoline (signal=<optimized out>) 
> at ../util/coroutine-sigaltstack.c:146
> #24 0x00007f6187cee1e0 in <signal handler called> () at 
> /lib64/libpthread.so.0
> #25 0x00007f61870bad8a in sigsuspend () at /lib64/libc.so.6
> #26 0x0000000000000001 in  ()
> #27 0x0000000000000000 in  ()
> 
>  From a glance, it looks to me like two coroutines are created 
> simultaneously in two threads, and so one thread sets up a special 
> SIGUSR2 action, then another reverts SIGUSR2 to the default, and then 
> the first one kills itself with SIGUSR2.
> 
> Not sure what this has to do with backup, though it is interesting that 
> backup_loop() runs in two threads.  So perhaps some AioContext problem.

Oh, 256 runs two backups concurrently.  So it isn’t that interesting, 
but perhaps part of the problem still.  (I have no idea, still looking.)

Max



  reply index

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-16 21:46 Vladimir Sementsov-Ogievskiy
2021-01-16 21:46 ` [PATCH v4 01/23] qapi: backup: add perf.use-copy-range parameter Vladimir Sementsov-Ogievskiy
2021-01-16 21:46 ` [PATCH v4 02/23] block/block-copy: More explicit call_state Vladimir Sementsov-Ogievskiy
2021-01-16 21:46 ` [PATCH v4 03/23] block/block-copy: implement block_copy_async Vladimir Sementsov-Ogievskiy
2021-01-16 21:46 ` [PATCH v4 04/23] block/block-copy: add max_chunk and max_workers parameters Vladimir Sementsov-Ogievskiy
2021-01-16 21:46 ` [PATCH v4 05/23] block/block-copy: add list of all call-states Vladimir Sementsov-Ogievskiy
2021-01-16 21:46 ` [PATCH v4 06/23] block/block-copy: add ratelimit to block-copy Vladimir Sementsov-Ogievskiy
2021-01-16 21:46 ` [PATCH v4 07/23] block/block-copy: add block_copy_cancel Vladimir Sementsov-Ogievskiy
2021-01-16 21:46 ` [PATCH v4 08/23] blockjob: add set_speed to BlockJobDriver Vladimir Sementsov-Ogievskiy
2021-01-16 21:46 ` [PATCH v4 09/23] job: call job_enter from job_pause Vladimir Sementsov-Ogievskiy
2021-01-18 13:45   ` Max Reitz
2021-01-18 14:20     ` Max Reitz
2021-04-07 11:19   ` Max Reitz
2021-04-07 11:38     ` Vladimir Sementsov-Ogievskiy
2021-04-21  8:31       ` Vladimir Sementsov-Ogievskiy
2021-01-16 21:46 ` [PATCH v4 10/23] qapi: backup: add max-chunk and max-workers to x-perf struct Vladimir Sementsov-Ogievskiy
2021-01-16 21:46 ` [PATCH v4 11/23] iotests: 56: prepare for backup over block-copy Vladimir Sementsov-Ogievskiy
2021-01-16 21:46 ` [PATCH v4 12/23] iotests: 185: " Vladimir Sementsov-Ogievskiy
2021-01-16 21:46 ` [PATCH v4 13/23] iotests: 219: " Vladimir Sementsov-Ogievskiy
2021-01-16 21:46 ` [PATCH v4 14/23] iotests: 257: " Vladimir Sementsov-Ogievskiy
2021-01-16 21:46 ` [PATCH v4 15/23] block/block-copy: make progress_bytes_callback optional Vladimir Sementsov-Ogievskiy
2021-01-16 21:46 ` [PATCH v4 16/23] block/backup: drop extra gotos from backup_run() Vladimir Sementsov-Ogievskiy
2021-01-16 21:46 ` [PATCH v4 17/23] backup: move to block-copy Vladimir Sementsov-Ogievskiy
2021-01-16 21:47 ` [PATCH v4 18/23] qapi: backup: disable copy_range by default Vladimir Sementsov-Ogievskiy
2021-01-16 21:47 ` [PATCH v4 19/23] block/block-copy: drop unused block_copy_set_progress_callback() Vladimir Sementsov-Ogievskiy
2021-01-16 21:47 ` [PATCH v4 20/23] block/block-copy: drop unused argument of block_copy() Vladimir Sementsov-Ogievskiy
2021-01-16 21:47 ` [PATCH v4 21/23] simplebench/bench_block_job: use correct shebang line with python3 Vladimir Sementsov-Ogievskiy
2021-01-16 21:47 ` [PATCH v4 22/23] simplebench: bench_block_job: add cmd_options argument Vladimir Sementsov-Ogievskiy
2021-01-16 21:47 ` [PATCH v4 23/23] simplebench: add bench-backup.py Vladimir Sementsov-Ogievskiy
2021-01-18 14:01   ` Max Reitz
2021-01-18 15:07 ` [PATCH v4 00/23] backup performance: block_status + async Max Reitz
2021-01-18 17:04   ` Vladimir Sementsov-Ogievskiy
2021-01-19 18:40 ` Max Reitz
2021-01-19 19:22   ` Vladimir Sementsov-Ogievskiy
2021-01-19 19:29     ` Vladimir Sementsov-Ogievskiy
2021-01-20 10:39     ` Max Reitz
2021-01-20 13:50       ` Max Reitz
2021-01-20 14:34         ` Max Reitz
2021-01-20 14:44           ` Max Reitz [this message]
2021-01-20 15:53             ` Max Reitz
2021-01-20 16:00               ` Max Reitz
2021-01-20 16:04               ` Daniel P. Berrangé
2021-01-20 16:40                 ` Max Reitz
2021-01-20 10:20 ` [PATCH v4 11.5/23] iotests/129: Limit backup's max-chunk/max-workers Max Reitz
2021-01-20 11:24   ` Vladimir Sementsov-Ogievskiy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3db87f48-b628-8000-a46a-6d07cdf1ccc3@redhat.com \
    --to=mreitz@redhat.com \
    --cc=armbru@redhat.com \
    --cc=den@openvz.org \
    --cc=jsnow@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=vsementsov@virtuozzo.com \
    --cc=wencongyang2@huawei.com \
    --cc=xiechanglong.d@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

QEMU-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/qemu-devel/0 qemu-devel/git/0.git
	git clone --mirror https://lore.kernel.org/qemu-devel/1 qemu-devel/git/1.git
	git clone --mirror https://lore.kernel.org/qemu-devel/2 qemu-devel/git/2.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 qemu-devel qemu-devel/ https://lore.kernel.org/qemu-devel \
		qemu-devel@nongnu.org
	public-inbox-index qemu-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.nongnu.qemu-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git