Re: [Qemu-block] [Qemu-devel] [PATCH] block/backup: install notifier during creation

From: John Snow <jsnow@redhat.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	Stefan Hajnoczi <stefanha@gmail.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Kevin Wolf <kwolf@redhat.com>
Cc: Max Reitz <mreitz@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"qemu-block@nongnu.org" <qemu-block@nongnu.org>,
	"qemu-stable@nongnu.org" <qemu-stable@nongnu.org>
Subject: Re: [Qemu-block] [Qemu-devel] [PATCH] block/backup: install notifier during creation
Date: Thu, 19 Sep 2019 15:11:13 -0400	[thread overview]
Message-ID: <0453fab7-6b2b-6caa-86d1-8db64b1cc9da@redhat.com> (raw)
In-Reply-To: <6e3c1b53-c104-2b05-418e-d44f45a82be8@virtuozzo.com>

On 9/19/19 3:11 AM, Vladimir Sementsov-Ogievskiy wrote:
> 18.09.2019 23:31, John Snow wrote:
>>
>>
>> On 9/10/19 9:23 AM, John Snow wrote:
>>>
>>>
>>> On 9/10/19 4:19 AM, Stefan Hajnoczi wrote:
>>>> On Wed, Aug 21, 2019 at 04:01:52PM -0400, John Snow wrote:
>>>>>
>>>>>
>>>>> On 8/21/19 10:41 AM, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> 09.08.2019 23:13, John Snow wrote:
>>>>>>> Backup jobs may yield prior to installing their handler, because of the
>>>>>>> job_co_entry shim which guarantees that a job won't begin work until
>>>>>>> we are ready to start an entire transaction.
>>>>>>>
>>>>>>> Unfortunately, this makes proving correctness about transactional
>>>>>>> points-in-time for backup hard to reason about. Make it explicitly clear
>>>>>>> by moving the handler registration to creation time, and changing the
>>>>>>> write notifier to a no-op until the job is started.
>>>>>>>
>>>>>>> Reported-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>>> Signed-off-by: John Snow <jsnow@redhat.com>
>>>>>>> ---
>>>>>>>    block/backup.c     | 32 +++++++++++++++++++++++---------
>>>>>>>    include/qemu/job.h |  5 +++++
>>>>>>>    job.c              |  2 +-
>>>>>>>    3 files changed, 29 insertions(+), 10 deletions(-)
>>>>>>>
>>>>>>> diff --git a/block/backup.c b/block/backup.c
>>>>>>> index 07d751aea4..4df5b95415 100644
>>>>>>> --- a/block/backup.c
>>>>>>> +++ b/block/backup.c
>>>>>>> @@ -344,6 +344,13 @@ static int coroutine_fn backup_before_write_notify(
>>>>>>>        assert(QEMU_IS_ALIGNED(req->offset, BDRV_SECTOR_SIZE));
>>>>>>>        assert(QEMU_IS_ALIGNED(req->bytes, BDRV_SECTOR_SIZE));
>>>>>>>    
>>>>>>> +    /* The handler is installed at creation time; the actual point-in-time
>>>>>>> +     * starts at job_start(). Transactions guarantee those two points are
>>>>>>> +     * the same point in time. */
>>>>>>> +    if (!job_started(&job->common.job)) {
>>>>>>> +        return 0;
>>>>>>> +    }
>>>>>>
>>>>>> Hmm, sorry if it is a stupid question, I'm not good in multiprocessing and in
>>>>>> Qemu iothreads..
>>>>>>
>>>>>> job_started just reads job->co. If bs runs in iothread, and therefore write-notifier
>>>>>> is in iothread, when job_start is called from main thread.. Is it guaranteed that
>>>>>> write-notifier will see job->co variable change early enough to not miss guest write?
>>>>>> Should not job->co be volatile for example or something like this?
>>>>>>
>>>>>> If not think about this patch looks good for me.
>>>>>>
>>>>>
>>>>> You know, it's a really good question.
>>>>> So good, in fact, that I have no idea.
>>>>>
>>>>> ¯\_(ツ)_/¯
>>>>>
>>>>> I'm fairly certain that IO will not come in until the .clean phase of a
>>>>> qmp_transaction, because bdrv_drained_begin(bs) is called during
>>>>> .prepare, and we activate the handler (by starting the job) in .commit.
>>>>> We do not end the drained section until .clean.
>>>>>
>>>>> I'm not fully clear on what threading guarantees we have otherwise,
>>>>> though; is it possible that "Thread A" would somehow lift the bdrv_drain
>>>>> on an IO thread ("Thread B") and, after that, "Thread B" would somehow
>>>>> still be able to see an outdated version of job->co that was set by
>>>>> "Thread A"?
>>>>>
>>>>> I doubt it; but I can't prove it.
>>>>
>>>> In the qmp_backup() case (not qmp_transaction()) there is:
>>>>
>>>>    void qmp_drive_backup(DriveBackup *arg, Error **errp)
>>>>    {
>>>>
>>>>        BlockJob *job;
>>>>        job = do_drive_backup(arg, NULL, errp);
>>>>        if (job) {
>>>>            job_start(&job->job);
>>>>        }
>>>>    }
>>>>
>>>> job_start() is called without any thread synchronization, which is
>>>> usually fine because the coroutine doesn't run until job_start() calls
>>>> aio_co_enter().
>>>>
>>>> Now that the before write notifier has been installed early, there is
>>>> indeed a race between job_start() and the write notifier accessing
>>>> job->co from an IOThread.
>>>>
>>>> The write before notifier might see job->co != NULL before job_start()
>>>> has finished.  This could lead to issues if job_*() APIs are invoked by
>>>> the write notifier and access an in-between job state.
>>>>
>>>
>>> I see. I think in this case, as long as it sees != NULL, that the
>>> notifier is actually safe to run. I agree that this might be confusing
>>> to verify and could bite us in the future. The worry we had, too, is
>>> more the opposite: will it see NULL for too long? We want to make sure
>>> that it is registering as true *before the first yield*.
>>>
>>>> A safer approach is to set a BackupBlockJob variable at the beginning of
>>>> backup_run() and check it from the before write notifier.
>>>>
>>>
>>> That's too late, for reasons below.
>>>
>>>> That said, I don't understand the benefit of this patch and IMO it makes
>>>> the code harder to understand because now we need to think about the
>>>> created but not started state too.
>>>>
>>>> Stefan
>>>>
>>>
>>> It's always possible I've hyped myself up into believing there's a
>>> problem where there isn't one, but the fear is this:
>>>
>>> The point in time from a QMP transaction covers the job creation and the
>>> job start, but when we start the job it will actually yield before we
>>> get to backup_run -- and there is no guarantee that the handler will get
>>> installed synchronously, so the point in time ends before the handler
>>> activates.
>>>
>>
>> i.e., the handler might get installed AFTER the critical region of a
>> transaction. We could drop initial writes if we were unlucky.
>>
>> (I think.)
>>
>>> The yield occurs in job_co_entry as an intentional feature of forcing a
>>> yield and pause point at run time -- so it's harder to write a job that
>>> accidentally hogs the thread during initialization.
>>>
>>> This is an attempt to get the handler installed earlier to ensure the
>>> point of time stays synchronized with creation time to provide a
>>> stronger transactional guarantee.
>>>
>>
>> Squeaky wheel gets the grease. Any comment?
>>
> 
> Hmm, this all becomes difficult, I'd prefer to not worry and wait for backup-top
> filter applied.
> 

If it goes into 4.2, then OK, but I'd still like to understand what's
going on here, actually.

--js