Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers - Vladimir Sementsov-Ogievskiy

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
To: Max Reitz <mreitz@redhat.com>,
	"qemu-block@nongnu.org" <qemu-block@nongnu.org>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Cc: "fam@euphon.net" <fam@euphon.net>,
	"stefanha@redhat.com" <stefanha@redhat.com>,
	"jcody@redhat.com" <jcody@redhat.com>,
	"kwolf@redhat.com" <kwolf@redhat.com>,
	Denis Lunev <den@virtuozzo.com>,
	"eblake@redhat.com" <eblake@redhat.com>,
	"jsnow@redhat.com" <jsnow@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers
Date: Mon, 28 Jan 2019 17:14:59 +0000	[thread overview]
Message-ID: <0b5b93d0-4ee9-b1b1-a895-9873a9e6de83@virtuozzo.com> (raw)
In-Reply-To: <0298f3ec-2267-c624-91bb-6136df3b786a@redhat.com>

28.01.2019 19:53, Max Reitz wrote:
> On 28.01.19 17:44, Vladimir Sementsov-Ogievskiy wrote:
>> 28.01.2019 18:59, Max Reitz wrote:
>>> On 28.01.19 12:29, Vladimir Sementsov-Ogievskiy wrote:
>>>> 18.01.2019 17:56, Max Reitz wrote:
>>>>> On 29.12.18 13:20, Vladimir Sementsov-Ogievskiy wrote:
> 
> [...]
> 
>>>>>> @@ -505,8 +474,20 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
>>>>>>                 if (alloced < 0) {
>>>>>>                     ret = alloced;
>>>>>>                 } else {
>>>>>> +                if (!hbitmap_get(s->copy_bitmap, offset)) {
>>>>>> +                    trace_backup_do_cow_skip(job, offset);
>>>>>> +                    continue; /* already copied */
>>>>>> +                }
>>>>>> +                if (!lock) {
>>>>>> +                    lock = bdrv_co_try_lock(s->source, offset, s->cluster_size);
>>>>>> +                    /*
>>>>>> +                     * Dirty bit is set, which means that there are no in-flight
>>>>>> +                     * write requests on this area. We must succeed.
>>>>>> +                     */
>>>>>> +                    assert(lock);
>>>>>
>>>>> What if I have a different parent node for the source that issues
>>>>> concurrent writes?  This used to work fine because the before_write
>>>>> notifier would still work.  After this patch, that would be broken
>>>>> because those writes would not cause a CbW.
>>>>
>>>> But haw could you have this different parent node? After appending filter,
>>>> there should not be such nodes.
>>>
>>> Unless you append them afterwards:
>>>
>>>> And I think, during backup it should be
>>>> forbidden to append new parents to source, ignoring filter, as it definitely
>>>> breaks what filter does.
>>>
>>> Agreed, but then this needs to be implemented.
>>>
>>>> And it applies to other block-job with their filters.
>>>> If we appended a filter, we don't want someone other to write omit our filter.
>>>>
>>>>>
>>>>> That's not so bad because we just have to make sure that all writes go
>>>>> through the backup-top node.  That would make this assertion valid
>>>>> again, too.  But that means we cannot share PERM_WRITE; see [1].
>>>>
>>>> But we don't share PERM_WRITE on source in backup_top, only on target.
>>>
>>> Are you sure?  The job itself shares it, and the filter shares it, too,
>>> as far as I can see.  It uses bdrv_filter_default_perms(), and that does
>>> seem to share PERM_WRITE.
>>
>> And in bdrv_Filter_default_perms it does "*nshared = *nshared | BLK_PERM_WRITE"
>> only for child_file, it is target. Source is child_backing.
> 
> Hm?  bdrv_filter_default_perms() does this, unconditionally:
> 
>>      *nshared = (shared & DEFAULT_PERM_PASSTHROUGH) |
>>                 (c->shared_perm & DEFAULT_PERM_UNCHANGED);
> 
> The backup_top filter does what you describe, but it just leaves
> *nshared untouched after bdrv_filter_default_perms() has done the above.

Oops, yes, I meant my backup_top_child_perm(), not bdrv_filter_default_perms(), which it calls
too. So, OK, will try to fix it. Sorry for that long extra arguing :(

> 
> [...]
> 
>>>>>> @@ -655,25 +656,31 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
>>>>>>     
>>>>>>         copy_bitmap = hbitmap_alloc(len, ctz32(cluster_size));
>>>>>>     
>>>>>> -    /* job->len is fixed, so we can't allow resize */
>>>>>> -    job = block_job_create(job_id, &backup_job_driver, txn, bs,
>>>>>> -                           BLK_PERM_CONSISTENT_READ,
>>>>>> -                           BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
>>>>>> -                           BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD,
>>>>>> -                           speed, creation_flags, cb, opaque, errp);
>>>>>> -    if (!job) {
>>>>>> +    /*
>>>>>> +     * bdrv_get_device_name will not help to find device name starting from
>>>>>> +     * @bs after backup-top append,
>>>>>
>>>>> Why not?  Since backup-top is appended, shouldn't all parents of @bs be
>>>>> parents of @backup_top then?  (Making bdrv_get_parent_name() return the
>>>>> same result)
>>>>
>>>> bdrv_get_device_name goes finally through role->get_name, and only root role has
>>>> this handler. After append we'll have backing role instead of root.
>>>
>>> Ah, I see, I asked the wrong question.
>>>
>>> Why is block_job_create() called on bs and not on backup_top?  mirror
>>> calls it on mirror_top_bs.
>>
>> Good question. I don't exactly remember, may be there are were more troubles with
>> permissions or somthing. So, I've to try it again..
>>
>> What is more beneficial?
>>
>> My current approach, is that job and filter are two sibling users of source node,
>> they do copying, they are synchronized. And in this way, it is better to read from
>> source directly, to not create extra intersection between job and filter..
>>
>> On the other hand, if we read through the filter, we possible should do the whole
>> copy operation through the filter..
>>
>> What is the difference between guest read and backup-job read, in filter POV? I think:
>>
>> For guest read, filter MUST read (as we must handle guest request), and than, if
>> we don't have too much in-flight requests, ram-cache is not full, etc, we can handle
>> already read data in terms of backup, so, copy it somewhere. Or we can drop it, if
>> we can't handle it at the moment..
>>
>> For job read, we even MAY not read, if our queues are full, postponing job request.
>>
>> So
>>
>> Guest read: MUST read, MAY backup
>> Job read: MAY read and backup
>>
>> So, reading through filter has a possibility of common code path + native prioritization
>> of copy operations. This of course will need more refactoring of backup, and may be done
>> as a separate step, but definitely, I have to at least try to create job above the filter.
> 
> Well, as far as I see it, right now backup_top's read function is just a
> passthrough.  I don't see a functional difference between reading from
> backup_top and source, but the fact that you could save yourself the
> trouble of figuring out the job ID manually.
> 
> As for the RAM cache, I thought it was just a target like any other and
> backup_top wouldn't need to care at all...?

Not sure.. We can have RAM cache, backed by disk cache together with actual backup target
(nbd, for ex.).. And how should they all be finally connected is a question; I think, it
would be simpler to figure it out, when I start implementing it.

I hope that RAM cache should somehow be an alternative to inflight copy requests. RAM cache
is better, as it's data may be clearly reused by guest reads.. Hmm, may be we'll finish with
two jobs, one copying from active disk to RAM cache, and one from RAM cache to target? Or
RAM cache would be inside one backup job?

So again, I'd prefer to return to these questions after backup_top completed, as it's simpler
to discuss something that works.

-- 
Best regards,
Vladimir