[Qemu-devel] [question] is it posssible that big-endian l1 table offset referenced by other I/O while updating l1 table offset in qcow2_update_snapshot

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [question] is it posssible that big-endian l1 table offset referenced by other I/O while updating l1 table offset in qcow2_update_snapshot_refcount?
@ 2014-10-09 11:17 Zhang Haoyu
  2014-10-09 14:58 ` Eric Blake
  2014-10-10  1:54 ` [Qemu-devel] [question] is it possible " Zhang Haoyu
  0 siblings, 2 replies; 11+ messages in thread
From: Zhang Haoyu @ 2014-10-09 11:17 UTC (permalink / raw)
  To: qemu-devel

Hi,
I encounter a problem that after deleting snaptshot, the qcow2 image size is very larger than that it should be displayed by ls command, 
but the virtual disk size is okay via qemu-img info.
I suspect that during updating l1 table offset, other I/O job reference the big-endian l1 table offset (very large value), so the file is truncated to very large.
Any ideas?

Thanks,
Zhang Haoyu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [question] is it posssible that big-endian l1 table offset referenced by other I/O while updating l1 table offset in qcow2_update_snapshot_refcount?
  2014-10-09 11:17 [Qemu-devel] [question] is it posssible that big-endian l1 table offset referenced by other I/O while updating l1 table offset in qcow2_update_snapshot_refcount? Zhang Haoyu
@ 2014-10-09 14:58 ` Eric Blake
  2014-10-10  1:54 ` [Qemu-devel] [question] is it possible " Zhang Haoyu
  1 sibling, 0 replies; 11+ messages in thread
From: Eric Blake @ 2014-10-09 14:58 UTC (permalink / raw)
  To: Zhang Haoyu, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1239 bytes --]

On 10/09/2014 05:17 AM, Zhang Haoyu wrote:
> Hi,
> I encounter a problem that after deleting snaptshot, the qcow2 image size is very larger than that it should be displayed by ls command, 
> but the virtual disk size is okay via qemu-img info.
> I suspect that during updating l1 table offset, other I/O job reference the big-endian l1 table offset (very large value), so the file is truncated to very large.

Not quite.  Rather, all the data that the snapshot used to occupy is
still consuming holes in the file; the maximum offset of the file is
still unchanged, even if the file is no longer using as many referenced
clusters.  Recent changes have gone in to sparsify the file when
possible (punching holes if your kernel and file system is new enough to
support that), so that it is not consuming the amount of disk space that
a mere ls reports.  But if what you are asking for is a way to compact
the file back down, then you'll need to submit a patch.  The idea of
having an online defragmenter for qcow2 files has been kicked around
before, but it is complex enough that no one has attempted a patch yet.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [question] is it possible that big-endian l1 table offset referenced by other I/O while updating l1 table offset in qcow2_update_snapshot_refcount?
  2014-10-09 11:17 [Qemu-devel] [question] is it posssible that big-endian l1 table offset referenced by other I/O while updating l1 table offset in qcow2_update_snapshot_refcount? Zhang Haoyu
  2014-10-09 14:58 ` Eric Blake
@ 2014-10-10  1:54 ` Zhang Haoyu
  2014-10-12 13:23   ` Max Reitz
  1 sibling, 1 reply; 11+ messages in thread
From: Zhang Haoyu @ 2014-10-10  1:54 UTC (permalink / raw)
  To: Eric Blake, qemu-devel

>> Hi,
>> I encounter a problem that after deleting snapshot, the qcow2 image size is very larger than that it should be displayed by ls command, 
>> but the virtual disk size is okay via qemu-img info.
>> I suspect that during updating l1 table offset, other I/O job reference the big-endian l1 table offset (very large value),
>> so the file is truncated to very large.
>
>Not quite.  Rather, all the data that the snapshot used to occupy is
>still consuming holes in the file; the maximum offset of the file is
>still unchanged, even if the file is no longer using as many referenced
>clusters.  Recent changes have gone in to sparsify the file when
>possible (punching holes if your kernel and file system is new enough to
>support that), so that it is not consuming the amount of disk space that
>a mere ls reports.  But if what you are asking for is a way to compact
>the file back down, then you'll need to submit a patch.  The idea of
>having an online defragmenter for qcow2 files has been kicked around
>before, but it is complex enough that no one has attempted a patch yet.

Sorry, I didn't clarify the problem clearly.
In qcow2_update_snapshot_refcount(), below code, 
    /* Update L1 only if it isn't deleted anyway (addend = -1) */
    if (ret == 0 && addend >= 0 && l1_modified) {
        for (i = 0; i < l1_size; i++) {
            cpu_to_be64s(&l1_table[i]);
        }

        ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, l1_size2);

        for (i = 0; i < l1_size; i++) {
            be64_to_cpus(&l1_table[i]);
        }
    }
between cpu_to_be64s(&l1_table[i]); and be64_to_cpus(&l1_table[i]);, 
is it possible that there is other I/O reference this interim l1 table whose entries contain the be64 l2 table offset? 
The be64 l2 table offset maybe a very large value, hundreds of TB is possible,
then the qcow2 file will be truncated to far larger than normal size.
So we'll see the huge size of the qcow2 file by ls -hl, but the size is still normal displayed by qemu-img info.

If the possibility mentioned above exists, below raw code may fix it,
     if (ret == 0 && addend >= 0 && l1_modified) {
        tmp_l1_table = g_malloc0(l1_size * sizeof(uint64_t))
        memcpy(tmp_l1_table, l1_table, l1_size * sizeof(uint64_t));
        for (i = 0; i < l1_size; i++) {
            cpu_to_be64s(&tmp_l1_table[i]);
        }
        ret = bdrv_pwrite_sync(bs->file, l1_table_offset, tmp_l1_table, l1_size2);

        free(tmp_l1_table);
    }

Thanks,
Zhang Haoyu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [question] is it possible that big-endian l1 table offset referenced by other I/O while updating l1 table offset in qcow2_update_snapshot_refcount?
  2014-10-10  1:54 ` [Qemu-devel] [question] is it possible " Zhang Haoyu
@ 2014-10-12 13:23   ` Max Reitz
  2014-10-13  3:17     ` [Qemu-devel] [question] is it possible that big-endian l1 tableoffset " Zhang Haoyu
  0 siblings, 1 reply; 11+ messages in thread
From: Max Reitz @ 2014-10-12 13:23 UTC (permalink / raw)
  To: Zhang Haoyu, Eric Blake, qemu-devel

Am 10.10.2014 um 03:54 schrieb Zhang Haoyu:
>>> Hi,
>>> I encounter a problem that after deleting snapshot, the qcow2 image size is very larger than that it should be displayed by ls command,
>>> but the virtual disk size is okay via qemu-img info.
>>> I suspect that during updating l1 table offset, other I/O job reference the big-endian l1 table offset (very large value),
>>> so the file is truncated to very large.
>> Not quite.  Rather, all the data that the snapshot used to occupy is
>> still consuming holes in the file; the maximum offset of the file is
>> still unchanged, even if the file is no longer using as many referenced
>> clusters.  Recent changes have gone in to sparsify the file when
>> possible (punching holes if your kernel and file system is new enough to
>> support that), so that it is not consuming the amount of disk space that
>> a mere ls reports.  But if what you are asking for is a way to compact
>> the file back down, then you'll need to submit a patch.  The idea of
>> having an online defragmenter for qcow2 files has been kicked around
>> before, but it is complex enough that no one has attempted a patch yet.
> Sorry, I didn't clarify the problem clearly.
> In qcow2_update_snapshot_refcount(), below code,
>      /* Update L1 only if it isn't deleted anyway (addend = -1) */
>      if (ret == 0 && addend >= 0 && l1_modified) {
>          for (i = 0; i < l1_size; i++) {
>              cpu_to_be64s(&l1_table[i]);
>          }
>
>          ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, l1_size2);
>
>          for (i = 0; i < l1_size; i++) {
>              be64_to_cpus(&l1_table[i]);
>          }
>      }
> between cpu_to_be64s(&l1_table[i]); and be64_to_cpus(&l1_table[i]);,
> is it possible that there is other I/O reference this interim l1 table whose entries contain the be64 l2 table offset?
> The be64 l2 table offset maybe a very large value, hundreds of TB is possible,
> then the qcow2 file will be truncated to far larger than normal size.
> So we'll see the huge size of the qcow2 file by ls -hl, but the size is still normal displayed by qemu-img info.
>
> If the possibility mentioned above exists, below raw code may fix it,
>       if (ret == 0 && addend >= 0 && l1_modified) {
>          tmp_l1_table = g_malloc0(l1_size * sizeof(uint64_t))
>          memcpy(tmp_l1_table, l1_table, l1_size * sizeof(uint64_t));
>          for (i = 0; i < l1_size; i++) {
>              cpu_to_be64s(&tmp_l1_table[i]);
>          }
>          ret = bdrv_pwrite_sync(bs->file, l1_table_offset, tmp_l1_table, l1_size2);
>
>          free(tmp_l1_table);
>      }

l1_table is already a local variable (local to 
qcow2_update_snapshot_refcount()), so I can't really imagine how 
introducing another local buffer should mitigate the problem, if there 
is any.

Max

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [question] is it possible that big-endian l1 tableoffset referenced by other I/O while updating l1 table offset in qcow2_update_snapshot_refcount?
  2014-10-12 13:23   ` Max Reitz
@ 2014-10-13  3:17     ` Zhang Haoyu
  2014-10-13  6:40       ` Max Reitz
  0 siblings, 1 reply; 11+ messages in thread
From: Zhang Haoyu @ 2014-10-13  3:17 UTC (permalink / raw)
  To: Max Reitz, Eric Blake, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

>>>> Hi,
>>>> I encounter a problem that after deleting snapshot, the qcow2 image size is very larger than that it should be displayed by ls command,
>>>> but the virtual disk size is okay via qemu-img info.
>>>> I suspect that during updating l1 table offset, other I/O job reference the big-endian l1 table offset (very large value),
>>>> so the file is truncated to very large.
>>> Not quite.  Rather, all the data that the snapshot used to occupy is
>>> still consuming holes in the file; the maximum offset of the file is
>>> still unchanged, even if the file is no longer using as many referenced
>>> clusters.  Recent changes have gone in to sparsify the file when
>>> possible (punching holes if your kernel and file system is new enough to
>>> support that), so that it is not consuming the amount of disk space that
>>> a mere ls reports.  But if what you are asking for is a way to compact
>>> the file back down, then you'll need to submit a patch.  The idea of
>>> having an online defragmenter for qcow2 files has been kicked around
>>> before, but it is complex enough that no one has attempted a patch yet.
>> Sorry, I didn't clarify the problem clearly.
>> In qcow2_update_snapshot_refcount(), below code,
>>      /* Update L1 only if it isn't deleted anyway (addend = -1) */
>>      if (ret == 0 && addend >= 0 && l1_modified) {
>>          for (i = 0; i < l1_size; i++) {
>>              cpu_to_be64s(&l1_table[i]);
>>          }
>>
>>          ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, l1_size2);
>>
>>          for (i = 0; i < l1_size; i++) {
>>              be64_to_cpus(&l1_table[i]);
>>          }
>>      }
>> between cpu_to_be64s(&l1_table[i]); and be64_to_cpus(&l1_table[i]);,
>> is it possible that there is other I/O reference this interim l1 table whose entries contain the be64 l2 table offset?
>> The be64 l2 table offset maybe a very large value, hundreds of TB is possible,
>> then the qcow2 file will be truncated to far larger than normal size.
>> So we'll see the huge size of the qcow2 file by ls -hl, but the size is still normal displayed by qemu-img info.
>>
>> If the possibility mentioned above exists, below raw code may fix it,
>>       if (ret == 0 && addend >= 0 && l1_modified) {
>>          tmp_l1_table = g_malloc0(l1_size * sizeof(uint64_t))
>>          memcpy(tmp_l1_table, l1_table, l1_size * sizeof(uint64_t));
>>          for (i = 0; i < l1_size; i++) {
>>              cpu_to_be64s(&tmp_l1_table[i]);
>>          }
>>          ret = bdrv_pwrite_sync(bs->file, l1_table_offset, tmp_l1_table, l1_size2);
>>
>>          free(tmp_l1_table);
>>      }
>
>l1_table is already a local variable (local to 
>qcow2_update_snapshot_refcount()), so I can't really imagine how 
>introducing another local buffer should mitigate the problem, if there 
>is any.
>
l1_table is not necessarily a local variable to qcow2_update_snapshot_refcount,
which depends on condition of "if (l1_table_offset != s->l1_table_offset)",
if the condition not true, l1_table = s->l1_table.

Thanks,
Zhang Haoyu
>Max

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [question] is it possible that big-endian l1 tableoffset referenced by other I/O while updating l1 table offset in qcow2_update_snapshot_refcount?
  2014-10-13  3:17     ` [Qemu-devel] [question] is it possible that big-endian l1 tableoffset " Zhang Haoyu
@ 2014-10-13  6:40       ` Max Reitz
  2014-10-13  7:13         ` [Qemu-devel] [question] is it possible that big-endian l1 tableoffsetreferenced " Zhang Haoyu
  0 siblings, 1 reply; 11+ messages in thread
From: Max Reitz @ 2014-10-13  6:40 UTC (permalink / raw)
  To: Zhang Haoyu, Eric Blake, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

Am 13.10.2014 um 05:17 schrieb Zhang Haoyu:
>>>>> Hi,
>>>>> I encounter a problem that after deleting snapshot, the qcow2 image size is very larger than that it should be displayed by ls command,
>>>>> but the virtual disk size is okay via qemu-img info.
>>>>> I suspect that during updating l1 table offset, other I/O job reference the big-endian l1 table offset (very large value),
>>>>> so the file is truncated to very large.
>>>> Not quite.  Rather, all the data that the snapshot used to occupy is
>>>> still consuming holes in the file; the maximum offset of the file is
>>>> still unchanged, even if the file is no longer using as many referenced
>>>> clusters.  Recent changes have gone in to sparsify the file when
>>>> possible (punching holes if your kernel and file system is new enough to
>>>> support that), so that it is not consuming the amount of disk space that
>>>> a mere ls reports.  But if what you are asking for is a way to compact
>>>> the file back down, then you'll need to submit a patch.  The idea of
>>>> having an online defragmenter for qcow2 files has been kicked around
>>>> before, but it is complex enough that no one has attempted a patch yet.
>>> Sorry, I didn't clarify the problem clearly.
>>> In qcow2_update_snapshot_refcount(), below code,
>>>       /* Update L1 only if it isn't deleted anyway (addend = -1) */
>>>       if (ret == 0 && addend >= 0 && l1_modified) {
>>>           for (i = 0; i < l1_size; i++) {
>>>               cpu_to_be64s(&l1_table[i]);
>>>           }
>>>
>>>           ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, l1_size2);
>>>
>>>           for (i = 0; i < l1_size; i++) {
>>>               be64_to_cpus(&l1_table[i]);
>>>           }
>>>       }
>>> between cpu_to_be64s(&l1_table[i]); and be64_to_cpus(&l1_table[i]);,
>>> is it possible that there is other I/O reference this interim l1 table whose entries contain the be64 l2 table offset?
>>> The be64 l2 table offset maybe a very large value, hundreds of TB is possible,
>>> then the qcow2 file will be truncated to far larger than normal size.
>>> So we'll see the huge size of the qcow2 file by ls -hl, but the size is still normal displayed by qemu-img info.
>>>
>>> If the possibility mentioned above exists, below raw code may fix it,
>>>        if (ret == 0 && addend >= 0 && l1_modified) {
>>>           tmp_l1_table = g_malloc0(l1_size * sizeof(uint64_t))
>>>           memcpy(tmp_l1_table, l1_table, l1_size * sizeof(uint64_t));
>>>           for (i = 0; i < l1_size; i++) {
>>>               cpu_to_be64s(&tmp_l1_table[i]);
>>>           }
>>>           ret = bdrv_pwrite_sync(bs->file, l1_table_offset, tmp_l1_table, l1_size2);
>>>
>>>           free(tmp_l1_table);
>>>       }
>> l1_table is already a local variable (local to
>> qcow2_update_snapshot_refcount()), so I can't really imagine how
>> introducing another local buffer should mitigate the problem, if there
>> is any.
>>
> l1_table is not necessarily a local variable to qcow2_update_snapshot_refcount,
> which depends on condition of "if (l1_table_offset != s->l1_table_offset)",
> if the condition not true, l1_table = s->l1_table.

Oh, yes, you're right. Okay, so in theory nothing should happen anyway, 
because qcow2 does not have to be reentrant (so s->l1_table will not be 
accessed while it's big endian and therefore possibly not in CPU order). 
But I find it rather ugly to convert the cached L1 table to big endian, 
so I'd be fine with the patch you proposed.

Max

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [question] is it possible that big-endian l1 tableoffsetreferenced by other I/O while updating l1 table offset in qcow2_update_snapshot_refcount?
  2014-10-13  6:40       ` Max Reitz
@ 2014-10-13  7:13         ` Zhang Haoyu
  2014-10-13  8:02           ` Max Reitz
  0 siblings, 1 reply; 11+ messages in thread
From: Zhang Haoyu @ 2014-10-13  7:13 UTC (permalink / raw)
  To: Max Reitz, Eric Blake, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

>>>>>> Hi,
>>>>>> I encounter a problem that after deleting snapshot, the qcow2 image size is very larger than that it should be displayed by ls command,
>>>>>> but the virtual disk size is okay via qemu-img info.
>>>>>> I suspect that during updating l1 table offset, other I/O job reference the big-endian l1 table offset (very large value),
>>>>>> so the file is truncated to very large.
>>>>> Not quite.  Rather, all the data that the snapshot used to occupy is
>>>>> still consuming holes in the file; the maximum offset of the file is
>>>>> still unchanged, even if the file is no longer using as many referenced
>>>>> clusters.  Recent changes have gone in to sparsify the file when
>>>>> possible (punching holes if your kernel and file system is new enough to
>>>>> support that), so that it is not consuming the amount of disk space that
>>>>> a mere ls reports.  But if what you are asking for is a way to compact
>>>>> the file back down, then you'll need to submit a patch.  The idea of
>>>>> having an online defragmenter for qcow2 files has been kicked around
>>>>> before, but it is complex enough that no one has attempted a patch yet.
>>>> Sorry, I didn't clarify the problem clearly.
>>>> In qcow2_update_snapshot_refcount(), below code,
>>>>       /* Update L1 only if it isn't deleted anyway (addend = -1) */
>>>>       if (ret == 0 && addend >= 0 && l1_modified) {
>>>>           for (i = 0; i < l1_size; i++) {
>>>>               cpu_to_be64s(&l1_table[i]);
>>>>           }
>>>>
>>>>           ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, l1_size2);
>>>>
>>>>           for (i = 0; i < l1_size; i++) {
>>>>               be64_to_cpus(&l1_table[i]);
>>>>           }
>>>>       }
>>>> between cpu_to_be64s(&l1_table[i]); and be64_to_cpus(&l1_table[i]);,
>>>> is it possible that there is other I/O reference this interim l1 table whose entries contain the be64 l2 table offset?
>>>> The be64 l2 table offset maybe a very large value, hundreds of TB is possible,
>>>> then the qcow2 file will be truncated to far larger than normal size.
>>>> So we'll see the huge size of the qcow2 file by ls -hl, but the size is still normal displayed by qemu-img info.
>>>>
>>>> If the possibility mentioned above exists, below raw code may fix it,
>>>>        if (ret == 0 && addend >= 0 && l1_modified) {
>>>>           tmp_l1_table = g_malloc0(l1_size * sizeof(uint64_t))
>>>>           memcpy(tmp_l1_table, l1_table, l1_size * sizeof(uint64_t));
>>>>           for (i = 0; i < l1_size; i++) {
>>>>               cpu_to_be64s(&tmp_l1_table[i]);
>>>>           }
>>>>           ret = bdrv_pwrite_sync(bs->file, l1_table_offset, tmp_l1_table, l1_size2);
>>>>
>>>>           free(tmp_l1_table);
>>>>       }
>>> l1_table is already a local variable (local to
>>> qcow2_update_snapshot_refcount()), so I can't really imagine how
>>> introducing another local buffer should mitigate the problem, if there
>>> is any.
>>>
>> l1_table is not necessarily a local variable to qcow2_update_snapshot_refcount,
>> which depends on condition of "if (l1_table_offset != s->l1_table_offset)",
>> if the condition not true, l1_table = s->l1_table.
>
>Oh, yes, you're right. Okay, so in theory nothing should happen anyway, 
>because qcow2 does not have to be reentrant (so s->l1_table will not be 
>accessed while it's big endian and therefore possibly not in CPU order). 
Could you detail how qcow2 does not have to be reentrant?
In below stack,
qcow2_update_snapshot_refcount
|- cpu_to_be64s(&l1_table[i])
|- bdrv_pwrite_sync
|-- bdrv_pwrite
|--- bdrv_pwritev
|---- bdrv_prwv_co
|----- aio_poll(aio_context) <== this aio_context is qemu_aio_context
|------ aio_dispatch
|------- bdrv_co_io_em_complete
|-------- qemu_coroutine_enter(co->coroutine, NULL); <== coroutine entry is bdrv_co_do_rw
bdrv_co_do_rw will access l1_table to perform I/O operation.

Thanks,
Zhang Haoyu
>But I find it rather ugly to convert the cached L1 table to big endian, 
>so I'd be fine with the patch you proposed.
>
>Max

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [question] is it possible that big-endian l1 tableoffsetreferenced by other I/O while updating l1 table offset in qcow2_update_snapshot_refcount?
  2014-10-13  7:13         ` [Qemu-devel] [question] is it possible that big-endian l1 tableoffsetreferenced " Zhang Haoyu
@ 2014-10-13  8:02           ` Max Reitz
  2014-10-13  8:19             ` [Qemu-devel] [question] is it possible that big-endian l1 tableoffsetreferencedby " Zhang Haoyu
  0 siblings, 1 reply; 11+ messages in thread
From: Max Reitz @ 2014-10-13  8:02 UTC (permalink / raw)
  To: Zhang Haoyu, Eric Blake, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

Am 13.10.2014 um 09:13 schrieb Zhang Haoyu:
>>>>>>> Hi,
>>>>>>> I encounter a problem that after deleting snapshot, the qcow2 image size is very larger than that it should be displayed by ls command,
>>>>>>> but the virtual disk size is okay via qemu-img info.
>>>>>>> I suspect that during updating l1 table offset, other I/O job reference the big-endian l1 table offset (very large value),
>>>>>>> so the file is truncated to very large.
>>>>>> Not quite.  Rather, all the data that the snapshot used to occupy is
>>>>>> still consuming holes in the file; the maximum offset of the file is
>>>>>> still unchanged, even if the file is no longer using as many referenced
>>>>>> clusters.  Recent changes have gone in to sparsify the file when
>>>>>> possible (punching holes if your kernel and file system is new enough to
>>>>>> support that), so that it is not consuming the amount of disk space that
>>>>>> a mere ls reports.  But if what you are asking for is a way to compact
>>>>>> the file back down, then you'll need to submit a patch.  The idea of
>>>>>> having an online defragmenter for qcow2 files has been kicked around
>>>>>> before, but it is complex enough that no one has attempted a patch yet.
>>>>> Sorry, I didn't clarify the problem clearly.
>>>>> In qcow2_update_snapshot_refcount(), below code,
>>>>>        /* Update L1 only if it isn't deleted anyway (addend = -1) */
>>>>>        if (ret == 0 && addend >= 0 && l1_modified) {
>>>>>            for (i = 0; i < l1_size; i++) {
>>>>>                cpu_to_be64s(&l1_table[i]);
>>>>>            }
>>>>>
>>>>>            ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, l1_size2);
>>>>>
>>>>>            for (i = 0; i < l1_size; i++) {
>>>>>                be64_to_cpus(&l1_table[i]);
>>>>>            }
>>>>>        }
>>>>> between cpu_to_be64s(&l1_table[i]); and be64_to_cpus(&l1_table[i]);,
>>>>> is it possible that there is other I/O reference this interim l1 table whose entries contain the be64 l2 table offset?
>>>>> The be64 l2 table offset maybe a very large value, hundreds of TB is possible,
>>>>> then the qcow2 file will be truncated to far larger than normal size.
>>>>> So we'll see the huge size of the qcow2 file by ls -hl, but the size is still normal displayed by qemu-img info.
>>>>>
>>>>> If the possibility mentioned above exists, below raw code may fix it,
>>>>>         if (ret == 0 && addend >= 0 && l1_modified) {
>>>>>            tmp_l1_table = g_malloc0(l1_size * sizeof(uint64_t))
>>>>>            memcpy(tmp_l1_table, l1_table, l1_size * sizeof(uint64_t));
>>>>>            for (i = 0; i < l1_size; i++) {
>>>>>                cpu_to_be64s(&tmp_l1_table[i]);
>>>>>            }
>>>>>            ret = bdrv_pwrite_sync(bs->file, l1_table_offset, tmp_l1_table, l1_size2);
>>>>>
>>>>>            free(tmp_l1_table);
>>>>>        }
>>>> l1_table is already a local variable (local to
>>>> qcow2_update_snapshot_refcount()), so I can't really imagine how
>>>> introducing another local buffer should mitigate the problem, if there
>>>> is any.
>>>>
>>> l1_table is not necessarily a local variable to qcow2_update_snapshot_refcount,
>>> which depends on condition of "if (l1_table_offset != s->l1_table_offset)",
>>> if the condition not true, l1_table = s->l1_table.
>> Oh, yes, you're right. Okay, so in theory nothing should happen anyway,
>> because qcow2 does not have to be reentrant (so s->l1_table will not be
>> accessed while it's big endian and therefore possibly not in CPU order).
> Could you detail how qcow2 does not have to be reentrant?
> In below stack,
> qcow2_update_snapshot_refcount
> |- cpu_to_be64s(&l1_table[i])
> |- bdrv_pwrite_sync

This is executed on bs->file, not the qcow2 BDS.

Max

> |-- bdrv_pwrite
> |--- bdrv_pwritev
> |---- bdrv_prwv_co
> |----- aio_poll(aio_context) <== this aio_context is qemu_aio_context
> |------ aio_dispatch
> |------- bdrv_co_io_em_complete
> |-------- qemu_coroutine_enter(co->coroutine, NULL); <== coroutine entry is bdrv_co_do_rw
> bdrv_co_do_rw will access l1_table to perform I/O operation.
>
> Thanks,
> Zhang Haoyu
>> But I find it rather ugly to convert the cached L1 table to big endian,
>> so I'd be fine with the patch you proposed.
>>
>> Max

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [question] is it possible that big-endian l1 tableoffsetreferencedby other I/O while updating l1 table offset in qcow2_update_snapshot_refcount?
  2014-10-13  8:02           ` Max Reitz
@ 2014-10-13  8:19             ` Zhang Haoyu
  2014-10-13  9:00               ` Max Reitz
  0 siblings, 1 reply; 11+ messages in thread
From: Zhang Haoyu @ 2014-10-13  8:19 UTC (permalink / raw)
  To: Max Reitz, Eric Blake, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

>>>>>>>> Hi,
>>>>>>>> I encounter a problem that after deleting snapshot, the qcow2 image size is very larger than that it should be displayed by ls command,
>>>>>>>> but the virtual disk size is okay via qemu-img info.
>>>>>>>> I suspect that during updating l1 table offset, other I/O job reference the big-endian l1 table offset (very large value),
>>>>>>>> so the file is truncated to very large.
>>>>>>> Not quite.  Rather, all the data that the snapshot used to occupy is
>>>>>>> still consuming holes in the file; the maximum offset of the file is
>>>>>>> still unchanged, even if the file is no longer using as many referenced
>>>>>>> clusters.  Recent changes have gone in to sparsify the file when
>>>>>>> possible (punching holes if your kernel and file system is new enough to
>>>>>>> support that), so that it is not consuming the amount of disk space that
>>>>>>> a mere ls reports.  But if what you are asking for is a way to compact
>>>>>>> the file back down, then you'll need to submit a patch.  The idea of
>>>>>>> having an online defragmenter for qcow2 files has been kicked around
>>>>>>> before, but it is complex enough that no one has attempted a patch yet.
>>>>>> Sorry, I didn't clarify the problem clearly.
>>>>>> In qcow2_update_snapshot_refcount(), below code,
>>>>>>        /* Update L1 only if it isn't deleted anyway (addend = -1) */
>>>>>>        if (ret == 0 && addend >= 0 && l1_modified) {
>>>>>>            for (i = 0; i < l1_size; i++) {
>>>>>>                cpu_to_be64s(&l1_table[i]);
>>>>>>            }
>>>>>>
>>>>>>            ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, l1_size2);
>>>>>>
>>>>>>            for (i = 0; i < l1_size; i++) {
>>>>>>                be64_to_cpus(&l1_table[i]);
>>>>>>            }
>>>>>>        }
>>>>>> between cpu_to_be64s(&l1_table[i]); and be64_to_cpus(&l1_table[i]);,
>>>>>> is it possible that there is other I/O reference this interim l1 table whose entries contain the be64 l2 table offset?
>>>>>> The be64 l2 table offset maybe a very large value, hundreds of TB is possible,
>>>>>> then the qcow2 file will be truncated to far larger than normal size.
>>>>>> So we'll see the huge size of the qcow2 file by ls -hl, but the size is still normal displayed by qemu-img info.
>>>>>>
>>>>>> If the possibility mentioned above exists, below raw code may fix it,
>>>>>>         if (ret == 0 && addend >= 0 && l1_modified) {
>>>>>>            tmp_l1_table = g_malloc0(l1_size * sizeof(uint64_t))
>>>>>>            memcpy(tmp_l1_table, l1_table, l1_size * sizeof(uint64_t));
>>>>>>            for (i = 0; i < l1_size; i++) {
>>>>>>                cpu_to_be64s(&tmp_l1_table[i]);
>>>>>>            }
>>>>>>            ret = bdrv_pwrite_sync(bs->file, l1_table_offset, tmp_l1_table, l1_size2);
>>>>>>
>>>>>>            free(tmp_l1_table);
>>>>>>        }
>>>>> l1_table is already a local variable (local to
>>>>> qcow2_update_snapshot_refcount()), so I can't really imagine how
>>>>> introducing another local buffer should mitigate the problem, if there
>>>>> is any.
>>>>>
>>>> l1_table is not necessarily a local variable to qcow2_update_snapshot_refcount,
>>>> which depends on condition of "if (l1_table_offset != s->l1_table_offset)",
>>>> if the condition not true, l1_table = s->l1_table.
>>> Oh, yes, you're right. Okay, so in theory nothing should happen anyway,
>>> because qcow2 does not have to be reentrant (so s->l1_table will not be
>>> accessed while it's big endian and therefore possibly not in CPU order).
>> Could you detail how qcow2 does not have to be reentrant?
>> In below stack,
>> qcow2_update_snapshot_refcount
>> |- cpu_to_be64s(&l1_table[i])
>> |- bdrv_pwrite_sync
>
>This is executed on bs->file, not the qcow2 BDS.
>
Yes, bs->file is passed to bdrv_pwrite_sync here, 
but aio_poll(aio_context) will poll all BDS's aio, not only that of bs->file, doesn't it?
Is it possible that there are pending aio which belong to this qcow2 BDS still exist?

Thanks,
Zhang Haoyu
>Max
>
>> |-- bdrv_pwrite
>> |--- bdrv_pwritev
>> |---- bdrv_prwv_co
>> |----- aio_poll(aio_context) <== this aio_context is qemu_aio_context
>> |------ aio_dispatch
>> |------- bdrv_co_io_em_complete
>> |-------- qemu_coroutine_enter(co->coroutine, NULL); <== coroutine entry is bdrv_co_do_rw
>> bdrv_co_do_rw will access l1_table to perform I/O operation.
>>
>> Thanks,
>> Zhang Haoyu
>>> But I find it rather ugly to convert the cached L1 table to big endian,
>>> so I'd be fine with the patch you proposed.
>>>
>>> Max

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [question] is it possible that big-endian l1 tableoffsetreferencedby other I/O while updating l1 table offset in qcow2_update_snapshot_refcount?
  2014-10-13  8:19             ` [Qemu-devel] [question] is it possible that big-endian l1 tableoffsetreferencedby " Zhang Haoyu
@ 2014-10-13  9:00               ` Max Reitz
  2014-10-14  1:55                 ` [Qemu-devel] [question] is it possible that big-endian l1tableoffsetreferencedby other I/O while updating l1 table offset inqcow2_update_snapshot_refcount? Zhang Haoyu
  0 siblings, 1 reply; 11+ messages in thread
From: Max Reitz @ 2014-10-13  9:00 UTC (permalink / raw)
  To: Zhang Haoyu, Eric Blake, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

Am 13.10.2014 um 10:19 schrieb Zhang Haoyu:
>>>>>>>>> Hi,
>>>>>>>>> I encounter a problem that after deleting snapshot, the qcow2 image size is very larger than that it should be displayed by ls command,
>>>>>>>>> but the virtual disk size is okay via qemu-img info.
>>>>>>>>> I suspect that during updating l1 table offset, other I/O job reference the big-endian l1 table offset (very large value),
>>>>>>>>> so the file is truncated to very large.
>>>>>>>> Not quite.  Rather, all the data that the snapshot used to occupy is
>>>>>>>> still consuming holes in the file; the maximum offset of the file is
>>>>>>>> still unchanged, even if the file is no longer using as many referenced
>>>>>>>> clusters.  Recent changes have gone in to sparsify the file when
>>>>>>>> possible (punching holes if your kernel and file system is new enough to
>>>>>>>> support that), so that it is not consuming the amount of disk space that
>>>>>>>> a mere ls reports.  But if what you are asking for is a way to compact
>>>>>>>> the file back down, then you'll need to submit a patch.  The idea of
>>>>>>>> having an online defragmenter for qcow2 files has been kicked around
>>>>>>>> before, but it is complex enough that no one has attempted a patch yet.
>>>>>>> Sorry, I didn't clarify the problem clearly.
>>>>>>> In qcow2_update_snapshot_refcount(), below code,
>>>>>>>         /* Update L1 only if it isn't deleted anyway (addend = -1) */
>>>>>>>         if (ret == 0 && addend >= 0 && l1_modified) {
>>>>>>>             for (i = 0; i < l1_size; i++) {
>>>>>>>                 cpu_to_be64s(&l1_table[i]);
>>>>>>>             }
>>>>>>>
>>>>>>>             ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, l1_size2);
>>>>>>>
>>>>>>>             for (i = 0; i < l1_size; i++) {
>>>>>>>                 be64_to_cpus(&l1_table[i]);
>>>>>>>             }
>>>>>>>         }
>>>>>>> between cpu_to_be64s(&l1_table[i]); and be64_to_cpus(&l1_table[i]);,
>>>>>>> is it possible that there is other I/O reference this interim l1 table whose entries contain the be64 l2 table offset?
>>>>>>> The be64 l2 table offset maybe a very large value, hundreds of TB is possible,
>>>>>>> then the qcow2 file will be truncated to far larger than normal size.
>>>>>>> So we'll see the huge size of the qcow2 file by ls -hl, but the size is still normal displayed by qemu-img info.
>>>>>>>
>>>>>>> If the possibility mentioned above exists, below raw code may fix it,
>>>>>>>          if (ret == 0 && addend >= 0 && l1_modified) {
>>>>>>>             tmp_l1_table = g_malloc0(l1_size * sizeof(uint64_t))
>>>>>>>             memcpy(tmp_l1_table, l1_table, l1_size * sizeof(uint64_t));
>>>>>>>             for (i = 0; i < l1_size; i++) {
>>>>>>>                 cpu_to_be64s(&tmp_l1_table[i]);
>>>>>>>             }
>>>>>>>             ret = bdrv_pwrite_sync(bs->file, l1_table_offset, tmp_l1_table, l1_size2);
>>>>>>>
>>>>>>>             free(tmp_l1_table);
>>>>>>>         }
>>>>>> l1_table is already a local variable (local to
>>>>>> qcow2_update_snapshot_refcount()), so I can't really imagine how
>>>>>> introducing another local buffer should mitigate the problem, if there
>>>>>> is any.
>>>>>>
>>>>> l1_table is not necessarily a local variable to qcow2_update_snapshot_refcount,
>>>>> which depends on condition of "if (l1_table_offset != s->l1_table_offset)",
>>>>> if the condition not true, l1_table = s->l1_table.
>>>> Oh, yes, you're right. Okay, so in theory nothing should happen anyway,
>>>> because qcow2 does not have to be reentrant (so s->l1_table will not be
>>>> accessed while it's big endian and therefore possibly not in CPU order).
>>> Could you detail how qcow2 does not have to be reentrant?
>>> In below stack,
>>> qcow2_update_snapshot_refcount
>>> |- cpu_to_be64s(&l1_table[i])
>>> |- bdrv_pwrite_sync
>> This is executed on bs->file, not the qcow2 BDS.
>>
> Yes, bs->file is passed to bdrv_pwrite_sync here,
> but aio_poll(aio_context) will poll all BDS's aio, not only that of bs->file, doesn't it?
> Is it possible that there are pending aio which belong to this qcow2 BDS still exist?

qcow2 is generally not reentrant, this is secured by locking 
(BDRVQcowState.lock). As long as one request for a BDS is still running, 
it will not be interrupted.

Max

> Thanks,
> Zhang Haoyu
>> Max
>>
>>> |-- bdrv_pwrite
>>> |--- bdrv_pwritev
>>> |---- bdrv_prwv_co
>>> |----- aio_poll(aio_context) <== this aio_context is qemu_aio_context
>>> |------ aio_dispatch
>>> |------- bdrv_co_io_em_complete
>>> |-------- qemu_coroutine_enter(co->coroutine, NULL); <== coroutine entry is bdrv_co_do_rw
>>> bdrv_co_do_rw will access l1_table to perform I/O operation.
>>>
>>> Thanks,
>>> Zhang Haoyu
>>>> But I find it rather ugly to convert the cached L1 table to big endian,
>>>> so I'd be fine with the patch you proposed.
>>>>
>>>> Max

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [question] is it possible that big-endian l1tableoffsetreferencedby other I/O while updating l1 table offset inqcow2_update_snapshot_refcount?
  2014-10-13  9:00               ` Max Reitz
@ 2014-10-14  1:55                 ` Zhang Haoyu
  0 siblings, 0 replies; 11+ messages in thread
From: Zhang Haoyu @ 2014-10-14  1:55 UTC (permalink / raw)
  To: Max Reitz, Eric Blake, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

>>>>>>>>>> Hi,
>>>>>>>>>> I encounter a problem that after deleting snapshot, the qcow2 image size is very larger than that it should be displayed by ls command,
>>>>>>>>>> but the virtual disk size is okay via qemu-img info.
>>>>>>>>>> I suspect that during updating l1 table offset, other I/O job reference the big-endian l1 table offset (very large value),
>>>>>>>>>> so the file is truncated to very large.
>>>>>>>>> Not quite.  Rather, all the data that the snapshot used to occupy is
>>>>>>>>> still consuming holes in the file; the maximum offset of the file is
>>>>>>>>> still unchanged, even if the file is no longer using as many referenced
>>>>>>>>> clusters.  Recent changes have gone in to sparsify the file when
>>>>>>>>> possible (punching holes if your kernel and file system is new enough to
>>>>>>>>> support that), so that it is not consuming the amount of disk space that
>>>>>>>>> a mere ls reports.  But if what you are asking for is a way to compact
>>>>>>>>> the file back down, then you'll need to submit a patch.  The idea of
>>>>>>>>> having an online defragmenter for qcow2 files has been kicked around
>>>>>>>>> before, but it is complex enough that no one has attempted a patch yet.
>>>>>>>> Sorry, I didn't clarify the problem clearly.
>>>>>>>> In qcow2_update_snapshot_refcount(), below code,
>>>>>>>>         /* Update L1 only if it isn't deleted anyway (addend = -1) */
>>>>>>>>         if (ret == 0 && addend >= 0 && l1_modified) {
>>>>>>>>             for (i = 0; i < l1_size; i++) {
>>>>>>>>                 cpu_to_be64s(&l1_table[i]);
>>>>>>>>             }
>>>>>>>>
>>>>>>>>             ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, l1_size2);
>>>>>>>>
>>>>>>>>             for (i = 0; i < l1_size; i++) {
>>>>>>>>                 be64_to_cpus(&l1_table[i]);
>>>>>>>>             }
>>>>>>>>         }
>>>>>>>> between cpu_to_be64s(&l1_table[i]); and be64_to_cpus(&l1_table[i]);,
>>>>>>>> is it possible that there is other I/O reference this interim l1 table whose entries contain the be64 l2 table offset?
>>>>>>>> The be64 l2 table offset maybe a very large value, hundreds of TB is possible,
>>>>>>>> then the qcow2 file will be truncated to far larger than normal size.
>>>>>>>> So we'll see the huge size of the qcow2 file by ls -hl, but the size is still normal displayed by qemu-img info.
>>>>>>>>
>>>>>>>> If the possibility mentioned above exists, below raw code may fix it,
>>>>>>>>          if (ret == 0 && addend >= 0 && l1_modified) {
>>>>>>>>             tmp_l1_table = g_malloc0(l1_size * sizeof(uint64_t))
>>>>>>>>             memcpy(tmp_l1_table, l1_table, l1_size * sizeof(uint64_t));
>>>>>>>>             for (i = 0; i < l1_size; i++) {
>>>>>>>>                 cpu_to_be64s(&tmp_l1_table[i]);
>>>>>>>>             }
>>>>>>>>             ret = bdrv_pwrite_sync(bs->file, l1_table_offset, tmp_l1_table, l1_size2);
>>>>>>>>
>>>>>>>>             free(tmp_l1_table);
>>>>>>>>         }
>>>>>>> l1_table is already a local variable (local to
>>>>>>> qcow2_update_snapshot_refcount()), so I can't really imagine how
>>>>>>> introducing another local buffer should mitigate the problem, if there
>>>>>>> is any.
>>>>>>>
>>>>>> l1_table is not necessarily a local variable to qcow2_update_snapshot_refcount,
>>>>>> which depends on condition of "if (l1_table_offset != s->l1_table_offset)",
>>>>>> if the condition not true, l1_table = s->l1_table.
>>>>> Oh, yes, you're right. Okay, so in theory nothing should happen anyway,
>>>>> because qcow2 does not have to be reentrant (so s->l1_table will not be
>>>>> accessed while it's big endian and therefore possibly not in CPU order).
>>>> Could you detail how qcow2 does not have to be reentrant?
>>>> In below stack,
>>>> qcow2_update_snapshot_refcount
>>>> |- cpu_to_be64s(&l1_table[i])
>>>> |- bdrv_pwrite_sync
>>> This is executed on bs->file, not the qcow2 BDS.
>>>
>> Yes, bs->file is passed to bdrv_pwrite_sync here,
>> but aio_poll(aio_context) will poll all BDS's aio, not only that of bs->file, doesn't it?
>> Is it possible that there are pending aio which belong to this qcow2 BDS still exist?
>
>qcow2 is generally not reentrant, this is secured by locking 
>(BDRVQcowState.lock). As long as one request for a BDS is still running, 
>it will not be interrupted.
>
This problem can be reproduced with loop of 
savevm -> delvm -> savevm -> delvm ..., for about half-hour,
but after applying above change of using local variable to sync l1_table,
this problem has not been occurred for more than 48 hours with loop of
savevm -> delvm -> savevm -> delvm ...
Could you help analysing this problem, please?

And, because bdrv_co_do_rw is running in a coroutine context, not the other thread,
both bdrv_co_do_rw and qcow2_update_snapshot_refcount are performed in the same thread (main-thread),
how does BDRVQcowState.lock avoid the reentrant?

Thanks,
Zhang Haoyu
>Max
>
>> Thanks,
>> Zhang Haoyu
>>> Max
>>>
>>>> |-- bdrv_pwrite
>>>> |--- bdrv_pwritev
>>>> |---- bdrv_prwv_co
>>>> |----- aio_poll(aio_context) <== this aio_context is qemu_aio_context
>>>> |------ aio_dispatch
>>>> |------- bdrv_co_io_em_complete
>>>> |-------- qemu_coroutine_enter(co->coroutine, NULL); <== coroutine entry is bdrv_co_do_rw
>>>> bdrv_co_do_rw will access l1_table to perform I/O operation.
>>>>
>>>> Thanks,
>>>> Zhang Haoyu
>>>>> But I find it rather ugly to convert the cached L1 table to big endian,
>>>>> so I'd be fine with the patch you proposed.
>>>>>
>>>>> Max

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-10-14  1:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-09 11:17 [Qemu-devel] [question] is it posssible that big-endian l1 table offset referenced by other I/O while updating l1 table offset in qcow2_update_snapshot_refcount? Zhang Haoyu
2014-10-09 14:58 ` Eric Blake
2014-10-10  1:54 ` [Qemu-devel] [question] is it possible " Zhang Haoyu
2014-10-12 13:23   ` Max Reitz
2014-10-13  3:17     ` [Qemu-devel] [question] is it possible that big-endian l1 tableoffset " Zhang Haoyu
2014-10-13  6:40       ` Max Reitz
2014-10-13  7:13         ` [Qemu-devel] [question] is it possible that big-endian l1 tableoffsetreferenced " Zhang Haoyu
2014-10-13  8:02           ` Max Reitz
2014-10-13  8:19             ` [Qemu-devel] [question] is it possible that big-endian l1 tableoffsetreferencedby " Zhang Haoyu
2014-10-13  9:00               ` Max Reitz
2014-10-14  1:55                 ` [Qemu-devel] [question] is it possible that big-endian l1tableoffsetreferencedby other I/O while updating l1 table offset inqcow2_update_snapshot_refcount? Zhang Haoyu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.