All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions
@ 2019-01-24 14:17 Kevin Wolf
  2019-01-24 14:40 ` Vladimir Sementsov-Ogievskiy
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Kevin Wolf @ 2019-01-24 14:17 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, vsementsov, mreitz, eblake, qemu-devel

Depending on the exact image layout and the storage backend (tmpfs is
konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
save us a lot of time e.g. during a mirror block job or qemu-img convert
with a fragmented source image (.bdrv_co_block_status on the protocol
layer can be called for every single cluster in the extreme case).

We may only cache data regions because of possible concurrent writers.
This means that we can later treat a recently punched hole as data, but
this is safe. We can't cache holes because then we might treat recently
written data as holes, which can cause corruption.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 8aee7a3fb8..7272c7c99d 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -168,6 +168,12 @@ typedef struct BDRVRawState {
     bool needs_alignment;
     bool check_cache_dropped;
 
+    struct seek_data_cache {
+        bool        valid;
+        uint64_t    start;
+        uint64_t    end;
+    } seek_data_cache;
+
     PRManager *pr_mgr;
 } BDRVRawState;
 
@@ -1555,8 +1561,17 @@ static int handle_aiocb_write_zeroes_unmap(void *opaque)
 {
     RawPosixAIOData *aiocb = opaque;
     BDRVRawState *s G_GNUC_UNUSED = aiocb->bs->opaque;
+    struct seek_data_cache *sdc;
     int ret;
 
+    /* Invalidate seek_data_cache if it overlaps */
+    sdc = &s->seek_data_cache;
+    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
+                        sdc->start > aiocb->aio_offset + aiocb->aio_nbytes))
+    {
+        sdc->valid = false;
+    }
+
     /* First try to write zeros and unmap at the same time */
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
@@ -1634,11 +1649,20 @@ static int handle_aiocb_discard(void *opaque)
     RawPosixAIOData *aiocb = opaque;
     int ret = -EOPNOTSUPP;
     BDRVRawState *s = aiocb->bs->opaque;
+    struct seek_data_cache *sdc;
 
     if (!s->has_discard) {
         return -ENOTSUP;
     }
 
+    /* Invalidate seek_data_cache if it overlaps */
+    sdc = &s->seek_data_cache;
+    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
+                        sdc->start > aiocb->aio_offset + aiocb->aio_nbytes))
+    {
+        sdc->valid = false;
+    }
+
     if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
 #ifdef BLKDISCARD
         do {
@@ -2424,6 +2448,8 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
                                             int64_t *map,
                                             BlockDriverState **file)
 {
+    BDRVRawState *s = bs->opaque;
+    struct seek_data_cache *sdc;
     off_t data = 0, hole = 0;
     int ret;
 
@@ -2439,6 +2465,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
         return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
     }
 
+    sdc = &s->seek_data_cache;
+    if (sdc->valid && sdc->start <= offset && sdc->end > offset) {
+        *pnum = MIN(bytes, sdc->end - offset);
+        *map = offset;
+        *file = bs;
+        return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+    }
+
     ret = find_allocation(bs, offset, &data, &hole);
     if (ret == -ENXIO) {
         /* Trailing hole */
@@ -2451,14 +2485,27 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
     } else if (data == offset) {
         /* On a data extent, compute bytes to the end of the extent,
          * possibly including a partial sector at EOF. */
-        *pnum = MIN(bytes, hole - offset);
+        *pnum = hole - offset;
         ret = BDRV_BLOCK_DATA;
     } else {
         /* On a hole, compute bytes to the beginning of the next extent.  */
         assert(hole == offset);
-        *pnum = MIN(bytes, data - offset);
+        *pnum = data - offset;
         ret = BDRV_BLOCK_ZERO;
     }
+
+    /* Caching allocated ranges is okay even if another process writes to the
+     * same file because we allow declaring things allocated even if there is a
+     * hole. However, we cannot cache holes without risking corruption. */
+    if (ret == BDRV_BLOCK_DATA) {
+        *sdc = (struct seek_data_cache) {
+            .valid  = true,
+            .start  = offset,
+            .end    = offset + *pnum,
+        };
+    }
+
+    *pnum = MIN(*pnum, bytes);
     *map = offset;
     *file = bs;
     return ret | BDRV_BLOCK_OFFSET_VALID;
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions
  2019-01-24 14:17 [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions Kevin Wolf
@ 2019-01-24 14:40 ` Vladimir Sementsov-Ogievskiy
  2019-01-24 15:11   ` Kevin Wolf
  2019-01-24 15:56 ` Eric Blake
  2019-01-24 16:18 ` Vladimir Sementsov-Ogievskiy
  2 siblings, 1 reply; 15+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-01-24 14:40 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: mreitz, eblake, qemu-devel

24.01.2019 17:17, Kevin Wolf wrote:
> Depending on the exact image layout and the storage backend (tmpfs is
> konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
> save us a lot of time e.g. during a mirror block job or qemu-img convert
> with a fragmented source image (.bdrv_co_block_status on the protocol
> layer can be called for every single cluster in the extreme case).
> 
> We may only cache data regions because of possible concurrent writers.
> This means that we can later treat a recently punched hole as data, but
> this is safe. We can't cache holes because then we might treat recently
> written data as holes, which can cause corruption.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 49 insertions(+), 2 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 8aee7a3fb8..7272c7c99d 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -168,6 +168,12 @@ typedef struct BDRVRawState {
>       bool needs_alignment;
>       bool check_cache_dropped;
>   
> +    struct seek_data_cache {
> +        bool        valid;
> +        uint64_t    start;
> +        uint64_t    end;
> +    } seek_data_cache;

Should we have some mutex-locking to protect it?

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions
  2019-01-24 14:40 ` Vladimir Sementsov-Ogievskiy
@ 2019-01-24 15:11   ` Kevin Wolf
  2019-01-24 15:22     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Wolf @ 2019-01-24 15:11 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy; +Cc: qemu-block, mreitz, eblake, qemu-devel

Am 24.01.2019 um 15:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 24.01.2019 17:17, Kevin Wolf wrote:
> > Depending on the exact image layout and the storage backend (tmpfs is
> > konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
> > save us a lot of time e.g. during a mirror block job or qemu-img convert
> > with a fragmented source image (.bdrv_co_block_status on the protocol
> > layer can be called for every single cluster in the extreme case).
> > 
> > We may only cache data regions because of possible concurrent writers.
> > This means that we can later treat a recently punched hole as data, but
> > this is safe. We can't cache holes because then we might treat recently
> > written data as holes, which can cause corruption.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >   block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++--
> >   1 file changed, 49 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 8aee7a3fb8..7272c7c99d 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -168,6 +168,12 @@ typedef struct BDRVRawState {
> >       bool needs_alignment;
> >       bool check_cache_dropped;
> >   
> > +    struct seek_data_cache {
> > +        bool        valid;
> > +        uint64_t    start;
> > +        uint64_t    end;
> > +    } seek_data_cache;
> 
> Should we have some mutex-locking to protect it?

It is protected by the AioContext lock, like everything else in
BDRVRawState.

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions
  2019-01-24 15:11   ` Kevin Wolf
@ 2019-01-24 15:22     ` Vladimir Sementsov-Ogievskiy
  2019-01-24 15:42       ` Kevin Wolf
  0 siblings, 1 reply; 15+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-01-24 15:22 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, mreitz, eblake, qemu-devel, Paolo Bonzini

24.01.2019 18:11, Kevin Wolf wrote:
> Am 24.01.2019 um 15:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 24.01.2019 17:17, Kevin Wolf wrote:
>>> Depending on the exact image layout and the storage backend (tmpfs is
>>> konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
>>> save us a lot of time e.g. during a mirror block job or qemu-img convert
>>> with a fragmented source image (.bdrv_co_block_status on the protocol
>>> layer can be called for every single cluster in the extreme case).
>>>
>>> We may only cache data regions because of possible concurrent writers.
>>> This means that we can later treat a recently punched hole as data, but
>>> this is safe. We can't cache holes because then we might treat recently
>>> written data as holes, which can cause corruption.
>>>
>>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>>> ---
>>>    block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++--
>>>    1 file changed, 49 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block/file-posix.c b/block/file-posix.c
>>> index 8aee7a3fb8..7272c7c99d 100644
>>> --- a/block/file-posix.c
>>> +++ b/block/file-posix.c
>>> @@ -168,6 +168,12 @@ typedef struct BDRVRawState {
>>>        bool needs_alignment;
>>>        bool check_cache_dropped;
>>>    
>>> +    struct seek_data_cache {
>>> +        bool        valid;
>>> +        uint64_t    start;
>>> +        uint64_t    end;
>>> +    } seek_data_cache;
>>
>> Should we have some mutex-locking to protect it?
> 
> It is protected by the AioContext lock, like everything else in
> BDRVRawState.
> 

Recently Paolo asked me not to add more users of AioContext lock. Unfortunately
I don't understand the whole picture around it.. Doesn't this apply here?
https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg03410.html



-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions
  2019-01-24 15:22     ` Vladimir Sementsov-Ogievskiy
@ 2019-01-24 15:42       ` Kevin Wolf
  2019-01-25 10:10         ` Paolo Bonzini
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Wolf @ 2019-01-24 15:42 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, mreitz, eblake, qemu-devel, Paolo Bonzini

Am 24.01.2019 um 16:22 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 24.01.2019 18:11, Kevin Wolf wrote:
> > Am 24.01.2019 um 15:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> 24.01.2019 17:17, Kevin Wolf wrote:
> >>> Depending on the exact image layout and the storage backend (tmpfs is
> >>> konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
> >>> save us a lot of time e.g. during a mirror block job or qemu-img convert
> >>> with a fragmented source image (.bdrv_co_block_status on the protocol
> >>> layer can be called for every single cluster in the extreme case).
> >>>
> >>> We may only cache data regions because of possible concurrent writers.
> >>> This means that we can later treat a recently punched hole as data, but
> >>> this is safe. We can't cache holes because then we might treat recently
> >>> written data as holes, which can cause corruption.
> >>>
> >>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> >>> ---
> >>>    block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++--
> >>>    1 file changed, 49 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/block/file-posix.c b/block/file-posix.c
> >>> index 8aee7a3fb8..7272c7c99d 100644
> >>> --- a/block/file-posix.c
> >>> +++ b/block/file-posix.c
> >>> @@ -168,6 +168,12 @@ typedef struct BDRVRawState {
> >>>        bool needs_alignment;
> >>>        bool check_cache_dropped;
> >>>    
> >>> +    struct seek_data_cache {
> >>> +        bool        valid;
> >>> +        uint64_t    start;
> >>> +        uint64_t    end;
> >>> +    } seek_data_cache;
> >>
> >> Should we have some mutex-locking to protect it?
> > 
> > It is protected by the AioContext lock, like everything else in
> > BDRVRawState.
> 
> Recently Paolo asked me not to add more users of AioContext lock. Unfortunately
> I don't understand the whole picture around it.. Doesn't this apply here?
> https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg03410.html

I don't know. Honestly I feel nobody except Paolo knows, because we
don't know his patches yet. But raw doesn't have an s->lock yet, so I
think removing the AioContext lock involves some work on it anyway and
adding this doesn't really change the amount of work.

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions
  2019-01-24 14:17 [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions Kevin Wolf
  2019-01-24 14:40 ` Vladimir Sementsov-Ogievskiy
@ 2019-01-24 15:56 ` Eric Blake
  2019-01-29 10:56   ` Kevin Wolf
  2019-01-24 16:18 ` Vladimir Sementsov-Ogievskiy
  2 siblings, 1 reply; 15+ messages in thread
From: Eric Blake @ 2019-01-24 15:56 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: vsementsov, mreitz, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1586 bytes --]

On 1/24/19 8:17 AM, Kevin Wolf wrote:
> Depending on the exact image layout and the storage backend (tmpfs is
> konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
> save us a lot of time e.g. during a mirror block job or qemu-img convert
> with a fragmented source image (.bdrv_co_block_status on the protocol
> layer can be called for every single cluster in the extreme case).
> 
> We may only cache data regions because of possible concurrent writers.
> This means that we can later treat a recently punched hole as data, but
> this is safe. We can't cache holes because then we might treat recently
> written data as holes, which can cause corruption.

gluster copies heavily from file-posix's implementation; should it also
copy this cache of known-data?  Should NBD also cache known-data when
NBD_CMD_BLOCK_STATUS is available?

> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 49 insertions(+), 2 deletions(-)
> 

>  
> +    /* Invalidate seek_data_cache if it overlaps */
> +    sdc = &s->seek_data_cache;
> +    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
> +                        sdc->start > aiocb->aio_offset + aiocb->aio_nbytes))
> +    {
> +        sdc->valid = false;
> +    }

Worth a helper function for this repeated action?

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions
  2019-01-24 14:17 [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions Kevin Wolf
  2019-01-24 14:40 ` Vladimir Sementsov-Ogievskiy
  2019-01-24 15:56 ` Eric Blake
@ 2019-01-24 16:18 ` Vladimir Sementsov-Ogievskiy
  2019-01-24 16:36   ` Kevin Wolf
  2 siblings, 1 reply; 15+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-01-24 16:18 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: mreitz, eblake, qemu-devel

24.01.2019 17:17, Kevin Wolf wrote:
> Depending on the exact image layout and the storage backend (tmpfs is
> konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
> save us a lot of time e.g. during a mirror block job or qemu-img convert
> with a fragmented source image (.bdrv_co_block_status on the protocol
> layer can be called for every single cluster in the extreme case).
> 
> We may only cache data regions because of possible concurrent writers.
> This means that we can later treat a recently punched hole as data, but
> this is safe. We can't cache holes because then we might treat recently
> written data as holes, which can cause corruption.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 49 insertions(+), 2 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 8aee7a3fb8..7272c7c99d 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -168,6 +168,12 @@ typedef struct BDRVRawState {
>       bool needs_alignment;
>       bool check_cache_dropped;
>   
> +    struct seek_data_cache {
> +        bool        valid;
> +        uint64_t    start;
> +        uint64_t    end;
> +    } seek_data_cache;
> +
>       PRManager *pr_mgr;
>   } BDRVRawState;
>   
> @@ -1555,8 +1561,17 @@ static int handle_aiocb_write_zeroes_unmap(void *opaque)
>   {
>       RawPosixAIOData *aiocb = opaque;
>       BDRVRawState *s G_GNUC_UNUSED = aiocb->bs->opaque;
> +    struct seek_data_cache *sdc;
>       int ret;
>   
> +    /* Invalidate seek_data_cache if it overlaps */
> +    sdc = &s->seek_data_cache;
> +    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
> +                        sdc->start > aiocb->aio_offset + aiocb->aio_nbytes))

to be presize: <= and >=

> +    {
> +        sdc->valid = false;
> +    }
> +
>       /* First try to write zeros and unmap at the same time */
>   


Why not to drop cache on handle_aiocb_write_zeroes()? Otherwise, we'll return DATA
for these regions which may unallocated read-as-zero, if I'm not mistaken.


>   #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> @@ -1634,11 +1649,20 @@ static int handle_aiocb_discard(void *opaque)
>       RawPosixAIOData *aiocb = opaque;
>       int ret = -EOPNOTSUPP;
>       BDRVRawState *s = aiocb->bs->opaque;
> +    struct seek_data_cache *sdc;
>   
>       if (!s->has_discard) {
>           return -ENOTSUP;
>       }
>   
> +    /* Invalidate seek_data_cache if it overlaps */
> +    sdc = &s->seek_data_cache;
> +    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
> +                        sdc->start > aiocb->aio_offset + aiocb->aio_nbytes))

and <= and >=

and if add same to handle_aiocb_write_zeroes(), then it worth to create helper function
to invalidate cache.

> +    {
> +        sdc->valid = false;
> +    }
> +
>       if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
>   #ifdef BLKDISCARD
>           do {
> @@ -2424,6 +2448,8 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
>                                               int64_t *map,
>                                               BlockDriverState **file)
>   {
> +    BDRVRawState *s = bs->opaque;
> +    struct seek_data_cache *sdc;
>       off_t data = 0, hole = 0;
>       int ret;
>   
> @@ -2439,6 +2465,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
>           return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
>       }
>   
> +    sdc = &s->seek_data_cache;
> +    if (sdc->valid && sdc->start <= offset && sdc->end > offset) {
> +        *pnum = MIN(bytes, sdc->end - offset);
> +        *map = offset;
> +        *file = bs;
> +        return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
> +    }
> +
>       ret = find_allocation(bs, offset, &data, &hole);
>       if (ret == -ENXIO) {
>           /* Trailing hole */
> @@ -2451,14 +2485,27 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
>       } else if (data == offset) {
>           /* On a data extent, compute bytes to the end of the extent,
>            * possibly including a partial sector at EOF. */
> -        *pnum = MIN(bytes, hole - offset);
> +        *pnum = hole - offset;

hmm, why? At least you didn't mention it in commit-message..

>           ret = BDRV_BLOCK_DATA;
>       } else {
>           /* On a hole, compute bytes to the beginning of the next extent.  */
>           assert(hole == offset);
> -        *pnum = MIN(bytes, data - offset);
> +        *pnum = data - offset;
>           ret = BDRV_BLOCK_ZERO;
>       }
> +
> +    /* Caching allocated ranges is okay even if another process writes to the
> +     * same file because we allow declaring things allocated even if there is a
> +     * hole. However, we cannot cache holes without risking corruption. */
> +    if (ret == BDRV_BLOCK_DATA) {
> +        *sdc = (struct seek_data_cache) {
> +            .valid  = true,
> +            .start  = offset,
> +            .end    = offset + *pnum,
> +        };
> +    }
> +
> +    *pnum = MIN(*pnum, bytes);
>       *map = offset;
>       *file = bs;
>       return ret | BDRV_BLOCK_OFFSET_VALID;
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions
  2019-01-24 16:18 ` Vladimir Sementsov-Ogievskiy
@ 2019-01-24 16:36   ` Kevin Wolf
  2019-01-25  9:13     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Wolf @ 2019-01-24 16:36 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy; +Cc: qemu-block, mreitz, eblake, qemu-devel

Am 24.01.2019 um 17:18 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 24.01.2019 17:17, Kevin Wolf wrote:
> > Depending on the exact image layout and the storage backend (tmpfs is
> > konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
> > save us a lot of time e.g. during a mirror block job or qemu-img convert
> > with a fragmented source image (.bdrv_co_block_status on the protocol
> > layer can be called for every single cluster in the extreme case).
> > 
> > We may only cache data regions because of possible concurrent writers.
> > This means that we can later treat a recently punched hole as data, but
> > this is safe. We can't cache holes because then we might treat recently
> > written data as holes, which can cause corruption.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>

> > @@ -1555,8 +1561,17 @@ static int handle_aiocb_write_zeroes_unmap(void *opaque)
> >   {
> >       RawPosixAIOData *aiocb = opaque;
> >       BDRVRawState *s G_GNUC_UNUSED = aiocb->bs->opaque;
> > +    struct seek_data_cache *sdc;
> >       int ret;
> >   
> > +    /* Invalidate seek_data_cache if it overlaps */
> > +    sdc = &s->seek_data_cache;
> > +    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
> > +                        sdc->start > aiocb->aio_offset + aiocb->aio_nbytes))
> 
> to be presize: <= and >=

Yes, you're right.

> > +    {
> > +        sdc->valid = false;
> > +    }
> > +
> >       /* First try to write zeros and unmap at the same time */
> >   
> 
> 
> Why not to drop cache on handle_aiocb_write_zeroes()? Otherwise, we'll return DATA
> for these regions which may unallocated read-as-zero, if I'm not mistaken.

handle_aiocb_write_zeroes() is not allowed to unmap things, so we don't
need to invalidate the cache there.

> >   #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> > @@ -1634,11 +1649,20 @@ static int handle_aiocb_discard(void *opaque)
> >       RawPosixAIOData *aiocb = opaque;
> >       int ret = -EOPNOTSUPP;
> >       BDRVRawState *s = aiocb->bs->opaque;
> > +    struct seek_data_cache *sdc;
> >   
> >       if (!s->has_discard) {
> >           return -ENOTSUP;
> >       }
> >   
> > +    /* Invalidate seek_data_cache if it overlaps */
> > +    sdc = &s->seek_data_cache;
> > +    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
> > +                        sdc->start > aiocb->aio_offset + aiocb->aio_nbytes))
> 
> and <= and >=
> 
> and if add same to handle_aiocb_write_zeroes(), then it worth to
> create helper function to invalidate cache.

Ok.

> > +    {
> > +        sdc->valid = false;
> > +    }
> > +
> >       if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
> >   #ifdef BLKDISCARD
> >           do {
> > @@ -2424,6 +2448,8 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
> >                                               int64_t *map,
> >                                               BlockDriverState **file)
> >   {
> > +    BDRVRawState *s = bs->opaque;
> > +    struct seek_data_cache *sdc;
> >       off_t data = 0, hole = 0;
> >       int ret;
> >   
> > @@ -2439,6 +2465,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
> >           return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
> >       }
> >   
> > +    sdc = &s->seek_data_cache;
> > +    if (sdc->valid && sdc->start <= offset && sdc->end > offset) {
> > +        *pnum = MIN(bytes, sdc->end - offset);
> > +        *map = offset;
> > +        *file = bs;
> > +        return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
> > +    }
> > +
> >       ret = find_allocation(bs, offset, &data, &hole);
> >       if (ret == -ENXIO) {
> >           /* Trailing hole */
> > @@ -2451,14 +2485,27 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
> >       } else if (data == offset) {
> >           /* On a data extent, compute bytes to the end of the extent,
> >            * possibly including a partial sector at EOF. */
> > -        *pnum = MIN(bytes, hole - offset);
> > +        *pnum = hole - offset;
> 
> hmm, why? At least you didn't mention it in commit-message..

We want to cache the whole range returned by lseek(), not just whatever
the raw_co_block_status() caller wanted to know.

For the returned value, *pnum is adjusted to MIN(bytes, *pnum) below...

> >           ret = BDRV_BLOCK_DATA;
> >       } else {
> >           /* On a hole, compute bytes to the beginning of the next extent.  */
> >           assert(hole == offset);
> > -        *pnum = MIN(bytes, data - offset);
> > +        *pnum = data - offset;
> >           ret = BDRV_BLOCK_ZERO;
> >       }
> > +
> > +    /* Caching allocated ranges is okay even if another process writes to the
> > +     * same file because we allow declaring things allocated even if there is a
> > +     * hole. However, we cannot cache holes without risking corruption. */
> > +    if (ret == BDRV_BLOCK_DATA) {
> > +        *sdc = (struct seek_data_cache) {
> > +            .valid  = true,
> > +            .start  = offset,
> > +            .end    = offset + *pnum,
> > +        };
> > +    }
> > +
> > +    *pnum = MIN(*pnum, bytes);

...here.

So what we return doesn't change.

> >       *map = offset;
> >       *file = bs;
> >       return ret | BDRV_BLOCK_OFFSET_VALID;

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions
  2019-01-24 16:36   ` Kevin Wolf
@ 2019-01-25  9:13     ` Vladimir Sementsov-Ogievskiy
  2019-01-25 13:26       ` Eric Blake
  0 siblings, 1 reply; 15+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-01-25  9:13 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, mreitz, eblake, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 8970 bytes --]

24.01.2019 19:36, Kevin Wolf wrote:
> Am 24.01.2019 um 17:18 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 24.01.2019 17:17, Kevin Wolf wrote:
>>> Depending on the exact image layout and the storage backend (tmpfs is
>>> konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
>>> save us a lot of time e.g. during a mirror block job or qemu-img convert
>>> with a fragmented source image (.bdrv_co_block_status on the protocol
>>> layer can be called for every single cluster in the extreme case).
>>>
>>> We may only cache data regions because of possible concurrent writers.
>>> This means that we can later treat a recently punched hole as data, but
>>> this is safe. We can't cache holes because then we might treat recently
>>> written data as holes, which can cause corruption.
>>>
>>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> 
>>> @@ -1555,8 +1561,17 @@ static int handle_aiocb_write_zeroes_unmap(void *opaque)
>>>    {
>>>        RawPosixAIOData *aiocb = opaque;
>>>        BDRVRawState *s G_GNUC_UNUSED = aiocb->bs->opaque;
>>> +    struct seek_data_cache *sdc;
>>>        int ret;
>>>    
>>> +    /* Invalidate seek_data_cache if it overlaps */
>>> +    sdc = &s->seek_data_cache;
>>> +    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
>>> +                        sdc->start > aiocb->aio_offset + aiocb->aio_nbytes))
>>
>> to be presize: <= and >=
> 
> Yes, you're right.
> 
>>> +    {
>>> +        sdc->valid = false;
>>> +    }
>>> +
>>>        /* First try to write zeros and unmap at the same time */
>>>    
>>
>>
>> Why not to drop cache on handle_aiocb_write_zeroes()? Otherwise, we'll return DATA
>> for these regions which may unallocated read-as-zero, if I'm not mistaken.
> 
> handle_aiocb_write_zeroes() is not allowed to unmap things, so we don't
> need to invalidate the cache there.

So, you want to say, that for fallocated regions we always return just _DATA, without _ZERO?
If it is so, it's of course bad, it means that convert will have to copy (or at least read
and detect zeroes by hand, if enabled) write-zeroed-without-unmap areas.

Let's check (hmm, I had to use qemu-img map inside qemu-io, patch attached for it,
also I printed printf("%s\n", __func__) in handle_aiocb_write_zeroes_unmap and
handle_aiocb_write_zeroes):

Let's test:
]# cat test
./qemu-img create -f raw x 1M

./qemu-io -f raw x <<CMDS
write 0 1M
map
write -z 100K 100K
map
write -z -u 500K 100K
map
CMDS

rm -rf x


rm -rf x

before your patch:
]# ./test
Formatting 'x', fmt=raw size=1048576
qemu-io> wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0523 sec (19.093 MiB/sec and 19.0927 ops/sec)
qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 0}]
qemu-io> handle_aiocb_write_zeroes
wrote 102400/102400 bytes at offset 102400
100 KiB, 1 ops; 0.0165 sec (5.898 MiB/sec and 60.3974 ops/sec)
qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": true, "offset": 0},
{ "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 102400},
{ "start": 204800, "length": 843776, "depth": 0, "zero": false, "data": true, "offset": 204800}]
qemu-io> handle_aiocb_write_zeroes_unmap
wrote 102400/102400 bytes at offset 512000
100 KiB, 1 ops; 0.0001 sec (545.566 MiB/sec and 5586.5922 ops/sec)
qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": true, "offset": 0},
{ "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 102400},
{ "start": 204800, "length": 307200, "depth": 0, "zero": false, "data": true, "offset": 204800},
{ "start": 512000, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 512000},
{ "start": 614400, "length": 434176, "depth": 0, "zero": false, "data": true, "offset": 614400}]



after your patch:
# ./test
Formatting 'x', fmt=raw size=1048576
qemu-io> wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0768 sec (13.019 MiB/sec and 13.0195 ops/sec)
qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 0}]
qemu-io> handle_aiocb_write_zeroes
wrote 102400/102400 bytes at offset 102400
100 KiB, 1 ops; 0.0166 sec (5.883 MiB/sec and 60.2410 ops/sec)
qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 0}]
qemu-io> handle_aiocb_write_zeroes_unmap
wrote 102400/102400 bytes at offset 512000
100 KiB, 1 ops; 0.0002 sec (469.501 MiB/sec and 4807.6923 ops/sec)
qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": true, "offset": 0},
{ "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 102400},
{ "start": 204800, "length": 307200, "depth": 0, "zero": false, "data": true, "offset": 204800},
{ "start": 512000, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 512000},
{ "start": 614400, "length": 434176, "depth": 0, "zero": false, "data": true, "offset": 614400}]


So, you've changed behavior of block_status after write_zeroes without UNMAP for the worse.

Hmm, should I prepare patch for qemu-io? qemu-img map is definitely better.

> 
>>>    #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>> @@ -1634,11 +1649,20 @@ static int handle_aiocb_discard(void *opaque)
>>>        RawPosixAIOData *aiocb = opaque;
>>>        int ret = -EOPNOTSUPP;
>>>        BDRVRawState *s = aiocb->bs->opaque;
>>> +    struct seek_data_cache *sdc;
>>>    
>>>        if (!s->has_discard) {
>>>            return -ENOTSUP;
>>>        }
>>>    
>>> +    /* Invalidate seek_data_cache if it overlaps */
>>> +    sdc = &s->seek_data_cache;
>>> +    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
>>> +                        sdc->start > aiocb->aio_offset + aiocb->aio_nbytes))
>>
>> and <= and >=
>>
>> and if add same to handle_aiocb_write_zeroes(), then it worth to
>> create helper function to invalidate cache.
> 
> Ok.
> 
>>> +    {
>>> +        sdc->valid = false;
>>> +    }
>>> +
>>>        if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
>>>    #ifdef BLKDISCARD
>>>            do {
>>> @@ -2424,6 +2448,8 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
>>>                                                int64_t *map,
>>>                                                BlockDriverState **file)
>>>    {
>>> +    BDRVRawState *s = bs->opaque;
>>> +    struct seek_data_cache *sdc;
>>>        off_t data = 0, hole = 0;
>>>        int ret;
>>>    
>>> @@ -2439,6 +2465,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
>>>            return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
>>>        }
>>>    
>>> +    sdc = &s->seek_data_cache;
>>> +    if (sdc->valid && sdc->start <= offset && sdc->end > offset) {
>>> +        *pnum = MIN(bytes, sdc->end - offset);
>>> +        *map = offset;
>>> +        *file = bs;
>>> +        return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
>>> +    }
>>> +
>>>        ret = find_allocation(bs, offset, &data, &hole);
>>>        if (ret == -ENXIO) {
>>>            /* Trailing hole */
>>> @@ -2451,14 +2485,27 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
>>>        } else if (data == offset) {
>>>            /* On a data extent, compute bytes to the end of the extent,
>>>             * possibly including a partial sector at EOF. */
>>> -        *pnum = MIN(bytes, hole - offset);
>>> +        *pnum = hole - offset;
>>
>> hmm, why? At least you didn't mention it in commit-message..
> 
> We want to cache the whole range returned by lseek(), not just whatever
> the raw_co_block_status() caller wanted to know.
> 
> For the returned value, *pnum is adjusted to MIN(bytes, *pnum) below...

Oops, stupid question it was, sorry:(

> 
>>>            ret = BDRV_BLOCK_DATA;
>>>        } else {
>>>            /* On a hole, compute bytes to the beginning of the next extent.  */
>>>            assert(hole == offset);
>>> -        *pnum = MIN(bytes, data - offset);
>>> +        *pnum = data - offset;
>>>            ret = BDRV_BLOCK_ZERO;
>>>        }
>>> +
>>> +    /* Caching allocated ranges is okay even if another process writes to the
>>> +     * same file because we allow declaring things allocated even if there is a
>>> +     * hole. However, we cannot cache holes without risking corruption. */
>>> +    if (ret == BDRV_BLOCK_DATA) {
>>> +        *sdc = (struct seek_data_cache) {
>>> +            .valid  = true,
>>> +            .start  = offset,
>>> +            .end    = offset + *pnum,
>>> +        };
>>> +    }
>>> +
>>> +    *pnum = MIN(*pnum, bytes);
> 
> ...here.
> 
> So what we return doesn't change.
> 
>>>        *map = offset;
>>>        *file = bs;
>>>        return ret | BDRV_BLOCK_OFFSET_VALID;
> 
> Kevin
> 


-- 
Best regards,
Vladimir

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-my.patch --]
[-- Type: text/x-patch; name="0001-my.patch", Size: 13440 bytes --]

From 8f52aa40514a290d7770ef2eba3ee3c8b93c1cab Mon Sep 17 00:00:00 2001
From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Date: Fri, 25 Jan 2019 11:48:05 +0300
Subject: [PATCH 1/2] my

---
 include/qemu-io.h |   7 ++
 qemu-img.c        | 157 +--------------------------------------
 qemu-io-cmds.c    | 182 ++++++++++++++++++++++++++++++++++------------
 3 files changed, 145 insertions(+), 201 deletions(-)

diff --git a/include/qemu-io.h b/include/qemu-io.h
index 7433239372..2abbbb4acf 100644
--- a/include/qemu-io.h
+++ b/include/qemu-io.h
@@ -54,4 +54,11 @@ void qemuio_complete_command(const char *input,
                              void (*fn)(const char *cmd, void *opaque),
                              void *opaque);
 
+typedef enum OutputFormat {
+    OFORMAT_JSON,
+    OFORMAT_HUMAN,
+} OutputFormat;
+
+int qemu_img_do_map(BlockBackend *blk, OutputFormat output_format);
+
 #endif /* QEMU_IO_H */
diff --git a/qemu-img.c b/qemu-img.c
index ad04f59565..ecd708af2d 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -45,6 +45,7 @@
 #include "block/qapi.h"
 #include "crypto/init.h"
 #include "trace/control.h"
+#include "qemu-io.h"
 
 #define QEMU_IMG_VERSION "qemu-img version " QEMU_FULL_VERSION \
                           "\n" QEMU_COPYRIGHT "\n"
@@ -68,11 +69,6 @@ enum {
     OPTION_SHRINK = 266,
 };
 
-typedef enum OutputFormat {
-    OFORMAT_JSON,
-    OFORMAT_HUMAN,
-} OutputFormat;
-
 /* Default to cache=writeback as data integrity is not important for qemu-img */
 #define BDRV_DEFAULT_CACHE "writeback"
 
@@ -2740,128 +2736,12 @@ static int img_info(int argc, char **argv)
     return 0;
 }
 
-static void dump_map_entry(OutputFormat output_format, MapEntry *e,
-                           MapEntry *next)
-{
-    switch (output_format) {
-    case OFORMAT_HUMAN:
-        if (e->data && !e->has_offset) {
-            error_report("File contains external, encrypted or compressed clusters.");
-            exit(1);
-        }
-        if (e->data && !e->zero) {
-            printf("%#-16"PRIx64"%#-16"PRIx64"%#-16"PRIx64"%s\n",
-                   e->start, e->length,
-                   e->has_offset ? e->offset : 0,
-                   e->has_filename ? e->filename : "");
-        }
-        /* This format ignores the distinction between 0, ZERO and ZERO|DATA.
-         * Modify the flags here to allow more coalescing.
-         */
-        if (next && (!next->data || next->zero)) {
-            next->data = false;
-            next->zero = true;
-        }
-        break;
-    case OFORMAT_JSON:
-        printf("%s{ \"start\": %"PRId64", \"length\": %"PRId64","
-               " \"depth\": %"PRId64", \"zero\": %s, \"data\": %s",
-               (e->start == 0 ? "[" : ",\n"),
-               e->start, e->length, e->depth,
-               e->zero ? "true" : "false",
-               e->data ? "true" : "false");
-        if (e->has_offset) {
-            printf(", \"offset\": %"PRId64"", e->offset);
-        }
-        putchar('}');
-
-        if (!next) {
-            printf("]\n");
-        }
-        break;
-    }
-}
-
-static int get_block_status(BlockDriverState *bs, int64_t offset,
-                            int64_t bytes, MapEntry *e)
-{
-    int ret;
-    int depth;
-    BlockDriverState *file;
-    bool has_offset;
-    int64_t map;
-
-    /* As an optimization, we could cache the current range of unallocated
-     * clusters in each file of the chain, and avoid querying the same
-     * range repeatedly.
-     */
-
-    depth = 0;
-    for (;;) {
-        ret = bdrv_block_status(bs, offset, bytes, &bytes, &map, &file);
-        if (ret < 0) {
-            return ret;
-        }
-        assert(bytes);
-        if (ret & (BDRV_BLOCK_ZERO|BDRV_BLOCK_DATA)) {
-            break;
-        }
-        bs = backing_bs(bs);
-        if (bs == NULL) {
-            ret = 0;
-            break;
-        }
-
-        depth++;
-    }
-
-    has_offset = !!(ret & BDRV_BLOCK_OFFSET_VALID);
-
-    *e = (MapEntry) {
-        .start = offset,
-        .length = bytes,
-        .data = !!(ret & BDRV_BLOCK_DATA),
-        .zero = !!(ret & BDRV_BLOCK_ZERO),
-        .offset = map,
-        .has_offset = has_offset,
-        .depth = depth,
-        .has_filename = file && has_offset,
-        .filename = file && has_offset ? file->filename : NULL,
-    };
-
-    return 0;
-}
-
-static inline bool entry_mergeable(const MapEntry *curr, const MapEntry *next)
-{
-    if (curr->length == 0) {
-        return false;
-    }
-    if (curr->zero != next->zero ||
-        curr->data != next->data ||
-        curr->depth != next->depth ||
-        curr->has_filename != next->has_filename ||
-        curr->has_offset != next->has_offset) {
-        return false;
-    }
-    if (curr->has_filename && strcmp(curr->filename, next->filename)) {
-        return false;
-    }
-    if (curr->has_offset && curr->offset + curr->length != next->offset) {
-        return false;
-    }
-    return true;
-}
-
 static int img_map(int argc, char **argv)
 {
     int c;
     OutputFormat output_format = OFORMAT_HUMAN;
     BlockBackend *blk;
-    BlockDriverState *bs;
     const char *filename, *fmt, *output;
-    int64_t length;
-    MapEntry curr = { .length = 0 }, next;
     int ret = 0;
     bool image_opts = false;
     bool force_share = false;
@@ -2940,41 +2820,10 @@ static int img_map(int argc, char **argv)
     if (!blk) {
         return 1;
     }
-    bs = blk_bs(blk);
-
-    if (output_format == OFORMAT_HUMAN) {
-        printf("%-16s%-16s%-16s%s\n", "Offset", "Length", "Mapped to", "File");
-    }
 
-    length = blk_getlength(blk);
-    while (curr.start + curr.length < length) {
-        int64_t offset = curr.start + curr.length;
-        int64_t n;
-
-        /* Probe up to 1 GiB at a time.  */
-        n = MIN(1 << 30, length - offset);
-        ret = get_block_status(bs, offset, n, &next);
-
-        if (ret < 0) {
-            error_report("Could not read file metadata: %s", strerror(-ret));
-            goto out;
-        }
-
-        if (entry_mergeable(&curr, &next)) {
-            curr.length += next.length;
-            continue;
-        }
-
-        if (curr.length > 0) {
-            dump_map_entry(output_format, &curr, &next);
-        }
-        curr = next;
-    }
-
-    dump_map_entry(output_format, &curr, NULL);
-
-out:
+    ret = qemu_img_do_map(blk, output_format);
     blk_unref(blk);
+
     return ret < 0;
 }
 
diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 2c39124036..791cedd96d 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1852,77 +1852,165 @@ static const cmdinfo_t alloc_cmd = {
     .oneline    = "checks if offset is allocated in the file",
 };
 
+static void dump_map_entry(OutputFormat output_format, MapEntry *e,
+                           MapEntry *next)
+{
+    switch (output_format) {
+    case OFORMAT_HUMAN:
+        if (e->data && !e->has_offset) {
+            error_report("File contains external, encrypted or compressed clusters.");
+            exit(1);
+        }
+        if (e->data && !e->zero) {
+            printf("%#-16"PRIx64"%#-16"PRIx64"%#-16"PRIx64"%s\n",
+                   e->start, e->length,
+                   e->has_offset ? e->offset : 0,
+                   e->has_filename ? e->filename : "");
+        }
+        /* This format ignores the distinction between 0, ZERO and ZERO|DATA.
+         * Modify the flags here to allow more coalescing.
+         */
+        if (next && (!next->data || next->zero)) {
+            next->data = false;
+            next->zero = true;
+        }
+        break;
+    case OFORMAT_JSON:
+        printf("%s{ \"start\": %"PRId64", \"length\": %"PRId64","
+               " \"depth\": %"PRId64", \"zero\": %s, \"data\": %s",
+               (e->start == 0 ? "[" : ",\n"),
+               e->start, e->length, e->depth,
+               e->zero ? "true" : "false",
+               e->data ? "true" : "false");
+        if (e->has_offset) {
+            printf(", \"offset\": %"PRId64"", e->offset);
+        }
+        putchar('}');
 
-static int map_is_allocated(BlockDriverState *bs, int64_t offset,
-                            int64_t bytes, int64_t *pnum)
-{
-    int64_t num;
-    int num_checked;
-    int ret, firstret;
-
-    num_checked = MIN(bytes, BDRV_REQUEST_MAX_BYTES);
-    ret = bdrv_is_allocated(bs, offset, num_checked, &num);
-    if (ret < 0) {
-        return ret;
+        if (!next) {
+            printf("]\n");
+        }
+        break;
     }
+}
 
-    firstret = ret;
-    *pnum = num;
-
-    while (bytes > 0 && ret == firstret) {
-        offset += num;
-        bytes -= num;
+static int get_block_status(BlockDriverState *bs, int64_t offset,
+                            int64_t bytes, MapEntry *e)
+{
+    int ret;
+    int depth;
+    BlockDriverState *file;
+    bool has_offset;
+    int64_t map;
+
+    /* As an optimization, we could cache the current range of unallocated
+     * clusters in each file of the chain, and avoid querying the same
+     * range repeatedly.
+     */
 
-        num_checked = MIN(bytes, BDRV_REQUEST_MAX_BYTES);
-        ret = bdrv_is_allocated(bs, offset, num_checked, &num);
-        if (ret == firstret && num) {
-            *pnum += num;
-        } else {
+    depth = 0;
+    for (;;) {
+        ret = bdrv_block_status(bs, offset, bytes, &bytes, &map, &file);
+        if (ret < 0) {
+            return ret;
+        }
+        assert(bytes);
+        if (ret & (BDRV_BLOCK_ZERO|BDRV_BLOCK_DATA)) {
             break;
         }
+        bs = backing_bs(bs);
+        if (bs == NULL) {
+            ret = 0;
+            break;
+        }
+
+        depth++;
     }
 
-    return firstret;
+    has_offset = !!(ret & BDRV_BLOCK_OFFSET_VALID);
+
+    *e = (MapEntry) {
+        .start = offset,
+        .length = bytes,
+        .data = !!(ret & BDRV_BLOCK_DATA),
+        .zero = !!(ret & BDRV_BLOCK_ZERO),
+        .offset = map,
+        .has_offset = has_offset,
+        .depth = depth,
+        .has_filename = file && has_offset,
+        .filename = file && has_offset ? file->filename : NULL,
+    };
+
+    return 0;
 }
 
-static int map_f(BlockBackend *blk, int argc, char **argv)
+static inline bool entry_mergeable(const MapEntry *curr, const MapEntry *next)
 {
-    int64_t offset, bytes;
-    char s1[64], s2[64];
-    int64_t num;
-    int ret;
-    const char *retstr;
+    if (curr->length == 0) {
+        return false;
+    }
+    if (curr->zero != next->zero ||
+        curr->data != next->data ||
+        curr->depth != next->depth ||
+        curr->has_filename != next->has_filename ||
+        curr->has_offset != next->has_offset) {
+        return false;
+    }
+    if (curr->has_filename && strcmp(curr->filename, next->filename)) {
+        return false;
+    }
+    if (curr->has_offset && curr->offset + curr->length != next->offset) {
+        return false;
+    }
+    return true;
+}
 
-    offset = 0;
-    bytes = blk_getlength(blk);
-    if (bytes < 0) {
-        error_report("Failed to query image length: %s", strerror(-bytes));
-        return bytes;
+int qemu_img_do_map(BlockBackend *blk, OutputFormat output_format)
+{
+    BlockDriverState *bs = blk_bs(blk);
+    MapEntry curr = { .length = 0 }, next;
+    int64_t length;
+
+    if (output_format == OFORMAT_HUMAN) {
+        printf("%-16s%-16s%-16s%s\n", "Offset", "Length", "Mapped to", "File");
     }
 
-    while (bytes) {
-        ret = map_is_allocated(blk_bs(blk), offset, bytes, &num);
+    length = blk_getlength(blk);
+    while (curr.start + curr.length < length) {
+        int64_t offset = curr.start + curr.length;
+        int64_t n;
+        int ret;
+
+        /* Probe up to 1 GiB at a time.  */
+        n = MIN(1 << 30, length - offset);
+        ret = get_block_status(bs, offset, n, &next);
+
         if (ret < 0) {
-            error_report("Failed to get allocation status: %s", strerror(-ret));
+            error_report("Could not read file metadata: %s", strerror(-ret));
             return ret;
-        } else if (!num) {
-            error_report("Unexpected end of image");
-            return -EIO;
         }
 
-        retstr = ret ? "    allocated" : "not allocated";
-        cvtstr(num, s1, sizeof(s1));
-        cvtstr(offset, s2, sizeof(s2));
-        printf("%s (0x%" PRIx64 ") bytes %s at offset %s (0x%" PRIx64 ")\n",
-               s1, num, retstr, s2, offset);
+        if (entry_mergeable(&curr, &next)) {
+            curr.length += next.length;
+            continue;
+        }
 
-        offset += num;
-        bytes -= num;
+        if (curr.length > 0) {
+            dump_map_entry(output_format, &curr, &next);
+        }
+        curr = next;
     }
 
+    dump_map_entry(output_format, &curr, NULL);
+
     return 0;
 }
 
+static int map_f(BlockBackend *blk, int argc, char **argv)
+{
+    return qemu_img_do_map(blk, OFORMAT_JSON);
+}
+
 static const cmdinfo_t map_cmd = {
        .name           = "map",
        .argmin         = 0,
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions
  2019-01-24 15:42       ` Kevin Wolf
@ 2019-01-25 10:10         ` Paolo Bonzini
  2019-01-25 10:30           ` Kevin Wolf
  0 siblings, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2019-01-25 10:10 UTC (permalink / raw)
  To: Kevin Wolf, Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, mreitz, eblake, qemu-devel

On 24/01/19 16:42, Kevin Wolf wrote:
>> Recently Paolo asked me not to add more users of AioContext lock. Unfortunately
>> I don't understand the whole picture around it.. Doesn't this apply here?
>> https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg03410.html
> I don't know. Honestly I feel nobody except Paolo knows, because we
> don't know his patches yet.

This is true.  On the other hand, the AioContext lock is only used in
some special cases around block jobs and blk_set_aio_context, and in
general the block devices already should not have any dependencies
(unless they crept in without me noticing).

In particular...

> But raw doesn't have an s->lock yet, so I
> think removing the AioContext lock involves some work on it anyway and
> adding this doesn't really change the amount of work.

... BDRVRawState doesn't have any data that changes after open, does it?
 This is why it doesn't have an s->lock.

Paolo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions
  2019-01-25 10:10         ` Paolo Bonzini
@ 2019-01-25 10:30           ` Kevin Wolf
  2019-02-04 10:17             ` Paolo Bonzini
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Wolf @ 2019-01-25 10:30 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Vladimir Sementsov-Ogievskiy, qemu-block, mreitz, eblake, qemu-devel

Am 25.01.2019 um 11:10 hat Paolo Bonzini geschrieben:
> On 24/01/19 16:42, Kevin Wolf wrote:
> >> Recently Paolo asked me not to add more users of AioContext lock. Unfortunately
> >> I don't understand the whole picture around it.. Doesn't this apply here?
> >> https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg03410.html
> > I don't know. Honestly I feel nobody except Paolo knows, because we
> > don't know his patches yet.
> 
> This is true.  On the other hand, the AioContext lock is only used in
> some special cases around block jobs and blk_set_aio_context, and in
> general the block devices already should not have any dependencies
> (unless they crept in without me noticing).

It's also used in those cases where coroutines don't need locking, but
threads would. Did you audit all of the drivers for such cases?

> In particular...
> 
> > But raw doesn't have an s->lock yet, so I
> > think removing the AioContext lock involves some work on it anyway and
> > adding this doesn't really change the amount of work.
> 
> ... BDRVRawState doesn't have any data that changes after open, does it?
>  This is why it doesn't have an s->lock.

No important data anyway. We do things like setting s->has_write_zeroes
= false after failure, but if we got a race and end up trying twice
before disabling it, it doesn't really hurt either.

Then there is reopen, but that involves a drain anyway. And that's it
probably.

So do you think I should introduce a CoMutex for raw here? Or QemuMutex?

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions
  2019-01-25  9:13     ` Vladimir Sementsov-Ogievskiy
@ 2019-01-25 13:26       ` Eric Blake
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Blake @ 2019-01-25 13:26 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Kevin Wolf; +Cc: qemu-block, mreitz, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3926 bytes --]

On 1/25/19 3:13 AM, Vladimir Sementsov-Ogievskiy wrote:

> 
> before your patch:
> ]# ./test
> Formatting 'x', fmt=raw size=1048576
> qemu-io> wrote 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 0.0523 sec (19.093 MiB/sec and 19.0927 ops/sec)
> qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 0}]
> qemu-io> handle_aiocb_write_zeroes
> wrote 102400/102400 bytes at offset 102400
> 100 KiB, 1 ops; 0.0165 sec (5.898 MiB/sec and 60.3974 ops/sec)
> qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": true, "offset": 0},
> { "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 102400},
> { "start": 204800, "length": 843776, "depth": 0, "zero": false, "data": true, "offset": 204800}]
> qemu-io> handle_aiocb_write_zeroes_unmap
> wrote 102400/102400 bytes at offset 512000
> 100 KiB, 1 ops; 0.0001 sec (545.566 MiB/sec and 5586.5922 ops/sec)
> qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": true, "offset": 0},
> { "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 102400},
> { "start": 204800, "length": 307200, "depth": 0, "zero": false, "data": true, "offset": 204800},
> { "start": 512000, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 512000},
> { "start": 614400, "length": 434176, "depth": 0, "zero": false, "data": true, "offset": 614400}]

Your demonstration pre-patch shows that both 'write -z' and 'write -z
-u' produced areas of the qcow2 image that were marked as known zeroes,
and claim 'data':false meaning that those two areas are sparse (that is,
it appears 'write -z' managed to unmap after all?)

> 
> 
> 
> after your patch:
> # ./test
> Formatting 'x', fmt=raw size=1048576
> qemu-io> wrote 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 0.0768 sec (13.019 MiB/sec and 13.0195 ops/sec)
> qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 0}]
> qemu-io> handle_aiocb_write_zeroes
> wrote 102400/102400 bytes at offset 102400
> 100 KiB, 1 ops; 0.0166 sec (5.883 MiB/sec and 60.2410 ops/sec)
> qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 0}]

So here, the cache was not invalidated, so the 'write -z' area was
temporarily reported as data...

> qemu-io> handle_aiocb_write_zeroes_unmap
> wrote 102400/102400 bytes at offset 512000
> 100 KiB, 1 ops; 0.0002 sec (469.501 MiB/sec and 4807.6923 ops/sec)
> qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": true, "offset": 0},
> { "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 102400},
> { "start": 204800, "length": 307200, "depth": 0, "zero": false, "data": true, "offset": 204800},
> { "start": 512000, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 512000},
> { "start": 614400, "length": 434176, "depth": 0, "zero": false, "data": true, "offset": 614400}]

but after the 'write -z -u' invalidated the cache, we once again see
that the 'write -z' worked.

> 
> 
> So, you've changed behavior of block_status after write_zeroes without UNMAP for the worse.
> 
> Hmm, should I prepare patch for qemu-io? qemu-img map is definitely better.

qemu-io map is NOT asking the same information as qemu-img map (it's
annoying - but BOTH pieces of information are useful, and there are some
iotests that check both outputs to get a full picture of things).
qemu-img asks as much information as possible about all layers, while
qemu-io asks only abut the top layer.

That said, qemu-io is NOT baked in stone; if you want to patch it AND
fix the iotest fallout, I'm not opposed to that.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions
  2019-01-24 15:56 ` Eric Blake
@ 2019-01-29 10:56   ` Kevin Wolf
  2019-01-29 21:03     ` Eric Blake
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Wolf @ 2019-01-29 10:56 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-block, vsementsov, mreitz, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1672 bytes --]

Am 24.01.2019 um 16:56 hat Eric Blake geschrieben:
> On 1/24/19 8:17 AM, Kevin Wolf wrote:
> > Depending on the exact image layout and the storage backend (tmpfs is
> > konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
> > save us a lot of time e.g. during a mirror block job or qemu-img convert
> > with a fragmented source image (.bdrv_co_block_status on the protocol
> > layer can be called for every single cluster in the extreme case).
> > 
> > We may only cache data regions because of possible concurrent writers.
> > This means that we can later treat a recently punched hole as data, but
> > this is safe. We can't cache holes because then we might treat recently
> > written data as holes, which can cause corruption.
> 
> gluster copies heavily from file-posix's implementation; should it also
> copy this cache of known-data?  Should NBD also cache known-data when
> NBD_CMD_BLOCK_STATUS is available?

This almost suggests that we should do the caching in generic block
layer code.

It would require that we can return a *pnum from the block driver that
is larger than the requested bytes from, but it looks like
raw_co_block_status() already handles this? We just don't seem to do
this yet in the block drivers.

If we want to cache for all drivers, however, the question is whether
there are drivers that can transition a block from data to hole without
a discard operation, so that we would have to invalidate the cache in
more places. One thing that comes to mind is loading an internal
snapshot for qcow2.

Or maybe we need to make this opt-in for drivers, with a bool flag in
BlockDriver?

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions
  2019-01-29 10:56   ` Kevin Wolf
@ 2019-01-29 21:03     ` Eric Blake
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Blake @ 2019-01-29 21:03 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, vsementsov, mreitz, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2506 bytes --]

On 1/29/19 4:56 AM, Kevin Wolf wrote:

>> gluster copies heavily from file-posix's implementation; should it also
>> copy this cache of known-data?  Should NBD also cache known-data when
>> NBD_CMD_BLOCK_STATUS is available?
> 
> This almost suggests that we should do the caching in generic block
> layer code.
> 
> It would require that we can return a *pnum from the block driver that
> is larger than the requested bytes from, but it looks like
> raw_co_block_status() already handles this? We just don't seem to do
> this yet in the block drivers.

The code in io.c bdrv_co_block_status() currently does one final
clamp-down to limit the answer to the caller's maximum request, but I
don't know if any drivers actually take advantage of passing back larger
than the request. I _do_ know that the NBD protocol took pains to ensure
that NBD_CMD_BLOCK_STATUS is permitted to return a value beyond the
caller's request if such information was easily obtained, precisely
because the idea of letting the caller cache the knowledge of a data
section that extends beyond the current query's area of interest may be
useful to minimize the need to make future block status calls.

> 
> If we want to cache for all drivers, however, the question is whether
> there are drivers that can transition a block from data to hole without
> a discard operation, so that we would have to invalidate the cache in
> more places. One thing that comes to mind is loading an internal
> snapshot for qcow2.

Oh, good point - switching to a different L1 table (due to loading an
internal snapshot) can indeed make a hole appear that used to read as
data, so if the block layer caches data ranges, it also needs to provide
a hook for drivers to invalidate the cache when doing unusual actions.
Still, I can't think of any place where a hole spontaneously appears
unless a specific driver action is taken (so the driver should have the
opportunity to invalidate the cache during that action), or if an image
is in active use by more than just the qemu process.  And if the driver
knows that an image might be shared with external processes modifying
the image, then yes, maybe having a way to opt out of caching altogether
is also appropriate.

> 
> Or maybe we need to make this opt-in for drivers, with a bool flag in
> BlockDriver?
> 
> Kevin
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions
  2019-01-25 10:30           ` Kevin Wolf
@ 2019-02-04 10:17             ` Paolo Bonzini
  0 siblings, 0 replies; 15+ messages in thread
From: Paolo Bonzini @ 2019-02-04 10:17 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Vladimir Sementsov-Ogievskiy, qemu-block, mreitz, eblake, qemu-devel

On 25/01/19 11:30, Kevin Wolf wrote:
>> On the other hand, the AioContext lock is only used in
>> some special cases around block jobs and blk_set_aio_context, and in
>> general the block devices already should not have any dependencies
>> (unless they crept in without me noticing).
> It's also used in those cases where coroutines don't need locking, but
> threads would. Did you audit all of the drivers for such cases?

I did and the drivers already have a QemuMutex if that's the case (e.g.
curl, iscsi).

>> In particular...
>>
>>> But raw doesn't have an s->lock yet, so I
>>> think removing the AioContext lock involves some work on it anyway and
>>> adding this doesn't really change the amount of work.
>> ... BDRVRawState doesn't have any data that changes after open, does it?
>>  This is why it doesn't have an s->lock.
> No important data anyway. We do things like setting s->has_write_zeroes
> = false after failure, but if we got a race and end up trying twice
> before disabling it, it doesn't really hurt either.
> 
> Then there is reopen, but that involves a drain anyway. And that's it
> probably.
> 
> So do you think I should introduce a CoMutex for raw here? Or QemuMutex?

For the cache you can introduce either a CoMutex or QemuMutex, it's the
same.

Paolo

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-02-04 10:18 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-24 14:17 [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions Kevin Wolf
2019-01-24 14:40 ` Vladimir Sementsov-Ogievskiy
2019-01-24 15:11   ` Kevin Wolf
2019-01-24 15:22     ` Vladimir Sementsov-Ogievskiy
2019-01-24 15:42       ` Kevin Wolf
2019-01-25 10:10         ` Paolo Bonzini
2019-01-25 10:30           ` Kevin Wolf
2019-02-04 10:17             ` Paolo Bonzini
2019-01-24 15:56 ` Eric Blake
2019-01-29 10:56   ` Kevin Wolf
2019-01-29 21:03     ` Eric Blake
2019-01-24 16:18 ` Vladimir Sementsov-Ogievskiy
2019-01-24 16:36   ` Kevin Wolf
2019-01-25  9:13     ` Vladimir Sementsov-Ogievskiy
2019-01-25 13:26       ` Eric Blake

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.