All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
       [not found]             ` <8736n9eb4s.fsf@dusky.pond.sub.org>
@ 2019-04-03 14:12               ` Xiang Zheng
  2019-04-03 15:35                 ` Laszlo Ersek
  0 siblings, 1 reply; 25+ messages in thread
From: Xiang Zheng @ 2019-04-03 14:12 UTC (permalink / raw)
  To: Markus Armbruster, Laszlo Ersek
  Cc: Peter Maydell, Ard Biesheuvel, QEMU Developers, qemu-arm,
	Heyi Guo, wanghaibin.wang

Hi Laszlo and Markus,

Thanks for your useful suggestions and comments! :)

On 2019/3/27 2:36, Markus Armbruster wrote:
> Laszlo Ersek <lersek@redhat.com> writes:
> 
>> On 03/26/19 17:39, Markus Armbruster wrote:
>>> Laszlo Ersek <lersek@redhat.com> writes:
>>
>>>> With the dynamic sizing in QEMU (which, IIRC, I had originally
>>>> introduced still in the 1MB times, due to the split between the
>>>> executable and varstore parts), both the 1MB->2MB switch, and the
>>>> 2MB->4MB switch in the firmware caused zero pain in QEMU. And right now,
>>>> 4MB looks like a "sweet spot", with some elbow room left.
>>>
>>> Explicit configuration would've been exactly as painless.  Even with
>>> pflash sizes restricted to powers of two.
>>
>> I wrote the patch that ended up as commit 637a5acb46b3 -- with your R-b
>> -- in 2013. I'm unsure if machine type properties existed back then, but
>> even if they did, do you think I knew about them? :)
>>
>> You are right, of course; it's just that we can't tell the future.
> 
> True!  All we can do is continue to design as well as we can given the
> information, experience and resources we have, and when the inevitable
> design mistakes become apparent, limit their impact.
> 
> Some of the things we now consider mistakes we just didn't see.  Others
> we saw (e.g. multiple pflash devices, unlike physical hardware), but
> underestimated their impact.
> 

I thought about your comments and wrote the following patch (just for test)
which uses a file mapping to replace the anonymous mapping. UEFI seems to work
fine. So why not use a file mapping to read or write a pflash device?

Except this way, I don't know how to share the pflash memory among VMs or
reduce the consumption for resource. :(

---8>---

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index ce2664a..12c78f2 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -898,6 +898,7 @@ static void create_one_flash(const char *name, hwaddr flashbase,
     qdev_prop_set_uint16(dev, "id2", 0x00);
     qdev_prop_set_uint16(dev, "id3", 0x00);
     qdev_prop_set_string(dev, "name", name);
+    qdev_prop_set_bit(dev, "prealloc", false);
     qdev_init_nofail(dev);

     memory_region_add_subregion(sysmem, flashbase,
diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
index 16dfae1..23a85bc 100644
--- a/hw/block/pflash_cfi01.c
+++ b/hw/block/pflash_cfi01.c
@@ -92,6 +92,7 @@ struct PFlashCFI01 {
     void *storage;
     VMChangeStateEntry *vmstate;
     bool old_multiple_chip_handling;
+    bool prealloc;
 };

 static int pflash_post_load(void *opaque, int version_id);
@@ -731,11 +732,21 @@ static void pflash_cfi01_realize(DeviceState *dev, Error **errp)
     }
     device_len = sector_len_per_device * blocks_per_device;

-    memory_region_init_rom_device(
-        &pfl->mem, OBJECT(dev),
-        &pflash_cfi01_ops,
-        pfl,
-        pfl->name, total_len, &local_err);
+    if (pfl->blk && !pfl->prealloc) {
+        memory_region_init_rom_device_from_file(
+            &pfl->mem, OBJECT(dev),
+            &pflash_cfi01_ops,
+            pfl,
+            pfl->name, total_len,
+            blk_is_read_only(pfl->blk) ? RAM_SHARED : RAM_PMEM,
+            blk_bs(pfl->blk)->filename, &local_err);
+    } else {
+        memory_region_init_rom_device(
+            &pfl->mem, OBJECT(dev),
+            &pflash_cfi01_ops,
+            pfl,
+            pfl->name, total_len, &local_err);
+    }
     if (local_err) {
         error_propagate(errp, local_err);
         return;
@@ -899,6 +910,7 @@ static Property pflash_cfi01_properties[] = {
     DEFINE_PROP_STRING("name", PFlashCFI01, name),
     DEFINE_PROP_BOOL("old-multiple-chip-handling", PFlashCFI01,
                      old_multiple_chip_handling, false),
+    DEFINE_PROP_BOOL("prealloc", PFlashCFI01, prealloc, true),
     DEFINE_PROP_END_OF_LIST(),
 };

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 1625913..1b16d3b 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -804,6 +804,16 @@ void memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
                                              uint64_t size,
                                              Error **errp);

+void memory_region_init_rom_device_from_file(MemoryRegion *mr,
+                                             struct Object *owner,
+                                             const MemoryRegionOps *ops,
+                                             void *opaque,
+                                             const char *name,
+                                             uint64_t size,
+                                             uint32_t ram_flags,
+                                             const char *path,
+                                             Error **errp);
+
 /**
  * memory_region_init_iommu: Initialize a memory region of a custom type
  * that translates addresses
diff --git a/memory.c b/memory.c
index 9fbca52..905956b 100644
--- a/memory.c
+++ b/memory.c
@@ -1719,6 +1719,36 @@ void memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
     }
 }

+void memory_region_init_rom_device_from_file(MemoryRegion *mr,
+                                             Object *owner,
+                                             const MemoryRegionOps *ops,
+                                             void *opaque,
+                                             const char *name,
+                                             uint64_t size,
+                                             uint32_t ram_flags,
+                                             const char *path,
+                                             Error **errp)
+{
+    DeviceState *owner_dev;
+    Error *err = NULL;
+    assert(ops);
+    memory_region_init(mr, owner, name, size);
+    mr->ops = ops;
+    mr->opaque = opaque;
+    mr->terminates = true;
+    mr->rom_device = true;
+    mr->destructor = memory_region_destructor_ram;
+    mr->ram_block = qemu_ram_alloc_from_file(size, mr, ram_flags, path, &err);
+    if (err) {
+        mr->size = int128_zero();
+        object_unparent(OBJECT(mr));
+        error_propagate(errp, err);
+        return ;
+    }
+    owner_dev = DEVICE(owner);
+    vmstate_register_ram(mr, owner_dev);
+}
+
 void memory_region_init_iommu(void *_iommu_mr,
                               size_t instance_size,
                               const char *mrtypename,
-- 
1.8.3.1




-- 

Thanks,
Xiang

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
  2019-04-03 14:12               ` [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory Xiang Zheng
@ 2019-04-03 15:35                 ` Laszlo Ersek
  2019-04-08 13:43                   ` Xiang Zheng
  0 siblings, 1 reply; 25+ messages in thread
From: Laszlo Ersek @ 2019-04-03 15:35 UTC (permalink / raw)
  To: Xiang Zheng, Markus Armbruster
  Cc: Peter Maydell, Ard Biesheuvel, QEMU Developers, qemu-arm,
	Heyi Guo, wanghaibin.wang

On 04/03/19 16:12, Xiang Zheng wrote:
> Hi Laszlo and Markus,
> 
> Thanks for your useful suggestions and comments! :)
> 
> On 2019/3/27 2:36, Markus Armbruster wrote:
>> Laszlo Ersek <lersek@redhat.com> writes:
>>
>>> On 03/26/19 17:39, Markus Armbruster wrote:
>>>> Laszlo Ersek <lersek@redhat.com> writes:
>>>
>>>>> With the dynamic sizing in QEMU (which, IIRC, I had originally
>>>>> introduced still in the 1MB times, due to the split between the
>>>>> executable and varstore parts), both the 1MB->2MB switch, and the
>>>>> 2MB->4MB switch in the firmware caused zero pain in QEMU. And right now,
>>>>> 4MB looks like a "sweet spot", with some elbow room left.
>>>>
>>>> Explicit configuration would've been exactly as painless.  Even with
>>>> pflash sizes restricted to powers of two.
>>>
>>> I wrote the patch that ended up as commit 637a5acb46b3 -- with your R-b
>>> -- in 2013. I'm unsure if machine type properties existed back then, but
>>> even if they did, do you think I knew about them? :)
>>>
>>> You are right, of course; it's just that we can't tell the future.
>>
>> True!  All we can do is continue to design as well as we can given the
>> information, experience and resources we have, and when the inevitable
>> design mistakes become apparent, limit their impact.
>>
>> Some of the things we now consider mistakes we just didn't see.  Others
>> we saw (e.g. multiple pflash devices, unlike physical hardware), but
>> underestimated their impact.
>>
> 
> I thought about your comments and wrote the following patch (just for test)
> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
> fine. So why not use a file mapping to read or write a pflash device?

Honestly I can't answer "why not do this". Maybe we should.

Regarding "why not do this *exactly as shown below*" -- probably because
then we'd have updates to the same underlying regular file via both
mmap() and write(), and the interactions between those are really tricky
(= best avoided).

One of my favorite questions over the years used to be posting a matrix
of possible mmap() and file descriptor r/w/sync interactions, with the
outcomes as they were "implied" by POSIX, and then asking at the bottom,
"is my understanding correct?" I've never ever received an answer to
that. :)

Also... although we don't really use them in practice, some QCOW2
features for pflash block backends are -- or would be -- nice. (Not the
internal snapshots or the live RAM dumps, of course.)

Thanks
Laszlo

> Except this way, I don't know how to share the pflash memory among VMs or
> reduce the consumption for resource. :(
> 
> ---8>---
> 
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index ce2664a..12c78f2 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -898,6 +898,7 @@ static void create_one_flash(const char *name, hwaddr flashbase,
>      qdev_prop_set_uint16(dev, "id2", 0x00);
>      qdev_prop_set_uint16(dev, "id3", 0x00);
>      qdev_prop_set_string(dev, "name", name);
> +    qdev_prop_set_bit(dev, "prealloc", false);
>      qdev_init_nofail(dev);
> 
>      memory_region_add_subregion(sysmem, flashbase,
> diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
> index 16dfae1..23a85bc 100644
> --- a/hw/block/pflash_cfi01.c
> +++ b/hw/block/pflash_cfi01.c
> @@ -92,6 +92,7 @@ struct PFlashCFI01 {
>      void *storage;
>      VMChangeStateEntry *vmstate;
>      bool old_multiple_chip_handling;
> +    bool prealloc;
>  };
> 
>  static int pflash_post_load(void *opaque, int version_id);
> @@ -731,11 +732,21 @@ static void pflash_cfi01_realize(DeviceState *dev, Error **errp)
>      }
>      device_len = sector_len_per_device * blocks_per_device;
> 
> -    memory_region_init_rom_device(
> -        &pfl->mem, OBJECT(dev),
> -        &pflash_cfi01_ops,
> -        pfl,
> -        pfl->name, total_len, &local_err);
> +    if (pfl->blk && !pfl->prealloc) {
> +        memory_region_init_rom_device_from_file(
> +            &pfl->mem, OBJECT(dev),
> +            &pflash_cfi01_ops,
> +            pfl,
> +            pfl->name, total_len,
> +            blk_is_read_only(pfl->blk) ? RAM_SHARED : RAM_PMEM,
> +            blk_bs(pfl->blk)->filename, &local_err);
> +    } else {
> +        memory_region_init_rom_device(
> +            &pfl->mem, OBJECT(dev),
> +            &pflash_cfi01_ops,
> +            pfl,
> +            pfl->name, total_len, &local_err);
> +    }
>      if (local_err) {
>          error_propagate(errp, local_err);
>          return;
> @@ -899,6 +910,7 @@ static Property pflash_cfi01_properties[] = {
>      DEFINE_PROP_STRING("name", PFlashCFI01, name),
>      DEFINE_PROP_BOOL("old-multiple-chip-handling", PFlashCFI01,
>                       old_multiple_chip_handling, false),
> +    DEFINE_PROP_BOOL("prealloc", PFlashCFI01, prealloc, true),
>      DEFINE_PROP_END_OF_LIST(),
>  };
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 1625913..1b16d3b 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -804,6 +804,16 @@ void memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
>                                               uint64_t size,
>                                               Error **errp);
> 
> +void memory_region_init_rom_device_from_file(MemoryRegion *mr,
> +                                             struct Object *owner,
> +                                             const MemoryRegionOps *ops,
> +                                             void *opaque,
> +                                             const char *name,
> +                                             uint64_t size,
> +                                             uint32_t ram_flags,
> +                                             const char *path,
> +                                             Error **errp);
> +
>  /**
>   * memory_region_init_iommu: Initialize a memory region of a custom type
>   * that translates addresses
> diff --git a/memory.c b/memory.c
> index 9fbca52..905956b 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1719,6 +1719,36 @@ void memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
>      }
>  }
> 
> +void memory_region_init_rom_device_from_file(MemoryRegion *mr,
> +                                             Object *owner,
> +                                             const MemoryRegionOps *ops,
> +                                             void *opaque,
> +                                             const char *name,
> +                                             uint64_t size,
> +                                             uint32_t ram_flags,
> +                                             const char *path,
> +                                             Error **errp)
> +{
> +    DeviceState *owner_dev;
> +    Error *err = NULL;
> +    assert(ops);
> +    memory_region_init(mr, owner, name, size);
> +    mr->ops = ops;
> +    mr->opaque = opaque;
> +    mr->terminates = true;
> +    mr->rom_device = true;
> +    mr->destructor = memory_region_destructor_ram;
> +    mr->ram_block = qemu_ram_alloc_from_file(size, mr, ram_flags, path, &err);
> +    if (err) {
> +        mr->size = int128_zero();
> +        object_unparent(OBJECT(mr));
> +        error_propagate(errp, err);
> +        return ;
> +    }
> +    owner_dev = DEVICE(owner);
> +    vmstate_register_ram(mr, owner_dev);
> +}
> +
>  void memory_region_init_iommu(void *_iommu_mr,
>                                size_t instance_size,
>                                const char *mrtypename,
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
  2019-04-03 15:35                 ` Laszlo Ersek
@ 2019-04-08 13:43                   ` Xiang Zheng
  2019-04-08 16:14                     ` Laszlo Ersek
  0 siblings, 1 reply; 25+ messages in thread
From: Xiang Zheng @ 2019-04-08 13:43 UTC (permalink / raw)
  To: Laszlo Ersek, Markus Armbruster
  Cc: Peter Maydell, Ard Biesheuvel, QEMU Developers, qemu-arm,
	Heyi Guo, wanghaibin.wang


On 2019/4/3 23:35, Laszlo Ersek wrote:
>> I thought about your comments and wrote the following patch (just for test)
>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
>> fine. So why not use a file mapping to read or write a pflash device?
> Honestly I can't answer "why not do this". Maybe we should.
> 
> Regarding "why not do this *exactly as shown below*" -- probably because
> then we'd have updates to the same underlying regular file via both
> mmap() and write(), and the interactions between those are really tricky
> (= best avoided).
> 
> One of my favorite questions over the years used to be posting a matrix
> of possible mmap() and file descriptor r/w/sync interactions, with the
> outcomes as they were "implied" by POSIX, and then asking at the bottom,
> "is my understanding correct?" I've never ever received an answer to
> that. :)

In my opinion, it's feasible to r/w/sync the memory devices which use a block
backend via mmap() and write().

> 
> Also... although we don't really use them in practice, some QCOW2
> features for pflash block backends are -- or would be -- nice. (Not the
> internal snapshots or the live RAM dumps, of course.)

Regarding sparse files, can we read actual backend size data for the read-only
pflash memory as the following steps shown?

1) Create a sparse file -- QEMU_EFI-test.raw:
   dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0

2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
   dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc

3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
patch applied:

---8>---

diff --git a/hw/block/block.c b/hw/block/block.c
index bf56c76..ad18d5e 100644
--- a/hw/block/block.c
+++ b/hw/block/block.c
@@ -30,7 +30,7 @@
 bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
                                  Error **errp)
 {
-    int64_t blk_len;
+    int64_t blk_len, actual_len;
     int ret;

     blk_len = blk_getlength(blk);
@@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
      * block device and read only on demand.
      */
     assert(size <= BDRV_REQUEST_MAX_BYTES);
-    ret = blk_pread(blk, 0, buf, size);
+    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
+    ret = blk_pread(blk, 0, buf,
+            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "can't read block backend");
         return false;
-- 
1.8.3.1


-- 

Thanks,
Xiang

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
  2019-04-08 13:43                   ` Xiang Zheng
@ 2019-04-08 16:14                     ` Laszlo Ersek
  2019-04-09  3:39                       ` Xiang Zheng
  2019-04-09  6:01                         ` Markus Armbruster
  0 siblings, 2 replies; 25+ messages in thread
From: Laszlo Ersek @ 2019-04-08 16:14 UTC (permalink / raw)
  To: Xiang Zheng, Markus Armbruster
  Cc: Peter Maydell, Ard Biesheuvel, QEMU Developers, qemu-arm,
	Heyi Guo, wanghaibin.wang

On 04/08/19 15:43, Xiang Zheng wrote:
> 
> On 2019/4/3 23:35, Laszlo Ersek wrote:
>>> I thought about your comments and wrote the following patch (just for test)
>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
>>> fine. So why not use a file mapping to read or write a pflash device?
>> Honestly I can't answer "why not do this". Maybe we should.
>>
>> Regarding "why not do this *exactly as shown below*" -- probably because
>> then we'd have updates to the same underlying regular file via both
>> mmap() and write(), and the interactions between those are really tricky
>> (= best avoided).
>>
>> One of my favorite questions over the years used to be posting a matrix
>> of possible mmap() and file descriptor r/w/sync interactions, with the
>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
>> "is my understanding correct?" I've never ever received an answer to
>> that. :)
> 
> In my opinion, it's feasible to r/w/sync the memory devices which use a block
> backend via mmap() and write().

Maybe. I think that would be a first in QEMU, and you'd likely have to
describe and follow a semi-formal model between fd actions and mmap()
actions.

Here's the (unconfirmed) table I referred to earlier.

+-------------+-----------------------------------------------------+
| change made | change visible via                                  |
| through     +-----------------+-------------+---------------------+
|             | MAP_SHARED      | MAP_PRIVATE | read()              |
+-------------+-----------------+-------------+---------------------+
| MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
|             |                 |             | MS_ASYNC, or normal |
|             |                 |             | system activity     |
+-------------+-----------------+-------------+---------------------+
| MAP_PRIVATE | no              | no          | no                  |
+-------------+-----------------+-------------+---------------------+
| write()     | depends on      | unspecified | yes                 |
|             | MS_INVALIDATE,  |             |                     |
|             | or the system's |             |                     |
|             | read/write      |             |                     |
|             | consistency     |             |                     |
+-------------+-----------------+-------------+---------------------+

Presumably:

- a write() to some offset range of a regular file should be visible in
a MAP_SHARED mapping of that range after a matching msync(...,
MS_INVALIDATE) call;

- changes through a MAP_SHARED mapping to a regular file should be
visible via read() after a matching msync(..., MS_SYNC) call returns.

I sent this table first in 2010 to the Austin Group mailing list, and
received no comments. Then another person (on the same list) asked
basically the same questions in 2013, to which I responded with the
above assumptions / interpretations -- and received no comments from
third parties again.

But, I'm really out of my depth here -- we're not even discussing
generic read()/write()/mmap()/munmap()/msync() interactions, but how
they would fit into the qemu block layer. And I have no idea.

> 
>>
>> Also... although we don't really use them in practice, some QCOW2
>> features for pflash block backends are -- or would be -- nice. (Not the
>> internal snapshots or the live RAM dumps, of course.)
> 
> Regarding sparse files, can we read actual backend size data for the read-only
> pflash memory as the following steps shown?
> 
> 1) Create a sparse file -- QEMU_EFI-test.raw:
>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
> 
> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
> 
> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
> patch applied:
> 
> ---8>---
> 
> diff --git a/hw/block/block.c b/hw/block/block.c
> index bf56c76..ad18d5e 100644
> --- a/hw/block/block.c
> +++ b/hw/block/block.c
> @@ -30,7 +30,7 @@
>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>                                   Error **errp)
>  {
> -    int64_t blk_len;
> +    int64_t blk_len, actual_len;
>      int ret;
> 
>      blk_len = blk_getlength(blk);
> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>       * block device and read only on demand.
>       */
>      assert(size <= BDRV_REQUEST_MAX_BYTES);
> -    ret = blk_pread(blk, 0, buf, size);
> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
> +    ret = blk_pread(blk, 0, buf,
> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
>      if (ret < 0) {
>          error_setg_errno(errp, -ret, "can't read block backend");
>          return false;
> 

This hunk looks dubious for a general helper function. It seems to
assume that the holes are punched at the end of the file.

Sorry, this discussion is making me uncomfortable. I don't want to
ignore your questions, but I have no idea about the right answers. This
really needs the attention of the block people.

Laszlo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
  2019-04-08 16:14                     ` Laszlo Ersek
@ 2019-04-09  3:39                       ` Xiang Zheng
  2019-04-09  6:01                         ` Markus Armbruster
  1 sibling, 0 replies; 25+ messages in thread
From: Xiang Zheng @ 2019-04-09  3:39 UTC (permalink / raw)
  To: Laszlo Ersek, Markus Armbruster
  Cc: Peter Maydell, Ard Biesheuvel, QEMU Developers, qemu-arm,
	Heyi Guo, wanghaibin.wang


On 2019/4/9 0:14, Laszlo Ersek wrote:
> On 04/08/19 15:43, Xiang Zheng wrote:
>>
>> On 2019/4/3 23:35, Laszlo Ersek wrote:
>>>> I thought about your comments and wrote the following patch (just for test)
>>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
>>>> fine. So why not use a file mapping to read or write a pflash device?
>>> Honestly I can't answer "why not do this". Maybe we should.
>>>
>>> Regarding "why not do this *exactly as shown below*" -- probably because
>>> then we'd have updates to the same underlying regular file via both
>>> mmap() and write(), and the interactions between those are really tricky
>>> (= best avoided).
>>>
>>> One of my favorite questions over the years used to be posting a matrix
>>> of possible mmap() and file descriptor r/w/sync interactions, with the
>>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
>>> "is my understanding correct?" I've never ever received an answer to
>>> that. :)
>>
>> In my opinion, it's feasible to r/w/sync the memory devices which use a block
>> backend via mmap() and write().
> 
> Maybe. I think that would be a first in QEMU, and you'd likely have to
> describe and follow a semi-formal model between fd actions and mmap()
> actions.
> 
> Here's the (unconfirmed) table I referred to earlier.
> 
> +-------------+-----------------------------------------------------+
> | change made | change visible via                                  |
> | through     +-----------------+-------------+---------------------+
> |             | MAP_SHARED      | MAP_PRIVATE | read()              |
> +-------------+-----------------+-------------+---------------------+
> | MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
> |             |                 |             | MS_ASYNC, or normal |
> |             |                 |             | system activity     |
> +-------------+-----------------+-------------+---------------------+
> | MAP_PRIVATE | no              | no          | no                  |
> +-------------+-----------------+-------------+---------------------+
> | write()     | depends on      | unspecified | yes                 |
> |             | MS_INVALIDATE,  |             |                     |
> |             | or the system's |             |                     |
> |             | read/write      |             |                     |
> |             | consistency     |             |                     |
> +-------------+-----------------+-------------+---------------------+
> 
> Presumably:
> 
> - a write() to some offset range of a regular file should be visible in
> a MAP_SHARED mapping of that range after a matching msync(...,
> MS_INVALIDATE) call;
> 
> - changes through a MAP_SHARED mapping to a regular file should be
> visible via read() after a matching msync(..., MS_SYNC) call returns.
> 
> I sent this table first in 2010 to the Austin Group mailing list, and
> received no comments. Then another person (on the same list) asked
> basically the same questions in 2013, to which I responded with the
> above assumptions / interpretations -- and received no comments from
> third parties again.
> 
> But, I'm really out of my depth here -- we're not even discussing
> generic read()/write()/mmap()/munmap()/msync() interactions, but how
> they would fit into the qemu block layer. And I have no idea.
> 
>>
>>>
>>> Also... although we don't really use them in practice, some QCOW2
>>> features for pflash block backends are -- or would be -- nice. (Not the
>>> internal snapshots or the live RAM dumps, of course.)
>>
>> Regarding sparse files, can we read actual backend size data for the read-only
>> pflash memory as the following steps shown?
>>
>> 1) Create a sparse file -- QEMU_EFI-test.raw:
>>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
>>
>> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
>>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
>>
>> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
>> patch applied:
>>
>> ---8>---
>>
>> diff --git a/hw/block/block.c b/hw/block/block.c
>> index bf56c76..ad18d5e 100644
>> --- a/hw/block/block.c
>> +++ b/hw/block/block.c
>> @@ -30,7 +30,7 @@
>>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>                                   Error **errp)
>>  {
>> -    int64_t blk_len;
>> +    int64_t blk_len, actual_len;
>>      int ret;
>>
>>      blk_len = blk_getlength(blk);
>> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>       * block device and read only on demand.
>>       */
>>      assert(size <= BDRV_REQUEST_MAX_BYTES);
>> -    ret = blk_pread(blk, 0, buf, size);
>> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
>> +    ret = blk_pread(blk, 0, buf,
>> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
>>      if (ret < 0) {
>>          error_setg_errno(errp, -ret, "can't read block backend");
>>          return false;
>>
> 
> This hunk looks dubious for a general helper function. It seems to
> assume that the holes are punched at the end of the file.
> 
> Sorry, this discussion is making me uncomfortable. I don't want to
> ignore your questions, but I have no idea about the right answers. This
> really needs the attention of the block people.

I feel deeply sorry for making you uncomfortable. It's generous of you to give me so much
of your time.

-- 

Thanks,
Xiang

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
@ 2019-04-09  6:01                         ` Markus Armbruster
  0 siblings, 0 replies; 25+ messages in thread
From: Markus Armbruster @ 2019-04-09  6:01 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Xiang Zheng, Peter Maydell, Ard Biesheuvel, QEMU Developers,
	qemu-arm, Heyi Guo, wanghaibin.wang, qemu-block, Kevin Wolf,
	Max Reitz, Stefan Hajnoczi

László's last sentence below is "This really needs the attention of the
block people."  Cc'ing some.

Laszlo Ersek <lersek@redhat.com> writes:

> On 04/08/19 15:43, Xiang Zheng wrote:
>> 
>> On 2019/4/3 23:35, Laszlo Ersek wrote:
>>>> I thought about your comments and wrote the following patch (just for test)
>>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
>>>> fine. So why not use a file mapping to read or write a pflash device?
>>> Honestly I can't answer "why not do this". Maybe we should.
>>>
>>> Regarding "why not do this *exactly as shown below*" -- probably because
>>> then we'd have updates to the same underlying regular file via both
>>> mmap() and write(), and the interactions between those are really tricky
>>> (= best avoided).
>>>
>>> One of my favorite questions over the years used to be posting a matrix
>>> of possible mmap() and file descriptor r/w/sync interactions, with the
>>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
>>> "is my understanding correct?" I've never ever received an answer to
>>> that. :)
>> 
>> In my opinion, it's feasible to r/w/sync the memory devices which use a block
>> backend via mmap() and write().
>
> Maybe. I think that would be a first in QEMU, and you'd likely have to
> describe and follow a semi-formal model between fd actions and mmap()
> actions.
>
> Here's the (unconfirmed) table I referred to earlier.
>
> +-------------+-----------------------------------------------------+
> | change made | change visible via                                  |
> | through     +-----------------+-------------+---------------------+
> |             | MAP_SHARED      | MAP_PRIVATE | read()              |
> +-------------+-----------------+-------------+---------------------+
> | MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
> |             |                 |             | MS_ASYNC, or normal |
> |             |                 |             | system activity     |
> +-------------+-----------------+-------------+---------------------+
> | MAP_PRIVATE | no              | no          | no                  |
> +-------------+-----------------+-------------+---------------------+
> | write()     | depends on      | unspecified | yes                 |
> |             | MS_INVALIDATE,  |             |                     |
> |             | or the system's |             |                     |
> |             | read/write      |             |                     |
> |             | consistency     |             |                     |
> +-------------+-----------------+-------------+---------------------+
>
> Presumably:
>
> - a write() to some offset range of a regular file should be visible in
> a MAP_SHARED mapping of that range after a matching msync(...,
> MS_INVALIDATE) call;
>
> - changes through a MAP_SHARED mapping to a regular file should be
> visible via read() after a matching msync(..., MS_SYNC) call returns.
>
> I sent this table first in 2010 to the Austin Group mailing list, and
> received no comments. Then another person (on the same list) asked
> basically the same questions in 2013, to which I responded with the
> above assumptions / interpretations -- and received no comments from
> third parties again.
>
> But, I'm really out of my depth here -- we're not even discussing
> generic read()/write()/mmap()/munmap()/msync() interactions, but how
> they would fit into the qemu block layer. And I have no idea.
>
>> 
>>>
>>> Also... although we don't really use them in practice, some QCOW2
>>> features for pflash block backends are -- or would be -- nice. (Not the
>>> internal snapshots or the live RAM dumps, of course.)
>> 
>> Regarding sparse files, can we read actual backend size data for the read-only
>> pflash memory as the following steps shown?
>> 
>> 1) Create a sparse file -- QEMU_EFI-test.raw:
>>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
>> 
>> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
>>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
>> 
>> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
>> patch applied:
>> 
>> ---8>---
>> 
>> diff --git a/hw/block/block.c b/hw/block/block.c
>> index bf56c76..ad18d5e 100644
>> --- a/hw/block/block.c
>> +++ b/hw/block/block.c
>> @@ -30,7 +30,7 @@
>>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>                                   Error **errp)
>>  {
>> -    int64_t blk_len;
>> +    int64_t blk_len, actual_len;
>>      int ret;
>> 
>>      blk_len = blk_getlength(blk);
>> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>       * block device and read only on demand.
>>       */
>>      assert(size <= BDRV_REQUEST_MAX_BYTES);
>> -    ret = blk_pread(blk, 0, buf, size);
>> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
>> +    ret = blk_pread(blk, 0, buf,
>> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
>>      if (ret < 0) {
>>          error_setg_errno(errp, -ret, "can't read block backend");
>>          return false;
>> 
>
> This hunk looks dubious for a general helper function. It seems to
> assume that the holes are punched at the end of the file.
>
> Sorry, this discussion is making me uncomfortable. I don't want to
> ignore your questions, but I have no idea about the right answers. This
> really needs the attention of the block people.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
@ 2019-04-09  6:01                         ` Markus Armbruster
  0 siblings, 0 replies; 25+ messages in thread
From: Markus Armbruster @ 2019-04-09  6:01 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Kevin Wolf, Peter Maydell, qemu-block, Ard Biesheuvel,
	QEMU Developers, Max Reitz, Xiang Zheng, qemu-arm,
	Stefan Hajnoczi, Heyi Guo, wanghaibin.wang

László's last sentence below is "This really needs the attention of the
block people."  Cc'ing some.

Laszlo Ersek <lersek@redhat.com> writes:

> On 04/08/19 15:43, Xiang Zheng wrote:
>> 
>> On 2019/4/3 23:35, Laszlo Ersek wrote:
>>>> I thought about your comments and wrote the following patch (just for test)
>>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
>>>> fine. So why not use a file mapping to read or write a pflash device?
>>> Honestly I can't answer "why not do this". Maybe we should.
>>>
>>> Regarding "why not do this *exactly as shown below*" -- probably because
>>> then we'd have updates to the same underlying regular file via both
>>> mmap() and write(), and the interactions between those are really tricky
>>> (= best avoided).
>>>
>>> One of my favorite questions over the years used to be posting a matrix
>>> of possible mmap() and file descriptor r/w/sync interactions, with the
>>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
>>> "is my understanding correct?" I've never ever received an answer to
>>> that. :)
>> 
>> In my opinion, it's feasible to r/w/sync the memory devices which use a block
>> backend via mmap() and write().
>
> Maybe. I think that would be a first in QEMU, and you'd likely have to
> describe and follow a semi-formal model between fd actions and mmap()
> actions.
>
> Here's the (unconfirmed) table I referred to earlier.
>
> +-------------+-----------------------------------------------------+
> | change made | change visible via                                  |
> | through     +-----------------+-------------+---------------------+
> |             | MAP_SHARED      | MAP_PRIVATE | read()              |
> +-------------+-----------------+-------------+---------------------+
> | MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
> |             |                 |             | MS_ASYNC, or normal |
> |             |                 |             | system activity     |
> +-------------+-----------------+-------------+---------------------+
> | MAP_PRIVATE | no              | no          | no                  |
> +-------------+-----------------+-------------+---------------------+
> | write()     | depends on      | unspecified | yes                 |
> |             | MS_INVALIDATE,  |             |                     |
> |             | or the system's |             |                     |
> |             | read/write      |             |                     |
> |             | consistency     |             |                     |
> +-------------+-----------------+-------------+---------------------+
>
> Presumably:
>
> - a write() to some offset range of a regular file should be visible in
> a MAP_SHARED mapping of that range after a matching msync(...,
> MS_INVALIDATE) call;
>
> - changes through a MAP_SHARED mapping to a regular file should be
> visible via read() after a matching msync(..., MS_SYNC) call returns.
>
> I sent this table first in 2010 to the Austin Group mailing list, and
> received no comments. Then another person (on the same list) asked
> basically the same questions in 2013, to which I responded with the
> above assumptions / interpretations -- and received no comments from
> third parties again.
>
> But, I'm really out of my depth here -- we're not even discussing
> generic read()/write()/mmap()/munmap()/msync() interactions, but how
> they would fit into the qemu block layer. And I have no idea.
>
>> 
>>>
>>> Also... although we don't really use them in practice, some QCOW2
>>> features for pflash block backends are -- or would be -- nice. (Not the
>>> internal snapshots or the live RAM dumps, of course.)
>> 
>> Regarding sparse files, can we read actual backend size data for the read-only
>> pflash memory as the following steps shown?
>> 
>> 1) Create a sparse file -- QEMU_EFI-test.raw:
>>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
>> 
>> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
>>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
>> 
>> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
>> patch applied:
>> 
>> ---8>---
>> 
>> diff --git a/hw/block/block.c b/hw/block/block.c
>> index bf56c76..ad18d5e 100644
>> --- a/hw/block/block.c
>> +++ b/hw/block/block.c
>> @@ -30,7 +30,7 @@
>>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>                                   Error **errp)
>>  {
>> -    int64_t blk_len;
>> +    int64_t blk_len, actual_len;
>>      int ret;
>> 
>>      blk_len = blk_getlength(blk);
>> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>       * block device and read only on demand.
>>       */
>>      assert(size <= BDRV_REQUEST_MAX_BYTES);
>> -    ret = blk_pread(blk, 0, buf, size);
>> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
>> +    ret = blk_pread(blk, 0, buf,
>> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
>>      if (ret < 0) {
>>          error_setg_errno(errp, -ret, "can't read block backend");
>>          return false;
>> 
>
> This hunk looks dubious for a general helper function. It seems to
> assume that the holes are punched at the end of the file.
>
> Sorry, this discussion is making me uncomfortable. I don't want to
> ignore your questions, but I have no idea about the right answers. This
> really needs the attention of the block people.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
@ 2019-04-09  8:28                           ` Kevin Wolf
  0 siblings, 0 replies; 25+ messages in thread
From: Kevin Wolf @ 2019-04-09  8:28 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Laszlo Ersek, Xiang Zheng, Peter Maydell, Ard Biesheuvel,
	QEMU Developers, qemu-arm, Heyi Guo, wanghaibin.wang, qemu-block,
	Max Reitz, Stefan Hajnoczi

Am 09.04.2019 um 08:01 hat Markus Armbruster geschrieben:
> László's last sentence below is "This really needs the attention of the
> block people."  Cc'ing some.
> 
> Laszlo Ersek <lersek@redhat.com> writes:
> 
> > On 04/08/19 15:43, Xiang Zheng wrote:
> >> 
> >> On 2019/4/3 23:35, Laszlo Ersek wrote:
> >>>> I thought about your comments and wrote the following patch (just for test)
> >>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
> >>>> fine. So why not use a file mapping to read or write a pflash device?
> >>> Honestly I can't answer "why not do this". Maybe we should.
> >>>
> >>> Regarding "why not do this *exactly as shown below*" -- probably because
> >>> then we'd have updates to the same underlying regular file via both
> >>> mmap() and write(), and the interactions between those are really tricky
> >>> (= best avoided).
> >>>
> >>> One of my favorite questions over the years used to be posting a matrix
> >>> of possible mmap() and file descriptor r/w/sync interactions, with the
> >>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
> >>> "is my understanding correct?" I've never ever received an answer to
> >>> that. :)
> >> 
> >> In my opinion, it's feasible to r/w/sync the memory devices which use a block
> >> backend via mmap() and write().
> >
> > Maybe. I think that would be a first in QEMU, and you'd likely have to
> > describe and follow a semi-formal model between fd actions and mmap()
> > actions.
> >
> > Here's the (unconfirmed) table I referred to earlier.
> >
> > +-------------+-----------------------------------------------------+
> > | change made | change visible via                                  |
> > | through     +-----------------+-------------+---------------------+
> > |             | MAP_SHARED      | MAP_PRIVATE | read()              |
> > +-------------+-----------------+-------------+---------------------+
> > | MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
> > |             |                 |             | MS_ASYNC, or normal |
> > |             |                 |             | system activity     |
> > +-------------+-----------------+-------------+---------------------+
> > | MAP_PRIVATE | no              | no          | no                  |
> > +-------------+-----------------+-------------+---------------------+
> > | write()     | depends on      | unspecified | yes                 |
> > |             | MS_INVALIDATE,  |             |                     |
> > |             | or the system's |             |                     |
> > |             | read/write      |             |                     |
> > |             | consistency     |             |                     |
> > +-------------+-----------------+-------------+---------------------+
> >
> > Presumably:
> >
> > - a write() to some offset range of a regular file should be visible in
> > a MAP_SHARED mapping of that range after a matching msync(...,
> > MS_INVALIDATE) call;
> >
> > - changes through a MAP_SHARED mapping to a regular file should be
> > visible via read() after a matching msync(..., MS_SYNC) call returns.
> >
> > I sent this table first in 2010 to the Austin Group mailing list, and
> > received no comments. Then another person (on the same list) asked
> > basically the same questions in 2013, to which I responded with the
> > above assumptions / interpretations -- and received no comments from
> > third parties again.
> >
> > But, I'm really out of my depth here -- we're not even discussing
> > generic read()/write()/mmap()/munmap()/msync() interactions, but how
> > they would fit into the qemu block layer. And I have no idea.

There is no infrastructure in the block layer for mmapping image files.

> >>> Also... although we don't really use them in practice, some QCOW2
> >>> features for pflash block backends are -- or would be -- nice. (Not the
> >>> internal snapshots or the live RAM dumps, of course.)
> >> 
> >> Regarding sparse files, can we read actual backend size data for the read-only
> >> pflash memory as the following steps shown?
> >> 
> >> 1) Create a sparse file -- QEMU_EFI-test.raw:
> >>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
> >> 
> >> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
> >>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
> >> 
> >> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
> >> patch applied:
> >> 
> >> ---8>---
> >> 
> >> diff --git a/hw/block/block.c b/hw/block/block.c
> >> index bf56c76..ad18d5e 100644
> >> --- a/hw/block/block.c
> >> +++ b/hw/block/block.c
> >> @@ -30,7 +30,7 @@
> >>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
> >>                                   Error **errp)
> >>  {
> >> -    int64_t blk_len;
> >> +    int64_t blk_len, actual_len;
> >>      int ret;
> >> 
> >>      blk_len = blk_getlength(blk);
> >> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
> >>       * block device and read only on demand.
> >>       */
> >>      assert(size <= BDRV_REQUEST_MAX_BYTES);
> >> -    ret = blk_pread(blk, 0, buf, size);
> >> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
> >> +    ret = blk_pread(blk, 0, buf,
> >> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
> >>      if (ret < 0) {
> >>          error_setg_errno(errp, -ret, "can't read block backend");
> >>          return false;
> >> 
> >
> > This hunk looks dubious for a general helper function. It seems to
> > assume that the holes are punched at the end of the file.
> >
> > Sorry, this discussion is making me uncomfortable. I don't want to
> > ignore your questions, but I have no idea about the right answers. This
> > really needs the attention of the block people.

Yes, this code is wrong. bdrv_get_allocated_file_size() isn't even
implemented in all block drivers, and when it is implemented it isn't
necessarily reliable (e.g. return 0 for block devices). It's fine for
'qemu-img info', but that's it.

But even if you actually get the correct allocated file size, that's the
disk space used on the host filesystem, so it doesn't include any holes,
but it does include image format metadata, the data could possibly be
compressed etc. In other words, for a guest device it's completely
meaningless.

I'm not even sure why you would want to do this even in your special
case of a raw image that is sparse only at the end. Is it an
optimisation to avoid reading zeros? This ends up basically being
memset() for sparse blocks, so very quick. And I think you do want to
zero out the remaining part of the buffer anyway. So it looks to me as
if it were complicating the code for no use.

Kevin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
@ 2019-04-09  8:28                           ` Kevin Wolf
  0 siblings, 0 replies; 25+ messages in thread
From: Kevin Wolf @ 2019-04-09  8:28 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Peter Maydell, qemu-block, Ard Biesheuvel, QEMU Developers,
	Max Reitz, Xiang Zheng, qemu-arm, Stefan Hajnoczi, Heyi Guo,
	wanghaibin.wang, Laszlo Ersek

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="UTF-8", Size: 6945 bytes --]

Am 09.04.2019 um 08:01 hat Markus Armbruster geschrieben:
> László's last sentence below is "This really needs the attention of the
> block people."  Cc'ing some.
> 
> Laszlo Ersek <lersek@redhat.com> writes:
> 
> > On 04/08/19 15:43, Xiang Zheng wrote:
> >> 
> >> On 2019/4/3 23:35, Laszlo Ersek wrote:
> >>>> I thought about your comments and wrote the following patch (just for test)
> >>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
> >>>> fine. So why not use a file mapping to read or write a pflash device?
> >>> Honestly I can't answer "why not do this". Maybe we should.
> >>>
> >>> Regarding "why not do this *exactly as shown below*" -- probably because
> >>> then we'd have updates to the same underlying regular file via both
> >>> mmap() and write(), and the interactions between those are really tricky
> >>> (= best avoided).
> >>>
> >>> One of my favorite questions over the years used to be posting a matrix
> >>> of possible mmap() and file descriptor r/w/sync interactions, with the
> >>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
> >>> "is my understanding correct?" I've never ever received an answer to
> >>> that. :)
> >> 
> >> In my opinion, it's feasible to r/w/sync the memory devices which use a block
> >> backend via mmap() and write().
> >
> > Maybe. I think that would be a first in QEMU, and you'd likely have to
> > describe and follow a semi-formal model between fd actions and mmap()
> > actions.
> >
> > Here's the (unconfirmed) table I referred to earlier.
> >
> > +-------------+-----------------------------------------------------+
> > | change made | change visible via                                  |
> > | through     +-----------------+-------------+---------------------+
> > |             | MAP_SHARED      | MAP_PRIVATE | read()              |
> > +-------------+-----------------+-------------+---------------------+
> > | MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
> > |             |                 |             | MS_ASYNC, or normal |
> > |             |                 |             | system activity     |
> > +-------------+-----------------+-------------+---------------------+
> > | MAP_PRIVATE | no              | no          | no                  |
> > +-------------+-----------------+-------------+---------------------+
> > | write()     | depends on      | unspecified | yes                 |
> > |             | MS_INVALIDATE,  |             |                     |
> > |             | or the system's |             |                     |
> > |             | read/write      |             |                     |
> > |             | consistency     |             |                     |
> > +-------------+-----------------+-------------+---------------------+
> >
> > Presumably:
> >
> > - a write() to some offset range of a regular file should be visible in
> > a MAP_SHARED mapping of that range after a matching msync(...,
> > MS_INVALIDATE) call;
> >
> > - changes through a MAP_SHARED mapping to a regular file should be
> > visible via read() after a matching msync(..., MS_SYNC) call returns.
> >
> > I sent this table first in 2010 to the Austin Group mailing list, and
> > received no comments. Then another person (on the same list) asked
> > basically the same questions in 2013, to which I responded with the
> > above assumptions / interpretations -- and received no comments from
> > third parties again.
> >
> > But, I'm really out of my depth here -- we're not even discussing
> > generic read()/write()/mmap()/munmap()/msync() interactions, but how
> > they would fit into the qemu block layer. And I have no idea.

There is no infrastructure in the block layer for mmapping image files.

> >>> Also... although we don't really use them in practice, some QCOW2
> >>> features for pflash block backends are -- or would be -- nice. (Not the
> >>> internal snapshots or the live RAM dumps, of course.)
> >> 
> >> Regarding sparse files, can we read actual backend size data for the read-only
> >> pflash memory as the following steps shown?
> >> 
> >> 1) Create a sparse file -- QEMU_EFI-test.raw:
> >>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
> >> 
> >> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
> >>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
> >> 
> >> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
> >> patch applied:
> >> 
> >> ---8>---
> >> 
> >> diff --git a/hw/block/block.c b/hw/block/block.c
> >> index bf56c76..ad18d5e 100644
> >> --- a/hw/block/block.c
> >> +++ b/hw/block/block.c
> >> @@ -30,7 +30,7 @@
> >>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
> >>                                   Error **errp)
> >>  {
> >> -    int64_t blk_len;
> >> +    int64_t blk_len, actual_len;
> >>      int ret;
> >> 
> >>      blk_len = blk_getlength(blk);
> >> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
> >>       * block device and read only on demand.
> >>       */
> >>      assert(size <= BDRV_REQUEST_MAX_BYTES);
> >> -    ret = blk_pread(blk, 0, buf, size);
> >> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
> >> +    ret = blk_pread(blk, 0, buf,
> >> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
> >>      if (ret < 0) {
> >>          error_setg_errno(errp, -ret, "can't read block backend");
> >>          return false;
> >> 
> >
> > This hunk looks dubious for a general helper function. It seems to
> > assume that the holes are punched at the end of the file.
> >
> > Sorry, this discussion is making me uncomfortable. I don't want to
> > ignore your questions, but I have no idea about the right answers. This
> > really needs the attention of the block people.

Yes, this code is wrong. bdrv_get_allocated_file_size() isn't even
implemented in all block drivers, and when it is implemented it isn't
necessarily reliable (e.g. return 0 for block devices). It's fine for
'qemu-img info', but that's it.

But even if you actually get the correct allocated file size, that's the
disk space used on the host filesystem, so it doesn't include any holes,
but it does include image format metadata, the data could possibly be
compressed etc. In other words, for a guest device it's completely
meaningless.

I'm not even sure why you would want to do this even in your special
case of a raw image that is sparse only at the end. Is it an
optimisation to avoid reading zeros? This ends up basically being
memset() for sparse blocks, so very quick. And I think you do want to
zero out the remaining part of the buffer anyway. So it looks to me as
if it were complicating the code for no use.

Kevin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
@ 2019-04-10  8:36                             ` Xiang Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Xiang Zheng @ 2019-04-10  8:36 UTC (permalink / raw)
  To: Kevin Wolf, Markus Armbruster
  Cc: Laszlo Ersek, Peter Maydell, Ard Biesheuvel, QEMU Developers,
	qemu-arm, Heyi Guo, wanghaibin.wang, qemu-block, Max Reitz,
	Stefan Hajnoczi

Hi Kevin,

On 2019/4/9 16:28, Kevin Wolf wrote:
> Am 09.04.2019 um 08:01 hat Markus Armbruster geschrieben:
>> László's last sentence below is "This really needs the attention of the
>> block people."  Cc'ing some.
>>
>> Laszlo Ersek <lersek@redhat.com> writes:
>>
>>> On 04/08/19 15:43, Xiang Zheng wrote:
>>>>
>>>> On 2019/4/3 23:35, Laszlo Ersek wrote:
>>>>>> I thought about your comments and wrote the following patch (just for test)
>>>>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
>>>>>> fine. So why not use a file mapping to read or write a pflash device?
>>>>> Honestly I can't answer "why not do this". Maybe we should.
>>>>>
>>>>> Regarding "why not do this *exactly as shown below*" -- probably because
>>>>> then we'd have updates to the same underlying regular file via both
>>>>> mmap() and write(), and the interactions between those are really tricky
>>>>> (= best avoided).
>>>>>
>>>>> One of my favorite questions over the years used to be posting a matrix
>>>>> of possible mmap() and file descriptor r/w/sync interactions, with the
>>>>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
>>>>> "is my understanding correct?" I've never ever received an answer to
>>>>> that. :)
>>>>
>>>> In my opinion, it's feasible to r/w/sync the memory devices which use a block
>>>> backend via mmap() and write().
>>>
>>> Maybe. I think that would be a first in QEMU, and you'd likely have to
>>> describe and follow a semi-formal model between fd actions and mmap()
>>> actions.
>>>
>>> Here's the (unconfirmed) table I referred to earlier.
>>>
>>> +-------------+-----------------------------------------------------+
>>> | change made | change visible via                                  |
>>> | through     +-----------------+-------------+---------------------+
>>> |             | MAP_SHARED      | MAP_PRIVATE | read()              |
>>> +-------------+-----------------+-------------+---------------------+
>>> | MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
>>> |             |                 |             | MS_ASYNC, or normal |
>>> |             |                 |             | system activity     |
>>> +-------------+-----------------+-------------+---------------------+
>>> | MAP_PRIVATE | no              | no          | no                  |
>>> +-------------+-----------------+-------------+---------------------+
>>> | write()     | depends on      | unspecified | yes                 |
>>> |             | MS_INVALIDATE,  |             |                     |
>>> |             | or the system's |             |                     |
>>> |             | read/write      |             |                     |
>>> |             | consistency     |             |                     |
>>> +-------------+-----------------+-------------+---------------------+
>>>
>>> Presumably:
>>>
>>> - a write() to some offset range of a regular file should be visible in
>>> a MAP_SHARED mapping of that range after a matching msync(...,
>>> MS_INVALIDATE) call;
>>>
>>> - changes through a MAP_SHARED mapping to a regular file should be
>>> visible via read() after a matching msync(..., MS_SYNC) call returns.
>>>
>>> I sent this table first in 2010 to the Austin Group mailing list, and
>>> received no comments. Then another person (on the same list) asked
>>> basically the same questions in 2013, to which I responded with the
>>> above assumptions / interpretations -- and received no comments from
>>> third parties again.
>>>
>>> But, I'm really out of my depth here -- we're not even discussing
>>> generic read()/write()/mmap()/munmap()/msync() interactions, but how
>>> they would fit into the qemu block layer. And I have no idea.
> 
> There is no infrastructure in the block layer for mmapping image files.
> 
>>>>> Also... although we don't really use them in practice, some QCOW2
>>>>> features for pflash block backends are -- or would be -- nice. (Not the
>>>>> internal snapshots or the live RAM dumps, of course.)
>>>>
>>>> Regarding sparse files, can we read actual backend size data for the read-only
>>>> pflash memory as the following steps shown?
>>>>
>>>> 1) Create a sparse file -- QEMU_EFI-test.raw:
>>>>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
>>>>
>>>> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
>>>>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
>>>>
>>>> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
>>>> patch applied:
>>>>
>>>> ---8>---
>>>>
>>>> diff --git a/hw/block/block.c b/hw/block/block.c
>>>> index bf56c76..ad18d5e 100644
>>>> --- a/hw/block/block.c
>>>> +++ b/hw/block/block.c
>>>> @@ -30,7 +30,7 @@
>>>>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>>>                                   Error **errp)
>>>>  {
>>>> -    int64_t blk_len;
>>>> +    int64_t blk_len, actual_len;
>>>>      int ret;
>>>>
>>>>      blk_len = blk_getlength(blk);
>>>> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>>>       * block device and read only on demand.
>>>>       */
>>>>      assert(size <= BDRV_REQUEST_MAX_BYTES);
>>>> -    ret = blk_pread(blk, 0, buf, size);
>>>> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
>>>> +    ret = blk_pread(blk, 0, buf,
>>>> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
>>>>      if (ret < 0) {
>>>>          error_setg_errno(errp, -ret, "can't read block backend");
>>>>          return false;
>>>>
>>>
>>> This hunk looks dubious for a general helper function. It seems to
>>> assume that the holes are punched at the end of the file.
>>>
>>> Sorry, this discussion is making me uncomfortable. I don't want to
>>> ignore your questions, but I have no idea about the right answers. This
>>> really needs the attention of the block people.
> 
> Yes, this code is wrong. bdrv_get_allocated_file_size() isn't even
> implemented in all block drivers, and when it is implemented it isn't
> necessarily reliable (e.g. return 0 for block devices). It's fine for
> 'qemu-img info', but that's it.
> 
> But even if you actually get the correct allocated file size, that's the
> disk space used on the host filesystem, so it doesn't include any holes,
> but it does include image format metadata, the data could possibly be
> compressed etc. In other words, for a guest device it's completely
> meaningless.
> 
> I'm not even sure why you would want to do this even in your special
> case of a raw image that is sparse only at the end. Is it an
> optimisation to avoid reading zeros? This ends up basically being
> memset() for sparse blocks, so very quick. And I think you do want to
> zero out the remaining part of the buffer anyway. So it looks to me as
> if it were complicating the code for no use.

The primary target of this discussion is to save the memory allocated
for UEFI firmware device. On virtual machine, we split it into two flash
devices[1] -- one is read-only in 64MB size and another is read-write in
64MB size. The qemu commandline is:

   -drive file=QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on \
   -drive file=empty_VARS.fd,if=pflash,format=raw,unit=1 \

Regarding the read-only one which are created from QEMU_EFI.fd, the original
QEMU_EFI.fd is only 2MB in size and we need to stuff it to 64MB with 62MB
'zeros':

   dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M count=64
   dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc

For some historical reasons such as compatibility and extensibility[2], we
restrict both their size to 64MB -- both UEFI and qemu have cold hard
constants.

This will consume a large amount of memory when running multiple VM
simultaneously (each guest needs 128MB).

There are accepted ideas proposed by Markus and Laszlo from prior discussion[3]:

1) Getting flash memory size from a machine type or property.

2) Map the read-only part so it is shared among guests; Map the read-write
part normally.

The first idea called explicit configuration which may break migration.

For the second idea, the only way I can imagine is using a file mapping to
read or write pflash devices so that we can allocate memory on demand. So I
tried to fit mmap() actions into the block layer[4]. However I am not sure
that maping image file whether can work fine for block layer.

On the other hand, Laszlo had mentioned the features of QCOW2 and sparse
files which may be nice for pflash block backends. But I am not familiar with
it.

Till now I still have no idea to get rid of this problem (more properly it's
a optimization).


[1]https://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg06606.html
[2]https://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg07009.html
[3]https://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg07107.html
[4]https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00719.html

-- 

Thanks,
Xiang

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
@ 2019-04-10  8:36                             ` Xiang Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Xiang Zheng @ 2019-04-10  8:36 UTC (permalink / raw)
  To: Kevin Wolf, Markus Armbruster
  Cc: Peter Maydell, qemu-block, Ard Biesheuvel, QEMU Developers,
	Max Reitz, qemu-arm, Stefan Hajnoczi, Heyi Guo, wanghaibin.wang,
	Laszlo Ersek

Hi Kevin,

On 2019/4/9 16:28, Kevin Wolf wrote:
> Am 09.04.2019 um 08:01 hat Markus Armbruster geschrieben:
>> László's last sentence below is "This really needs the attention of the
>> block people."  Cc'ing some.
>>
>> Laszlo Ersek <lersek@redhat.com> writes:
>>
>>> On 04/08/19 15:43, Xiang Zheng wrote:
>>>>
>>>> On 2019/4/3 23:35, Laszlo Ersek wrote:
>>>>>> I thought about your comments and wrote the following patch (just for test)
>>>>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
>>>>>> fine. So why not use a file mapping to read or write a pflash device?
>>>>> Honestly I can't answer "why not do this". Maybe we should.
>>>>>
>>>>> Regarding "why not do this *exactly as shown below*" -- probably because
>>>>> then we'd have updates to the same underlying regular file via both
>>>>> mmap() and write(), and the interactions between those are really tricky
>>>>> (= best avoided).
>>>>>
>>>>> One of my favorite questions over the years used to be posting a matrix
>>>>> of possible mmap() and file descriptor r/w/sync interactions, with the
>>>>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
>>>>> "is my understanding correct?" I've never ever received an answer to
>>>>> that. :)
>>>>
>>>> In my opinion, it's feasible to r/w/sync the memory devices which use a block
>>>> backend via mmap() and write().
>>>
>>> Maybe. I think that would be a first in QEMU, and you'd likely have to
>>> describe and follow a semi-formal model between fd actions and mmap()
>>> actions.
>>>
>>> Here's the (unconfirmed) table I referred to earlier.
>>>
>>> +-------------+-----------------------------------------------------+
>>> | change made | change visible via                                  |
>>> | through     +-----------------+-------------+---------------------+
>>> |             | MAP_SHARED      | MAP_PRIVATE | read()              |
>>> +-------------+-----------------+-------------+---------------------+
>>> | MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
>>> |             |                 |             | MS_ASYNC, or normal |
>>> |             |                 |             | system activity     |
>>> +-------------+-----------------+-------------+---------------------+
>>> | MAP_PRIVATE | no              | no          | no                  |
>>> +-------------+-----------------+-------------+---------------------+
>>> | write()     | depends on      | unspecified | yes                 |
>>> |             | MS_INVALIDATE,  |             |                     |
>>> |             | or the system's |             |                     |
>>> |             | read/write      |             |                     |
>>> |             | consistency     |             |                     |
>>> +-------------+-----------------+-------------+---------------------+
>>>
>>> Presumably:
>>>
>>> - a write() to some offset range of a regular file should be visible in
>>> a MAP_SHARED mapping of that range after a matching msync(...,
>>> MS_INVALIDATE) call;
>>>
>>> - changes through a MAP_SHARED mapping to a regular file should be
>>> visible via read() after a matching msync(..., MS_SYNC) call returns.
>>>
>>> I sent this table first in 2010 to the Austin Group mailing list, and
>>> received no comments. Then another person (on the same list) asked
>>> basically the same questions in 2013, to which I responded with the
>>> above assumptions / interpretations -- and received no comments from
>>> third parties again.
>>>
>>> But, I'm really out of my depth here -- we're not even discussing
>>> generic read()/write()/mmap()/munmap()/msync() interactions, but how
>>> they would fit into the qemu block layer. And I have no idea.
> 
> There is no infrastructure in the block layer for mmapping image files.
> 
>>>>> Also... although we don't really use them in practice, some QCOW2
>>>>> features for pflash block backends are -- or would be -- nice. (Not the
>>>>> internal snapshots or the live RAM dumps, of course.)
>>>>
>>>> Regarding sparse files, can we read actual backend size data for the read-only
>>>> pflash memory as the following steps shown?
>>>>
>>>> 1) Create a sparse file -- QEMU_EFI-test.raw:
>>>>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
>>>>
>>>> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
>>>>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
>>>>
>>>> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
>>>> patch applied:
>>>>
>>>> ---8>---
>>>>
>>>> diff --git a/hw/block/block.c b/hw/block/block.c
>>>> index bf56c76..ad18d5e 100644
>>>> --- a/hw/block/block.c
>>>> +++ b/hw/block/block.c
>>>> @@ -30,7 +30,7 @@
>>>>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>>>                                   Error **errp)
>>>>  {
>>>> -    int64_t blk_len;
>>>> +    int64_t blk_len, actual_len;
>>>>      int ret;
>>>>
>>>>      blk_len = blk_getlength(blk);
>>>> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>>>       * block device and read only on demand.
>>>>       */
>>>>      assert(size <= BDRV_REQUEST_MAX_BYTES);
>>>> -    ret = blk_pread(blk, 0, buf, size);
>>>> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
>>>> +    ret = blk_pread(blk, 0, buf,
>>>> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
>>>>      if (ret < 0) {
>>>>          error_setg_errno(errp, -ret, "can't read block backend");
>>>>          return false;
>>>>
>>>
>>> This hunk looks dubious for a general helper function. It seems to
>>> assume that the holes are punched at the end of the file.
>>>
>>> Sorry, this discussion is making me uncomfortable. I don't want to
>>> ignore your questions, but I have no idea about the right answers. This
>>> really needs the attention of the block people.
> 
> Yes, this code is wrong. bdrv_get_allocated_file_size() isn't even
> implemented in all block drivers, and when it is implemented it isn't
> necessarily reliable (e.g. return 0 for block devices). It's fine for
> 'qemu-img info', but that's it.
> 
> But even if you actually get the correct allocated file size, that's the
> disk space used on the host filesystem, so it doesn't include any holes,
> but it does include image format metadata, the data could possibly be
> compressed etc. In other words, for a guest device it's completely
> meaningless.
> 
> I'm not even sure why you would want to do this even in your special
> case of a raw image that is sparse only at the end. Is it an
> optimisation to avoid reading zeros? This ends up basically being
> memset() for sparse blocks, so very quick. And I think you do want to
> zero out the remaining part of the buffer anyway. So it looks to me as
> if it were complicating the code for no use.

The primary target of this discussion is to save the memory allocated
for UEFI firmware device. On virtual machine, we split it into two flash
devices[1] -- one is read-only in 64MB size and another is read-write in
64MB size. The qemu commandline is:

   -drive file=QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on \
   -drive file=empty_VARS.fd,if=pflash,format=raw,unit=1 \

Regarding the read-only one which are created from QEMU_EFI.fd, the original
QEMU_EFI.fd is only 2MB in size and we need to stuff it to 64MB with 62MB
'zeros':

   dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M count=64
   dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc

For some historical reasons such as compatibility and extensibility[2], we
restrict both their size to 64MB -- both UEFI and qemu have cold hard
constants.

This will consume a large amount of memory when running multiple VM
simultaneously (each guest needs 128MB).

There are accepted ideas proposed by Markus and Laszlo from prior discussion[3]:

1) Getting flash memory size from a machine type or property.

2) Map the read-only part so it is shared among guests; Map the read-write
part normally.

The first idea called explicit configuration which may break migration.

For the second idea, the only way I can imagine is using a file mapping to
read or write pflash devices so that we can allocate memory on demand. So I
tried to fit mmap() actions into the block layer[4]. However I am not sure
that maping image file whether can work fine for block layer.

On the other hand, Laszlo had mentioned the features of QCOW2 and sparse
files which may be nice for pflash block backends. But I am not familiar with
it.

Till now I still have no idea to get rid of this problem (more properly it's
a optimization).


[1]https://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg06606.html
[2]https://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg07009.html
[3]https://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg07107.html
[4]https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00719.html

-- 

Thanks,
Xiang




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
  2019-04-10  8:36                             ` Xiang Zheng
  (?)
@ 2019-04-11  7:15                             ` Markus Armbruster
  2019-04-12  9:26                               ` Xiang Zheng
  -1 siblings, 1 reply; 25+ messages in thread
From: Markus Armbruster @ 2019-04-11  7:15 UTC (permalink / raw)
  To: Xiang Zheng
  Cc: Kevin Wolf, Peter Maydell, qemu-block, Ard Biesheuvel,
	QEMU Developers, Max Reitz, qemu-arm, Stefan Hajnoczi, Heyi Guo,
	wanghaibin.wang, Laszlo Ersek

Xiang Zheng <zhengxiang9@huawei.com> writes:

> Hi Kevin,
>
> On 2019/4/9 16:28, Kevin Wolf wrote:
>> Am 09.04.2019 um 08:01 hat Markus Armbruster geschrieben:
>>> László's last sentence below is "This really needs the attention of the
>>> block people."  Cc'ing some.
>>>
>>> Laszlo Ersek <lersek@redhat.com> writes:
>>>
>>>> On 04/08/19 15:43, Xiang Zheng wrote:
>>>>>
>>>>> On 2019/4/3 23:35, Laszlo Ersek wrote:
>>>>>>> I thought about your comments and wrote the following patch (just for test)
>>>>>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
>>>>>>> fine. So why not use a file mapping to read or write a pflash device?
>>>>>> Honestly I can't answer "why not do this". Maybe we should.
>>>>>>
>>>>>> Regarding "why not do this *exactly as shown below*" -- probably because
>>>>>> then we'd have updates to the same underlying regular file via both
>>>>>> mmap() and write(), and the interactions between those are really tricky
>>>>>> (= best avoided).
>>>>>>
>>>>>> One of my favorite questions over the years used to be posting a matrix
>>>>>> of possible mmap() and file descriptor r/w/sync interactions, with the
>>>>>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
>>>>>> "is my understanding correct?" I've never ever received an answer to
>>>>>> that. :)
>>>>>
>>>>> In my opinion, it's feasible to r/w/sync the memory devices which use a block
>>>>> backend via mmap() and write().
>>>>
>>>> Maybe. I think that would be a first in QEMU, and you'd likely have to
>>>> describe and follow a semi-formal model between fd actions and mmap()
>>>> actions.
>>>>
>>>> Here's the (unconfirmed) table I referred to earlier.
>>>>
>>>> +-------------+-----------------------------------------------------+
>>>> | change made | change visible via                                  |
>>>> | through     +-----------------+-------------+---------------------+
>>>> |             | MAP_SHARED      | MAP_PRIVATE | read()              |
>>>> +-------------+-----------------+-------------+---------------------+
>>>> | MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
>>>> |             |                 |             | MS_ASYNC, or normal |
>>>> |             |                 |             | system activity     |
>>>> +-------------+-----------------+-------------+---------------------+
>>>> | MAP_PRIVATE | no              | no          | no                  |
>>>> +-------------+-----------------+-------------+---------------------+
>>>> | write()     | depends on      | unspecified | yes                 |
>>>> |             | MS_INVALIDATE,  |             |                     |
>>>> |             | or the system's |             |                     |
>>>> |             | read/write      |             |                     |
>>>> |             | consistency     |             |                     |
>>>> +-------------+-----------------+-------------+---------------------+
>>>>
>>>> Presumably:
>>>>
>>>> - a write() to some offset range of a regular file should be visible in
>>>> a MAP_SHARED mapping of that range after a matching msync(...,
>>>> MS_INVALIDATE) call;
>>>>
>>>> - changes through a MAP_SHARED mapping to a regular file should be
>>>> visible via read() after a matching msync(..., MS_SYNC) call returns.
>>>>
>>>> I sent this table first in 2010 to the Austin Group mailing list, and
>>>> received no comments. Then another person (on the same list) asked
>>>> basically the same questions in 2013, to which I responded with the
>>>> above assumptions / interpretations -- and received no comments from
>>>> third parties again.
>>>>
>>>> But, I'm really out of my depth here -- we're not even discussing
>>>> generic read()/write()/mmap()/munmap()/msync() interactions, but how
>>>> they would fit into the qemu block layer. And I have no idea.
>> 
>> There is no infrastructure in the block layer for mmapping image files.
>> 
>>>>>> Also... although we don't really use them in practice, some QCOW2
>>>>>> features for pflash block backends are -- or would be -- nice. (Not the
>>>>>> internal snapshots or the live RAM dumps, of course.)
>>>>>
>>>>> Regarding sparse files, can we read actual backend size data for the read-only
>>>>> pflash memory as the following steps shown?
>>>>>
>>>>> 1) Create a sparse file -- QEMU_EFI-test.raw:
>>>>>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
>>>>>
>>>>> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
>>>>>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
>>>>>
>>>>> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
>>>>> patch applied:
>>>>>
>>>>> ---8>---
>>>>>
>>>>> diff --git a/hw/block/block.c b/hw/block/block.c
>>>>> index bf56c76..ad18d5e 100644
>>>>> --- a/hw/block/block.c
>>>>> +++ b/hw/block/block.c
>>>>> @@ -30,7 +30,7 @@
>>>>>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>>>>                                   Error **errp)
>>>>>  {
>>>>> -    int64_t blk_len;
>>>>> +    int64_t blk_len, actual_len;
>>>>>      int ret;
>>>>>
>>>>>      blk_len = blk_getlength(blk);
>>>>> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>>>>       * block device and read only on demand.
>>>>>       */
>>>>>      assert(size <= BDRV_REQUEST_MAX_BYTES);
>>>>> -    ret = blk_pread(blk, 0, buf, size);
>>>>> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
>>>>> +    ret = blk_pread(blk, 0, buf,
>>>>> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
>>>>>      if (ret < 0) {
>>>>>          error_setg_errno(errp, -ret, "can't read block backend");
>>>>>          return false;
>>>>>
>>>>
>>>> This hunk looks dubious for a general helper function. It seems to
>>>> assume that the holes are punched at the end of the file.
>>>>
>>>> Sorry, this discussion is making me uncomfortable. I don't want to
>>>> ignore your questions, but I have no idea about the right answers. This
>>>> really needs the attention of the block people.
>> 
>> Yes, this code is wrong. bdrv_get_allocated_file_size() isn't even
>> implemented in all block drivers, and when it is implemented it isn't
>> necessarily reliable (e.g. return 0 for block devices). It's fine for
>> 'qemu-img info', but that's it.
>> 
>> But even if you actually get the correct allocated file size, that's the
>> disk space used on the host filesystem, so it doesn't include any holes,
>> but it does include image format metadata, the data could possibly be
>> compressed etc. In other words, for a guest device it's completely
>> meaningless.
>> 
>> I'm not even sure why you would want to do this even in your special
>> case of a raw image that is sparse only at the end. Is it an
>> optimisation to avoid reading zeros? This ends up basically being
>> memset() for sparse blocks, so very quick. And I think you do want to
>> zero out the remaining part of the buffer anyway. So it looks to me as
>> if it were complicating the code for no use.
>
> The primary target of this discussion is to save the memory allocated
> for UEFI firmware device. On virtual machine, we split it into two flash
> devices[1] -- one is read-only in 64MB size and another is read-write in
> 64MB size. The qemu commandline is:
>
>    -drive file=QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on \
>    -drive file=empty_VARS.fd,if=pflash,format=raw,unit=1 \
>
> Regarding the read-only one which are created from QEMU_EFI.fd, the original
> QEMU_EFI.fd is only 2MB in size and we need to stuff it to 64MB with 62MB
> 'zeros':
>
>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M count=64
>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
>
> For some historical reasons such as compatibility and extensibility[2], we
> restrict both their size to 64MB -- both UEFI and qemu have cold hard
> constants.

These reasons aren't historical.  But they're valid, and that's all that
matters :)

> This will consume a large amount of memory when running multiple VM
> simultaneously (each guest needs 128MB).

Understood.

> There are accepted ideas proposed by Markus and Laszlo from prior discussion[3]:
>
> 1) Getting flash memory size from a machine type or property.
>
> 2) Map the read-only part so it is shared among guests; Map the read-write
> part normally.
>
> The first idea called explicit configuration which may break migration.

For any configuration that can break migration, the solution is to have
identical configuration on source and destination.

> For the second idea, the only way I can imagine is using a file mapping to
> read or write pflash devices so that we can allocate memory on demand. So I
> tried to fit mmap() actions into the block layer[4]. However I am not sure
> that maping image file whether can work fine for block layer.

It adds an odd special case to the block layer.  But then flash devices
are an odd user of the block layer.

Normal users use the block layer as a *block* layer: to read and write
blocks.

Our flash devices don't do that.  They are *memory* devices, not *block*
devices.  Pretty much the only thing the two have in common is
persistence of data.  The block layer provides persistence, and I figure
that's why it got pressed into service here.

What the flash devices do is keep the complete image contents in memory,
with changes written through to disk.

This is pretty much exactly what mmap() does, so using it suggests
itself.

It's very much *not* what the block layer does.  mmap() is not even
possible for anything but a file backend.

With mmap(), memory gets faulted in on demand, and sharing read-only
memory with other VMs should be feasible.  The question is how to use it
here.  Is there a sane way to use it without rewriting the flash devices
not to use the block layer?

> On the other hand, Laszlo had mentioned the features of QCOW2 and sparse
> files which may be nice for pflash block backends. But I am not familiar with
> it.

Offhand, two QCOW2 features come to mind:

1. Sparse storage even when the storage substrate can't do sparse.  I
consider this one largely irrelevant.  Any decent file system can do
sparse.  Archival and network transfer can compress instead.

2. Snapshots to go with disk snapshots.  *Live* snapshots (the ones that
include memory) already have the flash contents in the memory part, so
this sems only relevant for "dead" snapshots.

Anything else?

> Till now I still have no idea to get rid of this problem (more properly it's
> a optimization).
>
>
> [1]https://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg06606.html
> [2]https://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg07009.html
> [3]https://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg07107.html
> [4]https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00719.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
@ 2019-04-11 12:22                               ` Kevin Wolf
  0 siblings, 0 replies; 25+ messages in thread
From: Kevin Wolf @ 2019-04-11 12:22 UTC (permalink / raw)
  To: Xiang Zheng
  Cc: Markus Armbruster, Laszlo Ersek, Peter Maydell, Ard Biesheuvel,
	QEMU Developers, qemu-arm, Heyi Guo, wanghaibin.wang, qemu-block,
	Max Reitz, Stefan Hajnoczi

Am 10.04.2019 um 10:36 hat Xiang Zheng geschrieben:
> Hi Kevin,
> 
> On 2019/4/9 16:28, Kevin Wolf wrote:
> > Am 09.04.2019 um 08:01 hat Markus Armbruster geschrieben:
> >> László's last sentence below is "This really needs the attention of the
> >> block people."  Cc'ing some.
> >>
> >> Laszlo Ersek <lersek@redhat.com> writes:
> >>
> >>> On 04/08/19 15:43, Xiang Zheng wrote:
> >>>>
> >>>> On 2019/4/3 23:35, Laszlo Ersek wrote:
> >>>>>> I thought about your comments and wrote the following patch (just for test)
> >>>>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
> >>>>>> fine. So why not use a file mapping to read or write a pflash device?
> >>>>> Honestly I can't answer "why not do this". Maybe we should.
> >>>>>
> >>>>> Regarding "why not do this *exactly as shown below*" -- probably because
> >>>>> then we'd have updates to the same underlying regular file via both
> >>>>> mmap() and write(), and the interactions between those are really tricky
> >>>>> (= best avoided).
> >>>>>
> >>>>> One of my favorite questions over the years used to be posting a matrix
> >>>>> of possible mmap() and file descriptor r/w/sync interactions, with the
> >>>>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
> >>>>> "is my understanding correct?" I've never ever received an answer to
> >>>>> that. :)
> >>>>
> >>>> In my opinion, it's feasible to r/w/sync the memory devices which use a block
> >>>> backend via mmap() and write().
> >>>
> >>> Maybe. I think that would be a first in QEMU, and you'd likely have to
> >>> describe and follow a semi-formal model between fd actions and mmap()
> >>> actions.
> >>>
> >>> Here's the (unconfirmed) table I referred to earlier.
> >>>
> >>> +-------------+-----------------------------------------------------+
> >>> | change made | change visible via                                  |
> >>> | through     +-----------------+-------------+---------------------+
> >>> |             | MAP_SHARED      | MAP_PRIVATE | read()              |
> >>> +-------------+-----------------+-------------+---------------------+
> >>> | MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
> >>> |             |                 |             | MS_ASYNC, or normal |
> >>> |             |                 |             | system activity     |
> >>> +-------------+-----------------+-------------+---------------------+
> >>> | MAP_PRIVATE | no              | no          | no                  |
> >>> +-------------+-----------------+-------------+---------------------+
> >>> | write()     | depends on      | unspecified | yes                 |
> >>> |             | MS_INVALIDATE,  |             |                     |
> >>> |             | or the system's |             |                     |
> >>> |             | read/write      |             |                     |
> >>> |             | consistency     |             |                     |
> >>> +-------------+-----------------+-------------+---------------------+
> >>>
> >>> Presumably:
> >>>
> >>> - a write() to some offset range of a regular file should be visible in
> >>> a MAP_SHARED mapping of that range after a matching msync(...,
> >>> MS_INVALIDATE) call;
> >>>
> >>> - changes through a MAP_SHARED mapping to a regular file should be
> >>> visible via read() after a matching msync(..., MS_SYNC) call returns.
> >>>
> >>> I sent this table first in 2010 to the Austin Group mailing list, and
> >>> received no comments. Then another person (on the same list) asked
> >>> basically the same questions in 2013, to which I responded with the
> >>> above assumptions / interpretations -- and received no comments from
> >>> third parties again.
> >>>
> >>> But, I'm really out of my depth here -- we're not even discussing
> >>> generic read()/write()/mmap()/munmap()/msync() interactions, but how
> >>> they would fit into the qemu block layer. And I have no idea.
> > 
> > There is no infrastructure in the block layer for mmapping image files.
> > 
> >>>>> Also... although we don't really use them in practice, some QCOW2
> >>>>> features for pflash block backends are -- or would be -- nice. (Not the
> >>>>> internal snapshots or the live RAM dumps, of course.)
> >>>>
> >>>> Regarding sparse files, can we read actual backend size data for the read-only
> >>>> pflash memory as the following steps shown?
> >>>>
> >>>> 1) Create a sparse file -- QEMU_EFI-test.raw:
> >>>>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
> >>>>
> >>>> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
> >>>>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
> >>>>
> >>>> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
> >>>> patch applied:
> >>>>
> >>>> ---8>---
> >>>>
> >>>> diff --git a/hw/block/block.c b/hw/block/block.c
> >>>> index bf56c76..ad18d5e 100644
> >>>> --- a/hw/block/block.c
> >>>> +++ b/hw/block/block.c
> >>>> @@ -30,7 +30,7 @@
> >>>>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
> >>>>                                   Error **errp)
> >>>>  {
> >>>> -    int64_t blk_len;
> >>>> +    int64_t blk_len, actual_len;
> >>>>      int ret;
> >>>>
> >>>>      blk_len = blk_getlength(blk);
> >>>> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
> >>>>       * block device and read only on demand.
> >>>>       */
> >>>>      assert(size <= BDRV_REQUEST_MAX_BYTES);
> >>>> -    ret = blk_pread(blk, 0, buf, size);
> >>>> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
> >>>> +    ret = blk_pread(blk, 0, buf,
> >>>> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
> >>>>      if (ret < 0) {
> >>>>          error_setg_errno(errp, -ret, "can't read block backend");
> >>>>          return false;
> >>>>
> >>>
> >>> This hunk looks dubious for a general helper function. It seems to
> >>> assume that the holes are punched at the end of the file.
> >>>
> >>> Sorry, this discussion is making me uncomfortable. I don't want to
> >>> ignore your questions, but I have no idea about the right answers. This
> >>> really needs the attention of the block people.
> > 
> > Yes, this code is wrong. bdrv_get_allocated_file_size() isn't even
> > implemented in all block drivers, and when it is implemented it isn't
> > necessarily reliable (e.g. return 0 for block devices). It's fine for
> > 'qemu-img info', but that's it.
> > 
> > But even if you actually get the correct allocated file size, that's the
> > disk space used on the host filesystem, so it doesn't include any holes,
> > but it does include image format metadata, the data could possibly be
> > compressed etc. In other words, for a guest device it's completely
> > meaningless.
> > 
> > I'm not even sure why you would want to do this even in your special
> > case of a raw image that is sparse only at the end. Is it an
> > optimisation to avoid reading zeros? This ends up basically being
> > memset() for sparse blocks, so very quick. And I think you do want to
> > zero out the remaining part of the buffer anyway. So it looks to me as
> > if it were complicating the code for no use.
> 
> The primary target of this discussion is to save the memory allocated
> for UEFI firmware device. On virtual machine, we split it into two flash
> devices[1] -- one is read-only in 64MB size and another is read-write in
> 64MB size. The qemu commandline is:
> 
>    -drive file=QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on \
>    -drive file=empty_VARS.fd,if=pflash,format=raw,unit=1 \
> 
> Regarding the read-only one which are created from QEMU_EFI.fd, the original
> QEMU_EFI.fd is only 2MB in size and we need to stuff it to 64MB with 62MB
> 'zeros':
> 
>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M count=64
>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
> 
> For some historical reasons such as compatibility and extensibility[2], we
> restrict both their size to 64MB -- both UEFI and qemu have cold hard
> constants.
> 
> This will consume a large amount of memory when running multiple VM
> simultaneously (each guest needs 128MB).

Okay, so your problem is that blk_pread() writes to the whole buffer,
writing explicit zeroes for unallocated parts of the image, while you
would like to leave those parts of the buffer untouched so that we don't
actually allocate the memory, but can just use the shared zero page.

If you just want to read the non-zero parts of the image, that can be
done by using a loop that calls bdrv_block_status() and only reads from
the image if the BDRV_BLOCK_ZERO bit is clear.

Would this solve your problem?

Kevin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
@ 2019-04-11 12:22                               ` Kevin Wolf
  0 siblings, 0 replies; 25+ messages in thread
From: Kevin Wolf @ 2019-04-11 12:22 UTC (permalink / raw)
  To: Xiang Zheng
  Cc: Peter Maydell, qemu-block, Ard Biesheuvel, QEMU Developers,
	Markus Armbruster, qemu-arm, Stefan Hajnoczi, Heyi Guo,
	wanghaibin.wang, Max Reitz, Laszlo Ersek

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="UTF-8", Size: 8897 bytes --]

Am 10.04.2019 um 10:36 hat Xiang Zheng geschrieben:
> Hi Kevin,
> 
> On 2019/4/9 16:28, Kevin Wolf wrote:
> > Am 09.04.2019 um 08:01 hat Markus Armbruster geschrieben:
> >> László's last sentence below is "This really needs the attention of the
> >> block people."  Cc'ing some.
> >>
> >> Laszlo Ersek <lersek@redhat.com> writes:
> >>
> >>> On 04/08/19 15:43, Xiang Zheng wrote:
> >>>>
> >>>> On 2019/4/3 23:35, Laszlo Ersek wrote:
> >>>>>> I thought about your comments and wrote the following patch (just for test)
> >>>>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
> >>>>>> fine. So why not use a file mapping to read or write a pflash device?
> >>>>> Honestly I can't answer "why not do this". Maybe we should.
> >>>>>
> >>>>> Regarding "why not do this *exactly as shown below*" -- probably because
> >>>>> then we'd have updates to the same underlying regular file via both
> >>>>> mmap() and write(), and the interactions between those are really tricky
> >>>>> (= best avoided).
> >>>>>
> >>>>> One of my favorite questions over the years used to be posting a matrix
> >>>>> of possible mmap() and file descriptor r/w/sync interactions, with the
> >>>>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
> >>>>> "is my understanding correct?" I've never ever received an answer to
> >>>>> that. :)
> >>>>
> >>>> In my opinion, it's feasible to r/w/sync the memory devices which use a block
> >>>> backend via mmap() and write().
> >>>
> >>> Maybe. I think that would be a first in QEMU, and you'd likely have to
> >>> describe and follow a semi-formal model between fd actions and mmap()
> >>> actions.
> >>>
> >>> Here's the (unconfirmed) table I referred to earlier.
> >>>
> >>> +-------------+-----------------------------------------------------+
> >>> | change made | change visible via                                  |
> >>> | through     +-----------------+-------------+---------------------+
> >>> |             | MAP_SHARED      | MAP_PRIVATE | read()              |
> >>> +-------------+-----------------+-------------+---------------------+
> >>> | MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
> >>> |             |                 |             | MS_ASYNC, or normal |
> >>> |             |                 |             | system activity     |
> >>> +-------------+-----------------+-------------+---------------------+
> >>> | MAP_PRIVATE | no              | no          | no                  |
> >>> +-------------+-----------------+-------------+---------------------+
> >>> | write()     | depends on      | unspecified | yes                 |
> >>> |             | MS_INVALIDATE,  |             |                     |
> >>> |             | or the system's |             |                     |
> >>> |             | read/write      |             |                     |
> >>> |             | consistency     |             |                     |
> >>> +-------------+-----------------+-------------+---------------------+
> >>>
> >>> Presumably:
> >>>
> >>> - a write() to some offset range of a regular file should be visible in
> >>> a MAP_SHARED mapping of that range after a matching msync(...,
> >>> MS_INVALIDATE) call;
> >>>
> >>> - changes through a MAP_SHARED mapping to a regular file should be
> >>> visible via read() after a matching msync(..., MS_SYNC) call returns.
> >>>
> >>> I sent this table first in 2010 to the Austin Group mailing list, and
> >>> received no comments. Then another person (on the same list) asked
> >>> basically the same questions in 2013, to which I responded with the
> >>> above assumptions / interpretations -- and received no comments from
> >>> third parties again.
> >>>
> >>> But, I'm really out of my depth here -- we're not even discussing
> >>> generic read()/write()/mmap()/munmap()/msync() interactions, but how
> >>> they would fit into the qemu block layer. And I have no idea.
> > 
> > There is no infrastructure in the block layer for mmapping image files.
> > 
> >>>>> Also... although we don't really use them in practice, some QCOW2
> >>>>> features for pflash block backends are -- or would be -- nice. (Not the
> >>>>> internal snapshots or the live RAM dumps, of course.)
> >>>>
> >>>> Regarding sparse files, can we read actual backend size data for the read-only
> >>>> pflash memory as the following steps shown?
> >>>>
> >>>> 1) Create a sparse file -- QEMU_EFI-test.raw:
> >>>>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
> >>>>
> >>>> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
> >>>>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
> >>>>
> >>>> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
> >>>> patch applied:
> >>>>
> >>>> ---8>---
> >>>>
> >>>> diff --git a/hw/block/block.c b/hw/block/block.c
> >>>> index bf56c76..ad18d5e 100644
> >>>> --- a/hw/block/block.c
> >>>> +++ b/hw/block/block.c
> >>>> @@ -30,7 +30,7 @@
> >>>>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
> >>>>                                   Error **errp)
> >>>>  {
> >>>> -    int64_t blk_len;
> >>>> +    int64_t blk_len, actual_len;
> >>>>      int ret;
> >>>>
> >>>>      blk_len = blk_getlength(blk);
> >>>> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
> >>>>       * block device and read only on demand.
> >>>>       */
> >>>>      assert(size <= BDRV_REQUEST_MAX_BYTES);
> >>>> -    ret = blk_pread(blk, 0, buf, size);
> >>>> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
> >>>> +    ret = blk_pread(blk, 0, buf,
> >>>> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
> >>>>      if (ret < 0) {
> >>>>          error_setg_errno(errp, -ret, "can't read block backend");
> >>>>          return false;
> >>>>
> >>>
> >>> This hunk looks dubious for a general helper function. It seems to
> >>> assume that the holes are punched at the end of the file.
> >>>
> >>> Sorry, this discussion is making me uncomfortable. I don't want to
> >>> ignore your questions, but I have no idea about the right answers. This
> >>> really needs the attention of the block people.
> > 
> > Yes, this code is wrong. bdrv_get_allocated_file_size() isn't even
> > implemented in all block drivers, and when it is implemented it isn't
> > necessarily reliable (e.g. return 0 for block devices). It's fine for
> > 'qemu-img info', but that's it.
> > 
> > But even if you actually get the correct allocated file size, that's the
> > disk space used on the host filesystem, so it doesn't include any holes,
> > but it does include image format metadata, the data could possibly be
> > compressed etc. In other words, for a guest device it's completely
> > meaningless.
> > 
> > I'm not even sure why you would want to do this even in your special
> > case of a raw image that is sparse only at the end. Is it an
> > optimisation to avoid reading zeros? This ends up basically being
> > memset() for sparse blocks, so very quick. And I think you do want to
> > zero out the remaining part of the buffer anyway. So it looks to me as
> > if it were complicating the code for no use.
> 
> The primary target of this discussion is to save the memory allocated
> for UEFI firmware device. On virtual machine, we split it into two flash
> devices[1] -- one is read-only in 64MB size and another is read-write in
> 64MB size. The qemu commandline is:
> 
>    -drive file=QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on \
>    -drive file=empty_VARS.fd,if=pflash,format=raw,unit=1 \
> 
> Regarding the read-only one which are created from QEMU_EFI.fd, the original
> QEMU_EFI.fd is only 2MB in size and we need to stuff it to 64MB with 62MB
> 'zeros':
> 
>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M count=64
>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
> 
> For some historical reasons such as compatibility and extensibility[2], we
> restrict both their size to 64MB -- both UEFI and qemu have cold hard
> constants.
> 
> This will consume a large amount of memory when running multiple VM
> simultaneously (each guest needs 128MB).

Okay, so your problem is that blk_pread() writes to the whole buffer,
writing explicit zeroes for unallocated parts of the image, while you
would like to leave those parts of the buffer untouched so that we don't
actually allocate the memory, but can just use the shared zero page.

If you just want to read the non-zero parts of the image, that can be
done by using a loop that calls bdrv_block_status() and only reads from
the image if the BDRV_BLOCK_ZERO bit is clear.

Would this solve your problem?

Kevin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
@ 2019-04-12  1:52                                 ` Xiang Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Xiang Zheng @ 2019-04-12  1:52 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Markus Armbruster, Laszlo Ersek, Peter Maydell, Ard Biesheuvel,
	QEMU Developers, qemu-arm, Heyi Guo, wanghaibin.wang, qemu-block,
	Max Reitz, Stefan Hajnoczi

On 2019/4/11 20:22, Kevin Wolf wrote:
> Okay, so your problem is that blk_pread() writes to the whole buffer,
> writing explicit zeroes for unallocated parts of the image, while you
> would like to leave those parts of the buffer untouched so that we don't
> actually allocate the memory, but can just use the shared zero page.
> 
> If you just want to read the non-zero parts of the image, that can be
> done by using a loop that calls bdrv_block_status() and only reads from
> the image if the BDRV_BLOCK_ZERO bit is clear.
> 
> Would this solve your problem?

Sounds good! What if guest tried to read/write the zero parts?

-- 

Thanks,
Xiang

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
@ 2019-04-12  1:52                                 ` Xiang Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Xiang Zheng @ 2019-04-12  1:52 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Peter Maydell, qemu-block, Ard Biesheuvel, QEMU Developers,
	Markus Armbruster, qemu-arm, Stefan Hajnoczi, Heyi Guo,
	wanghaibin.wang, Max Reitz, Laszlo Ersek

On 2019/4/11 20:22, Kevin Wolf wrote:
> Okay, so your problem is that blk_pread() writes to the whole buffer,
> writing explicit zeroes for unallocated parts of the image, while you
> would like to leave those parts of the buffer untouched so that we don't
> actually allocate the memory, but can just use the shared zero page.
> 
> If you just want to read the non-zero parts of the image, that can be
> done by using a loop that calls bdrv_block_status() and only reads from
> the image if the BDRV_BLOCK_ZERO bit is clear.
> 
> Would this solve your problem?

Sounds good! What if guest tried to read/write the zero parts?

-- 

Thanks,
Xiang




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
  2019-04-11  7:15                             ` Markus Armbruster
@ 2019-04-12  9:26                               ` Xiang Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Xiang Zheng @ 2019-04-12  9:26 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Kevin Wolf, Peter Maydell, qemu-block, Ard Biesheuvel,
	QEMU Developers, Max Reitz, qemu-arm, Stefan Hajnoczi, Heyi Guo,
	wanghaibin.wang, Laszlo Ersek



On 2019/4/11 15:15, Markus Armbruster wrote:
>> For some historical reasons such as compatibility and extensibility[2], we
>> restrict both their size to 64MB -- both UEFI and qemu have cold hard
>> constants.
> These reasons aren't historical.  But they're valid, and that's all that
> matters :)
> 
>> This will consume a large amount of memory when running multiple VM
>> simultaneously (each guest needs 128MB).
> Understood.
> 
>> There are accepted ideas proposed by Markus and Laszlo from prior discussion[3]:
>>
>> 1) Getting flash memory size from a machine type or property.
>>
>> 2) Map the read-only part so it is shared among guests; Map the read-write
>> part normally.
>>
>> The first idea called explicit configuration which may break migration.
> For any configuration that can break migration, the solution is to have
> identical configuration on source and destination.
> 
>> For the second idea, the only way I can imagine is using a file mapping to
>> read or write pflash devices so that we can allocate memory on demand. So I
>> tried to fit mmap() actions into the block layer[4]. However I am not sure
>> that maping image file whether can work fine for block layer.
> It adds an odd special case to the block layer.  But then flash devices
> are an odd user of the block layer.
> 
> Normal users use the block layer as a *block* layer: to read and write
> blocks.
> 
> Our flash devices don't do that.  They are *memory* devices, not *block*
> devices.  Pretty much the only thing the two have in common is
> persistence of data.  The block layer provides persistence, and I figure
> that's why it got pressed into service here.
> 
> What the flash devices do is keep the complete image contents in memory,
> with changes written through to disk.
> 
> This is pretty much exactly what mmap() does, so using it suggests
> itself.
> 
> It's very much *not* what the block layer does.  mmap() is not even
> possible for anything but a file backend.
> 
> With mmap(), memory gets faulted in on demand, and sharing read-only
> memory with other VMs should be feasible.  The question is how to use it
> here.  Is there a sane way to use it without rewriting the flash devices
> not to use the block layer?

There are two important changes we may need to pay more attention to if we
use mmap():

1) Sync/Update flash content from guest memory to a file backend. With mmap()
we might not have to do the extra sync/update as the block layer do in
pflash_write().
2) Live migration.

> 
>> On the other hand, Laszlo had mentioned the features of QCOW2 and sparse
>> files which may be nice for pflash block backends. But I am not familiar with
>> it.
> Offhand, two QCOW2 features come to mind:
> 
> 1. Sparse storage even when the storage substrate can't do sparse.  I
> consider this one largely irrelevant.  Any decent file system can do
> sparse.  Archival and network transfer can compress instead.
> 
> 2. Snapshots to go with disk snapshots.  *Live* snapshots (the ones that
> include memory) already have the flash contents in the memory part, so
> this sems only relevant for "dead" snapshots.
> 
> Anything else?
> 

No more, thanks.:D

Kevin came up with a good idea which might solve my problem.
See https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg01865.html

-- 

Thanks,
Xiang

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
@ 2019-04-12  9:50                                   ` Xiang Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Xiang Zheng @ 2019-04-12  9:50 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Markus Armbruster, Laszlo Ersek, Peter Maydell, Ard Biesheuvel,
	QEMU Developers, qemu-arm, Heyi Guo, wanghaibin.wang, qemu-block,
	Max Reitz, Stefan Hajnoczi


On 2019/4/12 9:52, Xiang Zheng wrote:
> On 2019/4/11 20:22, Kevin Wolf wrote:
>> Okay, so your problem is that blk_pread() writes to the whole buffer,
>> writing explicit zeroes for unallocated parts of the image, while you
>> would like to leave those parts of the buffer untouched so that we don't
>> actually allocate the memory, but can just use the shared zero page.
>>
>> If you just want to read the non-zero parts of the image, that can be
>> done by using a loop that calls bdrv_block_status() and only reads from
>> the image if the BDRV_BLOCK_ZERO bit is clear.
>>
>> Would this solve your problem?
> 
> Sounds good! What if guest tried to read/write the zero parts?
> 

I wrote the below patch (refer to bdrv_make_zero()) for test, it seems
that everything is OK and the memory is also exactly allocated on demand.

This requires pflash devices to use sparse files backend. Thus I have to
create images like:

   dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
   dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc

   dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0


---8>---

diff --git a/block/block-backend.c b/block/block-backend.c
index f78e82a..ed8ca87 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1379,6 +1379,12 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset,
                         flags | BDRV_REQ_ZERO_WRITE, cb, opaque);
 }

+int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
+{
+    int ret = bdrv_pread_nonzeroes(blk->root, buf);
+    return ret;
+}
+
 int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count)
 {
     int ret = blk_prw(blk, offset, buf, count, blk_read_entry, 0);
diff --git a/block/io.c b/block/io.c
index dfc153b..83e5ea7 100644
--- a/block/io.c
+++ b/block/io.c
@@ -882,6 +882,38 @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
                         BDRV_REQ_ZERO_WRITE | flags);
 }

+int bdrv_pread_nonzeroes(BdrvChild *child, void *buf)
+{
+    int ret;
+    int64_t target_size, bytes, offset = 0;
+    BlockDriverState *bs = child->bs;
+
+    target_size = bdrv_getlength(bs);
+    if (target_size < 0) {
+        return target_size;
+    }
+
+    for (;;) {
+        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
+        if (bytes <= 0) {
+            return 0;
+        }
+        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
+        if (ret < 0) {
+            return ret;
+        }
+        if (ret & BDRV_BLOCK_ZERO) {
+            offset += bytes;
+            continue;
+        }
+        ret = bdrv_pread(child, offset, buf, bytes);
+        if (ret < 0) {
+            return ret;
+        }
+        offset += bytes;
+    }
+}
+
 /*
  * Completely zero out a block device with the help of bdrv_pwrite_zeroes.
  * The operation is sped up by checking the block status and only writing
diff --git a/hw/block/block.c b/hw/block/block.c
index bf56c76..e3c67f8 100644
--- a/hw/block/block.c
+++ b/hw/block/block.c
@@ -53,7 +53,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
      * block device and read only on demand.
      */
     assert(size <= BDRV_REQUEST_MAX_BYTES);
-    ret = blk_pread(blk, 0, buf, size);
+    ret = blk_pread_nonzeroes(blk, buf);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "can't read block backend");
         return false;
diff --git a/include/block/block.h b/include/block/block.h
index c7a2619..d0e06cf 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -322,6 +322,7 @@ int bdrv_write(BdrvChild *child, int64_t sector_num,
                const uint8_t *buf, int nb_sectors);
 int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
                        int bytes, BdrvRequestFlags flags);
+int bdrv_pread_nonzeroes(BdrvChild *child, void *buf);
 int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags);
 int bdrv_pread(BdrvChild *child, int64_t offset, void *buf, int bytes);
 int bdrv_preadv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov);
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 3be05c2..5d349d2 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -129,6 +129,7 @@ int blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
 BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset,
                                   int bytes, BdrvRequestFlags flags,
                                   BlockCompletionFunc *cb, void *opaque);
+int blk_pread_nonzeroes(BlockBackend *blk, void *buf);
 int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags);
 int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int bytes);
 int blk_pwrite(BlockBackend *blk, int64_t offset, const void *buf, int bytes,
-- 
1.8.3.1



-- 

Thanks,
Xiang

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
@ 2019-04-12  9:50                                   ` Xiang Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Xiang Zheng @ 2019-04-12  9:50 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Peter Maydell, qemu-block, Ard Biesheuvel, QEMU Developers,
	Markus Armbruster, qemu-arm, Stefan Hajnoczi, Heyi Guo,
	wanghaibin.wang, Max Reitz, Laszlo Ersek


On 2019/4/12 9:52, Xiang Zheng wrote:
> On 2019/4/11 20:22, Kevin Wolf wrote:
>> Okay, so your problem is that blk_pread() writes to the whole buffer,
>> writing explicit zeroes for unallocated parts of the image, while you
>> would like to leave those parts of the buffer untouched so that we don't
>> actually allocate the memory, but can just use the shared zero page.
>>
>> If you just want to read the non-zero parts of the image, that can be
>> done by using a loop that calls bdrv_block_status() and only reads from
>> the image if the BDRV_BLOCK_ZERO bit is clear.
>>
>> Would this solve your problem?
> 
> Sounds good! What if guest tried to read/write the zero parts?
> 

I wrote the below patch (refer to bdrv_make_zero()) for test, it seems
that everything is OK and the memory is also exactly allocated on demand.

This requires pflash devices to use sparse files backend. Thus I have to
create images like:

   dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
   dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc

   dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0


---8>---

diff --git a/block/block-backend.c b/block/block-backend.c
index f78e82a..ed8ca87 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1379,6 +1379,12 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset,
                         flags | BDRV_REQ_ZERO_WRITE, cb, opaque);
 }

+int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
+{
+    int ret = bdrv_pread_nonzeroes(blk->root, buf);
+    return ret;
+}
+
 int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count)
 {
     int ret = blk_prw(blk, offset, buf, count, blk_read_entry, 0);
diff --git a/block/io.c b/block/io.c
index dfc153b..83e5ea7 100644
--- a/block/io.c
+++ b/block/io.c
@@ -882,6 +882,38 @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
                         BDRV_REQ_ZERO_WRITE | flags);
 }

+int bdrv_pread_nonzeroes(BdrvChild *child, void *buf)
+{
+    int ret;
+    int64_t target_size, bytes, offset = 0;
+    BlockDriverState *bs = child->bs;
+
+    target_size = bdrv_getlength(bs);
+    if (target_size < 0) {
+        return target_size;
+    }
+
+    for (;;) {
+        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
+        if (bytes <= 0) {
+            return 0;
+        }
+        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
+        if (ret < 0) {
+            return ret;
+        }
+        if (ret & BDRV_BLOCK_ZERO) {
+            offset += bytes;
+            continue;
+        }
+        ret = bdrv_pread(child, offset, buf, bytes);
+        if (ret < 0) {
+            return ret;
+        }
+        offset += bytes;
+    }
+}
+
 /*
  * Completely zero out a block device with the help of bdrv_pwrite_zeroes.
  * The operation is sped up by checking the block status and only writing
diff --git a/hw/block/block.c b/hw/block/block.c
index bf56c76..e3c67f8 100644
--- a/hw/block/block.c
+++ b/hw/block/block.c
@@ -53,7 +53,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
      * block device and read only on demand.
      */
     assert(size <= BDRV_REQUEST_MAX_BYTES);
-    ret = blk_pread(blk, 0, buf, size);
+    ret = blk_pread_nonzeroes(blk, buf);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "can't read block backend");
         return false;
diff --git a/include/block/block.h b/include/block/block.h
index c7a2619..d0e06cf 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -322,6 +322,7 @@ int bdrv_write(BdrvChild *child, int64_t sector_num,
                const uint8_t *buf, int nb_sectors);
 int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
                        int bytes, BdrvRequestFlags flags);
+int bdrv_pread_nonzeroes(BdrvChild *child, void *buf);
 int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags);
 int bdrv_pread(BdrvChild *child, int64_t offset, void *buf, int bytes);
 int bdrv_preadv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov);
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 3be05c2..5d349d2 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -129,6 +129,7 @@ int blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
 BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset,
                                   int bytes, BdrvRequestFlags flags,
                                   BlockCompletionFunc *cb, void *opaque);
+int blk_pread_nonzeroes(BlockBackend *blk, void *buf);
 int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags);
 int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int bytes);
 int blk_pwrite(BlockBackend *blk, int64_t offset, const void *buf, int bytes,
-- 
1.8.3.1



-- 

Thanks,
Xiang




^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
@ 2019-04-12 10:57                                     ` Kevin Wolf
  0 siblings, 0 replies; 25+ messages in thread
From: Kevin Wolf @ 2019-04-12 10:57 UTC (permalink / raw)
  To: Xiang Zheng
  Cc: Markus Armbruster, Laszlo Ersek, Peter Maydell, Ard Biesheuvel,
	QEMU Developers, qemu-arm, Heyi Guo, wanghaibin.wang, qemu-block,
	Max Reitz, Stefan Hajnoczi

Am 12.04.2019 um 11:50 hat Xiang Zheng geschrieben:
> 
> On 2019/4/12 9:52, Xiang Zheng wrote:
> > On 2019/4/11 20:22, Kevin Wolf wrote:
> >> Okay, so your problem is that blk_pread() writes to the whole buffer,
> >> writing explicit zeroes for unallocated parts of the image, while you
> >> would like to leave those parts of the buffer untouched so that we don't
> >> actually allocate the memory, but can just use the shared zero page.
> >>
> >> If you just want to read the non-zero parts of the image, that can be
> >> done by using a loop that calls bdrv_block_status() and only reads from
> >> the image if the BDRV_BLOCK_ZERO bit is clear.
> >>
> >> Would this solve your problem?
> > 
> > Sounds good! What if guest tried to read/write the zero parts?
> > 
> 
> I wrote the below patch (refer to bdrv_make_zero()) for test, it seems
> that everything is OK and the memory is also exactly allocated on demand.
> 
> This requires pflash devices to use sparse files backend. Thus I have to
> create images like:
> 
>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
> 
>    dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0
> 
> 
> ---8>---
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index f78e82a..ed8ca87 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1379,6 +1379,12 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset,
>                          flags | BDRV_REQ_ZERO_WRITE, cb, opaque);
>  }
> 
> +int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
> +{
> +    int ret = bdrv_pread_nonzeroes(blk->root, buf);
> +    return ret;
> +}

I don't think this deserves a place in the public block layer interface,
as it's only a single device that makes use of it.

Maybe you wrote things this way because there is no blk_block_status(),
but you can get the BlockDriverState with blk_bs(blk) and then implement
everything inside hw/block/block.c.

>  int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count)
>  {
>      int ret = blk_prw(blk, offset, buf, count, blk_read_entry, 0);
> diff --git a/block/io.c b/block/io.c
> index dfc153b..83e5ea7 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -882,6 +882,38 @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
>                          BDRV_REQ_ZERO_WRITE | flags);
>  }
> 
> +int bdrv_pread_nonzeroes(BdrvChild *child, void *buf)
> +{
> +    int ret;
> +    int64_t target_size, bytes, offset = 0;
> +    BlockDriverState *bs = child->bs;
> +
> +    target_size = bdrv_getlength(bs);
> +    if (target_size < 0) {
> +        return target_size;
> +    }
> +
> +    for (;;) {
> +        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
> +        if (bytes <= 0) {
> +            return 0;
> +        }
> +        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        if (ret & BDRV_BLOCK_ZERO) {
> +            offset += bytes;
> +            continue;
> +        }
> +        ret = bdrv_pread(child, offset, buf, bytes);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        offset += bytes;

I think the code becomes simpler the other way round:

    if (!(ret & BDRV_BLOCK_ZERO)) {
        ret = bdrv_pread(child, offset, buf, bytes);
        if (ret < 0) {
            return ret;
        }
    }
    offset += bytes;

You don't increment buf, so if you have a hole in the file, this will
corrupt the buffer. You need to either increment buf, too, or use
(uint8_t*) buf + offset for the bdrv_pread() call.

> +    }
> +}
> +
>  /*
>   * Completely zero out a block device with the help of bdrv_pwrite_zeroes.
>   * The operation is sped up by checking the block status and only writing

Kevin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
@ 2019-04-12 10:57                                     ` Kevin Wolf
  0 siblings, 0 replies; 25+ messages in thread
From: Kevin Wolf @ 2019-04-12 10:57 UTC (permalink / raw)
  To: Xiang Zheng
  Cc: Peter Maydell, qemu-block, Ard Biesheuvel, QEMU Developers,
	Markus Armbruster, qemu-arm, Stefan Hajnoczi, Heyi Guo,
	wanghaibin.wang, Max Reitz, Laszlo Ersek

Am 12.04.2019 um 11:50 hat Xiang Zheng geschrieben:
> 
> On 2019/4/12 9:52, Xiang Zheng wrote:
> > On 2019/4/11 20:22, Kevin Wolf wrote:
> >> Okay, so your problem is that blk_pread() writes to the whole buffer,
> >> writing explicit zeroes for unallocated parts of the image, while you
> >> would like to leave those parts of the buffer untouched so that we don't
> >> actually allocate the memory, but can just use the shared zero page.
> >>
> >> If you just want to read the non-zero parts of the image, that can be
> >> done by using a loop that calls bdrv_block_status() and only reads from
> >> the image if the BDRV_BLOCK_ZERO bit is clear.
> >>
> >> Would this solve your problem?
> > 
> > Sounds good! What if guest tried to read/write the zero parts?
> > 
> 
> I wrote the below patch (refer to bdrv_make_zero()) for test, it seems
> that everything is OK and the memory is also exactly allocated on demand.
> 
> This requires pflash devices to use sparse files backend. Thus I have to
> create images like:
> 
>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
> 
>    dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0
> 
> 
> ---8>---
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index f78e82a..ed8ca87 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1379,6 +1379,12 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset,
>                          flags | BDRV_REQ_ZERO_WRITE, cb, opaque);
>  }
> 
> +int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
> +{
> +    int ret = bdrv_pread_nonzeroes(blk->root, buf);
> +    return ret;
> +}

I don't think this deserves a place in the public block layer interface,
as it's only a single device that makes use of it.

Maybe you wrote things this way because there is no blk_block_status(),
but you can get the BlockDriverState with blk_bs(blk) and then implement
everything inside hw/block/block.c.

>  int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count)
>  {
>      int ret = blk_prw(blk, offset, buf, count, blk_read_entry, 0);
> diff --git a/block/io.c b/block/io.c
> index dfc153b..83e5ea7 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -882,6 +882,38 @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
>                          BDRV_REQ_ZERO_WRITE | flags);
>  }
> 
> +int bdrv_pread_nonzeroes(BdrvChild *child, void *buf)
> +{
> +    int ret;
> +    int64_t target_size, bytes, offset = 0;
> +    BlockDriverState *bs = child->bs;
> +
> +    target_size = bdrv_getlength(bs);
> +    if (target_size < 0) {
> +        return target_size;
> +    }
> +
> +    for (;;) {
> +        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
> +        if (bytes <= 0) {
> +            return 0;
> +        }
> +        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        if (ret & BDRV_BLOCK_ZERO) {
> +            offset += bytes;
> +            continue;
> +        }
> +        ret = bdrv_pread(child, offset, buf, bytes);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        offset += bytes;

I think the code becomes simpler the other way round:

    if (!(ret & BDRV_BLOCK_ZERO)) {
        ret = bdrv_pread(child, offset, buf, bytes);
        if (ret < 0) {
            return ret;
        }
    }
    offset += bytes;

You don't increment buf, so if you have a hole in the file, this will
corrupt the buffer. You need to either increment buf, too, or use
(uint8_t*) buf + offset for the bdrv_pread() call.

> +    }
> +}
> +
>  /*
>   * Completely zero out a block device with the help of bdrv_pwrite_zeroes.
>   * The operation is sped up by checking the block status and only writing

Kevin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
@ 2019-04-15  2:39                                       ` Xiang Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Xiang Zheng @ 2019-04-15  2:39 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Markus Armbruster, Laszlo Ersek, Peter Maydell, Ard Biesheuvel,
	QEMU Developers, qemu-arm, Heyi Guo, wanghaibin.wang, qemu-block,
	Max Reitz, Stefan Hajnoczi

On 2019/4/12 18:57, Kevin Wolf wrote:
> Am 12.04.2019 um 11:50 hat Xiang Zheng geschrieben:
>>
>> On 2019/4/12 9:52, Xiang Zheng wrote:
>>> On 2019/4/11 20:22, Kevin Wolf wrote:
>>>> Okay, so your problem is that blk_pread() writes to the whole buffer,
>>>> writing explicit zeroes for unallocated parts of the image, while you
>>>> would like to leave those parts of the buffer untouched so that we don't
>>>> actually allocate the memory, but can just use the shared zero page.
>>>>
>>>> If you just want to read the non-zero parts of the image, that can be
>>>> done by using a loop that calls bdrv_block_status() and only reads from
>>>> the image if the BDRV_BLOCK_ZERO bit is clear.
>>>>
>>>> Would this solve your problem?
>>>
>>> Sounds good! What if guest tried to read/write the zero parts?
>>>
>>
>> I wrote the below patch (refer to bdrv_make_zero()) for test, it seems
>> that everything is OK and the memory is also exactly allocated on demand.
>>
>> This requires pflash devices to use sparse files backend. Thus I have to
>> create images like:
>>
>>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
>>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
>>
>>    dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0
>>
>>
>> ---8>---
>>
>> diff --git a/block/block-backend.c b/block/block-backend.c
>> index f78e82a..ed8ca87 100644
>> --- a/block/block-backend.c
>> +++ b/block/block-backend.c
>> @@ -1379,6 +1379,12 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset,
>>                          flags | BDRV_REQ_ZERO_WRITE, cb, opaque);
>>  }
>>
>> +int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
>> +{
>> +    int ret = bdrv_pread_nonzeroes(blk->root, buf);
>> +    return ret;
>> +}
> 
> I don't think this deserves a place in the public block layer interface,
> as it's only a single device that makes use of it.
> 
> Maybe you wrote things this way because there is no blk_block_status(),
> but you can get the BlockDriverState with blk_bs(blk) and then implement
> everything inside hw/block/block.c.

Yes, you are right.

> 
>>  int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count)
>>  {
>>      int ret = blk_prw(blk, offset, buf, count, blk_read_entry, 0);
>> diff --git a/block/io.c b/block/io.c
>> index dfc153b..83e5ea7 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -882,6 +882,38 @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
>>                          BDRV_REQ_ZERO_WRITE | flags);
>>  }
>>
>> +int bdrv_pread_nonzeroes(BdrvChild *child, void *buf)
>> +{
>> +    int ret;
>> +    int64_t target_size, bytes, offset = 0;
>> +    BlockDriverState *bs = child->bs;
>> +
>> +    target_size = bdrv_getlength(bs);
>> +    if (target_size < 0) {
>> +        return target_size;
>> +    }
>> +
>> +    for (;;) {
>> +        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
>> +        if (bytes <= 0) {
>> +            return 0;
>> +        }
>> +        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +        if (ret & BDRV_BLOCK_ZERO) {
>> +            offset += bytes;
>> +            continue;
>> +        }
>> +        ret = bdrv_pread(child, offset, buf, bytes);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +        offset += bytes;
> 
> I think the code becomes simpler the other way round:
> 
>     if (!(ret & BDRV_BLOCK_ZERO)) {
>         ret = bdrv_pread(child, offset, buf, bytes);
>         if (ret < 0) {
>             return ret;
>         }
>     }
>     offset += bytes;
> 
> You don't increment buf, so if you have a hole in the file, this will
> corrupt the buffer. You need to either increment buf, too, or use
> (uint8_t*) buf + offset for the bdrv_pread() call.
> 

Yes, I didn't notice it. I think the latter is better. Does *BDRV_BLOCK_ZERO*
mean that there are all-zeroes data or a hole in the sector? But if I use an
image filled with zeroes, it will not set BDRV_BLOCK_ZERO bit on return.

Should I resend a patch?

---8>---

>From 4dbfe4955aa9fe23404cbe1890fbe148be2ff10e Mon Sep 17 00:00:00 2001
From: Xiang Zheng <zhengxiang9@huawei.com>
Date: Sat, 13 Apr 2019 02:27:03 +0800
Subject: [PATCH] pflash: Only read non-zero parts of backend image

Currently we fill the VIRT_FLASH memory space with two 64MB NOR images
when using persistent UEFI variables on virt board. Actually we only use
a very small(non-zero) part of the memory while the rest significant
large(zero) part of memory is wasted.

So this patch checks the block status and only writes the non-zero part
into memory. This requires pflash devices to use sparse files for
backends.

Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
---
 hw/block/block.c | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/hw/block/block.c b/hw/block/block.c
index bf56c76..3cb9d4c 100644
--- a/hw/block/block.c
+++ b/hw/block/block.c
@@ -15,6 +15,44 @@
 #include "qapi/qapi-types-block.h"

 /*
+ * Read the non-zeroes parts of @blk into @buf
+ * Reading all of the @blk is expensive if the zeroes parts of @blk
+ * is large enough. Therefore check the block status and only write
+ * the non-zeroes block into @buf.
+ *
+ * Return 0 on success, non-zero on error.
+ */
+static int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
+{
+    int ret;
+    int64_t target_size, bytes, offset = 0;
+    BlockDriverState *bs = blk_bs(blk);
+
+    target_size = bdrv_getlength(bs);
+    if (target_size < 0) {
+        return target_size;
+    }
+
+    for (;;) {
+        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_SECTORS);
+        if (bytes <= 0) {
+            return 0;
+        }
+        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
+        if (ret < 0) {
+            return ret;
+        }
+        if (!(ret & BDRV_BLOCK_ZERO)) {
+            ret = bdrv_pread(bs->file, offset, (uint8_t *) buf + offset, bytes);
+            if (ret < 0) {
+                return ret;
+            }
+        }
+        offset += bytes;
+    }
+}
+
+/*
  * Read the entire contents of @blk into @buf.
  * @blk's contents must be @size bytes, and @size must be at most
  * BDRV_REQUEST_MAX_BYTES.
@@ -53,7 +91,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
      * block device and read only on demand.
      */
     assert(size <= BDRV_REQUEST_MAX_BYTES);
-    ret = blk_pread(blk, 0, buf, size);
+    ret = blk_pread_nonzeroes(blk, buf);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "can't read block backend");
         return false;
-- 
1.8.3.1



-- 

Thanks,
Xiang

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
@ 2019-04-15  2:39                                       ` Xiang Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Xiang Zheng @ 2019-04-15  2:39 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Peter Maydell, qemu-block, Ard Biesheuvel, QEMU Developers,
	Markus Armbruster, qemu-arm, Stefan Hajnoczi, Heyi Guo,
	wanghaibin.wang, Max Reitz, Laszlo Ersek

On 2019/4/12 18:57, Kevin Wolf wrote:
> Am 12.04.2019 um 11:50 hat Xiang Zheng geschrieben:
>>
>> On 2019/4/12 9:52, Xiang Zheng wrote:
>>> On 2019/4/11 20:22, Kevin Wolf wrote:
>>>> Okay, so your problem is that blk_pread() writes to the whole buffer,
>>>> writing explicit zeroes for unallocated parts of the image, while you
>>>> would like to leave those parts of the buffer untouched so that we don't
>>>> actually allocate the memory, but can just use the shared zero page.
>>>>
>>>> If you just want to read the non-zero parts of the image, that can be
>>>> done by using a loop that calls bdrv_block_status() and only reads from
>>>> the image if the BDRV_BLOCK_ZERO bit is clear.
>>>>
>>>> Would this solve your problem?
>>>
>>> Sounds good! What if guest tried to read/write the zero parts?
>>>
>>
>> I wrote the below patch (refer to bdrv_make_zero()) for test, it seems
>> that everything is OK and the memory is also exactly allocated on demand.
>>
>> This requires pflash devices to use sparse files backend. Thus I have to
>> create images like:
>>
>>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
>>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
>>
>>    dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0
>>
>>
>> ---8>---
>>
>> diff --git a/block/block-backend.c b/block/block-backend.c
>> index f78e82a..ed8ca87 100644
>> --- a/block/block-backend.c
>> +++ b/block/block-backend.c
>> @@ -1379,6 +1379,12 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset,
>>                          flags | BDRV_REQ_ZERO_WRITE, cb, opaque);
>>  }
>>
>> +int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
>> +{
>> +    int ret = bdrv_pread_nonzeroes(blk->root, buf);
>> +    return ret;
>> +}
> 
> I don't think this deserves a place in the public block layer interface,
> as it's only a single device that makes use of it.
> 
> Maybe you wrote things this way because there is no blk_block_status(),
> but you can get the BlockDriverState with blk_bs(blk) and then implement
> everything inside hw/block/block.c.

Yes, you are right.

> 
>>  int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count)
>>  {
>>      int ret = blk_prw(blk, offset, buf, count, blk_read_entry, 0);
>> diff --git a/block/io.c b/block/io.c
>> index dfc153b..83e5ea7 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -882,6 +882,38 @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
>>                          BDRV_REQ_ZERO_WRITE | flags);
>>  }
>>
>> +int bdrv_pread_nonzeroes(BdrvChild *child, void *buf)
>> +{
>> +    int ret;
>> +    int64_t target_size, bytes, offset = 0;
>> +    BlockDriverState *bs = child->bs;
>> +
>> +    target_size = bdrv_getlength(bs);
>> +    if (target_size < 0) {
>> +        return target_size;
>> +    }
>> +
>> +    for (;;) {
>> +        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
>> +        if (bytes <= 0) {
>> +            return 0;
>> +        }
>> +        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +        if (ret & BDRV_BLOCK_ZERO) {
>> +            offset += bytes;
>> +            continue;
>> +        }
>> +        ret = bdrv_pread(child, offset, buf, bytes);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +        offset += bytes;
> 
> I think the code becomes simpler the other way round:
> 
>     if (!(ret & BDRV_BLOCK_ZERO)) {
>         ret = bdrv_pread(child, offset, buf, bytes);
>         if (ret < 0) {
>             return ret;
>         }
>     }
>     offset += bytes;
> 
> You don't increment buf, so if you have a hole in the file, this will
> corrupt the buffer. You need to either increment buf, too, or use
> (uint8_t*) buf + offset for the bdrv_pread() call.
> 

Yes, I didn't notice it. I think the latter is better. Does *BDRV_BLOCK_ZERO*
mean that there are all-zeroes data or a hole in the sector? But if I use an
image filled with zeroes, it will not set BDRV_BLOCK_ZERO bit on return.

Should I resend a patch?

---8>---

From 4dbfe4955aa9fe23404cbe1890fbe148be2ff10e Mon Sep 17 00:00:00 2001
From: Xiang Zheng <zhengxiang9@huawei.com>
Date: Sat, 13 Apr 2019 02:27:03 +0800
Subject: [PATCH] pflash: Only read non-zero parts of backend image

Currently we fill the VIRT_FLASH memory space with two 64MB NOR images
when using persistent UEFI variables on virt board. Actually we only use
a very small(non-zero) part of the memory while the rest significant
large(zero) part of memory is wasted.

So this patch checks the block status and only writes the non-zero part
into memory. This requires pflash devices to use sparse files for
backends.

Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
---
 hw/block/block.c | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/hw/block/block.c b/hw/block/block.c
index bf56c76..3cb9d4c 100644
--- a/hw/block/block.c
+++ b/hw/block/block.c
@@ -15,6 +15,44 @@
 #include "qapi/qapi-types-block.h"

 /*
+ * Read the non-zeroes parts of @blk into @buf
+ * Reading all of the @blk is expensive if the zeroes parts of @blk
+ * is large enough. Therefore check the block status and only write
+ * the non-zeroes block into @buf.
+ *
+ * Return 0 on success, non-zero on error.
+ */
+static int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
+{
+    int ret;
+    int64_t target_size, bytes, offset = 0;
+    BlockDriverState *bs = blk_bs(blk);
+
+    target_size = bdrv_getlength(bs);
+    if (target_size < 0) {
+        return target_size;
+    }
+
+    for (;;) {
+        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_SECTORS);
+        if (bytes <= 0) {
+            return 0;
+        }
+        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
+        if (ret < 0) {
+            return ret;
+        }
+        if (!(ret & BDRV_BLOCK_ZERO)) {
+            ret = bdrv_pread(bs->file, offset, (uint8_t *) buf + offset, bytes);
+            if (ret < 0) {
+                return ret;
+            }
+        }
+        offset += bytes;
+    }
+}
+
+/*
  * Read the entire contents of @blk into @buf.
  * @blk's contents must be @size bytes, and @size must be at most
  * BDRV_REQUEST_MAX_BYTES.
@@ -53,7 +91,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
      * block device and read only on demand.
      */
     assert(size <= BDRV_REQUEST_MAX_BYTES);
-    ret = blk_pread(blk, 0, buf, size);
+    ret = blk_pread_nonzeroes(blk, buf);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "can't read block backend");
         return false;
-- 
1.8.3.1



-- 

Thanks,
Xiang




^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
@ 2019-04-22  1:37                                         ` Xiang Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Xiang Zheng @ 2019-04-22  1:37 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Markus Armbruster, Laszlo Ersek, Peter Maydell, Ard Biesheuvel,
	QEMU Developers, qemu-arm, Heyi Guo, wanghaibin.wang, qemu-block,
	Max Reitz, Stefan Hajnoczi

Ping?

On 2019/4/15 10:39, Xiang Zheng wrote:
> On 2019/4/12 18:57, Kevin Wolf wrote:
>> Am 12.04.2019 um 11:50 hat Xiang Zheng geschrieben:
>>>
>>> On 2019/4/12 9:52, Xiang Zheng wrote:
>>>> On 2019/4/11 20:22, Kevin Wolf wrote:
>>>>> Okay, so your problem is that blk_pread() writes to the whole buffer,
>>>>> writing explicit zeroes for unallocated parts of the image, while you
>>>>> would like to leave those parts of the buffer untouched so that we don't
>>>>> actually allocate the memory, but can just use the shared zero page.
>>>>>
>>>>> If you just want to read the non-zero parts of the image, that can be
>>>>> done by using a loop that calls bdrv_block_status() and only reads from
>>>>> the image if the BDRV_BLOCK_ZERO bit is clear.
>>>>>
>>>>> Would this solve your problem?
>>>>
>>>> Sounds good! What if guest tried to read/write the zero parts?
>>>>
>>>
>>> I wrote the below patch (refer to bdrv_make_zero()) for test, it seems
>>> that everything is OK and the memory is also exactly allocated on demand.
>>>
>>> This requires pflash devices to use sparse files backend. Thus I have to
>>> create images like:
>>>
>>>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
>>>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
>>>
>>>    dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0
>>>
>>>
>>> ---8>---
>>>
>>> diff --git a/block/block-backend.c b/block/block-backend.c
>>> index f78e82a..ed8ca87 100644
>>> --- a/block/block-backend.c
>>> +++ b/block/block-backend.c
>>> @@ -1379,6 +1379,12 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset,
>>>                          flags | BDRV_REQ_ZERO_WRITE, cb, opaque);
>>>  }
>>>
>>> +int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
>>> +{
>>> +    int ret = bdrv_pread_nonzeroes(blk->root, buf);
>>> +    return ret;
>>> +}
>>
>> I don't think this deserves a place in the public block layer interface,
>> as it's only a single device that makes use of it.
>>
>> Maybe you wrote things this way because there is no blk_block_status(),
>> but you can get the BlockDriverState with blk_bs(blk) and then implement
>> everything inside hw/block/block.c.
> 
> Yes, you are right.
> 
>>
>>>  int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count)
>>>  {
>>>      int ret = blk_prw(blk, offset, buf, count, blk_read_entry, 0);
>>> diff --git a/block/io.c b/block/io.c
>>> index dfc153b..83e5ea7 100644
>>> --- a/block/io.c
>>> +++ b/block/io.c
>>> @@ -882,6 +882,38 @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
>>>                          BDRV_REQ_ZERO_WRITE | flags);
>>>  }
>>>
>>> +int bdrv_pread_nonzeroes(BdrvChild *child, void *buf)
>>> +{
>>> +    int ret;
>>> +    int64_t target_size, bytes, offset = 0;
>>> +    BlockDriverState *bs = child->bs;
>>> +
>>> +    target_size = bdrv_getlength(bs);
>>> +    if (target_size < 0) {
>>> +        return target_size;
>>> +    }
>>> +
>>> +    for (;;) {
>>> +        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
>>> +        if (bytes <= 0) {
>>> +            return 0;
>>> +        }
>>> +        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
>>> +        if (ret < 0) {
>>> +            return ret;
>>> +        }
>>> +        if (ret & BDRV_BLOCK_ZERO) {
>>> +            offset += bytes;
>>> +            continue;
>>> +        }
>>> +        ret = bdrv_pread(child, offset, buf, bytes);
>>> +        if (ret < 0) {
>>> +            return ret;
>>> +        }
>>> +        offset += bytes;
>>
>> I think the code becomes simpler the other way round:
>>
>>     if (!(ret & BDRV_BLOCK_ZERO)) {
>>         ret = bdrv_pread(child, offset, buf, bytes);
>>         if (ret < 0) {
>>             return ret;
>>         }
>>     }
>>     offset += bytes;
>>
>> You don't increment buf, so if you have a hole in the file, this will
>> corrupt the buffer. You need to either increment buf, too, or use
>> (uint8_t*) buf + offset for the bdrv_pread() call.
>>
> 
> Yes, I didn't notice it. I think the latter is better. Does *BDRV_BLOCK_ZERO*
> mean that there are all-zeroes data or a hole in the sector? But if I use an
> image filled with zeroes, it will not set BDRV_BLOCK_ZERO bit on return.
> 
> Should I resend a patch?
> 
> ---8>---
> 
>>From 4dbfe4955aa9fe23404cbe1890fbe148be2ff10e Mon Sep 17 00:00:00 2001
> From: Xiang Zheng <zhengxiang9@huawei.com>
> Date: Sat, 13 Apr 2019 02:27:03 +0800
> Subject: [PATCH] pflash: Only read non-zero parts of backend image
> 
> Currently we fill the VIRT_FLASH memory space with two 64MB NOR images
> when using persistent UEFI variables on virt board. Actually we only use
> a very small(non-zero) part of the memory while the rest significant
> large(zero) part of memory is wasted.
> 
> So this patch checks the block status and only writes the non-zero part
> into memory. This requires pflash devices to use sparse files for
> backends.
> 
> Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
> ---
>  hw/block/block.c | 40 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/block/block.c b/hw/block/block.c
> index bf56c76..3cb9d4c 100644
> --- a/hw/block/block.c
> +++ b/hw/block/block.c
> @@ -15,6 +15,44 @@
>  #include "qapi/qapi-types-block.h"
> 
>  /*
> + * Read the non-zeroes parts of @blk into @buf
> + * Reading all of the @blk is expensive if the zeroes parts of @blk
> + * is large enough. Therefore check the block status and only write
> + * the non-zeroes block into @buf.
> + *
> + * Return 0 on success, non-zero on error.
> + */
> +static int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
> +{
> +    int ret;
> +    int64_t target_size, bytes, offset = 0;
> +    BlockDriverState *bs = blk_bs(blk);
> +
> +    target_size = bdrv_getlength(bs);
> +    if (target_size < 0) {
> +        return target_size;
> +    }
> +
> +    for (;;) {
> +        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_SECTORS);
> +        if (bytes <= 0) {
> +            return 0;
> +        }
> +        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        if (!(ret & BDRV_BLOCK_ZERO)) {
> +            ret = bdrv_pread(bs->file, offset, (uint8_t *) buf + offset, bytes);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +        }
> +        offset += bytes;
> +    }
> +}
> +
> +/*
>   * Read the entire contents of @blk into @buf.
>   * @blk's contents must be @size bytes, and @size must be at most
>   * BDRV_REQUEST_MAX_BYTES.
> @@ -53,7 +91,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>       * block device and read only on demand.
>       */
>      assert(size <= BDRV_REQUEST_MAX_BYTES);
> -    ret = blk_pread(blk, 0, buf, size);
> +    ret = blk_pread_nonzeroes(blk, buf);
>      if (ret < 0) {
>          error_setg_errno(errp, -ret, "can't read block backend");
>          return false;
> 
-- 

Thanks,
Xiang

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
@ 2019-04-22  1:37                                         ` Xiang Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Xiang Zheng @ 2019-04-22  1:37 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Peter Maydell, qemu-block, Ard Biesheuvel, QEMU Developers,
	Markus Armbruster, qemu-arm, Stefan Hajnoczi, Heyi Guo,
	wanghaibin.wang, Max Reitz, Laszlo Ersek

Ping?

On 2019/4/15 10:39, Xiang Zheng wrote:
> On 2019/4/12 18:57, Kevin Wolf wrote:
>> Am 12.04.2019 um 11:50 hat Xiang Zheng geschrieben:
>>>
>>> On 2019/4/12 9:52, Xiang Zheng wrote:
>>>> On 2019/4/11 20:22, Kevin Wolf wrote:
>>>>> Okay, so your problem is that blk_pread() writes to the whole buffer,
>>>>> writing explicit zeroes for unallocated parts of the image, while you
>>>>> would like to leave those parts of the buffer untouched so that we don't
>>>>> actually allocate the memory, but can just use the shared zero page.
>>>>>
>>>>> If you just want to read the non-zero parts of the image, that can be
>>>>> done by using a loop that calls bdrv_block_status() and only reads from
>>>>> the image if the BDRV_BLOCK_ZERO bit is clear.
>>>>>
>>>>> Would this solve your problem?
>>>>
>>>> Sounds good! What if guest tried to read/write the zero parts?
>>>>
>>>
>>> I wrote the below patch (refer to bdrv_make_zero()) for test, it seems
>>> that everything is OK and the memory is also exactly allocated on demand.
>>>
>>> This requires pflash devices to use sparse files backend. Thus I have to
>>> create images like:
>>>
>>>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
>>>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
>>>
>>>    dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0
>>>
>>>
>>> ---8>---
>>>
>>> diff --git a/block/block-backend.c b/block/block-backend.c
>>> index f78e82a..ed8ca87 100644
>>> --- a/block/block-backend.c
>>> +++ b/block/block-backend.c
>>> @@ -1379,6 +1379,12 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset,
>>>                          flags | BDRV_REQ_ZERO_WRITE, cb, opaque);
>>>  }
>>>
>>> +int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
>>> +{
>>> +    int ret = bdrv_pread_nonzeroes(blk->root, buf);
>>> +    return ret;
>>> +}
>>
>> I don't think this deserves a place in the public block layer interface,
>> as it's only a single device that makes use of it.
>>
>> Maybe you wrote things this way because there is no blk_block_status(),
>> but you can get the BlockDriverState with blk_bs(blk) and then implement
>> everything inside hw/block/block.c.
> 
> Yes, you are right.
> 
>>
>>>  int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count)
>>>  {
>>>      int ret = blk_prw(blk, offset, buf, count, blk_read_entry, 0);
>>> diff --git a/block/io.c b/block/io.c
>>> index dfc153b..83e5ea7 100644
>>> --- a/block/io.c
>>> +++ b/block/io.c
>>> @@ -882,6 +882,38 @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
>>>                          BDRV_REQ_ZERO_WRITE | flags);
>>>  }
>>>
>>> +int bdrv_pread_nonzeroes(BdrvChild *child, void *buf)
>>> +{
>>> +    int ret;
>>> +    int64_t target_size, bytes, offset = 0;
>>> +    BlockDriverState *bs = child->bs;
>>> +
>>> +    target_size = bdrv_getlength(bs);
>>> +    if (target_size < 0) {
>>> +        return target_size;
>>> +    }
>>> +
>>> +    for (;;) {
>>> +        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
>>> +        if (bytes <= 0) {
>>> +            return 0;
>>> +        }
>>> +        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
>>> +        if (ret < 0) {
>>> +            return ret;
>>> +        }
>>> +        if (ret & BDRV_BLOCK_ZERO) {
>>> +            offset += bytes;
>>> +            continue;
>>> +        }
>>> +        ret = bdrv_pread(child, offset, buf, bytes);
>>> +        if (ret < 0) {
>>> +            return ret;
>>> +        }
>>> +        offset += bytes;
>>
>> I think the code becomes simpler the other way round:
>>
>>     if (!(ret & BDRV_BLOCK_ZERO)) {
>>         ret = bdrv_pread(child, offset, buf, bytes);
>>         if (ret < 0) {
>>             return ret;
>>         }
>>     }
>>     offset += bytes;
>>
>> You don't increment buf, so if you have a hole in the file, this will
>> corrupt the buffer. You need to either increment buf, too, or use
>> (uint8_t*) buf + offset for the bdrv_pread() call.
>>
> 
> Yes, I didn't notice it. I think the latter is better. Does *BDRV_BLOCK_ZERO*
> mean that there are all-zeroes data or a hole in the sector? But if I use an
> image filled with zeroes, it will not set BDRV_BLOCK_ZERO bit on return.
> 
> Should I resend a patch?
> 
> ---8>---
> 
>>From 4dbfe4955aa9fe23404cbe1890fbe148be2ff10e Mon Sep 17 00:00:00 2001
> From: Xiang Zheng <zhengxiang9@huawei.com>
> Date: Sat, 13 Apr 2019 02:27:03 +0800
> Subject: [PATCH] pflash: Only read non-zero parts of backend image
> 
> Currently we fill the VIRT_FLASH memory space with two 64MB NOR images
> when using persistent UEFI variables on virt board. Actually we only use
> a very small(non-zero) part of the memory while the rest significant
> large(zero) part of memory is wasted.
> 
> So this patch checks the block status and only writes the non-zero part
> into memory. This requires pflash devices to use sparse files for
> backends.
> 
> Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
> ---
>  hw/block/block.c | 40 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/block/block.c b/hw/block/block.c
> index bf56c76..3cb9d4c 100644
> --- a/hw/block/block.c
> +++ b/hw/block/block.c
> @@ -15,6 +15,44 @@
>  #include "qapi/qapi-types-block.h"
> 
>  /*
> + * Read the non-zeroes parts of @blk into @buf
> + * Reading all of the @blk is expensive if the zeroes parts of @blk
> + * is large enough. Therefore check the block status and only write
> + * the non-zeroes block into @buf.
> + *
> + * Return 0 on success, non-zero on error.
> + */
> +static int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
> +{
> +    int ret;
> +    int64_t target_size, bytes, offset = 0;
> +    BlockDriverState *bs = blk_bs(blk);
> +
> +    target_size = bdrv_getlength(bs);
> +    if (target_size < 0) {
> +        return target_size;
> +    }
> +
> +    for (;;) {
> +        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_SECTORS);
> +        if (bytes <= 0) {
> +            return 0;
> +        }
> +        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        if (!(ret & BDRV_BLOCK_ZERO)) {
> +            ret = bdrv_pread(bs->file, offset, (uint8_t *) buf + offset, bytes);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +        }
> +        offset += bytes;
> +    }
> +}
> +
> +/*
>   * Read the entire contents of @blk into @buf.
>   * @blk's contents must be @size bytes, and @size must be at most
>   * BDRV_REQUEST_MAX_BYTES.
> @@ -53,7 +91,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>       * block device and read only on demand.
>       */
>      assert(size <= BDRV_REQUEST_MAX_BYTES);
> -    ret = blk_pread(blk, 0, buf, size);
> +    ret = blk_pread_nonzeroes(blk, buf);
>      if (ret < 0) {
>          error_setg_errno(errp, -ret, "can't read block backend");
>          return false;
> 
-- 

Thanks,
Xiang




^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2019-04-22  1:48 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20190325125142.11628-1-zhengxiang9@huawei.com>
     [not found] ` <CAFEAcA-O5+A0Q=14k2vrBg3KNM7MmqbcyhBoZZOe=AsyE-UDbw@mail.gmail.com>
     [not found]   ` <19929558-9f2d-148a-4357-d48b46f8b62b@huawei.com>
     [not found]     ` <87va06kvm6.fsf@dusky.pond.sub.org>
     [not found]       ` <59f100b4-f6e9-973d-532b-58fb172a7009@redhat.com>
     [not found]         ` <87imw5fv45.fsf@dusky.pond.sub.org>
     [not found]           ` <abe1d9cd-85bb-2ef8-630f-af28cae6378c@redhat.com>
     [not found]             ` <8736n9eb4s.fsf@dusky.pond.sub.org>
2019-04-03 14:12               ` [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory Xiang Zheng
2019-04-03 15:35                 ` Laszlo Ersek
2019-04-08 13:43                   ` Xiang Zheng
2019-04-08 16:14                     ` Laszlo Ersek
2019-04-09  3:39                       ` Xiang Zheng
2019-04-09  6:01                       ` Markus Armbruster
2019-04-09  6:01                         ` Markus Armbruster
2019-04-09  8:28                         ` Kevin Wolf
2019-04-09  8:28                           ` Kevin Wolf
2019-04-10  8:36                           ` Xiang Zheng
2019-04-10  8:36                             ` Xiang Zheng
2019-04-11  7:15                             ` Markus Armbruster
2019-04-12  9:26                               ` Xiang Zheng
2019-04-11 12:22                             ` Kevin Wolf
2019-04-11 12:22                               ` Kevin Wolf
2019-04-12  1:52                               ` Xiang Zheng
2019-04-12  1:52                                 ` Xiang Zheng
2019-04-12  9:50                                 ` Xiang Zheng
2019-04-12  9:50                                   ` Xiang Zheng
2019-04-12 10:57                                   ` Kevin Wolf
2019-04-12 10:57                                     ` Kevin Wolf
2019-04-15  2:39                                     ` Xiang Zheng
2019-04-15  2:39                                       ` Xiang Zheng
2019-04-22  1:37                                       ` Xiang Zheng
2019-04-22  1:37                                         ` Xiang Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.