All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-04 19:38 ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-04 19:38 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Greg Kurz, Raphael Norwitz, Kevin Wolf,
	Hanna Reitz, Stefan Hajnoczi, Laurent Vivier, Amit Shah,
	Marc-André Lureau, Paolo Bonzini, Gerd Hoffmann, Jason Wang,
	Fam Zheng, Dr. David Alan Gilbert, David Hildenbrand,
	Gonglei (Arei),
	Eric Auger, qemu-block, virtio-fs

At the moment the maximum transfer size with virtio is limited to 4M
(1024 * PAGE_SIZE). This series raises this limit to its maximum
theoretical possible transfer size of 128M (32k pages) according to the
virtio specs:

https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006

Maintainers: if you don't care about allowing users to go beyond 4M then no
action is required on your side for now. This series preserves the old value
of 1k for now by using VIRTQUEUE_LEGACY_MAX_SIZE on your end.

If you do want to support 128M however, then replace
VIRTQUEUE_LEGACY_MAX_SIZE by VIRTQUEUE_MAX_SIZE on your end (see patch 3 as
example for 9pfs being the first virtio user supporting it) and make sure
that this new transfer size limit is actually supported by you.

Changes v1 -> v2:

  * Instead of simply raising VIRTQUEUE_MAX_SIZE to 32k for all virtio
    users, preserve the old value of 1k for all virtio users unless they
    explicitly opted in:
    https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00056.html

Christian Schoenebeck (3):
  virtio: turn VIRTQUEUE_MAX_SIZE into a variable
  virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  virtio-9p-device: switch to 32k max. transfer size

 hw/9pfs/virtio-9p-device.c     |  3 ++-
 hw/block/vhost-user-blk.c      |  6 +++---
 hw/block/virtio-blk.c          |  7 ++++---
 hw/char/virtio-serial-bus.c    |  2 +-
 hw/display/virtio-gpu-base.c   |  2 +-
 hw/input/virtio-input.c        |  2 +-
 hw/net/virtio-net.c            | 25 ++++++++++++------------
 hw/scsi/virtio-scsi.c          |  2 +-
 hw/virtio/vhost-user-fs.c      |  6 +++---
 hw/virtio/vhost-user-i2c.c     |  3 ++-
 hw/virtio/vhost-vsock-common.c |  2 +-
 hw/virtio/virtio-balloon.c     |  4 ++--
 hw/virtio/virtio-crypto.c      |  3 ++-
 hw/virtio/virtio-iommu.c       |  2 +-
 hw/virtio/virtio-mem.c         |  2 +-
 hw/virtio/virtio-mmio.c        |  4 ++--
 hw/virtio/virtio-pmem.c        |  2 +-
 hw/virtio/virtio-rng.c         |  3 ++-
 hw/virtio/virtio.c             | 35 +++++++++++++++++++++++-----------
 include/hw/virtio/virtio.h     | 25 ++++++++++++++++++++++--
 20 files changed, 90 insertions(+), 50 deletions(-)

-- 
2.20.1



^ permalink raw reply	[flat|nested] 97+ messages in thread

* [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-04 19:38 ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-04 19:38 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng

At the moment the maximum transfer size with virtio is limited to 4M
(1024 * PAGE_SIZE). This series raises this limit to its maximum
theoretical possible transfer size of 128M (32k pages) according to the
virtio specs:

https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006

Maintainers: if you don't care about allowing users to go beyond 4M then no
action is required on your side for now. This series preserves the old value
of 1k for now by using VIRTQUEUE_LEGACY_MAX_SIZE on your end.

If you do want to support 128M however, then replace
VIRTQUEUE_LEGACY_MAX_SIZE by VIRTQUEUE_MAX_SIZE on your end (see patch 3 as
example for 9pfs being the first virtio user supporting it) and make sure
that this new transfer size limit is actually supported by you.

Changes v1 -> v2:

  * Instead of simply raising VIRTQUEUE_MAX_SIZE to 32k for all virtio
    users, preserve the old value of 1k for all virtio users unless they
    explicitly opted in:
    https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00056.html

Christian Schoenebeck (3):
  virtio: turn VIRTQUEUE_MAX_SIZE into a variable
  virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  virtio-9p-device: switch to 32k max. transfer size

 hw/9pfs/virtio-9p-device.c     |  3 ++-
 hw/block/vhost-user-blk.c      |  6 +++---
 hw/block/virtio-blk.c          |  7 ++++---
 hw/char/virtio-serial-bus.c    |  2 +-
 hw/display/virtio-gpu-base.c   |  2 +-
 hw/input/virtio-input.c        |  2 +-
 hw/net/virtio-net.c            | 25 ++++++++++++------------
 hw/scsi/virtio-scsi.c          |  2 +-
 hw/virtio/vhost-user-fs.c      |  6 +++---
 hw/virtio/vhost-user-i2c.c     |  3 ++-
 hw/virtio/vhost-vsock-common.c |  2 +-
 hw/virtio/virtio-balloon.c     |  4 ++--
 hw/virtio/virtio-crypto.c      |  3 ++-
 hw/virtio/virtio-iommu.c       |  2 +-
 hw/virtio/virtio-mem.c         |  2 +-
 hw/virtio/virtio-mmio.c        |  4 ++--
 hw/virtio/virtio-pmem.c        |  2 +-
 hw/virtio/virtio-rng.c         |  3 ++-
 hw/virtio/virtio.c             | 35 +++++++++++++++++++++++-----------
 include/hw/virtio/virtio.h     | 25 ++++++++++++++++++++++--
 20 files changed, 90 insertions(+), 50 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 97+ messages in thread

* [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
  2021-10-04 19:38 ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-04 19:38   ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-04 19:38 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Greg Kurz, Raphael Norwitz, Kevin Wolf,
	Hanna Reitz, Stefan Hajnoczi, Laurent Vivier, Amit Shah,
	Marc-André Lureau, Paolo Bonzini, Gerd Hoffmann, Jason Wang,
	Fam Zheng, Dr. David Alan Gilbert, David Hildenbrand,
	Gonglei (Arei),
	Eric Auger, qemu-block, virtio-fs

Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
variable per virtio user.

Reasons:

(1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
    maximum queue size possible. Which is actually the maximum
    queue size allowed by the virtio protocol. The appropriate
    value for VIRTQUEUE_MAX_SIZE would therefore be 32768:

    https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006

    Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
    more or less arbitrary value of 1024 in the past, which
    limits the maximum transfer size with virtio to 4M
    (more precise: 1024 * PAGE_SIZE, with the latter typically
    being 4k).

(2) Additionally the current value of 1024 poses a hidden limit,
    invisible to guest, which causes a system hang with the
    following QEMU error if guest tries to exceed it:

    virtio: too many write descriptors in indirect table

(3) Unfortunately not all virtio users in QEMU would currently
    work correctly with the new value of 32768.

So let's turn this hard coded global value into a runtime
variable as a first step in this commit, configurable for each
virtio user by passing a corresponding value with virtio_init()
call.

Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
---
 hw/9pfs/virtio-9p-device.c     |  3 ++-
 hw/block/vhost-user-blk.c      |  2 +-
 hw/block/virtio-blk.c          |  3 ++-
 hw/char/virtio-serial-bus.c    |  2 +-
 hw/display/virtio-gpu-base.c   |  2 +-
 hw/input/virtio-input.c        |  2 +-
 hw/net/virtio-net.c            | 15 ++++++++-------
 hw/scsi/virtio-scsi.c          |  2 +-
 hw/virtio/vhost-user-fs.c      |  2 +-
 hw/virtio/vhost-user-i2c.c     |  3 ++-
 hw/virtio/vhost-vsock-common.c |  2 +-
 hw/virtio/virtio-balloon.c     |  4 ++--
 hw/virtio/virtio-crypto.c      |  3 ++-
 hw/virtio/virtio-iommu.c       |  2 +-
 hw/virtio/virtio-mem.c         |  2 +-
 hw/virtio/virtio-pmem.c        |  2 +-
 hw/virtio/virtio-rng.c         |  2 +-
 hw/virtio/virtio.c             | 35 +++++++++++++++++++++++-----------
 include/hw/virtio/virtio.h     |  5 ++++-
 19 files changed, 57 insertions(+), 36 deletions(-)

diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
index 54ee93b71f..cd5d95dd51 100644
--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -216,7 +216,8 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
     }
 
     v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
-    virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size);
+    virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
+                VIRTQUEUE_MAX_SIZE);
     v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
 }
 
diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index ba13cb87e5..336f56705c 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
     }
 
     virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
-                sizeof(struct virtio_blk_config));
+                sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
 
     s->virtqs = g_new(VirtQueue *, s->num_queues);
     for (i = 0; i < s->num_queues; i++) {
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index f139cd7cc9..9c0f46815c 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1213,7 +1213,8 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
 
     virtio_blk_set_config_size(s, s->host_features);
 
-    virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size);
+    virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
+                VIRTQUEUE_MAX_SIZE);
 
     s->blk = conf->conf.blk;
     s->rq = NULL;
diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
index f01ec2137c..9ad9111115 100644
--- a/hw/char/virtio-serial-bus.c
+++ b/hw/char/virtio-serial-bus.c
@@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState *dev, Error **errp)
         config_size = offsetof(struct virtio_console_config, emerg_wr);
     }
     virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
-                config_size);
+                config_size, VIRTQUEUE_MAX_SIZE);
 
     /* Spawn a new virtio-serial bus on which the ports will ride as devices */
     qbus_init(&vser->bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
diff --git a/hw/display/virtio-gpu-base.c b/hw/display/virtio-gpu-base.c
index c8da4806e0..20b06a7adf 100644
--- a/hw/display/virtio-gpu-base.c
+++ b/hw/display/virtio-gpu-base.c
@@ -171,7 +171,7 @@ virtio_gpu_base_device_realize(DeviceState *qdev,
 
     g->virtio_config.num_scanouts = cpu_to_le32(g->conf.max_outputs);
     virtio_init(VIRTIO_DEVICE(g), "virtio-gpu", VIRTIO_ID_GPU,
-                sizeof(struct virtio_gpu_config));
+                sizeof(struct virtio_gpu_config), VIRTQUEUE_MAX_SIZE);
 
     if (virtio_gpu_virgl_enabled(g->conf)) {
         /* use larger control queue in 3d mode */
diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
index 54bcb46c74..345eb2cce7 100644
--- a/hw/input/virtio-input.c
+++ b/hw/input/virtio-input.c
@@ -258,7 +258,7 @@ static void virtio_input_device_realize(DeviceState *dev, Error **errp)
     assert(vinput->cfg_size <= sizeof(virtio_input_config));
 
     virtio_init(vdev, "virtio-input", VIRTIO_ID_INPUT,
-                vinput->cfg_size);
+                vinput->cfg_size, VIRTQUEUE_MAX_SIZE);
     vinput->evt = virtio_add_queue(vdev, 64, virtio_input_handle_evt);
     vinput->sts = virtio_add_queue(vdev, 64, virtio_input_handle_sts);
 }
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index f205331dcf..f74b5f6268 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1746,9 +1746,9 @@ static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
     VirtIONet *n = qemu_get_nic_opaque(nc);
     VirtIONetQueue *q = virtio_net_get_subqueue(nc);
     VirtIODevice *vdev = VIRTIO_DEVICE(n);
-    VirtQueueElement *elems[VIRTQUEUE_MAX_SIZE];
-    size_t lens[VIRTQUEUE_MAX_SIZE];
-    struct iovec mhdr_sg[VIRTQUEUE_MAX_SIZE];
+    VirtQueueElement *elems[vdev->queue_max_size];
+    size_t lens[vdev->queue_max_size];
+    struct iovec mhdr_sg[vdev->queue_max_size];
     struct virtio_net_hdr_mrg_rxbuf mhdr;
     unsigned mhdr_cnt = 0;
     size_t offset, i, guest_offset, j;
@@ -1783,7 +1783,7 @@ static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
 
         total = 0;
 
-        if (i == VIRTQUEUE_MAX_SIZE) {
+        if (i == vdev->queue_max_size) {
             virtio_error(vdev, "virtio-net unexpected long buffer chain");
             err = size;
             goto err;
@@ -2532,7 +2532,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
     for (;;) {
         ssize_t ret;
         unsigned int out_num;
-        struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE + 1], *out_sg;
+        struct iovec sg[vdev->queue_max_size], sg2[vdev->queue_max_size + 1], *out_sg;
         struct virtio_net_hdr_mrg_rxbuf mhdr;
 
         elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement));
@@ -2564,7 +2564,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
                 out_num = iov_copy(&sg2[1], ARRAY_SIZE(sg2) - 1,
                                    out_sg, out_num,
                                    n->guest_hdr_len, -1);
-                if (out_num == VIRTQUEUE_MAX_SIZE) {
+                if (out_num == vdev->queue_max_size) {
                     goto drop;
                 }
                 out_num += 1;
@@ -3364,7 +3364,8 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
     }
 
     virtio_net_set_config_size(n, n->host_features);
-    virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size);
+    virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size,
+                VIRTQUEUE_MAX_SIZE);
 
     /*
      * We set a lower limit on RX queue size to what it always was.
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 51fd09522a..5e5e657e1d 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -973,7 +973,7 @@ void virtio_scsi_common_realize(DeviceState *dev,
     int i;
 
     virtio_init(vdev, "virtio-scsi", VIRTIO_ID_SCSI,
-                sizeof(VirtIOSCSIConfig));
+                sizeof(VirtIOSCSIConfig), VIRTQUEUE_MAX_SIZE);
 
     if (s->conf.num_queues == VIRTIO_SCSI_AUTO_NUM_QUEUES) {
         s->conf.num_queues = 1;
diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index c595957983..ae1672d667 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -220,7 +220,7 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
     }
 
     virtio_init(vdev, "vhost-user-fs", VIRTIO_ID_FS,
-                sizeof(struct virtio_fs_config));
+                sizeof(struct virtio_fs_config), VIRTQUEUE_MAX_SIZE);
 
     /* Hiprio queue */
     fs->hiprio_vq = virtio_add_queue(vdev, fs->conf.queue_size, vuf_handle_output);
diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c
index d172632bb0..eeb1d8853a 100644
--- a/hw/virtio/vhost-user-i2c.c
+++ b/hw/virtio/vhost-user-i2c.c
@@ -220,7 +220,8 @@ static void vu_i2c_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    virtio_init(vdev, "vhost-user-i2c", VIRTIO_ID_I2C_ADAPTER, 0);
+    virtio_init(vdev, "vhost-user-i2c", VIRTIO_ID_I2C_ADAPTER, 0,
+                VIRTQUEUE_MAX_SIZE);
 
     i2c->vhost_dev.nvqs = 1;
     i2c->vq = virtio_add_queue(vdev, 4, vu_i2c_handle_output);
diff --git a/hw/virtio/vhost-vsock-common.c b/hw/virtio/vhost-vsock-common.c
index 4ad6e234ad..a81fa884a8 100644
--- a/hw/virtio/vhost-vsock-common.c
+++ b/hw/virtio/vhost-vsock-common.c
@@ -201,7 +201,7 @@ void vhost_vsock_common_realize(VirtIODevice *vdev, const char *name)
     VHostVSockCommon *vvc = VHOST_VSOCK_COMMON(vdev);
 
     virtio_init(vdev, name, VIRTIO_ID_VSOCK,
-                sizeof(struct virtio_vsock_config));
+                sizeof(struct virtio_vsock_config), VIRTQUEUE_MAX_SIZE);
 
     /* Receive and transmit queues belong to vhost */
     vvc->recv_vq = virtio_add_queue(vdev, VHOST_VSOCK_QUEUE_SIZE,
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 5a69dce35d..067c73223d 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -886,7 +886,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
     int ret;
 
     virtio_init(vdev, "virtio-balloon", VIRTIO_ID_BALLOON,
-                virtio_balloon_config_size(s));
+                virtio_balloon_config_size(s), VIRTQUEUE_MAX_SIZE);
 
     ret = qemu_add_balloon_handler(virtio_balloon_to_target,
                                    virtio_balloon_stat, s);
@@ -909,7 +909,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
     s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
 
     if (virtio_has_feature(s->host_features, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
-        s->free_page_vq = virtio_add_queue(vdev, VIRTQUEUE_MAX_SIZE,
+        s->free_page_vq = virtio_add_queue(vdev, vdev->queue_max_size,
                                            virtio_balloon_handle_free_page_vq);
         precopy_add_notifier(&s->free_page_hint_notify);
 
diff --git a/hw/virtio/virtio-crypto.c b/hw/virtio/virtio-crypto.c
index 54f9bbb789..1e70d4d2a8 100644
--- a/hw/virtio/virtio-crypto.c
+++ b/hw/virtio/virtio-crypto.c
@@ -810,7 +810,8 @@ static void virtio_crypto_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    virtio_init(vdev, "virtio-crypto", VIRTIO_ID_CRYPTO, vcrypto->config_size);
+    virtio_init(vdev, "virtio-crypto", VIRTIO_ID_CRYPTO, vcrypto->config_size,
+                VIRTQUEUE_MAX_SIZE);
     vcrypto->curr_queues = 1;
     vcrypto->vqs = g_malloc0(sizeof(VirtIOCryptoQueue) * vcrypto->max_queues);
     for (i = 0; i < vcrypto->max_queues; i++) {
diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
index 1b23e8e18c..ca360e74eb 100644
--- a/hw/virtio/virtio-iommu.c
+++ b/hw/virtio/virtio-iommu.c
@@ -974,7 +974,7 @@ static void virtio_iommu_device_realize(DeviceState *dev, Error **errp)
     VirtIOIOMMU *s = VIRTIO_IOMMU(dev);
 
     virtio_init(vdev, "virtio-iommu", VIRTIO_ID_IOMMU,
-                sizeof(struct virtio_iommu_config));
+                sizeof(struct virtio_iommu_config), VIRTQUEUE_MAX_SIZE);
 
     memset(s->iommu_pcibus_by_bus_num, 0, sizeof(s->iommu_pcibus_by_bus_num));
 
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index df91e454b2..1d9d01b871 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -738,7 +738,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
     vmem->bitmap = bitmap_new(vmem->bitmap_size);
 
     virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
-                sizeof(struct virtio_mem_config));
+                sizeof(struct virtio_mem_config), VIRTQUEUE_MAX_SIZE);
     vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
 
     host_memory_backend_set_mapped(vmem->memdev, true);
diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
index d1aeb90a31..82b54b00c5 100644
--- a/hw/virtio/virtio-pmem.c
+++ b/hw/virtio/virtio-pmem.c
@@ -124,7 +124,7 @@ static void virtio_pmem_realize(DeviceState *dev, Error **errp)
 
     host_memory_backend_set_mapped(pmem->memdev, true);
     virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
-                sizeof(struct virtio_pmem_config));
+                sizeof(struct virtio_pmem_config), VIRTQUEUE_MAX_SIZE);
     pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
 }
 
diff --git a/hw/virtio/virtio-rng.c b/hw/virtio/virtio-rng.c
index cc8e9f775d..0e91d60106 100644
--- a/hw/virtio/virtio-rng.c
+++ b/hw/virtio/virtio-rng.c
@@ -215,7 +215,7 @@ static void virtio_rng_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0);
+    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0, VIRTQUEUE_MAX_SIZE);
 
     vrng->vq = virtio_add_queue(vdev, 8, handle_input);
     vrng->quota_remaining = vrng->conf.max_bytes;
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 240759ff0b..60e094d96a 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -1419,8 +1419,8 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
     VirtIODevice *vdev = vq->vdev;
     VirtQueueElement *elem = NULL;
     unsigned out_num, in_num, elem_entries;
-    hwaddr addr[VIRTQUEUE_MAX_SIZE];
-    struct iovec iov[VIRTQUEUE_MAX_SIZE];
+    hwaddr addr[vdev->queue_max_size];
+    struct iovec iov[vdev->queue_max_size];
     VRingDesc desc;
     int rc;
 
@@ -1492,7 +1492,7 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
         if (desc.flags & VRING_DESC_F_WRITE) {
             map_ok = virtqueue_map_desc(vdev, &in_num, addr + out_num,
                                         iov + out_num,
-                                        VIRTQUEUE_MAX_SIZE - out_num, true,
+                                        vdev->queue_max_size - out_num, true,
                                         desc.addr, desc.len);
         } else {
             if (in_num) {
@@ -1500,7 +1500,7 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
                 goto err_undo_map;
             }
             map_ok = virtqueue_map_desc(vdev, &out_num, addr, iov,
-                                        VIRTQUEUE_MAX_SIZE, false,
+                                        vdev->queue_max_size, false,
                                         desc.addr, desc.len);
         }
         if (!map_ok) {
@@ -1556,8 +1556,8 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
     VirtIODevice *vdev = vq->vdev;
     VirtQueueElement *elem = NULL;
     unsigned out_num, in_num, elem_entries;
-    hwaddr addr[VIRTQUEUE_MAX_SIZE];
-    struct iovec iov[VIRTQUEUE_MAX_SIZE];
+    hwaddr addr[vdev->queue_max_size];
+    struct iovec iov[vdev->queue_max_size];
     VRingPackedDesc desc;
     uint16_t id;
     int rc;
@@ -1620,7 +1620,7 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
         if (desc.flags & VRING_DESC_F_WRITE) {
             map_ok = virtqueue_map_desc(vdev, &in_num, addr + out_num,
                                         iov + out_num,
-                                        VIRTQUEUE_MAX_SIZE - out_num, true,
+                                        vdev->queue_max_size - out_num, true,
                                         desc.addr, desc.len);
         } else {
             if (in_num) {
@@ -1628,7 +1628,7 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
                 goto err_undo_map;
             }
             map_ok = virtqueue_map_desc(vdev, &out_num, addr, iov,
-                                        VIRTQUEUE_MAX_SIZE, false,
+                                        vdev->queue_max_size, false,
                                         desc.addr, desc.len);
         }
         if (!map_ok) {
@@ -2249,7 +2249,7 @@ void virtio_queue_set_num(VirtIODevice *vdev, int n, int num)
      * nonexistent states, or to set it to an invalid size.
      */
     if (!!num != !!vdev->vq[n].vring.num ||
-        num > VIRTQUEUE_MAX_SIZE ||
+        num > vdev->queue_max_size ||
         num < 0) {
         return;
     }
@@ -2400,7 +2400,7 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
             break;
     }
 
-    if (i == VIRTIO_QUEUE_MAX || queue_size > VIRTQUEUE_MAX_SIZE)
+    if (i == VIRTIO_QUEUE_MAX || queue_size > vdev->queue_max_size)
         abort();
 
     vdev->vq[i].vring.num = queue_size;
@@ -3239,13 +3239,25 @@ void virtio_instance_init_common(Object *proxy_obj, void *data,
 }
 
 void virtio_init(VirtIODevice *vdev, const char *name,
-                 uint16_t device_id, size_t config_size)
+                 uint16_t device_id, size_t config_size,
+                 uint16_t queue_max_size)
 {
     BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
     VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
     int i;
     int nvectors = k->query_nvectors ? k->query_nvectors(qbus->parent) : 0;
 
+    if (queue_max_size > VIRTQUEUE_MAX_SIZE ||
+        !is_power_of_2(queue_max_size))
+    {
+        error_report(
+            "virtio: invalid queue_max_size (= %" PRIu16 "), must be a "
+            "power of 2 and between 1 ... %d.",
+            queue_max_size, VIRTQUEUE_MAX_SIZE
+        );
+        abort();
+    }
+
     if (nvectors) {
         vdev->vector_queues =
             g_malloc0(sizeof(*vdev->vector_queues) * nvectors);
@@ -3258,6 +3270,7 @@ void virtio_init(VirtIODevice *vdev, const char *name,
     qatomic_set(&vdev->isr, 0);
     vdev->queue_sel = 0;
     vdev->config_vector = VIRTIO_NO_VECTOR;
+    vdev->queue_max_size = queue_max_size;
     vdev->vq = g_malloc0(sizeof(VirtQueue) * VIRTIO_QUEUE_MAX);
     vdev->vm_running = runstate_is_running();
     vdev->broken = false;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 8bab9cfb75..a37d1f7d52 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -89,6 +89,7 @@ struct VirtIODevice
     size_t config_len;
     void *config;
     uint16_t config_vector;
+    uint16_t queue_max_size;
     uint32_t generation;
     int nvectors;
     VirtQueue *vq;
@@ -166,7 +167,9 @@ void virtio_instance_init_common(Object *proxy_obj, void *data,
                                  size_t vdev_size, const char *vdev_name);
 
 void virtio_init(VirtIODevice *vdev, const char *name,
-                         uint16_t device_id, size_t config_size);
+                 uint16_t device_id, size_t config_size,
+                 uint16_t queue_max_size);
+
 void virtio_cleanup(VirtIODevice *vdev);
 
 void virtio_error(VirtIODevice *vdev, const char *fmt, ...) GCC_FMT_ATTR(2, 3);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [Virtio-fs] [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
@ 2021-10-04 19:38   ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-04 19:38 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng

Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
variable per virtio user.

Reasons:

(1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
    maximum queue size possible. Which is actually the maximum
    queue size allowed by the virtio protocol. The appropriate
    value for VIRTQUEUE_MAX_SIZE would therefore be 32768:

    https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006

    Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
    more or less arbitrary value of 1024 in the past, which
    limits the maximum transfer size with virtio to 4M
    (more precise: 1024 * PAGE_SIZE, with the latter typically
    being 4k).

(2) Additionally the current value of 1024 poses a hidden limit,
    invisible to guest, which causes a system hang with the
    following QEMU error if guest tries to exceed it:

    virtio: too many write descriptors in indirect table

(3) Unfortunately not all virtio users in QEMU would currently
    work correctly with the new value of 32768.

So let's turn this hard coded global value into a runtime
variable as a first step in this commit, configurable for each
virtio user by passing a corresponding value with virtio_init()
call.

Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
---
 hw/9pfs/virtio-9p-device.c     |  3 ++-
 hw/block/vhost-user-blk.c      |  2 +-
 hw/block/virtio-blk.c          |  3 ++-
 hw/char/virtio-serial-bus.c    |  2 +-
 hw/display/virtio-gpu-base.c   |  2 +-
 hw/input/virtio-input.c        |  2 +-
 hw/net/virtio-net.c            | 15 ++++++++-------
 hw/scsi/virtio-scsi.c          |  2 +-
 hw/virtio/vhost-user-fs.c      |  2 +-
 hw/virtio/vhost-user-i2c.c     |  3 ++-
 hw/virtio/vhost-vsock-common.c |  2 +-
 hw/virtio/virtio-balloon.c     |  4 ++--
 hw/virtio/virtio-crypto.c      |  3 ++-
 hw/virtio/virtio-iommu.c       |  2 +-
 hw/virtio/virtio-mem.c         |  2 +-
 hw/virtio/virtio-pmem.c        |  2 +-
 hw/virtio/virtio-rng.c         |  2 +-
 hw/virtio/virtio.c             | 35 +++++++++++++++++++++++-----------
 include/hw/virtio/virtio.h     |  5 ++++-
 19 files changed, 57 insertions(+), 36 deletions(-)

diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
index 54ee93b71f..cd5d95dd51 100644
--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -216,7 +216,8 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
     }
 
     v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
-    virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size);
+    virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
+                VIRTQUEUE_MAX_SIZE);
     v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
 }
 
diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index ba13cb87e5..336f56705c 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
     }
 
     virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
-                sizeof(struct virtio_blk_config));
+                sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
 
     s->virtqs = g_new(VirtQueue *, s->num_queues);
     for (i = 0; i < s->num_queues; i++) {
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index f139cd7cc9..9c0f46815c 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1213,7 +1213,8 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
 
     virtio_blk_set_config_size(s, s->host_features);
 
-    virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size);
+    virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
+                VIRTQUEUE_MAX_SIZE);
 
     s->blk = conf->conf.blk;
     s->rq = NULL;
diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
index f01ec2137c..9ad9111115 100644
--- a/hw/char/virtio-serial-bus.c
+++ b/hw/char/virtio-serial-bus.c
@@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState *dev, Error **errp)
         config_size = offsetof(struct virtio_console_config, emerg_wr);
     }
     virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
-                config_size);
+                config_size, VIRTQUEUE_MAX_SIZE);
 
     /* Spawn a new virtio-serial bus on which the ports will ride as devices */
     qbus_init(&vser->bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
diff --git a/hw/display/virtio-gpu-base.c b/hw/display/virtio-gpu-base.c
index c8da4806e0..20b06a7adf 100644
--- a/hw/display/virtio-gpu-base.c
+++ b/hw/display/virtio-gpu-base.c
@@ -171,7 +171,7 @@ virtio_gpu_base_device_realize(DeviceState *qdev,
 
     g->virtio_config.num_scanouts = cpu_to_le32(g->conf.max_outputs);
     virtio_init(VIRTIO_DEVICE(g), "virtio-gpu", VIRTIO_ID_GPU,
-                sizeof(struct virtio_gpu_config));
+                sizeof(struct virtio_gpu_config), VIRTQUEUE_MAX_SIZE);
 
     if (virtio_gpu_virgl_enabled(g->conf)) {
         /* use larger control queue in 3d mode */
diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
index 54bcb46c74..345eb2cce7 100644
--- a/hw/input/virtio-input.c
+++ b/hw/input/virtio-input.c
@@ -258,7 +258,7 @@ static void virtio_input_device_realize(DeviceState *dev, Error **errp)
     assert(vinput->cfg_size <= sizeof(virtio_input_config));
 
     virtio_init(vdev, "virtio-input", VIRTIO_ID_INPUT,
-                vinput->cfg_size);
+                vinput->cfg_size, VIRTQUEUE_MAX_SIZE);
     vinput->evt = virtio_add_queue(vdev, 64, virtio_input_handle_evt);
     vinput->sts = virtio_add_queue(vdev, 64, virtio_input_handle_sts);
 }
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index f205331dcf..f74b5f6268 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1746,9 +1746,9 @@ static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
     VirtIONet *n = qemu_get_nic_opaque(nc);
     VirtIONetQueue *q = virtio_net_get_subqueue(nc);
     VirtIODevice *vdev = VIRTIO_DEVICE(n);
-    VirtQueueElement *elems[VIRTQUEUE_MAX_SIZE];
-    size_t lens[VIRTQUEUE_MAX_SIZE];
-    struct iovec mhdr_sg[VIRTQUEUE_MAX_SIZE];
+    VirtQueueElement *elems[vdev->queue_max_size];
+    size_t lens[vdev->queue_max_size];
+    struct iovec mhdr_sg[vdev->queue_max_size];
     struct virtio_net_hdr_mrg_rxbuf mhdr;
     unsigned mhdr_cnt = 0;
     size_t offset, i, guest_offset, j;
@@ -1783,7 +1783,7 @@ static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
 
         total = 0;
 
-        if (i == VIRTQUEUE_MAX_SIZE) {
+        if (i == vdev->queue_max_size) {
             virtio_error(vdev, "virtio-net unexpected long buffer chain");
             err = size;
             goto err;
@@ -2532,7 +2532,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
     for (;;) {
         ssize_t ret;
         unsigned int out_num;
-        struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE + 1], *out_sg;
+        struct iovec sg[vdev->queue_max_size], sg2[vdev->queue_max_size + 1], *out_sg;
         struct virtio_net_hdr_mrg_rxbuf mhdr;
 
         elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement));
@@ -2564,7 +2564,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
                 out_num = iov_copy(&sg2[1], ARRAY_SIZE(sg2) - 1,
                                    out_sg, out_num,
                                    n->guest_hdr_len, -1);
-                if (out_num == VIRTQUEUE_MAX_SIZE) {
+                if (out_num == vdev->queue_max_size) {
                     goto drop;
                 }
                 out_num += 1;
@@ -3364,7 +3364,8 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
     }
 
     virtio_net_set_config_size(n, n->host_features);
-    virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size);
+    virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size,
+                VIRTQUEUE_MAX_SIZE);
 
     /*
      * We set a lower limit on RX queue size to what it always was.
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 51fd09522a..5e5e657e1d 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -973,7 +973,7 @@ void virtio_scsi_common_realize(DeviceState *dev,
     int i;
 
     virtio_init(vdev, "virtio-scsi", VIRTIO_ID_SCSI,
-                sizeof(VirtIOSCSIConfig));
+                sizeof(VirtIOSCSIConfig), VIRTQUEUE_MAX_SIZE);
 
     if (s->conf.num_queues == VIRTIO_SCSI_AUTO_NUM_QUEUES) {
         s->conf.num_queues = 1;
diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index c595957983..ae1672d667 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -220,7 +220,7 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
     }
 
     virtio_init(vdev, "vhost-user-fs", VIRTIO_ID_FS,
-                sizeof(struct virtio_fs_config));
+                sizeof(struct virtio_fs_config), VIRTQUEUE_MAX_SIZE);
 
     /* Hiprio queue */
     fs->hiprio_vq = virtio_add_queue(vdev, fs->conf.queue_size, vuf_handle_output);
diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c
index d172632bb0..eeb1d8853a 100644
--- a/hw/virtio/vhost-user-i2c.c
+++ b/hw/virtio/vhost-user-i2c.c
@@ -220,7 +220,8 @@ static void vu_i2c_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    virtio_init(vdev, "vhost-user-i2c", VIRTIO_ID_I2C_ADAPTER, 0);
+    virtio_init(vdev, "vhost-user-i2c", VIRTIO_ID_I2C_ADAPTER, 0,
+                VIRTQUEUE_MAX_SIZE);
 
     i2c->vhost_dev.nvqs = 1;
     i2c->vq = virtio_add_queue(vdev, 4, vu_i2c_handle_output);
diff --git a/hw/virtio/vhost-vsock-common.c b/hw/virtio/vhost-vsock-common.c
index 4ad6e234ad..a81fa884a8 100644
--- a/hw/virtio/vhost-vsock-common.c
+++ b/hw/virtio/vhost-vsock-common.c
@@ -201,7 +201,7 @@ void vhost_vsock_common_realize(VirtIODevice *vdev, const char *name)
     VHostVSockCommon *vvc = VHOST_VSOCK_COMMON(vdev);
 
     virtio_init(vdev, name, VIRTIO_ID_VSOCK,
-                sizeof(struct virtio_vsock_config));
+                sizeof(struct virtio_vsock_config), VIRTQUEUE_MAX_SIZE);
 
     /* Receive and transmit queues belong to vhost */
     vvc->recv_vq = virtio_add_queue(vdev, VHOST_VSOCK_QUEUE_SIZE,
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 5a69dce35d..067c73223d 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -886,7 +886,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
     int ret;
 
     virtio_init(vdev, "virtio-balloon", VIRTIO_ID_BALLOON,
-                virtio_balloon_config_size(s));
+                virtio_balloon_config_size(s), VIRTQUEUE_MAX_SIZE);
 
     ret = qemu_add_balloon_handler(virtio_balloon_to_target,
                                    virtio_balloon_stat, s);
@@ -909,7 +909,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
     s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
 
     if (virtio_has_feature(s->host_features, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
-        s->free_page_vq = virtio_add_queue(vdev, VIRTQUEUE_MAX_SIZE,
+        s->free_page_vq = virtio_add_queue(vdev, vdev->queue_max_size,
                                            virtio_balloon_handle_free_page_vq);
         precopy_add_notifier(&s->free_page_hint_notify);
 
diff --git a/hw/virtio/virtio-crypto.c b/hw/virtio/virtio-crypto.c
index 54f9bbb789..1e70d4d2a8 100644
--- a/hw/virtio/virtio-crypto.c
+++ b/hw/virtio/virtio-crypto.c
@@ -810,7 +810,8 @@ static void virtio_crypto_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    virtio_init(vdev, "virtio-crypto", VIRTIO_ID_CRYPTO, vcrypto->config_size);
+    virtio_init(vdev, "virtio-crypto", VIRTIO_ID_CRYPTO, vcrypto->config_size,
+                VIRTQUEUE_MAX_SIZE);
     vcrypto->curr_queues = 1;
     vcrypto->vqs = g_malloc0(sizeof(VirtIOCryptoQueue) * vcrypto->max_queues);
     for (i = 0; i < vcrypto->max_queues; i++) {
diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
index 1b23e8e18c..ca360e74eb 100644
--- a/hw/virtio/virtio-iommu.c
+++ b/hw/virtio/virtio-iommu.c
@@ -974,7 +974,7 @@ static void virtio_iommu_device_realize(DeviceState *dev, Error **errp)
     VirtIOIOMMU *s = VIRTIO_IOMMU(dev);
 
     virtio_init(vdev, "virtio-iommu", VIRTIO_ID_IOMMU,
-                sizeof(struct virtio_iommu_config));
+                sizeof(struct virtio_iommu_config), VIRTQUEUE_MAX_SIZE);
 
     memset(s->iommu_pcibus_by_bus_num, 0, sizeof(s->iommu_pcibus_by_bus_num));
 
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index df91e454b2..1d9d01b871 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -738,7 +738,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
     vmem->bitmap = bitmap_new(vmem->bitmap_size);
 
     virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
-                sizeof(struct virtio_mem_config));
+                sizeof(struct virtio_mem_config), VIRTQUEUE_MAX_SIZE);
     vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
 
     host_memory_backend_set_mapped(vmem->memdev, true);
diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
index d1aeb90a31..82b54b00c5 100644
--- a/hw/virtio/virtio-pmem.c
+++ b/hw/virtio/virtio-pmem.c
@@ -124,7 +124,7 @@ static void virtio_pmem_realize(DeviceState *dev, Error **errp)
 
     host_memory_backend_set_mapped(pmem->memdev, true);
     virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
-                sizeof(struct virtio_pmem_config));
+                sizeof(struct virtio_pmem_config), VIRTQUEUE_MAX_SIZE);
     pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
 }
 
diff --git a/hw/virtio/virtio-rng.c b/hw/virtio/virtio-rng.c
index cc8e9f775d..0e91d60106 100644
--- a/hw/virtio/virtio-rng.c
+++ b/hw/virtio/virtio-rng.c
@@ -215,7 +215,7 @@ static void virtio_rng_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0);
+    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0, VIRTQUEUE_MAX_SIZE);
 
     vrng->vq = virtio_add_queue(vdev, 8, handle_input);
     vrng->quota_remaining = vrng->conf.max_bytes;
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 240759ff0b..60e094d96a 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -1419,8 +1419,8 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
     VirtIODevice *vdev = vq->vdev;
     VirtQueueElement *elem = NULL;
     unsigned out_num, in_num, elem_entries;
-    hwaddr addr[VIRTQUEUE_MAX_SIZE];
-    struct iovec iov[VIRTQUEUE_MAX_SIZE];
+    hwaddr addr[vdev->queue_max_size];
+    struct iovec iov[vdev->queue_max_size];
     VRingDesc desc;
     int rc;
 
@@ -1492,7 +1492,7 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
         if (desc.flags & VRING_DESC_F_WRITE) {
             map_ok = virtqueue_map_desc(vdev, &in_num, addr + out_num,
                                         iov + out_num,
-                                        VIRTQUEUE_MAX_SIZE - out_num, true,
+                                        vdev->queue_max_size - out_num, true,
                                         desc.addr, desc.len);
         } else {
             if (in_num) {
@@ -1500,7 +1500,7 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
                 goto err_undo_map;
             }
             map_ok = virtqueue_map_desc(vdev, &out_num, addr, iov,
-                                        VIRTQUEUE_MAX_SIZE, false,
+                                        vdev->queue_max_size, false,
                                         desc.addr, desc.len);
         }
         if (!map_ok) {
@@ -1556,8 +1556,8 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
     VirtIODevice *vdev = vq->vdev;
     VirtQueueElement *elem = NULL;
     unsigned out_num, in_num, elem_entries;
-    hwaddr addr[VIRTQUEUE_MAX_SIZE];
-    struct iovec iov[VIRTQUEUE_MAX_SIZE];
+    hwaddr addr[vdev->queue_max_size];
+    struct iovec iov[vdev->queue_max_size];
     VRingPackedDesc desc;
     uint16_t id;
     int rc;
@@ -1620,7 +1620,7 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
         if (desc.flags & VRING_DESC_F_WRITE) {
             map_ok = virtqueue_map_desc(vdev, &in_num, addr + out_num,
                                         iov + out_num,
-                                        VIRTQUEUE_MAX_SIZE - out_num, true,
+                                        vdev->queue_max_size - out_num, true,
                                         desc.addr, desc.len);
         } else {
             if (in_num) {
@@ -1628,7 +1628,7 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
                 goto err_undo_map;
             }
             map_ok = virtqueue_map_desc(vdev, &out_num, addr, iov,
-                                        VIRTQUEUE_MAX_SIZE, false,
+                                        vdev->queue_max_size, false,
                                         desc.addr, desc.len);
         }
         if (!map_ok) {
@@ -2249,7 +2249,7 @@ void virtio_queue_set_num(VirtIODevice *vdev, int n, int num)
      * nonexistent states, or to set it to an invalid size.
      */
     if (!!num != !!vdev->vq[n].vring.num ||
-        num > VIRTQUEUE_MAX_SIZE ||
+        num > vdev->queue_max_size ||
         num < 0) {
         return;
     }
@@ -2400,7 +2400,7 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
             break;
     }
 
-    if (i == VIRTIO_QUEUE_MAX || queue_size > VIRTQUEUE_MAX_SIZE)
+    if (i == VIRTIO_QUEUE_MAX || queue_size > vdev->queue_max_size)
         abort();
 
     vdev->vq[i].vring.num = queue_size;
@@ -3239,13 +3239,25 @@ void virtio_instance_init_common(Object *proxy_obj, void *data,
 }
 
 void virtio_init(VirtIODevice *vdev, const char *name,
-                 uint16_t device_id, size_t config_size)
+                 uint16_t device_id, size_t config_size,
+                 uint16_t queue_max_size)
 {
     BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
     VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
     int i;
     int nvectors = k->query_nvectors ? k->query_nvectors(qbus->parent) : 0;
 
+    if (queue_max_size > VIRTQUEUE_MAX_SIZE ||
+        !is_power_of_2(queue_max_size))
+    {
+        error_report(
+            "virtio: invalid queue_max_size (= %" PRIu16 "), must be a "
+            "power of 2 and between 1 ... %d.",
+            queue_max_size, VIRTQUEUE_MAX_SIZE
+        );
+        abort();
+    }
+
     if (nvectors) {
         vdev->vector_queues =
             g_malloc0(sizeof(*vdev->vector_queues) * nvectors);
@@ -3258,6 +3270,7 @@ void virtio_init(VirtIODevice *vdev, const char *name,
     qatomic_set(&vdev->isr, 0);
     vdev->queue_sel = 0;
     vdev->config_vector = VIRTIO_NO_VECTOR;
+    vdev->queue_max_size = queue_max_size;
     vdev->vq = g_malloc0(sizeof(VirtQueue) * VIRTIO_QUEUE_MAX);
     vdev->vm_running = runstate_is_running();
     vdev->broken = false;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 8bab9cfb75..a37d1f7d52 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -89,6 +89,7 @@ struct VirtIODevice
     size_t config_len;
     void *config;
     uint16_t config_vector;
+    uint16_t queue_max_size;
     uint32_t generation;
     int nvectors;
     VirtQueue *vq;
@@ -166,7 +167,9 @@ void virtio_instance_init_common(Object *proxy_obj, void *data,
                                  size_t vdev_size, const char *vdev_name);
 
 void virtio_init(VirtIODevice *vdev, const char *name,
-                         uint16_t device_id, size_t config_size);
+                 uint16_t device_id, size_t config_size,
+                 uint16_t queue_max_size);
+
 void virtio_cleanup(VirtIODevice *vdev);
 
 void virtio_error(VirtIODevice *vdev, const char *fmt, ...) GCC_FMT_ATTR(2, 3);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-04 19:38 ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-04 19:38   ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-04 19:38 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Greg Kurz, Raphael Norwitz, Kevin Wolf,
	Hanna Reitz, Stefan Hajnoczi, Laurent Vivier, Amit Shah,
	Marc-André Lureau, Paolo Bonzini, Gerd Hoffmann, Jason Wang,
	Fam Zheng, Dr. David Alan Gilbert, David Hildenbrand,
	Gonglei (Arei),
	Eric Auger, qemu-block, virtio-fs

Raise the maximum possible virtio transfer size to 128M
(more precisely: 32k * PAGE_SIZE). See previous commit for a
more detailed explanation for the reasons of this change.

For not breaking any virtio user, all virtio users transition
to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
VIRTQUEUE_MAX_SIZE, so they are all still using the old value
of 1k with this commit.

On the long-term, each virtio user should subsequently either
switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
after checking that they support the new value of 32k, or
otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
macro by an appropriate value supported by them.

Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
---
 hw/9pfs/virtio-9p-device.c     |  2 +-
 hw/block/vhost-user-blk.c      |  6 +++---
 hw/block/virtio-blk.c          |  6 +++---
 hw/char/virtio-serial-bus.c    |  2 +-
 hw/input/virtio-input.c        |  2 +-
 hw/net/virtio-net.c            | 12 ++++++------
 hw/scsi/virtio-scsi.c          |  2 +-
 hw/virtio/vhost-user-fs.c      |  6 +++---
 hw/virtio/vhost-user-i2c.c     |  2 +-
 hw/virtio/vhost-vsock-common.c |  2 +-
 hw/virtio/virtio-balloon.c     |  2 +-
 hw/virtio/virtio-crypto.c      |  2 +-
 hw/virtio/virtio-iommu.c       |  2 +-
 hw/virtio/virtio-mem.c         |  2 +-
 hw/virtio/virtio-mmio.c        |  4 ++--
 hw/virtio/virtio-pmem.c        |  2 +-
 hw/virtio/virtio-rng.c         |  3 ++-
 include/hw/virtio/virtio.h     | 20 +++++++++++++++++++-
 18 files changed, 49 insertions(+), 30 deletions(-)

diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
index cd5d95dd51..9013e7df6e 100644
--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
 
     v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
     virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
-                VIRTQUEUE_MAX_SIZE);
+                VIRTQUEUE_LEGACY_MAX_SIZE);
     v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
 }
 
diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 336f56705c..e5e45262ab 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -480,9 +480,9 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
         error_setg(errp, "queue size must be non-zero");
         return;
     }
-    if (s->queue_size > VIRTQUEUE_MAX_SIZE) {
+    if (s->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
         error_setg(errp, "queue size must not exceed %d",
-                   VIRTQUEUE_MAX_SIZE);
+                   VIRTQUEUE_LEGACY_MAX_SIZE);
         return;
     }
 
@@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
     }
 
     virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
-                sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
+                sizeof(struct virtio_blk_config), VIRTQUEUE_LEGACY_MAX_SIZE);
 
     s->virtqs = g_new(VirtQueue *, s->num_queues);
     for (i = 0; i < s->num_queues; i++) {
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 9c0f46815c..5883e3e7db 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1171,10 +1171,10 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
         return;
     }
     if (!is_power_of_2(conf->queue_size) ||
-        conf->queue_size > VIRTQUEUE_MAX_SIZE) {
+        conf->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
         error_setg(errp, "invalid queue-size property (%" PRIu16 "), "
                    "must be a power of 2 (max %d)",
-                   conf->queue_size, VIRTQUEUE_MAX_SIZE);
+                   conf->queue_size, VIRTQUEUE_LEGACY_MAX_SIZE);
         return;
     }
 
@@ -1214,7 +1214,7 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
     virtio_blk_set_config_size(s, s->host_features);
 
     virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
-                VIRTQUEUE_MAX_SIZE);
+                VIRTQUEUE_LEGACY_MAX_SIZE);
 
     s->blk = conf->conf.blk;
     s->rq = NULL;
diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
index 9ad9111115..2d4285ab53 100644
--- a/hw/char/virtio-serial-bus.c
+++ b/hw/char/virtio-serial-bus.c
@@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState *dev, Error **errp)
         config_size = offsetof(struct virtio_console_config, emerg_wr);
     }
     virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
-                config_size, VIRTQUEUE_MAX_SIZE);
+                config_size, VIRTQUEUE_LEGACY_MAX_SIZE);
 
     /* Spawn a new virtio-serial bus on which the ports will ride as devices */
     qbus_init(&vser->bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
index 345eb2cce7..b6b77488f2 100644
--- a/hw/input/virtio-input.c
+++ b/hw/input/virtio-input.c
@@ -258,7 +258,7 @@ static void virtio_input_device_realize(DeviceState *dev, Error **errp)
     assert(vinput->cfg_size <= sizeof(virtio_input_config));
 
     virtio_init(vdev, "virtio-input", VIRTIO_ID_INPUT,
-                vinput->cfg_size, VIRTQUEUE_MAX_SIZE);
+                vinput->cfg_size, VIRTQUEUE_LEGACY_MAX_SIZE);
     vinput->evt = virtio_add_queue(vdev, 64, virtio_input_handle_evt);
     vinput->sts = virtio_add_queue(vdev, 64, virtio_input_handle_sts);
 }
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index f74b5f6268..5100978b07 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -636,7 +636,7 @@ static int virtio_net_max_tx_queue_size(VirtIONet *n)
         return VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE;
     }
 
-    return VIRTQUEUE_MAX_SIZE;
+    return VIRTQUEUE_LEGACY_MAX_SIZE;
 }
 
 static int peer_attach(VirtIONet *n, int index)
@@ -3365,7 +3365,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
 
     virtio_net_set_config_size(n, n->host_features);
     virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size,
-                VIRTQUEUE_MAX_SIZE);
+                VIRTQUEUE_LEGACY_MAX_SIZE);
 
     /*
      * We set a lower limit on RX queue size to what it always was.
@@ -3373,23 +3373,23 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
      * help from us (using virtio 1 and up).
      */
     if (n->net_conf.rx_queue_size < VIRTIO_NET_RX_QUEUE_MIN_SIZE ||
-        n->net_conf.rx_queue_size > VIRTQUEUE_MAX_SIZE ||
+        n->net_conf.rx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
         !is_power_of_2(n->net_conf.rx_queue_size)) {
         error_setg(errp, "Invalid rx_queue_size (= %" PRIu16 "), "
                    "must be a power of 2 between %d and %d.",
                    n->net_conf.rx_queue_size, VIRTIO_NET_RX_QUEUE_MIN_SIZE,
-                   VIRTQUEUE_MAX_SIZE);
+                   VIRTQUEUE_LEGACY_MAX_SIZE);
         virtio_cleanup(vdev);
         return;
     }
 
     if (n->net_conf.tx_queue_size < VIRTIO_NET_TX_QUEUE_MIN_SIZE ||
-        n->net_conf.tx_queue_size > VIRTQUEUE_MAX_SIZE ||
+        n->net_conf.tx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
         !is_power_of_2(n->net_conf.tx_queue_size)) {
         error_setg(errp, "Invalid tx_queue_size (= %" PRIu16 "), "
                    "must be a power of 2 between %d and %d",
                    n->net_conf.tx_queue_size, VIRTIO_NET_TX_QUEUE_MIN_SIZE,
-                   VIRTQUEUE_MAX_SIZE);
+                   VIRTQUEUE_LEGACY_MAX_SIZE);
         virtio_cleanup(vdev);
         return;
     }
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 5e5e657e1d..f204e8878a 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -973,7 +973,7 @@ void virtio_scsi_common_realize(DeviceState *dev,
     int i;
 
     virtio_init(vdev, "virtio-scsi", VIRTIO_ID_SCSI,
-                sizeof(VirtIOSCSIConfig), VIRTQUEUE_MAX_SIZE);
+                sizeof(VirtIOSCSIConfig), VIRTQUEUE_LEGACY_MAX_SIZE);
 
     if (s->conf.num_queues == VIRTIO_SCSI_AUTO_NUM_QUEUES) {
         s->conf.num_queues = 1;
diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index ae1672d667..decc5def39 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -209,9 +209,9 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    if (fs->conf.queue_size > VIRTQUEUE_MAX_SIZE) {
+    if (fs->conf.queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
         error_setg(errp, "queue-size property must be %u or smaller",
-                   VIRTQUEUE_MAX_SIZE);
+                   VIRTQUEUE_LEGACY_MAX_SIZE);
         return;
     }
 
@@ -220,7 +220,7 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
     }
 
     virtio_init(vdev, "vhost-user-fs", VIRTIO_ID_FS,
-                sizeof(struct virtio_fs_config), VIRTQUEUE_MAX_SIZE);
+                sizeof(struct virtio_fs_config), VIRTQUEUE_LEGACY_MAX_SIZE);
 
     /* Hiprio queue */
     fs->hiprio_vq = virtio_add_queue(vdev, fs->conf.queue_size, vuf_handle_output);
diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c
index eeb1d8853a..b248ddbe93 100644
--- a/hw/virtio/vhost-user-i2c.c
+++ b/hw/virtio/vhost-user-i2c.c
@@ -221,7 +221,7 @@ static void vu_i2c_device_realize(DeviceState *dev, Error **errp)
     }
 
     virtio_init(vdev, "vhost-user-i2c", VIRTIO_ID_I2C_ADAPTER, 0,
-                VIRTQUEUE_MAX_SIZE);
+                VIRTQUEUE_LEGACY_MAX_SIZE);
 
     i2c->vhost_dev.nvqs = 1;
     i2c->vq = virtio_add_queue(vdev, 4, vu_i2c_handle_output);
diff --git a/hw/virtio/vhost-vsock-common.c b/hw/virtio/vhost-vsock-common.c
index a81fa884a8..73e6b72bba 100644
--- a/hw/virtio/vhost-vsock-common.c
+++ b/hw/virtio/vhost-vsock-common.c
@@ -201,7 +201,7 @@ void vhost_vsock_common_realize(VirtIODevice *vdev, const char *name)
     VHostVSockCommon *vvc = VHOST_VSOCK_COMMON(vdev);
 
     virtio_init(vdev, name, VIRTIO_ID_VSOCK,
-                sizeof(struct virtio_vsock_config), VIRTQUEUE_MAX_SIZE);
+                sizeof(struct virtio_vsock_config), VIRTQUEUE_LEGACY_MAX_SIZE);
 
     /* Receive and transmit queues belong to vhost */
     vvc->recv_vq = virtio_add_queue(vdev, VHOST_VSOCK_QUEUE_SIZE,
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 067c73223d..890fb15ed3 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -886,7 +886,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
     int ret;
 
     virtio_init(vdev, "virtio-balloon", VIRTIO_ID_BALLOON,
-                virtio_balloon_config_size(s), VIRTQUEUE_MAX_SIZE);
+                virtio_balloon_config_size(s), VIRTQUEUE_LEGACY_MAX_SIZE);
 
     ret = qemu_add_balloon_handler(virtio_balloon_to_target,
                                    virtio_balloon_stat, s);
diff --git a/hw/virtio/virtio-crypto.c b/hw/virtio/virtio-crypto.c
index 1e70d4d2a8..e13b6091d6 100644
--- a/hw/virtio/virtio-crypto.c
+++ b/hw/virtio/virtio-crypto.c
@@ -811,7 +811,7 @@ static void virtio_crypto_device_realize(DeviceState *dev, Error **errp)
     }
 
     virtio_init(vdev, "virtio-crypto", VIRTIO_ID_CRYPTO, vcrypto->config_size,
-                VIRTQUEUE_MAX_SIZE);
+                VIRTQUEUE_LEGACY_MAX_SIZE);
     vcrypto->curr_queues = 1;
     vcrypto->vqs = g_malloc0(sizeof(VirtIOCryptoQueue) * vcrypto->max_queues);
     for (i = 0; i < vcrypto->max_queues; i++) {
diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
index ca360e74eb..845df78842 100644
--- a/hw/virtio/virtio-iommu.c
+++ b/hw/virtio/virtio-iommu.c
@@ -974,7 +974,7 @@ static void virtio_iommu_device_realize(DeviceState *dev, Error **errp)
     VirtIOIOMMU *s = VIRTIO_IOMMU(dev);
 
     virtio_init(vdev, "virtio-iommu", VIRTIO_ID_IOMMU,
-                sizeof(struct virtio_iommu_config), VIRTQUEUE_MAX_SIZE);
+                sizeof(struct virtio_iommu_config), VIRTQUEUE_LEGACY_MAX_SIZE);
 
     memset(s->iommu_pcibus_by_bus_num, 0, sizeof(s->iommu_pcibus_by_bus_num));
 
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 1d9d01b871..7a39550cde 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -738,7 +738,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
     vmem->bitmap = bitmap_new(vmem->bitmap_size);
 
     virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
-                sizeof(struct virtio_mem_config), VIRTQUEUE_MAX_SIZE);
+                sizeof(struct virtio_mem_config), VIRTQUEUE_LEGACY_MAX_SIZE);
     vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
 
     host_memory_backend_set_mapped(vmem->memdev, true);
diff --git a/hw/virtio/virtio-mmio.c b/hw/virtio/virtio-mmio.c
index 7b3ebca178..ae0cc223e9 100644
--- a/hw/virtio/virtio-mmio.c
+++ b/hw/virtio/virtio-mmio.c
@@ -174,7 +174,7 @@ static uint64_t virtio_mmio_read(void *opaque, hwaddr offset, unsigned size)
         if (!virtio_queue_get_num(vdev, vdev->queue_sel)) {
             return 0;
         }
-        return VIRTQUEUE_MAX_SIZE;
+        return VIRTQUEUE_LEGACY_MAX_SIZE;
     case VIRTIO_MMIO_QUEUE_PFN:
         if (!proxy->legacy) {
             qemu_log_mask(LOG_GUEST_ERROR,
@@ -348,7 +348,7 @@ static void virtio_mmio_write(void *opaque, hwaddr offset, uint64_t value,
         }
         break;
     case VIRTIO_MMIO_QUEUE_NUM:
-        trace_virtio_mmio_queue_write(value, VIRTQUEUE_MAX_SIZE);
+        trace_virtio_mmio_queue_write(value, VIRTQUEUE_LEGACY_MAX_SIZE);
         virtio_queue_set_num(vdev, vdev->queue_sel, value);
 
         if (proxy->legacy) {
diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
index 82b54b00c5..5f4d375b58 100644
--- a/hw/virtio/virtio-pmem.c
+++ b/hw/virtio/virtio-pmem.c
@@ -124,7 +124,7 @@ static void virtio_pmem_realize(DeviceState *dev, Error **errp)
 
     host_memory_backend_set_mapped(pmem->memdev, true);
     virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
-                sizeof(struct virtio_pmem_config), VIRTQUEUE_MAX_SIZE);
+                sizeof(struct virtio_pmem_config), VIRTQUEUE_LEGACY_MAX_SIZE);
     pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
 }
 
diff --git a/hw/virtio/virtio-rng.c b/hw/virtio/virtio-rng.c
index 0e91d60106..ab075b22b6 100644
--- a/hw/virtio/virtio-rng.c
+++ b/hw/virtio/virtio-rng.c
@@ -215,7 +215,8 @@ static void virtio_rng_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0, VIRTQUEUE_MAX_SIZE);
+    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0,
+                VIRTQUEUE_LEGACY_MAX_SIZE);
 
     vrng->vq = virtio_add_queue(vdev, 8, handle_input);
     vrng->quota_remaining = vrng->conf.max_bytes;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index a37d1f7d52..fe0f13266b 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -48,7 +48,25 @@ size_t virtio_feature_get_config_size(const VirtIOFeature *features,
 
 typedef struct VirtQueue VirtQueue;
 
-#define VIRTQUEUE_MAX_SIZE 1024
+/*
+ * This is meant as transitional measure for VIRTQUEUE_MAX_SIZE's old value
+ * of 1024 to its new value of 32768. On the long-term virtio users should
+ * either switch to VIRTQUEUE_MAX_SIZE, provided they support 32768,
+ * otherwise they should replace this macro on their side with an
+ * appropriate value actually supported by them.
+ *
+ * Once all virtio users switched, this macro will be removed.
+ */
+#define VIRTQUEUE_LEGACY_MAX_SIZE 1024
+
+/*
+ * Reflects the absolute theoretical maximum queue size (in amount of pages)
+ * ever possible, which is actually the maximum queue size allowed by the
+ * virtio protocol. This value therefore construes the maximum transfer size
+ * possible with virtio (multiplied by system dependent PAGE_SIZE); assuming
+ * a typical page size of 4k this would be a maximum transfer size of 128M.
+ */
+#define VIRTQUEUE_MAX_SIZE 32768
 
 typedef struct VirtQueueElement
 {
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [Virtio-fs] [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-04 19:38   ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-04 19:38 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng

Raise the maximum possible virtio transfer size to 128M
(more precisely: 32k * PAGE_SIZE). See previous commit for a
more detailed explanation for the reasons of this change.

For not breaking any virtio user, all virtio users transition
to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
VIRTQUEUE_MAX_SIZE, so they are all still using the old value
of 1k with this commit.

On the long-term, each virtio user should subsequently either
switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
after checking that they support the new value of 32k, or
otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
macro by an appropriate value supported by them.

Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
---
 hw/9pfs/virtio-9p-device.c     |  2 +-
 hw/block/vhost-user-blk.c      |  6 +++---
 hw/block/virtio-blk.c          |  6 +++---
 hw/char/virtio-serial-bus.c    |  2 +-
 hw/input/virtio-input.c        |  2 +-
 hw/net/virtio-net.c            | 12 ++++++------
 hw/scsi/virtio-scsi.c          |  2 +-
 hw/virtio/vhost-user-fs.c      |  6 +++---
 hw/virtio/vhost-user-i2c.c     |  2 +-
 hw/virtio/vhost-vsock-common.c |  2 +-
 hw/virtio/virtio-balloon.c     |  2 +-
 hw/virtio/virtio-crypto.c      |  2 +-
 hw/virtio/virtio-iommu.c       |  2 +-
 hw/virtio/virtio-mem.c         |  2 +-
 hw/virtio/virtio-mmio.c        |  4 ++--
 hw/virtio/virtio-pmem.c        |  2 +-
 hw/virtio/virtio-rng.c         |  3 ++-
 include/hw/virtio/virtio.h     | 20 +++++++++++++++++++-
 18 files changed, 49 insertions(+), 30 deletions(-)

diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
index cd5d95dd51..9013e7df6e 100644
--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
 
     v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
     virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
-                VIRTQUEUE_MAX_SIZE);
+                VIRTQUEUE_LEGACY_MAX_SIZE);
     v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
 }
 
diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 336f56705c..e5e45262ab 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -480,9 +480,9 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
         error_setg(errp, "queue size must be non-zero");
         return;
     }
-    if (s->queue_size > VIRTQUEUE_MAX_SIZE) {
+    if (s->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
         error_setg(errp, "queue size must not exceed %d",
-                   VIRTQUEUE_MAX_SIZE);
+                   VIRTQUEUE_LEGACY_MAX_SIZE);
         return;
     }
 
@@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
     }
 
     virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
-                sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
+                sizeof(struct virtio_blk_config), VIRTQUEUE_LEGACY_MAX_SIZE);
 
     s->virtqs = g_new(VirtQueue *, s->num_queues);
     for (i = 0; i < s->num_queues; i++) {
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 9c0f46815c..5883e3e7db 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1171,10 +1171,10 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
         return;
     }
     if (!is_power_of_2(conf->queue_size) ||
-        conf->queue_size > VIRTQUEUE_MAX_SIZE) {
+        conf->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
         error_setg(errp, "invalid queue-size property (%" PRIu16 "), "
                    "must be a power of 2 (max %d)",
-                   conf->queue_size, VIRTQUEUE_MAX_SIZE);
+                   conf->queue_size, VIRTQUEUE_LEGACY_MAX_SIZE);
         return;
     }
 
@@ -1214,7 +1214,7 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
     virtio_blk_set_config_size(s, s->host_features);
 
     virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
-                VIRTQUEUE_MAX_SIZE);
+                VIRTQUEUE_LEGACY_MAX_SIZE);
 
     s->blk = conf->conf.blk;
     s->rq = NULL;
diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
index 9ad9111115..2d4285ab53 100644
--- a/hw/char/virtio-serial-bus.c
+++ b/hw/char/virtio-serial-bus.c
@@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState *dev, Error **errp)
         config_size = offsetof(struct virtio_console_config, emerg_wr);
     }
     virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
-                config_size, VIRTQUEUE_MAX_SIZE);
+                config_size, VIRTQUEUE_LEGACY_MAX_SIZE);
 
     /* Spawn a new virtio-serial bus on which the ports will ride as devices */
     qbus_init(&vser->bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
index 345eb2cce7..b6b77488f2 100644
--- a/hw/input/virtio-input.c
+++ b/hw/input/virtio-input.c
@@ -258,7 +258,7 @@ static void virtio_input_device_realize(DeviceState *dev, Error **errp)
     assert(vinput->cfg_size <= sizeof(virtio_input_config));
 
     virtio_init(vdev, "virtio-input", VIRTIO_ID_INPUT,
-                vinput->cfg_size, VIRTQUEUE_MAX_SIZE);
+                vinput->cfg_size, VIRTQUEUE_LEGACY_MAX_SIZE);
     vinput->evt = virtio_add_queue(vdev, 64, virtio_input_handle_evt);
     vinput->sts = virtio_add_queue(vdev, 64, virtio_input_handle_sts);
 }
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index f74b5f6268..5100978b07 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -636,7 +636,7 @@ static int virtio_net_max_tx_queue_size(VirtIONet *n)
         return VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE;
     }
 
-    return VIRTQUEUE_MAX_SIZE;
+    return VIRTQUEUE_LEGACY_MAX_SIZE;
 }
 
 static int peer_attach(VirtIONet *n, int index)
@@ -3365,7 +3365,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
 
     virtio_net_set_config_size(n, n->host_features);
     virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size,
-                VIRTQUEUE_MAX_SIZE);
+                VIRTQUEUE_LEGACY_MAX_SIZE);
 
     /*
      * We set a lower limit on RX queue size to what it always was.
@@ -3373,23 +3373,23 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
      * help from us (using virtio 1 and up).
      */
     if (n->net_conf.rx_queue_size < VIRTIO_NET_RX_QUEUE_MIN_SIZE ||
-        n->net_conf.rx_queue_size > VIRTQUEUE_MAX_SIZE ||
+        n->net_conf.rx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
         !is_power_of_2(n->net_conf.rx_queue_size)) {
         error_setg(errp, "Invalid rx_queue_size (= %" PRIu16 "), "
                    "must be a power of 2 between %d and %d.",
                    n->net_conf.rx_queue_size, VIRTIO_NET_RX_QUEUE_MIN_SIZE,
-                   VIRTQUEUE_MAX_SIZE);
+                   VIRTQUEUE_LEGACY_MAX_SIZE);
         virtio_cleanup(vdev);
         return;
     }
 
     if (n->net_conf.tx_queue_size < VIRTIO_NET_TX_QUEUE_MIN_SIZE ||
-        n->net_conf.tx_queue_size > VIRTQUEUE_MAX_SIZE ||
+        n->net_conf.tx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
         !is_power_of_2(n->net_conf.tx_queue_size)) {
         error_setg(errp, "Invalid tx_queue_size (= %" PRIu16 "), "
                    "must be a power of 2 between %d and %d",
                    n->net_conf.tx_queue_size, VIRTIO_NET_TX_QUEUE_MIN_SIZE,
-                   VIRTQUEUE_MAX_SIZE);
+                   VIRTQUEUE_LEGACY_MAX_SIZE);
         virtio_cleanup(vdev);
         return;
     }
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 5e5e657e1d..f204e8878a 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -973,7 +973,7 @@ void virtio_scsi_common_realize(DeviceState *dev,
     int i;
 
     virtio_init(vdev, "virtio-scsi", VIRTIO_ID_SCSI,
-                sizeof(VirtIOSCSIConfig), VIRTQUEUE_MAX_SIZE);
+                sizeof(VirtIOSCSIConfig), VIRTQUEUE_LEGACY_MAX_SIZE);
 
     if (s->conf.num_queues == VIRTIO_SCSI_AUTO_NUM_QUEUES) {
         s->conf.num_queues = 1;
diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index ae1672d667..decc5def39 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -209,9 +209,9 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    if (fs->conf.queue_size > VIRTQUEUE_MAX_SIZE) {
+    if (fs->conf.queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
         error_setg(errp, "queue-size property must be %u or smaller",
-                   VIRTQUEUE_MAX_SIZE);
+                   VIRTQUEUE_LEGACY_MAX_SIZE);
         return;
     }
 
@@ -220,7 +220,7 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
     }
 
     virtio_init(vdev, "vhost-user-fs", VIRTIO_ID_FS,
-                sizeof(struct virtio_fs_config), VIRTQUEUE_MAX_SIZE);
+                sizeof(struct virtio_fs_config), VIRTQUEUE_LEGACY_MAX_SIZE);
 
     /* Hiprio queue */
     fs->hiprio_vq = virtio_add_queue(vdev, fs->conf.queue_size, vuf_handle_output);
diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c
index eeb1d8853a..b248ddbe93 100644
--- a/hw/virtio/vhost-user-i2c.c
+++ b/hw/virtio/vhost-user-i2c.c
@@ -221,7 +221,7 @@ static void vu_i2c_device_realize(DeviceState *dev, Error **errp)
     }
 
     virtio_init(vdev, "vhost-user-i2c", VIRTIO_ID_I2C_ADAPTER, 0,
-                VIRTQUEUE_MAX_SIZE);
+                VIRTQUEUE_LEGACY_MAX_SIZE);
 
     i2c->vhost_dev.nvqs = 1;
     i2c->vq = virtio_add_queue(vdev, 4, vu_i2c_handle_output);
diff --git a/hw/virtio/vhost-vsock-common.c b/hw/virtio/vhost-vsock-common.c
index a81fa884a8..73e6b72bba 100644
--- a/hw/virtio/vhost-vsock-common.c
+++ b/hw/virtio/vhost-vsock-common.c
@@ -201,7 +201,7 @@ void vhost_vsock_common_realize(VirtIODevice *vdev, const char *name)
     VHostVSockCommon *vvc = VHOST_VSOCK_COMMON(vdev);
 
     virtio_init(vdev, name, VIRTIO_ID_VSOCK,
-                sizeof(struct virtio_vsock_config), VIRTQUEUE_MAX_SIZE);
+                sizeof(struct virtio_vsock_config), VIRTQUEUE_LEGACY_MAX_SIZE);
 
     /* Receive and transmit queues belong to vhost */
     vvc->recv_vq = virtio_add_queue(vdev, VHOST_VSOCK_QUEUE_SIZE,
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 067c73223d..890fb15ed3 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -886,7 +886,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
     int ret;
 
     virtio_init(vdev, "virtio-balloon", VIRTIO_ID_BALLOON,
-                virtio_balloon_config_size(s), VIRTQUEUE_MAX_SIZE);
+                virtio_balloon_config_size(s), VIRTQUEUE_LEGACY_MAX_SIZE);
 
     ret = qemu_add_balloon_handler(virtio_balloon_to_target,
                                    virtio_balloon_stat, s);
diff --git a/hw/virtio/virtio-crypto.c b/hw/virtio/virtio-crypto.c
index 1e70d4d2a8..e13b6091d6 100644
--- a/hw/virtio/virtio-crypto.c
+++ b/hw/virtio/virtio-crypto.c
@@ -811,7 +811,7 @@ static void virtio_crypto_device_realize(DeviceState *dev, Error **errp)
     }
 
     virtio_init(vdev, "virtio-crypto", VIRTIO_ID_CRYPTO, vcrypto->config_size,
-                VIRTQUEUE_MAX_SIZE);
+                VIRTQUEUE_LEGACY_MAX_SIZE);
     vcrypto->curr_queues = 1;
     vcrypto->vqs = g_malloc0(sizeof(VirtIOCryptoQueue) * vcrypto->max_queues);
     for (i = 0; i < vcrypto->max_queues; i++) {
diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
index ca360e74eb..845df78842 100644
--- a/hw/virtio/virtio-iommu.c
+++ b/hw/virtio/virtio-iommu.c
@@ -974,7 +974,7 @@ static void virtio_iommu_device_realize(DeviceState *dev, Error **errp)
     VirtIOIOMMU *s = VIRTIO_IOMMU(dev);
 
     virtio_init(vdev, "virtio-iommu", VIRTIO_ID_IOMMU,
-                sizeof(struct virtio_iommu_config), VIRTQUEUE_MAX_SIZE);
+                sizeof(struct virtio_iommu_config), VIRTQUEUE_LEGACY_MAX_SIZE);
 
     memset(s->iommu_pcibus_by_bus_num, 0, sizeof(s->iommu_pcibus_by_bus_num));
 
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 1d9d01b871..7a39550cde 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -738,7 +738,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
     vmem->bitmap = bitmap_new(vmem->bitmap_size);
 
     virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
-                sizeof(struct virtio_mem_config), VIRTQUEUE_MAX_SIZE);
+                sizeof(struct virtio_mem_config), VIRTQUEUE_LEGACY_MAX_SIZE);
     vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
 
     host_memory_backend_set_mapped(vmem->memdev, true);
diff --git a/hw/virtio/virtio-mmio.c b/hw/virtio/virtio-mmio.c
index 7b3ebca178..ae0cc223e9 100644
--- a/hw/virtio/virtio-mmio.c
+++ b/hw/virtio/virtio-mmio.c
@@ -174,7 +174,7 @@ static uint64_t virtio_mmio_read(void *opaque, hwaddr offset, unsigned size)
         if (!virtio_queue_get_num(vdev, vdev->queue_sel)) {
             return 0;
         }
-        return VIRTQUEUE_MAX_SIZE;
+        return VIRTQUEUE_LEGACY_MAX_SIZE;
     case VIRTIO_MMIO_QUEUE_PFN:
         if (!proxy->legacy) {
             qemu_log_mask(LOG_GUEST_ERROR,
@@ -348,7 +348,7 @@ static void virtio_mmio_write(void *opaque, hwaddr offset, uint64_t value,
         }
         break;
     case VIRTIO_MMIO_QUEUE_NUM:
-        trace_virtio_mmio_queue_write(value, VIRTQUEUE_MAX_SIZE);
+        trace_virtio_mmio_queue_write(value, VIRTQUEUE_LEGACY_MAX_SIZE);
         virtio_queue_set_num(vdev, vdev->queue_sel, value);
 
         if (proxy->legacy) {
diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
index 82b54b00c5..5f4d375b58 100644
--- a/hw/virtio/virtio-pmem.c
+++ b/hw/virtio/virtio-pmem.c
@@ -124,7 +124,7 @@ static void virtio_pmem_realize(DeviceState *dev, Error **errp)
 
     host_memory_backend_set_mapped(pmem->memdev, true);
     virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
-                sizeof(struct virtio_pmem_config), VIRTQUEUE_MAX_SIZE);
+                sizeof(struct virtio_pmem_config), VIRTQUEUE_LEGACY_MAX_SIZE);
     pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
 }
 
diff --git a/hw/virtio/virtio-rng.c b/hw/virtio/virtio-rng.c
index 0e91d60106..ab075b22b6 100644
--- a/hw/virtio/virtio-rng.c
+++ b/hw/virtio/virtio-rng.c
@@ -215,7 +215,8 @@ static void virtio_rng_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0, VIRTQUEUE_MAX_SIZE);
+    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0,
+                VIRTQUEUE_LEGACY_MAX_SIZE);
 
     vrng->vq = virtio_add_queue(vdev, 8, handle_input);
     vrng->quota_remaining = vrng->conf.max_bytes;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index a37d1f7d52..fe0f13266b 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -48,7 +48,25 @@ size_t virtio_feature_get_config_size(const VirtIOFeature *features,
 
 typedef struct VirtQueue VirtQueue;
 
-#define VIRTQUEUE_MAX_SIZE 1024
+/*
+ * This is meant as transitional measure for VIRTQUEUE_MAX_SIZE's old value
+ * of 1024 to its new value of 32768. On the long-term virtio users should
+ * either switch to VIRTQUEUE_MAX_SIZE, provided they support 32768,
+ * otherwise they should replace this macro on their side with an
+ * appropriate value actually supported by them.
+ *
+ * Once all virtio users switched, this macro will be removed.
+ */
+#define VIRTQUEUE_LEGACY_MAX_SIZE 1024
+
+/*
+ * Reflects the absolute theoretical maximum queue size (in amount of pages)
+ * ever possible, which is actually the maximum queue size allowed by the
+ * virtio protocol. This value therefore construes the maximum transfer size
+ * possible with virtio (multiplied by system dependent PAGE_SIZE); assuming
+ * a typical page size of 4k this would be a maximum transfer size of 128M.
+ */
+#define VIRTQUEUE_MAX_SIZE 32768
 
 typedef struct VirtQueueElement
 {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH v2 3/3] virtio-9p-device: switch to 32k max. transfer size
  2021-10-04 19:38 ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-04 19:38   ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-04 19:38 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Greg Kurz, Raphael Norwitz, Kevin Wolf,
	Hanna Reitz, Stefan Hajnoczi, Laurent Vivier, Amit Shah,
	Marc-André Lureau, Paolo Bonzini, Gerd Hoffmann, Jason Wang,
	Fam Zheng, Dr. David Alan Gilbert, David Hildenbrand,
	Gonglei (Arei),
	Eric Auger, qemu-block, virtio-fs

9pfs supports the new maximum virtio queue size of 32k, so let's
switch the 9pfs virtio transport from 1k to 32k.

This will allow a maximum 'msize' option (maximum message size)
by 9p client of approximately 128M (assuming 4k page size, in
practice slightly smaller, e.g. with Linux client minus 2 pages).

Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
---
 hw/9pfs/virtio-9p-device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
index 9013e7df6e..cd5d95dd51 100644
--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
 
     v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
     virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
-                VIRTQUEUE_LEGACY_MAX_SIZE);
+                VIRTQUEUE_MAX_SIZE);
     v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
 }
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [Virtio-fs] [PATCH v2 3/3] virtio-9p-device: switch to 32k max. transfer size
@ 2021-10-04 19:38   ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-04 19:38 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng

9pfs supports the new maximum virtio queue size of 32k, so let's
switch the 9pfs virtio transport from 1k to 32k.

This will allow a maximum 'msize' option (maximum message size)
by 9p client of approximately 128M (assuming 4k page size, in
practice slightly smaller, e.g. with Linux client minus 2 pages).

Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
---
 hw/9pfs/virtio-9p-device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
index 9013e7df6e..cd5d95dd51 100644
--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
 
     v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
     virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
-                VIRTQUEUE_LEGACY_MAX_SIZE);
+                VIRTQUEUE_MAX_SIZE);
     v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-04 19:38   ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-05  7:16     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 97+ messages in thread
From: Michael S. Tsirkin @ 2021-10-05  7:16 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Amit Shah, Jason Wang,
	David Hildenbrand, qemu-devel, Raphael Norwitz, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Dr. David Alan Gilbert,
	Greg Kurz

On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> Raise the maximum possible virtio transfer size to 128M
> (more precisely: 32k * PAGE_SIZE). See previous commit for a
> more detailed explanation for the reasons of this change.
> 
> For not breaking any virtio user, all virtio users transition
> to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> of 1k with this commit.
> 
> On the long-term, each virtio user should subsequently either
> switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> after checking that they support the new value of 32k, or
> otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> macro by an appropriate value supported by them.
> 
> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>


I don't think we need this. Legacy isn't descriptive either.  Just leave
VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.

> ---
>  hw/9pfs/virtio-9p-device.c     |  2 +-
>  hw/block/vhost-user-blk.c      |  6 +++---
>  hw/block/virtio-blk.c          |  6 +++---
>  hw/char/virtio-serial-bus.c    |  2 +-
>  hw/input/virtio-input.c        |  2 +-
>  hw/net/virtio-net.c            | 12 ++++++------
>  hw/scsi/virtio-scsi.c          |  2 +-
>  hw/virtio/vhost-user-fs.c      |  6 +++---
>  hw/virtio/vhost-user-i2c.c     |  2 +-
>  hw/virtio/vhost-vsock-common.c |  2 +-
>  hw/virtio/virtio-balloon.c     |  2 +-
>  hw/virtio/virtio-crypto.c      |  2 +-
>  hw/virtio/virtio-iommu.c       |  2 +-
>  hw/virtio/virtio-mem.c         |  2 +-
>  hw/virtio/virtio-mmio.c        |  4 ++--
>  hw/virtio/virtio-pmem.c        |  2 +-
>  hw/virtio/virtio-rng.c         |  3 ++-
>  include/hw/virtio/virtio.h     | 20 +++++++++++++++++++-
>  18 files changed, 49 insertions(+), 30 deletions(-)
> 
> diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> index cd5d95dd51..9013e7df6e 100644
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
>  
>      v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
>      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> -                VIRTQUEUE_MAX_SIZE);
> +                VIRTQUEUE_LEGACY_MAX_SIZE);
>      v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
>  }
>  
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index 336f56705c..e5e45262ab 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -480,9 +480,9 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
>          error_setg(errp, "queue size must be non-zero");
>          return;
>      }
> -    if (s->queue_size > VIRTQUEUE_MAX_SIZE) {
> +    if (s->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
>          error_setg(errp, "queue size must not exceed %d",
> -                   VIRTQUEUE_MAX_SIZE);
> +                   VIRTQUEUE_LEGACY_MAX_SIZE);
>          return;
>      }
>  
> @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> -                sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
> +                sizeof(struct virtio_blk_config), VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      s->virtqs = g_new(VirtQueue *, s->num_queues);
>      for (i = 0; i < s->num_queues; i++) {
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index 9c0f46815c..5883e3e7db 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -1171,10 +1171,10 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>      if (!is_power_of_2(conf->queue_size) ||
> -        conf->queue_size > VIRTQUEUE_MAX_SIZE) {
> +        conf->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
>          error_setg(errp, "invalid queue-size property (%" PRIu16 "), "
>                     "must be a power of 2 (max %d)",
> -                   conf->queue_size, VIRTQUEUE_MAX_SIZE);
> +                   conf->queue_size, VIRTQUEUE_LEGACY_MAX_SIZE);
>          return;
>      }
>  
> @@ -1214,7 +1214,7 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
>      virtio_blk_set_config_size(s, s->host_features);
>  
>      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
> -                VIRTQUEUE_MAX_SIZE);
> +                VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      s->blk = conf->conf.blk;
>      s->rq = NULL;
> diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
> index 9ad9111115..2d4285ab53 100644
> --- a/hw/char/virtio-serial-bus.c
> +++ b/hw/char/virtio-serial-bus.c
> @@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState *dev, Error **errp)
>          config_size = offsetof(struct virtio_console_config, emerg_wr);
>      }
>      virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
> -                config_size, VIRTQUEUE_MAX_SIZE);
> +                config_size, VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      /* Spawn a new virtio-serial bus on which the ports will ride as devices */
>      qbus_init(&vser->bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
> diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
> index 345eb2cce7..b6b77488f2 100644
> --- a/hw/input/virtio-input.c
> +++ b/hw/input/virtio-input.c
> @@ -258,7 +258,7 @@ static void virtio_input_device_realize(DeviceState *dev, Error **errp)
>      assert(vinput->cfg_size <= sizeof(virtio_input_config));
>  
>      virtio_init(vdev, "virtio-input", VIRTIO_ID_INPUT,
> -                vinput->cfg_size, VIRTQUEUE_MAX_SIZE);
> +                vinput->cfg_size, VIRTQUEUE_LEGACY_MAX_SIZE);
>      vinput->evt = virtio_add_queue(vdev, 64, virtio_input_handle_evt);
>      vinput->sts = virtio_add_queue(vdev, 64, virtio_input_handle_sts);
>  }
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index f74b5f6268..5100978b07 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -636,7 +636,7 @@ static int virtio_net_max_tx_queue_size(VirtIONet *n)
>          return VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE;
>      }
>  
> -    return VIRTQUEUE_MAX_SIZE;
> +    return VIRTQUEUE_LEGACY_MAX_SIZE;
>  }
>  
>  static int peer_attach(VirtIONet *n, int index)
> @@ -3365,7 +3365,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>  
>      virtio_net_set_config_size(n, n->host_features);
>      virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size,
> -                VIRTQUEUE_MAX_SIZE);
> +                VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      /*
>       * We set a lower limit on RX queue size to what it always was.
> @@ -3373,23 +3373,23 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>       * help from us (using virtio 1 and up).
>       */
>      if (n->net_conf.rx_queue_size < VIRTIO_NET_RX_QUEUE_MIN_SIZE ||
> -        n->net_conf.rx_queue_size > VIRTQUEUE_MAX_SIZE ||
> +        n->net_conf.rx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
>          !is_power_of_2(n->net_conf.rx_queue_size)) {
>          error_setg(errp, "Invalid rx_queue_size (= %" PRIu16 "), "
>                     "must be a power of 2 between %d and %d.",
>                     n->net_conf.rx_queue_size, VIRTIO_NET_RX_QUEUE_MIN_SIZE,
> -                   VIRTQUEUE_MAX_SIZE);
> +                   VIRTQUEUE_LEGACY_MAX_SIZE);
>          virtio_cleanup(vdev);
>          return;
>      }
>  
>      if (n->net_conf.tx_queue_size < VIRTIO_NET_TX_QUEUE_MIN_SIZE ||
> -        n->net_conf.tx_queue_size > VIRTQUEUE_MAX_SIZE ||
> +        n->net_conf.tx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
>          !is_power_of_2(n->net_conf.tx_queue_size)) {
>          error_setg(errp, "Invalid tx_queue_size (= %" PRIu16 "), "
>                     "must be a power of 2 between %d and %d",
>                     n->net_conf.tx_queue_size, VIRTIO_NET_TX_QUEUE_MIN_SIZE,
> -                   VIRTQUEUE_MAX_SIZE);
> +                   VIRTQUEUE_LEGACY_MAX_SIZE);
>          virtio_cleanup(vdev);
>          return;
>      }
> diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> index 5e5e657e1d..f204e8878a 100644
> --- a/hw/scsi/virtio-scsi.c
> +++ b/hw/scsi/virtio-scsi.c
> @@ -973,7 +973,7 @@ void virtio_scsi_common_realize(DeviceState *dev,
>      int i;
>  
>      virtio_init(vdev, "virtio-scsi", VIRTIO_ID_SCSI,
> -                sizeof(VirtIOSCSIConfig), VIRTQUEUE_MAX_SIZE);
> +                sizeof(VirtIOSCSIConfig), VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      if (s->conf.num_queues == VIRTIO_SCSI_AUTO_NUM_QUEUES) {
>          s->conf.num_queues = 1;
> diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> index ae1672d667..decc5def39 100644
> --- a/hw/virtio/vhost-user-fs.c
> +++ b/hw/virtio/vhost-user-fs.c
> @@ -209,9 +209,9 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    if (fs->conf.queue_size > VIRTQUEUE_MAX_SIZE) {
> +    if (fs->conf.queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
>          error_setg(errp, "queue-size property must be %u or smaller",
> -                   VIRTQUEUE_MAX_SIZE);
> +                   VIRTQUEUE_LEGACY_MAX_SIZE);
>          return;
>      }
>  
> @@ -220,7 +220,7 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      virtio_init(vdev, "vhost-user-fs", VIRTIO_ID_FS,
> -                sizeof(struct virtio_fs_config), VIRTQUEUE_MAX_SIZE);
> +                sizeof(struct virtio_fs_config), VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      /* Hiprio queue */
>      fs->hiprio_vq = virtio_add_queue(vdev, fs->conf.queue_size, vuf_handle_output);
> diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c
> index eeb1d8853a..b248ddbe93 100644
> --- a/hw/virtio/vhost-user-i2c.c
> +++ b/hw/virtio/vhost-user-i2c.c
> @@ -221,7 +221,7 @@ static void vu_i2c_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      virtio_init(vdev, "vhost-user-i2c", VIRTIO_ID_I2C_ADAPTER, 0,
> -                VIRTQUEUE_MAX_SIZE);
> +                VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      i2c->vhost_dev.nvqs = 1;
>      i2c->vq = virtio_add_queue(vdev, 4, vu_i2c_handle_output);
> diff --git a/hw/virtio/vhost-vsock-common.c b/hw/virtio/vhost-vsock-common.c
> index a81fa884a8..73e6b72bba 100644
> --- a/hw/virtio/vhost-vsock-common.c
> +++ b/hw/virtio/vhost-vsock-common.c
> @@ -201,7 +201,7 @@ void vhost_vsock_common_realize(VirtIODevice *vdev, const char *name)
>      VHostVSockCommon *vvc = VHOST_VSOCK_COMMON(vdev);
>  
>      virtio_init(vdev, name, VIRTIO_ID_VSOCK,
> -                sizeof(struct virtio_vsock_config), VIRTQUEUE_MAX_SIZE);
> +                sizeof(struct virtio_vsock_config), VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      /* Receive and transmit queues belong to vhost */
>      vvc->recv_vq = virtio_add_queue(vdev, VHOST_VSOCK_QUEUE_SIZE,
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 067c73223d..890fb15ed3 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -886,7 +886,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>      int ret;
>  
>      virtio_init(vdev, "virtio-balloon", VIRTIO_ID_BALLOON,
> -                virtio_balloon_config_size(s), VIRTQUEUE_MAX_SIZE);
> +                virtio_balloon_config_size(s), VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      ret = qemu_add_balloon_handler(virtio_balloon_to_target,
>                                     virtio_balloon_stat, s);
> diff --git a/hw/virtio/virtio-crypto.c b/hw/virtio/virtio-crypto.c
> index 1e70d4d2a8..e13b6091d6 100644
> --- a/hw/virtio/virtio-crypto.c
> +++ b/hw/virtio/virtio-crypto.c
> @@ -811,7 +811,7 @@ static void virtio_crypto_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      virtio_init(vdev, "virtio-crypto", VIRTIO_ID_CRYPTO, vcrypto->config_size,
> -                VIRTQUEUE_MAX_SIZE);
> +                VIRTQUEUE_LEGACY_MAX_SIZE);
>      vcrypto->curr_queues = 1;
>      vcrypto->vqs = g_malloc0(sizeof(VirtIOCryptoQueue) * vcrypto->max_queues);
>      for (i = 0; i < vcrypto->max_queues; i++) {
> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> index ca360e74eb..845df78842 100644
> --- a/hw/virtio/virtio-iommu.c
> +++ b/hw/virtio/virtio-iommu.c
> @@ -974,7 +974,7 @@ static void virtio_iommu_device_realize(DeviceState *dev, Error **errp)
>      VirtIOIOMMU *s = VIRTIO_IOMMU(dev);
>  
>      virtio_init(vdev, "virtio-iommu", VIRTIO_ID_IOMMU,
> -                sizeof(struct virtio_iommu_config), VIRTQUEUE_MAX_SIZE);
> +                sizeof(struct virtio_iommu_config), VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      memset(s->iommu_pcibus_by_bus_num, 0, sizeof(s->iommu_pcibus_by_bus_num));
>  
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index 1d9d01b871..7a39550cde 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -738,7 +738,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>      vmem->bitmap = bitmap_new(vmem->bitmap_size);
>  
>      virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
> -                sizeof(struct virtio_mem_config), VIRTQUEUE_MAX_SIZE);
> +                sizeof(struct virtio_mem_config), VIRTQUEUE_LEGACY_MAX_SIZE);
>      vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
>  
>      host_memory_backend_set_mapped(vmem->memdev, true);
> diff --git a/hw/virtio/virtio-mmio.c b/hw/virtio/virtio-mmio.c
> index 7b3ebca178..ae0cc223e9 100644
> --- a/hw/virtio/virtio-mmio.c
> +++ b/hw/virtio/virtio-mmio.c
> @@ -174,7 +174,7 @@ static uint64_t virtio_mmio_read(void *opaque, hwaddr offset, unsigned size)
>          if (!virtio_queue_get_num(vdev, vdev->queue_sel)) {
>              return 0;
>          }
> -        return VIRTQUEUE_MAX_SIZE;
> +        return VIRTQUEUE_LEGACY_MAX_SIZE;
>      case VIRTIO_MMIO_QUEUE_PFN:
>          if (!proxy->legacy) {
>              qemu_log_mask(LOG_GUEST_ERROR,
> @@ -348,7 +348,7 @@ static void virtio_mmio_write(void *opaque, hwaddr offset, uint64_t value,
>          }
>          break;
>      case VIRTIO_MMIO_QUEUE_NUM:
> -        trace_virtio_mmio_queue_write(value, VIRTQUEUE_MAX_SIZE);
> +        trace_virtio_mmio_queue_write(value, VIRTQUEUE_LEGACY_MAX_SIZE);
>          virtio_queue_set_num(vdev, vdev->queue_sel, value);
>  
>          if (proxy->legacy) {
> diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
> index 82b54b00c5..5f4d375b58 100644
> --- a/hw/virtio/virtio-pmem.c
> +++ b/hw/virtio/virtio-pmem.c
> @@ -124,7 +124,7 @@ static void virtio_pmem_realize(DeviceState *dev, Error **errp)
>  
>      host_memory_backend_set_mapped(pmem->memdev, true);
>      virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
> -                sizeof(struct virtio_pmem_config), VIRTQUEUE_MAX_SIZE);
> +                sizeof(struct virtio_pmem_config), VIRTQUEUE_LEGACY_MAX_SIZE);
>      pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
>  }
>  
> diff --git a/hw/virtio/virtio-rng.c b/hw/virtio/virtio-rng.c
> index 0e91d60106..ab075b22b6 100644
> --- a/hw/virtio/virtio-rng.c
> +++ b/hw/virtio/virtio-rng.c
> @@ -215,7 +215,8 @@ static void virtio_rng_device_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0, VIRTQUEUE_MAX_SIZE);
> +    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0,
> +                VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      vrng->vq = virtio_add_queue(vdev, 8, handle_input);
>      vrng->quota_remaining = vrng->conf.max_bytes;
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index a37d1f7d52..fe0f13266b 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -48,7 +48,25 @@ size_t virtio_feature_get_config_size(const VirtIOFeature *features,
>  
>  typedef struct VirtQueue VirtQueue;
>  
> -#define VIRTQUEUE_MAX_SIZE 1024
> +/*
> + * This is meant as transitional measure for VIRTQUEUE_MAX_SIZE's old value
> + * of 1024 to its new value of 32768. On the long-term virtio users should
> + * either switch to VIRTQUEUE_MAX_SIZE, provided they support 32768,
> + * otherwise they should replace this macro on their side with an
> + * appropriate value actually supported by them.
> + *
> + * Once all virtio users switched, this macro will be removed.
> + */
> +#define VIRTQUEUE_LEGACY_MAX_SIZE 1024
> +
> +/*
> + * Reflects the absolute theoretical maximum queue size (in amount of pages)
> + * ever possible, which is actually the maximum queue size allowed by the
> + * virtio protocol. This value therefore construes the maximum transfer size
> + * possible with virtio (multiplied by system dependent PAGE_SIZE); assuming
> + * a typical page size of 4k this would be a maximum transfer size of 128M.
> + */
> +#define VIRTQUEUE_MAX_SIZE 32768
>  
>  typedef struct VirtQueueElement
>  {
> -- 
> 2.20.1



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-05  7:16     ` Michael S. Tsirkin
  0 siblings, 0 replies; 97+ messages in thread
From: Michael S. Tsirkin @ 2021-10-05  7:16 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Amit Shah, Jason Wang,
	David Hildenbrand, qemu-devel, Raphael Norwitz, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> Raise the maximum possible virtio transfer size to 128M
> (more precisely: 32k * PAGE_SIZE). See previous commit for a
> more detailed explanation for the reasons of this change.
> 
> For not breaking any virtio user, all virtio users transition
> to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> of 1k with this commit.
> 
> On the long-term, each virtio user should subsequently either
> switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> after checking that they support the new value of 32k, or
> otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> macro by an appropriate value supported by them.
> 
> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>


I don't think we need this. Legacy isn't descriptive either.  Just leave
VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.

> ---
>  hw/9pfs/virtio-9p-device.c     |  2 +-
>  hw/block/vhost-user-blk.c      |  6 +++---
>  hw/block/virtio-blk.c          |  6 +++---
>  hw/char/virtio-serial-bus.c    |  2 +-
>  hw/input/virtio-input.c        |  2 +-
>  hw/net/virtio-net.c            | 12 ++++++------
>  hw/scsi/virtio-scsi.c          |  2 +-
>  hw/virtio/vhost-user-fs.c      |  6 +++---
>  hw/virtio/vhost-user-i2c.c     |  2 +-
>  hw/virtio/vhost-vsock-common.c |  2 +-
>  hw/virtio/virtio-balloon.c     |  2 +-
>  hw/virtio/virtio-crypto.c      |  2 +-
>  hw/virtio/virtio-iommu.c       |  2 +-
>  hw/virtio/virtio-mem.c         |  2 +-
>  hw/virtio/virtio-mmio.c        |  4 ++--
>  hw/virtio/virtio-pmem.c        |  2 +-
>  hw/virtio/virtio-rng.c         |  3 ++-
>  include/hw/virtio/virtio.h     | 20 +++++++++++++++++++-
>  18 files changed, 49 insertions(+), 30 deletions(-)
> 
> diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> index cd5d95dd51..9013e7df6e 100644
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
>  
>      v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
>      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> -                VIRTQUEUE_MAX_SIZE);
> +                VIRTQUEUE_LEGACY_MAX_SIZE);
>      v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
>  }
>  
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index 336f56705c..e5e45262ab 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -480,9 +480,9 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
>          error_setg(errp, "queue size must be non-zero");
>          return;
>      }
> -    if (s->queue_size > VIRTQUEUE_MAX_SIZE) {
> +    if (s->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
>          error_setg(errp, "queue size must not exceed %d",
> -                   VIRTQUEUE_MAX_SIZE);
> +                   VIRTQUEUE_LEGACY_MAX_SIZE);
>          return;
>      }
>  
> @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> -                sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
> +                sizeof(struct virtio_blk_config), VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      s->virtqs = g_new(VirtQueue *, s->num_queues);
>      for (i = 0; i < s->num_queues; i++) {
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index 9c0f46815c..5883e3e7db 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -1171,10 +1171,10 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>      if (!is_power_of_2(conf->queue_size) ||
> -        conf->queue_size > VIRTQUEUE_MAX_SIZE) {
> +        conf->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
>          error_setg(errp, "invalid queue-size property (%" PRIu16 "), "
>                     "must be a power of 2 (max %d)",
> -                   conf->queue_size, VIRTQUEUE_MAX_SIZE);
> +                   conf->queue_size, VIRTQUEUE_LEGACY_MAX_SIZE);
>          return;
>      }
>  
> @@ -1214,7 +1214,7 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
>      virtio_blk_set_config_size(s, s->host_features);
>  
>      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
> -                VIRTQUEUE_MAX_SIZE);
> +                VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      s->blk = conf->conf.blk;
>      s->rq = NULL;
> diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
> index 9ad9111115..2d4285ab53 100644
> --- a/hw/char/virtio-serial-bus.c
> +++ b/hw/char/virtio-serial-bus.c
> @@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState *dev, Error **errp)
>          config_size = offsetof(struct virtio_console_config, emerg_wr);
>      }
>      virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
> -                config_size, VIRTQUEUE_MAX_SIZE);
> +                config_size, VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      /* Spawn a new virtio-serial bus on which the ports will ride as devices */
>      qbus_init(&vser->bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
> diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
> index 345eb2cce7..b6b77488f2 100644
> --- a/hw/input/virtio-input.c
> +++ b/hw/input/virtio-input.c
> @@ -258,7 +258,7 @@ static void virtio_input_device_realize(DeviceState *dev, Error **errp)
>      assert(vinput->cfg_size <= sizeof(virtio_input_config));
>  
>      virtio_init(vdev, "virtio-input", VIRTIO_ID_INPUT,
> -                vinput->cfg_size, VIRTQUEUE_MAX_SIZE);
> +                vinput->cfg_size, VIRTQUEUE_LEGACY_MAX_SIZE);
>      vinput->evt = virtio_add_queue(vdev, 64, virtio_input_handle_evt);
>      vinput->sts = virtio_add_queue(vdev, 64, virtio_input_handle_sts);
>  }
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index f74b5f6268..5100978b07 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -636,7 +636,7 @@ static int virtio_net_max_tx_queue_size(VirtIONet *n)
>          return VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE;
>      }
>  
> -    return VIRTQUEUE_MAX_SIZE;
> +    return VIRTQUEUE_LEGACY_MAX_SIZE;
>  }
>  
>  static int peer_attach(VirtIONet *n, int index)
> @@ -3365,7 +3365,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>  
>      virtio_net_set_config_size(n, n->host_features);
>      virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size,
> -                VIRTQUEUE_MAX_SIZE);
> +                VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      /*
>       * We set a lower limit on RX queue size to what it always was.
> @@ -3373,23 +3373,23 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>       * help from us (using virtio 1 and up).
>       */
>      if (n->net_conf.rx_queue_size < VIRTIO_NET_RX_QUEUE_MIN_SIZE ||
> -        n->net_conf.rx_queue_size > VIRTQUEUE_MAX_SIZE ||
> +        n->net_conf.rx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
>          !is_power_of_2(n->net_conf.rx_queue_size)) {
>          error_setg(errp, "Invalid rx_queue_size (= %" PRIu16 "), "
>                     "must be a power of 2 between %d and %d.",
>                     n->net_conf.rx_queue_size, VIRTIO_NET_RX_QUEUE_MIN_SIZE,
> -                   VIRTQUEUE_MAX_SIZE);
> +                   VIRTQUEUE_LEGACY_MAX_SIZE);
>          virtio_cleanup(vdev);
>          return;
>      }
>  
>      if (n->net_conf.tx_queue_size < VIRTIO_NET_TX_QUEUE_MIN_SIZE ||
> -        n->net_conf.tx_queue_size > VIRTQUEUE_MAX_SIZE ||
> +        n->net_conf.tx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
>          !is_power_of_2(n->net_conf.tx_queue_size)) {
>          error_setg(errp, "Invalid tx_queue_size (= %" PRIu16 "), "
>                     "must be a power of 2 between %d and %d",
>                     n->net_conf.tx_queue_size, VIRTIO_NET_TX_QUEUE_MIN_SIZE,
> -                   VIRTQUEUE_MAX_SIZE);
> +                   VIRTQUEUE_LEGACY_MAX_SIZE);
>          virtio_cleanup(vdev);
>          return;
>      }
> diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> index 5e5e657e1d..f204e8878a 100644
> --- a/hw/scsi/virtio-scsi.c
> +++ b/hw/scsi/virtio-scsi.c
> @@ -973,7 +973,7 @@ void virtio_scsi_common_realize(DeviceState *dev,
>      int i;
>  
>      virtio_init(vdev, "virtio-scsi", VIRTIO_ID_SCSI,
> -                sizeof(VirtIOSCSIConfig), VIRTQUEUE_MAX_SIZE);
> +                sizeof(VirtIOSCSIConfig), VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      if (s->conf.num_queues == VIRTIO_SCSI_AUTO_NUM_QUEUES) {
>          s->conf.num_queues = 1;
> diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> index ae1672d667..decc5def39 100644
> --- a/hw/virtio/vhost-user-fs.c
> +++ b/hw/virtio/vhost-user-fs.c
> @@ -209,9 +209,9 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    if (fs->conf.queue_size > VIRTQUEUE_MAX_SIZE) {
> +    if (fs->conf.queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
>          error_setg(errp, "queue-size property must be %u or smaller",
> -                   VIRTQUEUE_MAX_SIZE);
> +                   VIRTQUEUE_LEGACY_MAX_SIZE);
>          return;
>      }
>  
> @@ -220,7 +220,7 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      virtio_init(vdev, "vhost-user-fs", VIRTIO_ID_FS,
> -                sizeof(struct virtio_fs_config), VIRTQUEUE_MAX_SIZE);
> +                sizeof(struct virtio_fs_config), VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      /* Hiprio queue */
>      fs->hiprio_vq = virtio_add_queue(vdev, fs->conf.queue_size, vuf_handle_output);
> diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c
> index eeb1d8853a..b248ddbe93 100644
> --- a/hw/virtio/vhost-user-i2c.c
> +++ b/hw/virtio/vhost-user-i2c.c
> @@ -221,7 +221,7 @@ static void vu_i2c_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      virtio_init(vdev, "vhost-user-i2c", VIRTIO_ID_I2C_ADAPTER, 0,
> -                VIRTQUEUE_MAX_SIZE);
> +                VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      i2c->vhost_dev.nvqs = 1;
>      i2c->vq = virtio_add_queue(vdev, 4, vu_i2c_handle_output);
> diff --git a/hw/virtio/vhost-vsock-common.c b/hw/virtio/vhost-vsock-common.c
> index a81fa884a8..73e6b72bba 100644
> --- a/hw/virtio/vhost-vsock-common.c
> +++ b/hw/virtio/vhost-vsock-common.c
> @@ -201,7 +201,7 @@ void vhost_vsock_common_realize(VirtIODevice *vdev, const char *name)
>      VHostVSockCommon *vvc = VHOST_VSOCK_COMMON(vdev);
>  
>      virtio_init(vdev, name, VIRTIO_ID_VSOCK,
> -                sizeof(struct virtio_vsock_config), VIRTQUEUE_MAX_SIZE);
> +                sizeof(struct virtio_vsock_config), VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      /* Receive and transmit queues belong to vhost */
>      vvc->recv_vq = virtio_add_queue(vdev, VHOST_VSOCK_QUEUE_SIZE,
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 067c73223d..890fb15ed3 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -886,7 +886,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>      int ret;
>  
>      virtio_init(vdev, "virtio-balloon", VIRTIO_ID_BALLOON,
> -                virtio_balloon_config_size(s), VIRTQUEUE_MAX_SIZE);
> +                virtio_balloon_config_size(s), VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      ret = qemu_add_balloon_handler(virtio_balloon_to_target,
>                                     virtio_balloon_stat, s);
> diff --git a/hw/virtio/virtio-crypto.c b/hw/virtio/virtio-crypto.c
> index 1e70d4d2a8..e13b6091d6 100644
> --- a/hw/virtio/virtio-crypto.c
> +++ b/hw/virtio/virtio-crypto.c
> @@ -811,7 +811,7 @@ static void virtio_crypto_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      virtio_init(vdev, "virtio-crypto", VIRTIO_ID_CRYPTO, vcrypto->config_size,
> -                VIRTQUEUE_MAX_SIZE);
> +                VIRTQUEUE_LEGACY_MAX_SIZE);
>      vcrypto->curr_queues = 1;
>      vcrypto->vqs = g_malloc0(sizeof(VirtIOCryptoQueue) * vcrypto->max_queues);
>      for (i = 0; i < vcrypto->max_queues; i++) {
> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> index ca360e74eb..845df78842 100644
> --- a/hw/virtio/virtio-iommu.c
> +++ b/hw/virtio/virtio-iommu.c
> @@ -974,7 +974,7 @@ static void virtio_iommu_device_realize(DeviceState *dev, Error **errp)
>      VirtIOIOMMU *s = VIRTIO_IOMMU(dev);
>  
>      virtio_init(vdev, "virtio-iommu", VIRTIO_ID_IOMMU,
> -                sizeof(struct virtio_iommu_config), VIRTQUEUE_MAX_SIZE);
> +                sizeof(struct virtio_iommu_config), VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      memset(s->iommu_pcibus_by_bus_num, 0, sizeof(s->iommu_pcibus_by_bus_num));
>  
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index 1d9d01b871..7a39550cde 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -738,7 +738,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>      vmem->bitmap = bitmap_new(vmem->bitmap_size);
>  
>      virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
> -                sizeof(struct virtio_mem_config), VIRTQUEUE_MAX_SIZE);
> +                sizeof(struct virtio_mem_config), VIRTQUEUE_LEGACY_MAX_SIZE);
>      vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
>  
>      host_memory_backend_set_mapped(vmem->memdev, true);
> diff --git a/hw/virtio/virtio-mmio.c b/hw/virtio/virtio-mmio.c
> index 7b3ebca178..ae0cc223e9 100644
> --- a/hw/virtio/virtio-mmio.c
> +++ b/hw/virtio/virtio-mmio.c
> @@ -174,7 +174,7 @@ static uint64_t virtio_mmio_read(void *opaque, hwaddr offset, unsigned size)
>          if (!virtio_queue_get_num(vdev, vdev->queue_sel)) {
>              return 0;
>          }
> -        return VIRTQUEUE_MAX_SIZE;
> +        return VIRTQUEUE_LEGACY_MAX_SIZE;
>      case VIRTIO_MMIO_QUEUE_PFN:
>          if (!proxy->legacy) {
>              qemu_log_mask(LOG_GUEST_ERROR,
> @@ -348,7 +348,7 @@ static void virtio_mmio_write(void *opaque, hwaddr offset, uint64_t value,
>          }
>          break;
>      case VIRTIO_MMIO_QUEUE_NUM:
> -        trace_virtio_mmio_queue_write(value, VIRTQUEUE_MAX_SIZE);
> +        trace_virtio_mmio_queue_write(value, VIRTQUEUE_LEGACY_MAX_SIZE);
>          virtio_queue_set_num(vdev, vdev->queue_sel, value);
>  
>          if (proxy->legacy) {
> diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
> index 82b54b00c5..5f4d375b58 100644
> --- a/hw/virtio/virtio-pmem.c
> +++ b/hw/virtio/virtio-pmem.c
> @@ -124,7 +124,7 @@ static void virtio_pmem_realize(DeviceState *dev, Error **errp)
>  
>      host_memory_backend_set_mapped(pmem->memdev, true);
>      virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
> -                sizeof(struct virtio_pmem_config), VIRTQUEUE_MAX_SIZE);
> +                sizeof(struct virtio_pmem_config), VIRTQUEUE_LEGACY_MAX_SIZE);
>      pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
>  }
>  
> diff --git a/hw/virtio/virtio-rng.c b/hw/virtio/virtio-rng.c
> index 0e91d60106..ab075b22b6 100644
> --- a/hw/virtio/virtio-rng.c
> +++ b/hw/virtio/virtio-rng.c
> @@ -215,7 +215,8 @@ static void virtio_rng_device_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0, VIRTQUEUE_MAX_SIZE);
> +    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0,
> +                VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>      vrng->vq = virtio_add_queue(vdev, 8, handle_input);
>      vrng->quota_remaining = vrng->conf.max_bytes;
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index a37d1f7d52..fe0f13266b 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -48,7 +48,25 @@ size_t virtio_feature_get_config_size(const VirtIOFeature *features,
>  
>  typedef struct VirtQueue VirtQueue;
>  
> -#define VIRTQUEUE_MAX_SIZE 1024
> +/*
> + * This is meant as transitional measure for VIRTQUEUE_MAX_SIZE's old value
> + * of 1024 to its new value of 32768. On the long-term virtio users should
> + * either switch to VIRTQUEUE_MAX_SIZE, provided they support 32768,
> + * otherwise they should replace this macro on their side with an
> + * appropriate value actually supported by them.
> + *
> + * Once all virtio users switched, this macro will be removed.
> + */
> +#define VIRTQUEUE_LEGACY_MAX_SIZE 1024
> +
> +/*
> + * Reflects the absolute theoretical maximum queue size (in amount of pages)
> + * ever possible, which is actually the maximum queue size allowed by the
> + * virtio protocol. This value therefore construes the maximum transfer size
> + * possible with virtio (multiplied by system dependent PAGE_SIZE); assuming
> + * a typical page size of 4k this would be a maximum transfer size of 128M.
> + */
> +#define VIRTQUEUE_MAX_SIZE 32768
>  
>  typedef struct VirtQueueElement
>  {
> -- 
> 2.20.1


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-05  7:16     ` [Virtio-fs] " Michael S. Tsirkin
@ 2021-10-05  7:35       ` Greg Kurz
  -1 siblings, 0 replies; 97+ messages in thread
From: Greg Kurz @ 2021-10-05  7:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Amit Shah, Jason Wang,
	David Hildenbrand, Christian Schoenebeck, qemu-devel,
	Raphael Norwitz, virtio-fs, Eric Auger, Hanna Reitz,
	Gonglei (Arei),
	Gerd Hoffmann, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Dr. David Alan Gilbert

On Tue, 5 Oct 2021 03:16:07 -0400
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> > Raise the maximum possible virtio transfer size to 128M
> > (more precisely: 32k * PAGE_SIZE). See previous commit for a
> > more detailed explanation for the reasons of this change.
> > 
> > For not breaking any virtio user, all virtio users transition
> > to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> > VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> > of 1k with this commit.
> > 
> > On the long-term, each virtio user should subsequently either
> > switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> > after checking that they support the new value of 32k, or
> > otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> > macro by an appropriate value supported by them.
> > 
> > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> 
> 
> I don't think we need this. Legacy isn't descriptive either.  Just leave
> VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.
> 

Yes I agree. Only virtio-9p is going to benefit from the new
size in the short/medium term, so it looks a bit excessive to
patch all devices. Also in the end, you end up reverting the name
change in the last patch for virtio-9p... which is a indication
that this patch does too much.

Introduce the new macro in virtio-9p and use it only there.

> > ---
> >  hw/9pfs/virtio-9p-device.c     |  2 +-
> >  hw/block/vhost-user-blk.c      |  6 +++---
> >  hw/block/virtio-blk.c          |  6 +++---
> >  hw/char/virtio-serial-bus.c    |  2 +-
> >  hw/input/virtio-input.c        |  2 +-
> >  hw/net/virtio-net.c            | 12 ++++++------
> >  hw/scsi/virtio-scsi.c          |  2 +-
> >  hw/virtio/vhost-user-fs.c      |  6 +++---
> >  hw/virtio/vhost-user-i2c.c     |  2 +-
> >  hw/virtio/vhost-vsock-common.c |  2 +-
> >  hw/virtio/virtio-balloon.c     |  2 +-
> >  hw/virtio/virtio-crypto.c      |  2 +-
> >  hw/virtio/virtio-iommu.c       |  2 +-
> >  hw/virtio/virtio-mem.c         |  2 +-
> >  hw/virtio/virtio-mmio.c        |  4 ++--
> >  hw/virtio/virtio-pmem.c        |  2 +-
> >  hw/virtio/virtio-rng.c         |  3 ++-
> >  include/hw/virtio/virtio.h     | 20 +++++++++++++++++++-
> >  18 files changed, 49 insertions(+), 30 deletions(-)
> > 
> > diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> > index cd5d95dd51..9013e7df6e 100644
> > --- a/hw/9pfs/virtio-9p-device.c
> > +++ b/hw/9pfs/virtio-9p-device.c
> > @@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
> >  
> >      v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
> >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> >      v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> >  }
> >  
> > diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> > index 336f56705c..e5e45262ab 100644
> > --- a/hw/block/vhost-user-blk.c
> > +++ b/hw/block/vhost-user-blk.c
> > @@ -480,9 +480,9 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
> >          error_setg(errp, "queue size must be non-zero");
> >          return;
> >      }
> > -    if (s->queue_size > VIRTQUEUE_MAX_SIZE) {
> > +    if (s->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> >          error_setg(errp, "queue size must not exceed %d",
> > -                   VIRTQUEUE_MAX_SIZE);
> > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> >          return;
> >      }
> >  
> > @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
> >      }
> >  
> >      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> > -                sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_blk_config), VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      s->virtqs = g_new(VirtQueue *, s->num_queues);
> >      for (i = 0; i < s->num_queues; i++) {
> > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> > index 9c0f46815c..5883e3e7db 100644
> > --- a/hw/block/virtio-blk.c
> > +++ b/hw/block/virtio-blk.c
> > @@ -1171,10 +1171,10 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
> >          return;
> >      }
> >      if (!is_power_of_2(conf->queue_size) ||
> > -        conf->queue_size > VIRTQUEUE_MAX_SIZE) {
> > +        conf->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> >          error_setg(errp, "invalid queue-size property (%" PRIu16 "), "
> >                     "must be a power of 2 (max %d)",
> > -                   conf->queue_size, VIRTQUEUE_MAX_SIZE);
> > +                   conf->queue_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> >          return;
> >      }
> >  
> > @@ -1214,7 +1214,7 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
> >      virtio_blk_set_config_size(s, s->host_features);
> >  
> >      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      s->blk = conf->conf.blk;
> >      s->rq = NULL;
> > diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
> > index 9ad9111115..2d4285ab53 100644
> > --- a/hw/char/virtio-serial-bus.c
> > +++ b/hw/char/virtio-serial-bus.c
> > @@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState *dev, Error **errp)
> >          config_size = offsetof(struct virtio_console_config, emerg_wr);
> >      }
> >      virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
> > -                config_size, VIRTQUEUE_MAX_SIZE);
> > +                config_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      /* Spawn a new virtio-serial bus on which the ports will ride as devices */
> >      qbus_init(&vser->bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
> > diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
> > index 345eb2cce7..b6b77488f2 100644
> > --- a/hw/input/virtio-input.c
> > +++ b/hw/input/virtio-input.c
> > @@ -258,7 +258,7 @@ static void virtio_input_device_realize(DeviceState *dev, Error **errp)
> >      assert(vinput->cfg_size <= sizeof(virtio_input_config));
> >  
> >      virtio_init(vdev, "virtio-input", VIRTIO_ID_INPUT,
> > -                vinput->cfg_size, VIRTQUEUE_MAX_SIZE);
> > +                vinput->cfg_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> >      vinput->evt = virtio_add_queue(vdev, 64, virtio_input_handle_evt);
> >      vinput->sts = virtio_add_queue(vdev, 64, virtio_input_handle_sts);
> >  }
> > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > index f74b5f6268..5100978b07 100644
> > --- a/hw/net/virtio-net.c
> > +++ b/hw/net/virtio-net.c
> > @@ -636,7 +636,7 @@ static int virtio_net_max_tx_queue_size(VirtIONet *n)
> >          return VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE;
> >      }
> >  
> > -    return VIRTQUEUE_MAX_SIZE;
> > +    return VIRTQUEUE_LEGACY_MAX_SIZE;
> >  }
> >  
> >  static int peer_attach(VirtIONet *n, int index)
> > @@ -3365,7 +3365,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
> >  
> >      virtio_net_set_config_size(n, n->host_features);
> >      virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size,
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      /*
> >       * We set a lower limit on RX queue size to what it always was.
> > @@ -3373,23 +3373,23 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
> >       * help from us (using virtio 1 and up).
> >       */
> >      if (n->net_conf.rx_queue_size < VIRTIO_NET_RX_QUEUE_MIN_SIZE ||
> > -        n->net_conf.rx_queue_size > VIRTQUEUE_MAX_SIZE ||
> > +        n->net_conf.rx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
> >          !is_power_of_2(n->net_conf.rx_queue_size)) {
> >          error_setg(errp, "Invalid rx_queue_size (= %" PRIu16 "), "
> >                     "must be a power of 2 between %d and %d.",
> >                     n->net_conf.rx_queue_size, VIRTIO_NET_RX_QUEUE_MIN_SIZE,
> > -                   VIRTQUEUE_MAX_SIZE);
> > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> >          virtio_cleanup(vdev);
> >          return;
> >      }
> >  
> >      if (n->net_conf.tx_queue_size < VIRTIO_NET_TX_QUEUE_MIN_SIZE ||
> > -        n->net_conf.tx_queue_size > VIRTQUEUE_MAX_SIZE ||
> > +        n->net_conf.tx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
> >          !is_power_of_2(n->net_conf.tx_queue_size)) {
> >          error_setg(errp, "Invalid tx_queue_size (= %" PRIu16 "), "
> >                     "must be a power of 2 between %d and %d",
> >                     n->net_conf.tx_queue_size, VIRTIO_NET_TX_QUEUE_MIN_SIZE,
> > -                   VIRTQUEUE_MAX_SIZE);
> > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> >          virtio_cleanup(vdev);
> >          return;
> >      }
> > diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> > index 5e5e657e1d..f204e8878a 100644
> > --- a/hw/scsi/virtio-scsi.c
> > +++ b/hw/scsi/virtio-scsi.c
> > @@ -973,7 +973,7 @@ void virtio_scsi_common_realize(DeviceState *dev,
> >      int i;
> >  
> >      virtio_init(vdev, "virtio-scsi", VIRTIO_ID_SCSI,
> > -                sizeof(VirtIOSCSIConfig), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(VirtIOSCSIConfig), VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      if (s->conf.num_queues == VIRTIO_SCSI_AUTO_NUM_QUEUES) {
> >          s->conf.num_queues = 1;
> > diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> > index ae1672d667..decc5def39 100644
> > --- a/hw/virtio/vhost-user-fs.c
> > +++ b/hw/virtio/vhost-user-fs.c
> > @@ -209,9 +209,9 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
> >          return;
> >      }
> >  
> > -    if (fs->conf.queue_size > VIRTQUEUE_MAX_SIZE) {
> > +    if (fs->conf.queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> >          error_setg(errp, "queue-size property must be %u or smaller",
> > -                   VIRTQUEUE_MAX_SIZE);
> > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> >          return;
> >      }
> >  
> > @@ -220,7 +220,7 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
> >      }
> >  
> >      virtio_init(vdev, "vhost-user-fs", VIRTIO_ID_FS,
> > -                sizeof(struct virtio_fs_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_fs_config), VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      /* Hiprio queue */
> >      fs->hiprio_vq = virtio_add_queue(vdev, fs->conf.queue_size, vuf_handle_output);
> > diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c
> > index eeb1d8853a..b248ddbe93 100644
> > --- a/hw/virtio/vhost-user-i2c.c
> > +++ b/hw/virtio/vhost-user-i2c.c
> > @@ -221,7 +221,7 @@ static void vu_i2c_device_realize(DeviceState *dev, Error **errp)
> >      }
> >  
> >      virtio_init(vdev, "vhost-user-i2c", VIRTIO_ID_I2C_ADAPTER, 0,
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      i2c->vhost_dev.nvqs = 1;
> >      i2c->vq = virtio_add_queue(vdev, 4, vu_i2c_handle_output);
> > diff --git a/hw/virtio/vhost-vsock-common.c b/hw/virtio/vhost-vsock-common.c
> > index a81fa884a8..73e6b72bba 100644
> > --- a/hw/virtio/vhost-vsock-common.c
> > +++ b/hw/virtio/vhost-vsock-common.c
> > @@ -201,7 +201,7 @@ void vhost_vsock_common_realize(VirtIODevice *vdev, const char *name)
> >      VHostVSockCommon *vvc = VHOST_VSOCK_COMMON(vdev);
> >  
> >      virtio_init(vdev, name, VIRTIO_ID_VSOCK,
> > -                sizeof(struct virtio_vsock_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_vsock_config), VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      /* Receive and transmit queues belong to vhost */
> >      vvc->recv_vq = virtio_add_queue(vdev, VHOST_VSOCK_QUEUE_SIZE,
> > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > index 067c73223d..890fb15ed3 100644
> > --- a/hw/virtio/virtio-balloon.c
> > +++ b/hw/virtio/virtio-balloon.c
> > @@ -886,7 +886,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
> >      int ret;
> >  
> >      virtio_init(vdev, "virtio-balloon", VIRTIO_ID_BALLOON,
> > -                virtio_balloon_config_size(s), VIRTQUEUE_MAX_SIZE);
> > +                virtio_balloon_config_size(s), VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      ret = qemu_add_balloon_handler(virtio_balloon_to_target,
> >                                     virtio_balloon_stat, s);
> > diff --git a/hw/virtio/virtio-crypto.c b/hw/virtio/virtio-crypto.c
> > index 1e70d4d2a8..e13b6091d6 100644
> > --- a/hw/virtio/virtio-crypto.c
> > +++ b/hw/virtio/virtio-crypto.c
> > @@ -811,7 +811,7 @@ static void virtio_crypto_device_realize(DeviceState *dev, Error **errp)
> >      }
> >  
> >      virtio_init(vdev, "virtio-crypto", VIRTIO_ID_CRYPTO, vcrypto->config_size,
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> >      vcrypto->curr_queues = 1;
> >      vcrypto->vqs = g_malloc0(sizeof(VirtIOCryptoQueue) * vcrypto->max_queues);
> >      for (i = 0; i < vcrypto->max_queues; i++) {
> > diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> > index ca360e74eb..845df78842 100644
> > --- a/hw/virtio/virtio-iommu.c
> > +++ b/hw/virtio/virtio-iommu.c
> > @@ -974,7 +974,7 @@ static void virtio_iommu_device_realize(DeviceState *dev, Error **errp)
> >      VirtIOIOMMU *s = VIRTIO_IOMMU(dev);
> >  
> >      virtio_init(vdev, "virtio-iommu", VIRTIO_ID_IOMMU,
> > -                sizeof(struct virtio_iommu_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_iommu_config), VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      memset(s->iommu_pcibus_by_bus_num, 0, sizeof(s->iommu_pcibus_by_bus_num));
> >  
> > diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> > index 1d9d01b871..7a39550cde 100644
> > --- a/hw/virtio/virtio-mem.c
> > +++ b/hw/virtio/virtio-mem.c
> > @@ -738,7 +738,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
> >      vmem->bitmap = bitmap_new(vmem->bitmap_size);
> >  
> >      virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
> > -                sizeof(struct virtio_mem_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_mem_config), VIRTQUEUE_LEGACY_MAX_SIZE);
> >      vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
> >  
> >      host_memory_backend_set_mapped(vmem->memdev, true);
> > diff --git a/hw/virtio/virtio-mmio.c b/hw/virtio/virtio-mmio.c
> > index 7b3ebca178..ae0cc223e9 100644
> > --- a/hw/virtio/virtio-mmio.c
> > +++ b/hw/virtio/virtio-mmio.c
> > @@ -174,7 +174,7 @@ static uint64_t virtio_mmio_read(void *opaque, hwaddr offset, unsigned size)
> >          if (!virtio_queue_get_num(vdev, vdev->queue_sel)) {
> >              return 0;
> >          }
> > -        return VIRTQUEUE_MAX_SIZE;
> > +        return VIRTQUEUE_LEGACY_MAX_SIZE;
> >      case VIRTIO_MMIO_QUEUE_PFN:
> >          if (!proxy->legacy) {
> >              qemu_log_mask(LOG_GUEST_ERROR,
> > @@ -348,7 +348,7 @@ static void virtio_mmio_write(void *opaque, hwaddr offset, uint64_t value,
> >          }
> >          break;
> >      case VIRTIO_MMIO_QUEUE_NUM:
> > -        trace_virtio_mmio_queue_write(value, VIRTQUEUE_MAX_SIZE);
> > +        trace_virtio_mmio_queue_write(value, VIRTQUEUE_LEGACY_MAX_SIZE);
> >          virtio_queue_set_num(vdev, vdev->queue_sel, value);
> >  
> >          if (proxy->legacy) {
> > diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
> > index 82b54b00c5..5f4d375b58 100644
> > --- a/hw/virtio/virtio-pmem.c
> > +++ b/hw/virtio/virtio-pmem.c
> > @@ -124,7 +124,7 @@ static void virtio_pmem_realize(DeviceState *dev, Error **errp)
> >  
> >      host_memory_backend_set_mapped(pmem->memdev, true);
> >      virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
> > -                sizeof(struct virtio_pmem_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_pmem_config), VIRTQUEUE_LEGACY_MAX_SIZE);
> >      pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
> >  }
> >  
> > diff --git a/hw/virtio/virtio-rng.c b/hw/virtio/virtio-rng.c
> > index 0e91d60106..ab075b22b6 100644
> > --- a/hw/virtio/virtio-rng.c
> > +++ b/hw/virtio/virtio-rng.c
> > @@ -215,7 +215,8 @@ static void virtio_rng_device_realize(DeviceState *dev, Error **errp)
> >          return;
> >      }
> >  
> > -    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0, VIRTQUEUE_MAX_SIZE);
> > +    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0,
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      vrng->vq = virtio_add_queue(vdev, 8, handle_input);
> >      vrng->quota_remaining = vrng->conf.max_bytes;
> > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> > index a37d1f7d52..fe0f13266b 100644
> > --- a/include/hw/virtio/virtio.h
> > +++ b/include/hw/virtio/virtio.h
> > @@ -48,7 +48,25 @@ size_t virtio_feature_get_config_size(const VirtIOFeature *features,
> >  
> >  typedef struct VirtQueue VirtQueue;
> >  
> > -#define VIRTQUEUE_MAX_SIZE 1024
> > +/*
> > + * This is meant as transitional measure for VIRTQUEUE_MAX_SIZE's old value
> > + * of 1024 to its new value of 32768. On the long-term virtio users should
> > + * either switch to VIRTQUEUE_MAX_SIZE, provided they support 32768,
> > + * otherwise they should replace this macro on their side with an
> > + * appropriate value actually supported by them.
> > + *
> > + * Once all virtio users switched, this macro will be removed.
> > + */
> > +#define VIRTQUEUE_LEGACY_MAX_SIZE 1024
> > +
> > +/*
> > + * Reflects the absolute theoretical maximum queue size (in amount of pages)
> > + * ever possible, which is actually the maximum queue size allowed by the
> > + * virtio protocol. This value therefore construes the maximum transfer size
> > + * possible with virtio (multiplied by system dependent PAGE_SIZE); assuming
> > + * a typical page size of 4k this would be a maximum transfer size of 128M.
> > + */
> > +#define VIRTQUEUE_MAX_SIZE 32768
> >  
> >  typedef struct VirtQueueElement
> >  {
> > -- 
> > 2.20.1
> 



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-05  7:35       ` Greg Kurz
  0 siblings, 0 replies; 97+ messages in thread
From: Greg Kurz @ 2021-10-05  7:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Amit Shah, Jason Wang,
	Hildenbrand, Christian Schoenebeck, qemu-devel, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Tue, 5 Oct 2021 03:16:07 -0400
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> > Raise the maximum possible virtio transfer size to 128M
> > (more precisely: 32k * PAGE_SIZE). See previous commit for a
> > more detailed explanation for the reasons of this change.
> > 
> > For not breaking any virtio user, all virtio users transition
> > to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> > VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> > of 1k with this commit.
> > 
> > On the long-term, each virtio user should subsequently either
> > switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> > after checking that they support the new value of 32k, or
> > otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> > macro by an appropriate value supported by them.
> > 
> > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> 
> 
> I don't think we need this. Legacy isn't descriptive either.  Just leave
> VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.
> 

Yes I agree. Only virtio-9p is going to benefit from the new
size in the short/medium term, so it looks a bit excessive to
patch all devices. Also in the end, you end up reverting the name
change in the last patch for virtio-9p... which is a indication
that this patch does too much.

Introduce the new macro in virtio-9p and use it only there.

> > ---
> >  hw/9pfs/virtio-9p-device.c     |  2 +-
> >  hw/block/vhost-user-blk.c      |  6 +++---
> >  hw/block/virtio-blk.c          |  6 +++---
> >  hw/char/virtio-serial-bus.c    |  2 +-
> >  hw/input/virtio-input.c        |  2 +-
> >  hw/net/virtio-net.c            | 12 ++++++------
> >  hw/scsi/virtio-scsi.c          |  2 +-
> >  hw/virtio/vhost-user-fs.c      |  6 +++---
> >  hw/virtio/vhost-user-i2c.c     |  2 +-
> >  hw/virtio/vhost-vsock-common.c |  2 +-
> >  hw/virtio/virtio-balloon.c     |  2 +-
> >  hw/virtio/virtio-crypto.c      |  2 +-
> >  hw/virtio/virtio-iommu.c       |  2 +-
> >  hw/virtio/virtio-mem.c         |  2 +-
> >  hw/virtio/virtio-mmio.c        |  4 ++--
> >  hw/virtio/virtio-pmem.c        |  2 +-
> >  hw/virtio/virtio-rng.c         |  3 ++-
> >  include/hw/virtio/virtio.h     | 20 +++++++++++++++++++-
> >  18 files changed, 49 insertions(+), 30 deletions(-)
> > 
> > diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> > index cd5d95dd51..9013e7df6e 100644
> > --- a/hw/9pfs/virtio-9p-device.c
> > +++ b/hw/9pfs/virtio-9p-device.c
> > @@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
> >  
> >      v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
> >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> >      v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> >  }
> >  
> > diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> > index 336f56705c..e5e45262ab 100644
> > --- a/hw/block/vhost-user-blk.c
> > +++ b/hw/block/vhost-user-blk.c
> > @@ -480,9 +480,9 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
> >          error_setg(errp, "queue size must be non-zero");
> >          return;
> >      }
> > -    if (s->queue_size > VIRTQUEUE_MAX_SIZE) {
> > +    if (s->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> >          error_setg(errp, "queue size must not exceed %d",
> > -                   VIRTQUEUE_MAX_SIZE);
> > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> >          return;
> >      }
> >  
> > @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
> >      }
> >  
> >      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> > -                sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_blk_config), VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      s->virtqs = g_new(VirtQueue *, s->num_queues);
> >      for (i = 0; i < s->num_queues; i++) {
> > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> > index 9c0f46815c..5883e3e7db 100644
> > --- a/hw/block/virtio-blk.c
> > +++ b/hw/block/virtio-blk.c
> > @@ -1171,10 +1171,10 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
> >          return;
> >      }
> >      if (!is_power_of_2(conf->queue_size) ||
> > -        conf->queue_size > VIRTQUEUE_MAX_SIZE) {
> > +        conf->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> >          error_setg(errp, "invalid queue-size property (%" PRIu16 "), "
> >                     "must be a power of 2 (max %d)",
> > -                   conf->queue_size, VIRTQUEUE_MAX_SIZE);
> > +                   conf->queue_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> >          return;
> >      }
> >  
> > @@ -1214,7 +1214,7 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
> >      virtio_blk_set_config_size(s, s->host_features);
> >  
> >      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      s->blk = conf->conf.blk;
> >      s->rq = NULL;
> > diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
> > index 9ad9111115..2d4285ab53 100644
> > --- a/hw/char/virtio-serial-bus.c
> > +++ b/hw/char/virtio-serial-bus.c
> > @@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState *dev, Error **errp)
> >          config_size = offsetof(struct virtio_console_config, emerg_wr);
> >      }
> >      virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
> > -                config_size, VIRTQUEUE_MAX_SIZE);
> > +                config_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      /* Spawn a new virtio-serial bus on which the ports will ride as devices */
> >      qbus_init(&vser->bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
> > diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
> > index 345eb2cce7..b6b77488f2 100644
> > --- a/hw/input/virtio-input.c
> > +++ b/hw/input/virtio-input.c
> > @@ -258,7 +258,7 @@ static void virtio_input_device_realize(DeviceState *dev, Error **errp)
> >      assert(vinput->cfg_size <= sizeof(virtio_input_config));
> >  
> >      virtio_init(vdev, "virtio-input", VIRTIO_ID_INPUT,
> > -                vinput->cfg_size, VIRTQUEUE_MAX_SIZE);
> > +                vinput->cfg_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> >      vinput->evt = virtio_add_queue(vdev, 64, virtio_input_handle_evt);
> >      vinput->sts = virtio_add_queue(vdev, 64, virtio_input_handle_sts);
> >  }
> > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > index f74b5f6268..5100978b07 100644
> > --- a/hw/net/virtio-net.c
> > +++ b/hw/net/virtio-net.c
> > @@ -636,7 +636,7 @@ static int virtio_net_max_tx_queue_size(VirtIONet *n)
> >          return VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE;
> >      }
> >  
> > -    return VIRTQUEUE_MAX_SIZE;
> > +    return VIRTQUEUE_LEGACY_MAX_SIZE;
> >  }
> >  
> >  static int peer_attach(VirtIONet *n, int index)
> > @@ -3365,7 +3365,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
> >  
> >      virtio_net_set_config_size(n, n->host_features);
> >      virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size,
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      /*
> >       * We set a lower limit on RX queue size to what it always was.
> > @@ -3373,23 +3373,23 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
> >       * help from us (using virtio 1 and up).
> >       */
> >      if (n->net_conf.rx_queue_size < VIRTIO_NET_RX_QUEUE_MIN_SIZE ||
> > -        n->net_conf.rx_queue_size > VIRTQUEUE_MAX_SIZE ||
> > +        n->net_conf.rx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
> >          !is_power_of_2(n->net_conf.rx_queue_size)) {
> >          error_setg(errp, "Invalid rx_queue_size (= %" PRIu16 "), "
> >                     "must be a power of 2 between %d and %d.",
> >                     n->net_conf.rx_queue_size, VIRTIO_NET_RX_QUEUE_MIN_SIZE,
> > -                   VIRTQUEUE_MAX_SIZE);
> > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> >          virtio_cleanup(vdev);
> >          return;
> >      }
> >  
> >      if (n->net_conf.tx_queue_size < VIRTIO_NET_TX_QUEUE_MIN_SIZE ||
> > -        n->net_conf.tx_queue_size > VIRTQUEUE_MAX_SIZE ||
> > +        n->net_conf.tx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
> >          !is_power_of_2(n->net_conf.tx_queue_size)) {
> >          error_setg(errp, "Invalid tx_queue_size (= %" PRIu16 "), "
> >                     "must be a power of 2 between %d and %d",
> >                     n->net_conf.tx_queue_size, VIRTIO_NET_TX_QUEUE_MIN_SIZE,
> > -                   VIRTQUEUE_MAX_SIZE);
> > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> >          virtio_cleanup(vdev);
> >          return;
> >      }
> > diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> > index 5e5e657e1d..f204e8878a 100644
> > --- a/hw/scsi/virtio-scsi.c
> > +++ b/hw/scsi/virtio-scsi.c
> > @@ -973,7 +973,7 @@ void virtio_scsi_common_realize(DeviceState *dev,
> >      int i;
> >  
> >      virtio_init(vdev, "virtio-scsi", VIRTIO_ID_SCSI,
> > -                sizeof(VirtIOSCSIConfig), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(VirtIOSCSIConfig), VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      if (s->conf.num_queues == VIRTIO_SCSI_AUTO_NUM_QUEUES) {
> >          s->conf.num_queues = 1;
> > diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> > index ae1672d667..decc5def39 100644
> > --- a/hw/virtio/vhost-user-fs.c
> > +++ b/hw/virtio/vhost-user-fs.c
> > @@ -209,9 +209,9 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
> >          return;
> >      }
> >  
> > -    if (fs->conf.queue_size > VIRTQUEUE_MAX_SIZE) {
> > +    if (fs->conf.queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> >          error_setg(errp, "queue-size property must be %u or smaller",
> > -                   VIRTQUEUE_MAX_SIZE);
> > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> >          return;
> >      }
> >  
> > @@ -220,7 +220,7 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
> >      }
> >  
> >      virtio_init(vdev, "vhost-user-fs", VIRTIO_ID_FS,
> > -                sizeof(struct virtio_fs_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_fs_config), VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      /* Hiprio queue */
> >      fs->hiprio_vq = virtio_add_queue(vdev, fs->conf.queue_size, vuf_handle_output);
> > diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c
> > index eeb1d8853a..b248ddbe93 100644
> > --- a/hw/virtio/vhost-user-i2c.c
> > +++ b/hw/virtio/vhost-user-i2c.c
> > @@ -221,7 +221,7 @@ static void vu_i2c_device_realize(DeviceState *dev, Error **errp)
> >      }
> >  
> >      virtio_init(vdev, "vhost-user-i2c", VIRTIO_ID_I2C_ADAPTER, 0,
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      i2c->vhost_dev.nvqs = 1;
> >      i2c->vq = virtio_add_queue(vdev, 4, vu_i2c_handle_output);
> > diff --git a/hw/virtio/vhost-vsock-common.c b/hw/virtio/vhost-vsock-common.c
> > index a81fa884a8..73e6b72bba 100644
> > --- a/hw/virtio/vhost-vsock-common.c
> > +++ b/hw/virtio/vhost-vsock-common.c
> > @@ -201,7 +201,7 @@ void vhost_vsock_common_realize(VirtIODevice *vdev, const char *name)
> >      VHostVSockCommon *vvc = VHOST_VSOCK_COMMON(vdev);
> >  
> >      virtio_init(vdev, name, VIRTIO_ID_VSOCK,
> > -                sizeof(struct virtio_vsock_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_vsock_config), VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      /* Receive and transmit queues belong to vhost */
> >      vvc->recv_vq = virtio_add_queue(vdev, VHOST_VSOCK_QUEUE_SIZE,
> > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > index 067c73223d..890fb15ed3 100644
> > --- a/hw/virtio/virtio-balloon.c
> > +++ b/hw/virtio/virtio-balloon.c
> > @@ -886,7 +886,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
> >      int ret;
> >  
> >      virtio_init(vdev, "virtio-balloon", VIRTIO_ID_BALLOON,
> > -                virtio_balloon_config_size(s), VIRTQUEUE_MAX_SIZE);
> > +                virtio_balloon_config_size(s), VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      ret = qemu_add_balloon_handler(virtio_balloon_to_target,
> >                                     virtio_balloon_stat, s);
> > diff --git a/hw/virtio/virtio-crypto.c b/hw/virtio/virtio-crypto.c
> > index 1e70d4d2a8..e13b6091d6 100644
> > --- a/hw/virtio/virtio-crypto.c
> > +++ b/hw/virtio/virtio-crypto.c
> > @@ -811,7 +811,7 @@ static void virtio_crypto_device_realize(DeviceState *dev, Error **errp)
> >      }
> >  
> >      virtio_init(vdev, "virtio-crypto", VIRTIO_ID_CRYPTO, vcrypto->config_size,
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> >      vcrypto->curr_queues = 1;
> >      vcrypto->vqs = g_malloc0(sizeof(VirtIOCryptoQueue) * vcrypto->max_queues);
> >      for (i = 0; i < vcrypto->max_queues; i++) {
> > diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> > index ca360e74eb..845df78842 100644
> > --- a/hw/virtio/virtio-iommu.c
> > +++ b/hw/virtio/virtio-iommu.c
> > @@ -974,7 +974,7 @@ static void virtio_iommu_device_realize(DeviceState *dev, Error **errp)
> >      VirtIOIOMMU *s = VIRTIO_IOMMU(dev);
> >  
> >      virtio_init(vdev, "virtio-iommu", VIRTIO_ID_IOMMU,
> > -                sizeof(struct virtio_iommu_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_iommu_config), VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      memset(s->iommu_pcibus_by_bus_num, 0, sizeof(s->iommu_pcibus_by_bus_num));
> >  
> > diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> > index 1d9d01b871..7a39550cde 100644
> > --- a/hw/virtio/virtio-mem.c
> > +++ b/hw/virtio/virtio-mem.c
> > @@ -738,7 +738,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
> >      vmem->bitmap = bitmap_new(vmem->bitmap_size);
> >  
> >      virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
> > -                sizeof(struct virtio_mem_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_mem_config), VIRTQUEUE_LEGACY_MAX_SIZE);
> >      vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
> >  
> >      host_memory_backend_set_mapped(vmem->memdev, true);
> > diff --git a/hw/virtio/virtio-mmio.c b/hw/virtio/virtio-mmio.c
> > index 7b3ebca178..ae0cc223e9 100644
> > --- a/hw/virtio/virtio-mmio.c
> > +++ b/hw/virtio/virtio-mmio.c
> > @@ -174,7 +174,7 @@ static uint64_t virtio_mmio_read(void *opaque, hwaddr offset, unsigned size)
> >          if (!virtio_queue_get_num(vdev, vdev->queue_sel)) {
> >              return 0;
> >          }
> > -        return VIRTQUEUE_MAX_SIZE;
> > +        return VIRTQUEUE_LEGACY_MAX_SIZE;
> >      case VIRTIO_MMIO_QUEUE_PFN:
> >          if (!proxy->legacy) {
> >              qemu_log_mask(LOG_GUEST_ERROR,
> > @@ -348,7 +348,7 @@ static void virtio_mmio_write(void *opaque, hwaddr offset, uint64_t value,
> >          }
> >          break;
> >      case VIRTIO_MMIO_QUEUE_NUM:
> > -        trace_virtio_mmio_queue_write(value, VIRTQUEUE_MAX_SIZE);
> > +        trace_virtio_mmio_queue_write(value, VIRTQUEUE_LEGACY_MAX_SIZE);
> >          virtio_queue_set_num(vdev, vdev->queue_sel, value);
> >  
> >          if (proxy->legacy) {
> > diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
> > index 82b54b00c5..5f4d375b58 100644
> > --- a/hw/virtio/virtio-pmem.c
> > +++ b/hw/virtio/virtio-pmem.c
> > @@ -124,7 +124,7 @@ static void virtio_pmem_realize(DeviceState *dev, Error **errp)
> >  
> >      host_memory_backend_set_mapped(pmem->memdev, true);
> >      virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
> > -                sizeof(struct virtio_pmem_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_pmem_config), VIRTQUEUE_LEGACY_MAX_SIZE);
> >      pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
> >  }
> >  
> > diff --git a/hw/virtio/virtio-rng.c b/hw/virtio/virtio-rng.c
> > index 0e91d60106..ab075b22b6 100644
> > --- a/hw/virtio/virtio-rng.c
> > +++ b/hw/virtio/virtio-rng.c
> > @@ -215,7 +215,8 @@ static void virtio_rng_device_realize(DeviceState *dev, Error **errp)
> >          return;
> >      }
> >  
> > -    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0, VIRTQUEUE_MAX_SIZE);
> > +    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0,
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >      vrng->vq = virtio_add_queue(vdev, 8, handle_input);
> >      vrng->quota_remaining = vrng->conf.max_bytes;
> > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> > index a37d1f7d52..fe0f13266b 100644
> > --- a/include/hw/virtio/virtio.h
> > +++ b/include/hw/virtio/virtio.h
> > @@ -48,7 +48,25 @@ size_t virtio_feature_get_config_size(const VirtIOFeature *features,
> >  
> >  typedef struct VirtQueue VirtQueue;
> >  
> > -#define VIRTQUEUE_MAX_SIZE 1024
> > +/*
> > + * This is meant as transitional measure for VIRTQUEUE_MAX_SIZE's old value
> > + * of 1024 to its new value of 32768. On the long-term virtio users should
> > + * either switch to VIRTQUEUE_MAX_SIZE, provided they support 32768,
> > + * otherwise they should replace this macro on their side with an
> > + * appropriate value actually supported by them.
> > + *
> > + * Once all virtio users switched, this macro will be removed.
> > + */
> > +#define VIRTQUEUE_LEGACY_MAX_SIZE 1024
> > +
> > +/*
> > + * Reflects the absolute theoretical maximum queue size (in amount of pages)
> > + * ever possible, which is actually the maximum queue size allowed by the
> > + * virtio protocol. This value therefore construes the maximum transfer size
> > + * possible with virtio (multiplied by system dependent PAGE_SIZE); assuming
> > + * a typical page size of 4k this would be a maximum transfer size of 128M.
> > + */
> > +#define VIRTQUEUE_MAX_SIZE 32768
> >  
> >  typedef struct VirtQueueElement
> >  {
> > -- 
> > 2.20.1
> 


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
  2021-10-04 19:38   ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-05  7:36     ` Greg Kurz
  -1 siblings, 0 replies; 97+ messages in thread
From: Greg Kurz @ 2021-10-05  7:36 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel,
	Raphael Norwitz, virtio-fs, Eric Auger, Hanna Reitz,
	Gonglei (Arei),
	Gerd Hoffmann, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Dr. David Alan Gilbert

On Mon, 4 Oct 2021 21:38:04 +0200
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> variable per virtio user.
> 
> Reasons:
> 
> (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
>     maximum queue size possible. Which is actually the maximum
>     queue size allowed by the virtio protocol. The appropriate
>     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> 
>     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006
> 
>     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
>     more or less arbitrary value of 1024 in the past, which
>     limits the maximum transfer size with virtio to 4M
>     (more precise: 1024 * PAGE_SIZE, with the latter typically
>     being 4k).
> 
> (2) Additionally the current value of 1024 poses a hidden limit,
>     invisible to guest, which causes a system hang with the
>     following QEMU error if guest tries to exceed it:
> 
>     virtio: too many write descriptors in indirect table
> 
> (3) Unfortunately not all virtio users in QEMU would currently
>     work correctly with the new value of 32768.
> 
> So let's turn this hard coded global value into a runtime
> variable as a first step in this commit, configurable for each
> virtio user by passing a corresponding value with virtio_init()
> call.
> 
> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> ---

Reviewed-by: Greg Kurz <groug@kaod.org>

>  hw/9pfs/virtio-9p-device.c     |  3 ++-
>  hw/block/vhost-user-blk.c      |  2 +-
>  hw/block/virtio-blk.c          |  3 ++-
>  hw/char/virtio-serial-bus.c    |  2 +-
>  hw/display/virtio-gpu-base.c   |  2 +-
>  hw/input/virtio-input.c        |  2 +-
>  hw/net/virtio-net.c            | 15 ++++++++-------
>  hw/scsi/virtio-scsi.c          |  2 +-
>  hw/virtio/vhost-user-fs.c      |  2 +-
>  hw/virtio/vhost-user-i2c.c     |  3 ++-
>  hw/virtio/vhost-vsock-common.c |  2 +-
>  hw/virtio/virtio-balloon.c     |  4 ++--
>  hw/virtio/virtio-crypto.c      |  3 ++-
>  hw/virtio/virtio-iommu.c       |  2 +-
>  hw/virtio/virtio-mem.c         |  2 +-
>  hw/virtio/virtio-pmem.c        |  2 +-
>  hw/virtio/virtio-rng.c         |  2 +-
>  hw/virtio/virtio.c             | 35 +++++++++++++++++++++++-----------
>  include/hw/virtio/virtio.h     |  5 ++++-
>  19 files changed, 57 insertions(+), 36 deletions(-)
> 
> diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> index 54ee93b71f..cd5d95dd51 100644
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -216,7 +216,8 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
> -    virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size);
> +    virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> +                VIRTQUEUE_MAX_SIZE);
>      v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
>  }
>  
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index ba13cb87e5..336f56705c 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> -                sizeof(struct virtio_blk_config));
> +                sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
>  
>      s->virtqs = g_new(VirtQueue *, s->num_queues);
>      for (i = 0; i < s->num_queues; i++) {
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index f139cd7cc9..9c0f46815c 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -1213,7 +1213,8 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
>  
>      virtio_blk_set_config_size(s, s->host_features);
>  
> -    virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size);
> +    virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
> +                VIRTQUEUE_MAX_SIZE);
>  
>      s->blk = conf->conf.blk;
>      s->rq = NULL;
> diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
> index f01ec2137c..9ad9111115 100644
> --- a/hw/char/virtio-serial-bus.c
> +++ b/hw/char/virtio-serial-bus.c
> @@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState *dev, Error **errp)
>          config_size = offsetof(struct virtio_console_config, emerg_wr);
>      }
>      virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
> -                config_size);
> +                config_size, VIRTQUEUE_MAX_SIZE);
>  
>      /* Spawn a new virtio-serial bus on which the ports will ride as devices */
>      qbus_init(&vser->bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
> diff --git a/hw/display/virtio-gpu-base.c b/hw/display/virtio-gpu-base.c
> index c8da4806e0..20b06a7adf 100644
> --- a/hw/display/virtio-gpu-base.c
> +++ b/hw/display/virtio-gpu-base.c
> @@ -171,7 +171,7 @@ virtio_gpu_base_device_realize(DeviceState *qdev,
>  
>      g->virtio_config.num_scanouts = cpu_to_le32(g->conf.max_outputs);
>      virtio_init(VIRTIO_DEVICE(g), "virtio-gpu", VIRTIO_ID_GPU,
> -                sizeof(struct virtio_gpu_config));
> +                sizeof(struct virtio_gpu_config), VIRTQUEUE_MAX_SIZE);
>  
>      if (virtio_gpu_virgl_enabled(g->conf)) {
>          /* use larger control queue in 3d mode */
> diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
> index 54bcb46c74..345eb2cce7 100644
> --- a/hw/input/virtio-input.c
> +++ b/hw/input/virtio-input.c
> @@ -258,7 +258,7 @@ static void virtio_input_device_realize(DeviceState *dev, Error **errp)
>      assert(vinput->cfg_size <= sizeof(virtio_input_config));
>  
>      virtio_init(vdev, "virtio-input", VIRTIO_ID_INPUT,
> -                vinput->cfg_size);
> +                vinput->cfg_size, VIRTQUEUE_MAX_SIZE);
>      vinput->evt = virtio_add_queue(vdev, 64, virtio_input_handle_evt);
>      vinput->sts = virtio_add_queue(vdev, 64, virtio_input_handle_sts);
>  }
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index f205331dcf..f74b5f6268 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -1746,9 +1746,9 @@ static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
>      VirtIONet *n = qemu_get_nic_opaque(nc);
>      VirtIONetQueue *q = virtio_net_get_subqueue(nc);
>      VirtIODevice *vdev = VIRTIO_DEVICE(n);
> -    VirtQueueElement *elems[VIRTQUEUE_MAX_SIZE];
> -    size_t lens[VIRTQUEUE_MAX_SIZE];
> -    struct iovec mhdr_sg[VIRTQUEUE_MAX_SIZE];
> +    VirtQueueElement *elems[vdev->queue_max_size];
> +    size_t lens[vdev->queue_max_size];
> +    struct iovec mhdr_sg[vdev->queue_max_size];
>      struct virtio_net_hdr_mrg_rxbuf mhdr;
>      unsigned mhdr_cnt = 0;
>      size_t offset, i, guest_offset, j;
> @@ -1783,7 +1783,7 @@ static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
>  
>          total = 0;
>  
> -        if (i == VIRTQUEUE_MAX_SIZE) {
> +        if (i == vdev->queue_max_size) {
>              virtio_error(vdev, "virtio-net unexpected long buffer chain");
>              err = size;
>              goto err;
> @@ -2532,7 +2532,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
>      for (;;) {
>          ssize_t ret;
>          unsigned int out_num;
> -        struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE + 1], *out_sg;
> +        struct iovec sg[vdev->queue_max_size], sg2[vdev->queue_max_size + 1], *out_sg;
>          struct virtio_net_hdr_mrg_rxbuf mhdr;
>  
>          elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement));
> @@ -2564,7 +2564,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
>                  out_num = iov_copy(&sg2[1], ARRAY_SIZE(sg2) - 1,
>                                     out_sg, out_num,
>                                     n->guest_hdr_len, -1);
> -                if (out_num == VIRTQUEUE_MAX_SIZE) {
> +                if (out_num == vdev->queue_max_size) {
>                      goto drop;
>                  }
>                  out_num += 1;
> @@ -3364,7 +3364,8 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      virtio_net_set_config_size(n, n->host_features);
> -    virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size);
> +    virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size,
> +                VIRTQUEUE_MAX_SIZE);
>  
>      /*
>       * We set a lower limit on RX queue size to what it always was.
> diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> index 51fd09522a..5e5e657e1d 100644
> --- a/hw/scsi/virtio-scsi.c
> +++ b/hw/scsi/virtio-scsi.c
> @@ -973,7 +973,7 @@ void virtio_scsi_common_realize(DeviceState *dev,
>      int i;
>  
>      virtio_init(vdev, "virtio-scsi", VIRTIO_ID_SCSI,
> -                sizeof(VirtIOSCSIConfig));
> +                sizeof(VirtIOSCSIConfig), VIRTQUEUE_MAX_SIZE);
>  
>      if (s->conf.num_queues == VIRTIO_SCSI_AUTO_NUM_QUEUES) {
>          s->conf.num_queues = 1;
> diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> index c595957983..ae1672d667 100644
> --- a/hw/virtio/vhost-user-fs.c
> +++ b/hw/virtio/vhost-user-fs.c
> @@ -220,7 +220,7 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      virtio_init(vdev, "vhost-user-fs", VIRTIO_ID_FS,
> -                sizeof(struct virtio_fs_config));
> +                sizeof(struct virtio_fs_config), VIRTQUEUE_MAX_SIZE);
>  
>      /* Hiprio queue */
>      fs->hiprio_vq = virtio_add_queue(vdev, fs->conf.queue_size, vuf_handle_output);
> diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c
> index d172632bb0..eeb1d8853a 100644
> --- a/hw/virtio/vhost-user-i2c.c
> +++ b/hw/virtio/vhost-user-i2c.c
> @@ -220,7 +220,8 @@ static void vu_i2c_device_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    virtio_init(vdev, "vhost-user-i2c", VIRTIO_ID_I2C_ADAPTER, 0);
> +    virtio_init(vdev, "vhost-user-i2c", VIRTIO_ID_I2C_ADAPTER, 0,
> +                VIRTQUEUE_MAX_SIZE);
>  
>      i2c->vhost_dev.nvqs = 1;
>      i2c->vq = virtio_add_queue(vdev, 4, vu_i2c_handle_output);
> diff --git a/hw/virtio/vhost-vsock-common.c b/hw/virtio/vhost-vsock-common.c
> index 4ad6e234ad..a81fa884a8 100644
> --- a/hw/virtio/vhost-vsock-common.c
> +++ b/hw/virtio/vhost-vsock-common.c
> @@ -201,7 +201,7 @@ void vhost_vsock_common_realize(VirtIODevice *vdev, const char *name)
>      VHostVSockCommon *vvc = VHOST_VSOCK_COMMON(vdev);
>  
>      virtio_init(vdev, name, VIRTIO_ID_VSOCK,
> -                sizeof(struct virtio_vsock_config));
> +                sizeof(struct virtio_vsock_config), VIRTQUEUE_MAX_SIZE);
>  
>      /* Receive and transmit queues belong to vhost */
>      vvc->recv_vq = virtio_add_queue(vdev, VHOST_VSOCK_QUEUE_SIZE,
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 5a69dce35d..067c73223d 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -886,7 +886,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>      int ret;
>  
>      virtio_init(vdev, "virtio-balloon", VIRTIO_ID_BALLOON,
> -                virtio_balloon_config_size(s));
> +                virtio_balloon_config_size(s), VIRTQUEUE_MAX_SIZE);
>  
>      ret = qemu_add_balloon_handler(virtio_balloon_to_target,
>                                     virtio_balloon_stat, s);
> @@ -909,7 +909,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
>  
>      if (virtio_has_feature(s->host_features, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> -        s->free_page_vq = virtio_add_queue(vdev, VIRTQUEUE_MAX_SIZE,
> +        s->free_page_vq = virtio_add_queue(vdev, vdev->queue_max_size,
>                                             virtio_balloon_handle_free_page_vq);
>          precopy_add_notifier(&s->free_page_hint_notify);
>  
> diff --git a/hw/virtio/virtio-crypto.c b/hw/virtio/virtio-crypto.c
> index 54f9bbb789..1e70d4d2a8 100644
> --- a/hw/virtio/virtio-crypto.c
> +++ b/hw/virtio/virtio-crypto.c
> @@ -810,7 +810,8 @@ static void virtio_crypto_device_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    virtio_init(vdev, "virtio-crypto", VIRTIO_ID_CRYPTO, vcrypto->config_size);
> +    virtio_init(vdev, "virtio-crypto", VIRTIO_ID_CRYPTO, vcrypto->config_size,
> +                VIRTQUEUE_MAX_SIZE);
>      vcrypto->curr_queues = 1;
>      vcrypto->vqs = g_malloc0(sizeof(VirtIOCryptoQueue) * vcrypto->max_queues);
>      for (i = 0; i < vcrypto->max_queues; i++) {
> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> index 1b23e8e18c..ca360e74eb 100644
> --- a/hw/virtio/virtio-iommu.c
> +++ b/hw/virtio/virtio-iommu.c
> @@ -974,7 +974,7 @@ static void virtio_iommu_device_realize(DeviceState *dev, Error **errp)
>      VirtIOIOMMU *s = VIRTIO_IOMMU(dev);
>  
>      virtio_init(vdev, "virtio-iommu", VIRTIO_ID_IOMMU,
> -                sizeof(struct virtio_iommu_config));
> +                sizeof(struct virtio_iommu_config), VIRTQUEUE_MAX_SIZE);
>  
>      memset(s->iommu_pcibus_by_bus_num, 0, sizeof(s->iommu_pcibus_by_bus_num));
>  
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index df91e454b2..1d9d01b871 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -738,7 +738,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>      vmem->bitmap = bitmap_new(vmem->bitmap_size);
>  
>      virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
> -                sizeof(struct virtio_mem_config));
> +                sizeof(struct virtio_mem_config), VIRTQUEUE_MAX_SIZE);
>      vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
>  
>      host_memory_backend_set_mapped(vmem->memdev, true);
> diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
> index d1aeb90a31..82b54b00c5 100644
> --- a/hw/virtio/virtio-pmem.c
> +++ b/hw/virtio/virtio-pmem.c
> @@ -124,7 +124,7 @@ static void virtio_pmem_realize(DeviceState *dev, Error **errp)
>  
>      host_memory_backend_set_mapped(pmem->memdev, true);
>      virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
> -                sizeof(struct virtio_pmem_config));
> +                sizeof(struct virtio_pmem_config), VIRTQUEUE_MAX_SIZE);
>      pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
>  }
>  
> diff --git a/hw/virtio/virtio-rng.c b/hw/virtio/virtio-rng.c
> index cc8e9f775d..0e91d60106 100644
> --- a/hw/virtio/virtio-rng.c
> +++ b/hw/virtio/virtio-rng.c
> @@ -215,7 +215,7 @@ static void virtio_rng_device_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0);
> +    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0, VIRTQUEUE_MAX_SIZE);
>  
>      vrng->vq = virtio_add_queue(vdev, 8, handle_input);
>      vrng->quota_remaining = vrng->conf.max_bytes;
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 240759ff0b..60e094d96a 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -1419,8 +1419,8 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
>      VirtIODevice *vdev = vq->vdev;
>      VirtQueueElement *elem = NULL;
>      unsigned out_num, in_num, elem_entries;
> -    hwaddr addr[VIRTQUEUE_MAX_SIZE];
> -    struct iovec iov[VIRTQUEUE_MAX_SIZE];
> +    hwaddr addr[vdev->queue_max_size];
> +    struct iovec iov[vdev->queue_max_size];
>      VRingDesc desc;
>      int rc;
>  
> @@ -1492,7 +1492,7 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
>          if (desc.flags & VRING_DESC_F_WRITE) {
>              map_ok = virtqueue_map_desc(vdev, &in_num, addr + out_num,
>                                          iov + out_num,
> -                                        VIRTQUEUE_MAX_SIZE - out_num, true,
> +                                        vdev->queue_max_size - out_num, true,
>                                          desc.addr, desc.len);
>          } else {
>              if (in_num) {
> @@ -1500,7 +1500,7 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
>                  goto err_undo_map;
>              }
>              map_ok = virtqueue_map_desc(vdev, &out_num, addr, iov,
> -                                        VIRTQUEUE_MAX_SIZE, false,
> +                                        vdev->queue_max_size, false,
>                                          desc.addr, desc.len);
>          }
>          if (!map_ok) {
> @@ -1556,8 +1556,8 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
>      VirtIODevice *vdev = vq->vdev;
>      VirtQueueElement *elem = NULL;
>      unsigned out_num, in_num, elem_entries;
> -    hwaddr addr[VIRTQUEUE_MAX_SIZE];
> -    struct iovec iov[VIRTQUEUE_MAX_SIZE];
> +    hwaddr addr[vdev->queue_max_size];
> +    struct iovec iov[vdev->queue_max_size];
>      VRingPackedDesc desc;
>      uint16_t id;
>      int rc;
> @@ -1620,7 +1620,7 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
>          if (desc.flags & VRING_DESC_F_WRITE) {
>              map_ok = virtqueue_map_desc(vdev, &in_num, addr + out_num,
>                                          iov + out_num,
> -                                        VIRTQUEUE_MAX_SIZE - out_num, true,
> +                                        vdev->queue_max_size - out_num, true,
>                                          desc.addr, desc.len);
>          } else {
>              if (in_num) {
> @@ -1628,7 +1628,7 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
>                  goto err_undo_map;
>              }
>              map_ok = virtqueue_map_desc(vdev, &out_num, addr, iov,
> -                                        VIRTQUEUE_MAX_SIZE, false,
> +                                        vdev->queue_max_size, false,
>                                          desc.addr, desc.len);
>          }
>          if (!map_ok) {
> @@ -2249,7 +2249,7 @@ void virtio_queue_set_num(VirtIODevice *vdev, int n, int num)
>       * nonexistent states, or to set it to an invalid size.
>       */
>      if (!!num != !!vdev->vq[n].vring.num ||
> -        num > VIRTQUEUE_MAX_SIZE ||
> +        num > vdev->queue_max_size ||
>          num < 0) {
>          return;
>      }
> @@ -2400,7 +2400,7 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
>              break;
>      }
>  
> -    if (i == VIRTIO_QUEUE_MAX || queue_size > VIRTQUEUE_MAX_SIZE)
> +    if (i == VIRTIO_QUEUE_MAX || queue_size > vdev->queue_max_size)
>          abort();
>  
>      vdev->vq[i].vring.num = queue_size;
> @@ -3239,13 +3239,25 @@ void virtio_instance_init_common(Object *proxy_obj, void *data,
>  }
>  
>  void virtio_init(VirtIODevice *vdev, const char *name,
> -                 uint16_t device_id, size_t config_size)
> +                 uint16_t device_id, size_t config_size,
> +                 uint16_t queue_max_size)
>  {
>      BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
>      VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
>      int i;
>      int nvectors = k->query_nvectors ? k->query_nvectors(qbus->parent) : 0;
>  
> +    if (queue_max_size > VIRTQUEUE_MAX_SIZE ||
> +        !is_power_of_2(queue_max_size))
> +    {
> +        error_report(
> +            "virtio: invalid queue_max_size (= %" PRIu16 "), must be a "
> +            "power of 2 and between 1 ... %d.",
> +            queue_max_size, VIRTQUEUE_MAX_SIZE
> +        );
> +        abort();
> +    }
> +
>      if (nvectors) {
>          vdev->vector_queues =
>              g_malloc0(sizeof(*vdev->vector_queues) * nvectors);
> @@ -3258,6 +3270,7 @@ void virtio_init(VirtIODevice *vdev, const char *name,
>      qatomic_set(&vdev->isr, 0);
>      vdev->queue_sel = 0;
>      vdev->config_vector = VIRTIO_NO_VECTOR;
> +    vdev->queue_max_size = queue_max_size;
>      vdev->vq = g_malloc0(sizeof(VirtQueue) * VIRTIO_QUEUE_MAX);
>      vdev->vm_running = runstate_is_running();
>      vdev->broken = false;
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index 8bab9cfb75..a37d1f7d52 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -89,6 +89,7 @@ struct VirtIODevice
>      size_t config_len;
>      void *config;
>      uint16_t config_vector;
> +    uint16_t queue_max_size;
>      uint32_t generation;
>      int nvectors;
>      VirtQueue *vq;
> @@ -166,7 +167,9 @@ void virtio_instance_init_common(Object *proxy_obj, void *data,
>                                   size_t vdev_size, const char *vdev_name);
>  
>  void virtio_init(VirtIODevice *vdev, const char *name,
> -                         uint16_t device_id, size_t config_size);
> +                 uint16_t device_id, size_t config_size,
> +                 uint16_t queue_max_size);
> +
>  void virtio_cleanup(VirtIODevice *vdev);
>  
>  void virtio_error(VirtIODevice *vdev, const char *fmt, ...) GCC_FMT_ATTR(2, 3);



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
@ 2021-10-05  7:36     ` Greg Kurz
  0 siblings, 0 replies; 97+ messages in thread
From: Greg Kurz @ 2021-10-05  7:36 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Fam Zheng, Michael S. Tsirkin, Wang, qemu-devel, Gerd Hoffmann,
	virtio-fs, qemu-block, David Hildenbrand, Gonglei (Arei),
	Marc-André Lureau, Vivier, Amit Shah, Eric Auger,
	Kevin Wolf, Norwitz, Reitz, Bonzini

On Mon, 4 Oct 2021 21:38:04 +0200
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> variable per virtio user.
> 
> Reasons:
> 
> (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
>     maximum queue size possible. Which is actually the maximum
>     queue size allowed by the virtio protocol. The appropriate
>     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> 
>     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006
> 
>     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
>     more or less arbitrary value of 1024 in the past, which
>     limits the maximum transfer size with virtio to 4M
>     (more precise: 1024 * PAGE_SIZE, with the latter typically
>     being 4k).
> 
> (2) Additionally the current value of 1024 poses a hidden limit,
>     invisible to guest, which causes a system hang with the
>     following QEMU error if guest tries to exceed it:
> 
>     virtio: too many write descriptors in indirect table
> 
> (3) Unfortunately not all virtio users in QEMU would currently
>     work correctly with the new value of 32768.
> 
> So let's turn this hard coded global value into a runtime
> variable as a first step in this commit, configurable for each
> virtio user by passing a corresponding value with virtio_init()
> call.
> 
> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> ---

Reviewed-by: Greg Kurz <groug@kaod.org>

>  hw/9pfs/virtio-9p-device.c     |  3 ++-
>  hw/block/vhost-user-blk.c      |  2 +-
>  hw/block/virtio-blk.c          |  3 ++-
>  hw/char/virtio-serial-bus.c    |  2 +-
>  hw/display/virtio-gpu-base.c   |  2 +-
>  hw/input/virtio-input.c        |  2 +-
>  hw/net/virtio-net.c            | 15 ++++++++-------
>  hw/scsi/virtio-scsi.c          |  2 +-
>  hw/virtio/vhost-user-fs.c      |  2 +-
>  hw/virtio/vhost-user-i2c.c     |  3 ++-
>  hw/virtio/vhost-vsock-common.c |  2 +-
>  hw/virtio/virtio-balloon.c     |  4 ++--
>  hw/virtio/virtio-crypto.c      |  3 ++-
>  hw/virtio/virtio-iommu.c       |  2 +-
>  hw/virtio/virtio-mem.c         |  2 +-
>  hw/virtio/virtio-pmem.c        |  2 +-
>  hw/virtio/virtio-rng.c         |  2 +-
>  hw/virtio/virtio.c             | 35 +++++++++++++++++++++++-----------
>  include/hw/virtio/virtio.h     |  5 ++++-
>  19 files changed, 57 insertions(+), 36 deletions(-)
> 
> diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> index 54ee93b71f..cd5d95dd51 100644
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -216,7 +216,8 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
> -    virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size);
> +    virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> +                VIRTQUEUE_MAX_SIZE);
>      v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
>  }
>  
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index ba13cb87e5..336f56705c 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> -                sizeof(struct virtio_blk_config));
> +                sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
>  
>      s->virtqs = g_new(VirtQueue *, s->num_queues);
>      for (i = 0; i < s->num_queues; i++) {
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index f139cd7cc9..9c0f46815c 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -1213,7 +1213,8 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
>  
>      virtio_blk_set_config_size(s, s->host_features);
>  
> -    virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size);
> +    virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
> +                VIRTQUEUE_MAX_SIZE);
>  
>      s->blk = conf->conf.blk;
>      s->rq = NULL;
> diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
> index f01ec2137c..9ad9111115 100644
> --- a/hw/char/virtio-serial-bus.c
> +++ b/hw/char/virtio-serial-bus.c
> @@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState *dev, Error **errp)
>          config_size = offsetof(struct virtio_console_config, emerg_wr);
>      }
>      virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
> -                config_size);
> +                config_size, VIRTQUEUE_MAX_SIZE);
>  
>      /* Spawn a new virtio-serial bus on which the ports will ride as devices */
>      qbus_init(&vser->bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
> diff --git a/hw/display/virtio-gpu-base.c b/hw/display/virtio-gpu-base.c
> index c8da4806e0..20b06a7adf 100644
> --- a/hw/display/virtio-gpu-base.c
> +++ b/hw/display/virtio-gpu-base.c
> @@ -171,7 +171,7 @@ virtio_gpu_base_device_realize(DeviceState *qdev,
>  
>      g->virtio_config.num_scanouts = cpu_to_le32(g->conf.max_outputs);
>      virtio_init(VIRTIO_DEVICE(g), "virtio-gpu", VIRTIO_ID_GPU,
> -                sizeof(struct virtio_gpu_config));
> +                sizeof(struct virtio_gpu_config), VIRTQUEUE_MAX_SIZE);
>  
>      if (virtio_gpu_virgl_enabled(g->conf)) {
>          /* use larger control queue in 3d mode */
> diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
> index 54bcb46c74..345eb2cce7 100644
> --- a/hw/input/virtio-input.c
> +++ b/hw/input/virtio-input.c
> @@ -258,7 +258,7 @@ static void virtio_input_device_realize(DeviceState *dev, Error **errp)
>      assert(vinput->cfg_size <= sizeof(virtio_input_config));
>  
>      virtio_init(vdev, "virtio-input", VIRTIO_ID_INPUT,
> -                vinput->cfg_size);
> +                vinput->cfg_size, VIRTQUEUE_MAX_SIZE);
>      vinput->evt = virtio_add_queue(vdev, 64, virtio_input_handle_evt);
>      vinput->sts = virtio_add_queue(vdev, 64, virtio_input_handle_sts);
>  }
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index f205331dcf..f74b5f6268 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -1746,9 +1746,9 @@ static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
>      VirtIONet *n = qemu_get_nic_opaque(nc);
>      VirtIONetQueue *q = virtio_net_get_subqueue(nc);
>      VirtIODevice *vdev = VIRTIO_DEVICE(n);
> -    VirtQueueElement *elems[VIRTQUEUE_MAX_SIZE];
> -    size_t lens[VIRTQUEUE_MAX_SIZE];
> -    struct iovec mhdr_sg[VIRTQUEUE_MAX_SIZE];
> +    VirtQueueElement *elems[vdev->queue_max_size];
> +    size_t lens[vdev->queue_max_size];
> +    struct iovec mhdr_sg[vdev->queue_max_size];
>      struct virtio_net_hdr_mrg_rxbuf mhdr;
>      unsigned mhdr_cnt = 0;
>      size_t offset, i, guest_offset, j;
> @@ -1783,7 +1783,7 @@ static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
>  
>          total = 0;
>  
> -        if (i == VIRTQUEUE_MAX_SIZE) {
> +        if (i == vdev->queue_max_size) {
>              virtio_error(vdev, "virtio-net unexpected long buffer chain");
>              err = size;
>              goto err;
> @@ -2532,7 +2532,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
>      for (;;) {
>          ssize_t ret;
>          unsigned int out_num;
> -        struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE + 1], *out_sg;
> +        struct iovec sg[vdev->queue_max_size], sg2[vdev->queue_max_size + 1], *out_sg;
>          struct virtio_net_hdr_mrg_rxbuf mhdr;
>  
>          elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement));
> @@ -2564,7 +2564,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
>                  out_num = iov_copy(&sg2[1], ARRAY_SIZE(sg2) - 1,
>                                     out_sg, out_num,
>                                     n->guest_hdr_len, -1);
> -                if (out_num == VIRTQUEUE_MAX_SIZE) {
> +                if (out_num == vdev->queue_max_size) {
>                      goto drop;
>                  }
>                  out_num += 1;
> @@ -3364,7 +3364,8 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      virtio_net_set_config_size(n, n->host_features);
> -    virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size);
> +    virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size,
> +                VIRTQUEUE_MAX_SIZE);
>  
>      /*
>       * We set a lower limit on RX queue size to what it always was.
> diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> index 51fd09522a..5e5e657e1d 100644
> --- a/hw/scsi/virtio-scsi.c
> +++ b/hw/scsi/virtio-scsi.c
> @@ -973,7 +973,7 @@ void virtio_scsi_common_realize(DeviceState *dev,
>      int i;
>  
>      virtio_init(vdev, "virtio-scsi", VIRTIO_ID_SCSI,
> -                sizeof(VirtIOSCSIConfig));
> +                sizeof(VirtIOSCSIConfig), VIRTQUEUE_MAX_SIZE);
>  
>      if (s->conf.num_queues == VIRTIO_SCSI_AUTO_NUM_QUEUES) {
>          s->conf.num_queues = 1;
> diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> index c595957983..ae1672d667 100644
> --- a/hw/virtio/vhost-user-fs.c
> +++ b/hw/virtio/vhost-user-fs.c
> @@ -220,7 +220,7 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      virtio_init(vdev, "vhost-user-fs", VIRTIO_ID_FS,
> -                sizeof(struct virtio_fs_config));
> +                sizeof(struct virtio_fs_config), VIRTQUEUE_MAX_SIZE);
>  
>      /* Hiprio queue */
>      fs->hiprio_vq = virtio_add_queue(vdev, fs->conf.queue_size, vuf_handle_output);
> diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c
> index d172632bb0..eeb1d8853a 100644
> --- a/hw/virtio/vhost-user-i2c.c
> +++ b/hw/virtio/vhost-user-i2c.c
> @@ -220,7 +220,8 @@ static void vu_i2c_device_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    virtio_init(vdev, "vhost-user-i2c", VIRTIO_ID_I2C_ADAPTER, 0);
> +    virtio_init(vdev, "vhost-user-i2c", VIRTIO_ID_I2C_ADAPTER, 0,
> +                VIRTQUEUE_MAX_SIZE);
>  
>      i2c->vhost_dev.nvqs = 1;
>      i2c->vq = virtio_add_queue(vdev, 4, vu_i2c_handle_output);
> diff --git a/hw/virtio/vhost-vsock-common.c b/hw/virtio/vhost-vsock-common.c
> index 4ad6e234ad..a81fa884a8 100644
> --- a/hw/virtio/vhost-vsock-common.c
> +++ b/hw/virtio/vhost-vsock-common.c
> @@ -201,7 +201,7 @@ void vhost_vsock_common_realize(VirtIODevice *vdev, const char *name)
>      VHostVSockCommon *vvc = VHOST_VSOCK_COMMON(vdev);
>  
>      virtio_init(vdev, name, VIRTIO_ID_VSOCK,
> -                sizeof(struct virtio_vsock_config));
> +                sizeof(struct virtio_vsock_config), VIRTQUEUE_MAX_SIZE);
>  
>      /* Receive and transmit queues belong to vhost */
>      vvc->recv_vq = virtio_add_queue(vdev, VHOST_VSOCK_QUEUE_SIZE,
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 5a69dce35d..067c73223d 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -886,7 +886,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>      int ret;
>  
>      virtio_init(vdev, "virtio-balloon", VIRTIO_ID_BALLOON,
> -                virtio_balloon_config_size(s));
> +                virtio_balloon_config_size(s), VIRTQUEUE_MAX_SIZE);
>  
>      ret = qemu_add_balloon_handler(virtio_balloon_to_target,
>                                     virtio_balloon_stat, s);
> @@ -909,7 +909,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
>  
>      if (virtio_has_feature(s->host_features, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> -        s->free_page_vq = virtio_add_queue(vdev, VIRTQUEUE_MAX_SIZE,
> +        s->free_page_vq = virtio_add_queue(vdev, vdev->queue_max_size,
>                                             virtio_balloon_handle_free_page_vq);
>          precopy_add_notifier(&s->free_page_hint_notify);
>  
> diff --git a/hw/virtio/virtio-crypto.c b/hw/virtio/virtio-crypto.c
> index 54f9bbb789..1e70d4d2a8 100644
> --- a/hw/virtio/virtio-crypto.c
> +++ b/hw/virtio/virtio-crypto.c
> @@ -810,7 +810,8 @@ static void virtio_crypto_device_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    virtio_init(vdev, "virtio-crypto", VIRTIO_ID_CRYPTO, vcrypto->config_size);
> +    virtio_init(vdev, "virtio-crypto", VIRTIO_ID_CRYPTO, vcrypto->config_size,
> +                VIRTQUEUE_MAX_SIZE);
>      vcrypto->curr_queues = 1;
>      vcrypto->vqs = g_malloc0(sizeof(VirtIOCryptoQueue) * vcrypto->max_queues);
>      for (i = 0; i < vcrypto->max_queues; i++) {
> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> index 1b23e8e18c..ca360e74eb 100644
> --- a/hw/virtio/virtio-iommu.c
> +++ b/hw/virtio/virtio-iommu.c
> @@ -974,7 +974,7 @@ static void virtio_iommu_device_realize(DeviceState *dev, Error **errp)
>      VirtIOIOMMU *s = VIRTIO_IOMMU(dev);
>  
>      virtio_init(vdev, "virtio-iommu", VIRTIO_ID_IOMMU,
> -                sizeof(struct virtio_iommu_config));
> +                sizeof(struct virtio_iommu_config), VIRTQUEUE_MAX_SIZE);
>  
>      memset(s->iommu_pcibus_by_bus_num, 0, sizeof(s->iommu_pcibus_by_bus_num));
>  
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index df91e454b2..1d9d01b871 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -738,7 +738,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>      vmem->bitmap = bitmap_new(vmem->bitmap_size);
>  
>      virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
> -                sizeof(struct virtio_mem_config));
> +                sizeof(struct virtio_mem_config), VIRTQUEUE_MAX_SIZE);
>      vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
>  
>      host_memory_backend_set_mapped(vmem->memdev, true);
> diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
> index d1aeb90a31..82b54b00c5 100644
> --- a/hw/virtio/virtio-pmem.c
> +++ b/hw/virtio/virtio-pmem.c
> @@ -124,7 +124,7 @@ static void virtio_pmem_realize(DeviceState *dev, Error **errp)
>  
>      host_memory_backend_set_mapped(pmem->memdev, true);
>      virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
> -                sizeof(struct virtio_pmem_config));
> +                sizeof(struct virtio_pmem_config), VIRTQUEUE_MAX_SIZE);
>      pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
>  }
>  
> diff --git a/hw/virtio/virtio-rng.c b/hw/virtio/virtio-rng.c
> index cc8e9f775d..0e91d60106 100644
> --- a/hw/virtio/virtio-rng.c
> +++ b/hw/virtio/virtio-rng.c
> @@ -215,7 +215,7 @@ static void virtio_rng_device_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0);
> +    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0, VIRTQUEUE_MAX_SIZE);
>  
>      vrng->vq = virtio_add_queue(vdev, 8, handle_input);
>      vrng->quota_remaining = vrng->conf.max_bytes;
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 240759ff0b..60e094d96a 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -1419,8 +1419,8 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
>      VirtIODevice *vdev = vq->vdev;
>      VirtQueueElement *elem = NULL;
>      unsigned out_num, in_num, elem_entries;
> -    hwaddr addr[VIRTQUEUE_MAX_SIZE];
> -    struct iovec iov[VIRTQUEUE_MAX_SIZE];
> +    hwaddr addr[vdev->queue_max_size];
> +    struct iovec iov[vdev->queue_max_size];
>      VRingDesc desc;
>      int rc;
>  
> @@ -1492,7 +1492,7 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
>          if (desc.flags & VRING_DESC_F_WRITE) {
>              map_ok = virtqueue_map_desc(vdev, &in_num, addr + out_num,
>                                          iov + out_num,
> -                                        VIRTQUEUE_MAX_SIZE - out_num, true,
> +                                        vdev->queue_max_size - out_num, true,
>                                          desc.addr, desc.len);
>          } else {
>              if (in_num) {
> @@ -1500,7 +1500,7 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
>                  goto err_undo_map;
>              }
>              map_ok = virtqueue_map_desc(vdev, &out_num, addr, iov,
> -                                        VIRTQUEUE_MAX_SIZE, false,
> +                                        vdev->queue_max_size, false,
>                                          desc.addr, desc.len);
>          }
>          if (!map_ok) {
> @@ -1556,8 +1556,8 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
>      VirtIODevice *vdev = vq->vdev;
>      VirtQueueElement *elem = NULL;
>      unsigned out_num, in_num, elem_entries;
> -    hwaddr addr[VIRTQUEUE_MAX_SIZE];
> -    struct iovec iov[VIRTQUEUE_MAX_SIZE];
> +    hwaddr addr[vdev->queue_max_size];
> +    struct iovec iov[vdev->queue_max_size];
>      VRingPackedDesc desc;
>      uint16_t id;
>      int rc;
> @@ -1620,7 +1620,7 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
>          if (desc.flags & VRING_DESC_F_WRITE) {
>              map_ok = virtqueue_map_desc(vdev, &in_num, addr + out_num,
>                                          iov + out_num,
> -                                        VIRTQUEUE_MAX_SIZE - out_num, true,
> +                                        vdev->queue_max_size - out_num, true,
>                                          desc.addr, desc.len);
>          } else {
>              if (in_num) {
> @@ -1628,7 +1628,7 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
>                  goto err_undo_map;
>              }
>              map_ok = virtqueue_map_desc(vdev, &out_num, addr, iov,
> -                                        VIRTQUEUE_MAX_SIZE, false,
> +                                        vdev->queue_max_size, false,
>                                          desc.addr, desc.len);
>          }
>          if (!map_ok) {
> @@ -2249,7 +2249,7 @@ void virtio_queue_set_num(VirtIODevice *vdev, int n, int num)
>       * nonexistent states, or to set it to an invalid size.
>       */
>      if (!!num != !!vdev->vq[n].vring.num ||
> -        num > VIRTQUEUE_MAX_SIZE ||
> +        num > vdev->queue_max_size ||
>          num < 0) {
>          return;
>      }
> @@ -2400,7 +2400,7 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
>              break;
>      }
>  
> -    if (i == VIRTIO_QUEUE_MAX || queue_size > VIRTQUEUE_MAX_SIZE)
> +    if (i == VIRTIO_QUEUE_MAX || queue_size > vdev->queue_max_size)
>          abort();
>  
>      vdev->vq[i].vring.num = queue_size;
> @@ -3239,13 +3239,25 @@ void virtio_instance_init_common(Object *proxy_obj, void *data,
>  }
>  
>  void virtio_init(VirtIODevice *vdev, const char *name,
> -                 uint16_t device_id, size_t config_size)
> +                 uint16_t device_id, size_t config_size,
> +                 uint16_t queue_max_size)
>  {
>      BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
>      VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
>      int i;
>      int nvectors = k->query_nvectors ? k->query_nvectors(qbus->parent) : 0;
>  
> +    if (queue_max_size > VIRTQUEUE_MAX_SIZE ||
> +        !is_power_of_2(queue_max_size))
> +    {
> +        error_report(
> +            "virtio: invalid queue_max_size (= %" PRIu16 "), must be a "
> +            "power of 2 and between 1 ... %d.",
> +            queue_max_size, VIRTQUEUE_MAX_SIZE
> +        );
> +        abort();
> +    }
> +
>      if (nvectors) {
>          vdev->vector_queues =
>              g_malloc0(sizeof(*vdev->vector_queues) * nvectors);
> @@ -3258,6 +3270,7 @@ void virtio_init(VirtIODevice *vdev, const char *name,
>      qatomic_set(&vdev->isr, 0);
>      vdev->queue_sel = 0;
>      vdev->config_vector = VIRTIO_NO_VECTOR;
> +    vdev->queue_max_size = queue_max_size;
>      vdev->vq = g_malloc0(sizeof(VirtQueue) * VIRTIO_QUEUE_MAX);
>      vdev->vm_running = runstate_is_running();
>      vdev->broken = false;
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index 8bab9cfb75..a37d1f7d52 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -89,6 +89,7 @@ struct VirtIODevice
>      size_t config_len;
>      void *config;
>      uint16_t config_vector;
> +    uint16_t queue_max_size;
>      uint32_t generation;
>      int nvectors;
>      VirtQueue *vq;
> @@ -166,7 +167,9 @@ void virtio_instance_init_common(Object *proxy_obj, void *data,
>                                   size_t vdev_size, const char *vdev_name);
>  
>  void virtio_init(VirtIODevice *vdev, const char *name,
> -                         uint16_t device_id, size_t config_size);
> +                 uint16_t device_id, size_t config_size,
> +                 uint16_t queue_max_size);
> +
>  void virtio_cleanup(VirtIODevice *vdev);
>  
>  void virtio_error(VirtIODevice *vdev, const char *fmt, ...) GCC_FMT_ATTR(2, 3);


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-04 19:38 ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-05  7:38   ` David Hildenbrand
  -1 siblings, 0 replies; 97+ messages in thread
From: David Hildenbrand @ 2021-10-05  7:38 UTC (permalink / raw)
  To: Christian Schoenebeck, qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, Greg Kurz, Raphael Norwitz, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Dr. David Alan Gilbert

On 04.10.21 21:38, Christian Schoenebeck wrote:
> At the moment the maximum transfer size with virtio is limited to 4M
> (1024 * PAGE_SIZE). This series raises this limit to its maximum
> theoretical possible transfer size of 128M (32k pages) according to the
> virtio specs:
> 
> https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006
> 

I'm missing the "why do we care". Can you comment on that?


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-05  7:38   ` David Hildenbrand
  0 siblings, 0 replies; 97+ messages in thread
From: David Hildenbrand @ 2021-10-05  7:38 UTC (permalink / raw)
  To: Christian Schoenebeck, qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, Raphael Norwitz, virtio-fs, Eric Auger,
	Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng

On 04.10.21 21:38, Christian Schoenebeck wrote:
> At the moment the maximum transfer size with virtio is limited to 4M
> (1024 * PAGE_SIZE). This series raises this limit to its maximum
> theoretical possible transfer size of 128M (32k pages) according to the
> virtio specs:
> 
> https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006
> 

I'm missing the "why do we care". Can you comment on that?


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-05  7:38   ` [Virtio-fs] " David Hildenbrand
@ 2021-10-05 11:10     ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-05 11:10 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Kevin Wolf, Laurent Vivier, qemu-block,
	Michael S. Tsirkin, Jason Wang, Amit Shah, Greg Kurz,
	Raphael Norwitz, virtio-fs, Eric Auger, Hanna Reitz,
	Gonglei (Arei),
	Gerd Hoffmann, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Dr. David Alan Gilbert

On Dienstag, 5. Oktober 2021 09:38:53 CEST David Hildenbrand wrote:
> On 04.10.21 21:38, Christian Schoenebeck wrote:
> > At the moment the maximum transfer size with virtio is limited to 4M
> > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > theoretical possible transfer size of 128M (32k pages) according to the
> > virtio specs:
> > 
> > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > x1-240006
> I'm missing the "why do we care". Can you comment on that?

Primary motivation is the possibility of improved performance, e.g. in case of 
9pfs, people can raise the maximum transfer size with the Linux 9p client's 
'msize' option on guest side (and only on guest side actually). If guest 
performs large chunk I/O, e.g. consider something "useful" like this one on 
guest side:

  time cat large_file_on_9pfs.dat > /dev/null

Then there is a noticable performance increase with higher transfer size 
values. That performance gain is continuous with rising transfer size values, 
but the performance increase obviously shrinks with rising transfer sizes as 
well, as with similar concepts in general like cache sizes, etc.

Then a secondary motivation is described in reason (2) of patch 2: if the 
transfer size is configurable on guest side (like it is the case with the 9pfs 
'msize' option), then there is the unpleasant side effect that the current 
virtio limit of 4M is invisible to guest; as this value of 4M is simply an 
arbitrarily limit set on QEMU side in the past (probably just implementation 
motivated on QEMU side at that point), i.e. it is not a limit specified by the 
virtio protocol, nor is this limit be made aware to guest via virtio protocol 
at all. The consequence with 9pfs would be if user tries to go higher than 4M, 
then the system would simply hang with this QEMU error:

  virtio: too many write descriptors in indirect table

Now whether this is an issue or not for individual virtio users, depends on 
whether the individual virtio user already had its own limitation <= 4M 
enforced on its side.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-05 11:10     ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-05 11:10 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, David Hildenbrand, Amit Shah, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng

On Dienstag, 5. Oktober 2021 09:38:53 CEST David Hildenbrand wrote:
> On 04.10.21 21:38, Christian Schoenebeck wrote:
> > At the moment the maximum transfer size with virtio is limited to 4M
> > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > theoretical possible transfer size of 128M (32k pages) according to the
> > virtio specs:
> > 
> > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > x1-240006
> I'm missing the "why do we care". Can you comment on that?

Primary motivation is the possibility of improved performance, e.g. in case of 
9pfs, people can raise the maximum transfer size with the Linux 9p client's 
'msize' option on guest side (and only on guest side actually). If guest 
performs large chunk I/O, e.g. consider something "useful" like this one on 
guest side:

  time cat large_file_on_9pfs.dat > /dev/null

Then there is a noticable performance increase with higher transfer size 
values. That performance gain is continuous with rising transfer size values, 
but the performance increase obviously shrinks with rising transfer sizes as 
well, as with similar concepts in general like cache sizes, etc.

Then a secondary motivation is described in reason (2) of patch 2: if the 
transfer size is configurable on guest side (like it is the case with the 9pfs 
'msize' option), then there is the unpleasant side effect that the current 
virtio limit of 4M is invisible to guest; as this value of 4M is simply an 
arbitrarily limit set on QEMU side in the past (probably just implementation 
motivated on QEMU side at that point), i.e. it is not a limit specified by the 
virtio protocol, nor is this limit be made aware to guest via virtio protocol 
at all. The consequence with 9pfs would be if user tries to go higher than 4M, 
then the system would simply hang with this QEMU error:

  virtio: too many write descriptors in indirect table

Now whether this is an issue or not for individual virtio users, depends on 
whether the individual virtio user already had its own limitation <= 4M 
enforced on its side.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-05  7:16     ` [Virtio-fs] " Michael S. Tsirkin
@ 2021-10-05 11:17       ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-05 11:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, Greg Kurz, Raphael Norwitz, Kevin Wolf, Hanna Reitz,
	Stefan Hajnoczi, Laurent Vivier, Amit Shah,
	Marc-André Lureau, Paolo Bonzini, Gerd Hoffmann, Jason Wang,
	Fam Zheng, Dr. David Alan Gilbert, David Hildenbrand,
	Gonglei (Arei),
	Eric Auger, qemu-block, virtio-fs

On Dienstag, 5. Oktober 2021 09:16:07 CEST Michael S. Tsirkin wrote:
> On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> > Raise the maximum possible virtio transfer size to 128M
> > (more precisely: 32k * PAGE_SIZE). See previous commit for a
> > more detailed explanation for the reasons of this change.
> > 
> > For not breaking any virtio user, all virtio users transition
> > to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> > VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> > of 1k with this commit.
> > 
> > On the long-term, each virtio user should subsequently either
> > switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> > after checking that they support the new value of 32k, or
> > otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> > macro by an appropriate value supported by them.
> > 
> > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> 
> I don't think we need this. Legacy isn't descriptive either.  Just leave
> VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.

Does this mean you disagree that on the long-term all virtio users should 
transition either to the new upper limit of 32k max queue size or introduce 
their own limit at their end?

Independent of the name, and I would appreciate for suggestions for an 
adequate macro name here, I still think this new limit should be placed in the 
shared virtio.h file. Because this value is not something invented on virtio 
user side. It rather reflects the theoretical upper limited possible with the 
virtio protocol, which is and will be common for all virtio users.

> > ---
> > 
> >  hw/9pfs/virtio-9p-device.c     |  2 +-
> >  hw/block/vhost-user-blk.c      |  6 +++---
> >  hw/block/virtio-blk.c          |  6 +++---
> >  hw/char/virtio-serial-bus.c    |  2 +-
> >  hw/input/virtio-input.c        |  2 +-
> >  hw/net/virtio-net.c            | 12 ++++++------
> >  hw/scsi/virtio-scsi.c          |  2 +-
> >  hw/virtio/vhost-user-fs.c      |  6 +++---
> >  hw/virtio/vhost-user-i2c.c     |  2 +-
> >  hw/virtio/vhost-vsock-common.c |  2 +-
> >  hw/virtio/virtio-balloon.c     |  2 +-
> >  hw/virtio/virtio-crypto.c      |  2 +-
> >  hw/virtio/virtio-iommu.c       |  2 +-
> >  hw/virtio/virtio-mem.c         |  2 +-
> >  hw/virtio/virtio-mmio.c        |  4 ++--
> >  hw/virtio/virtio-pmem.c        |  2 +-
> >  hw/virtio/virtio-rng.c         |  3 ++-
> >  include/hw/virtio/virtio.h     | 20 +++++++++++++++++++-
> >  18 files changed, 49 insertions(+), 30 deletions(-)
> > 
> > diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> > index cd5d95dd51..9013e7df6e 100644
> > --- a/hw/9pfs/virtio-9p-device.c
> > +++ b/hw/9pfs/virtio-9p-device.c
> > @@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev,
> > Error **errp)> 
> >      v->config_size = sizeof(struct virtio_9p_config) +
> >      strlen(s->fsconf.tag);
> >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > 
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> >  
> >  }
> > 
> > diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> > index 336f56705c..e5e45262ab 100644
> > --- a/hw/block/vhost-user-blk.c
> > +++ b/hw/block/vhost-user-blk.c
> > @@ -480,9 +480,9 @@ static void vhost_user_blk_device_realize(DeviceState
> > *dev, Error **errp)> 
> >          error_setg(errp, "queue size must be non-zero");
> >          return;
> >      
> >      }
> > 
> > -    if (s->queue_size > VIRTQUEUE_MAX_SIZE) {
> > +    if (s->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> > 
> >          error_setg(errp, "queue size must not exceed %d",
> > 
> > -                   VIRTQUEUE_MAX_SIZE);
> > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >          return;
> >      
> >      }
> > 
> > @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState
> > *dev, Error **errp)> 
> >      }
> >      
> >      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> > 
> > -                sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_blk_config),
> > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> >      s->virtqs = g_new(VirtQueue *, s->num_queues);
> >      for (i = 0; i < s->num_queues; i++) {
> > 
> > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> > index 9c0f46815c..5883e3e7db 100644
> > --- a/hw/block/virtio-blk.c
> > +++ b/hw/block/virtio-blk.c
> > @@ -1171,10 +1171,10 @@ static void virtio_blk_device_realize(DeviceState
> > *dev, Error **errp)> 
> >          return;
> >      
> >      }
> >      if (!is_power_of_2(conf->queue_size) ||
> > 
> > -        conf->queue_size > VIRTQUEUE_MAX_SIZE) {
> > +        conf->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> > 
> >          error_setg(errp, "invalid queue-size property (%" PRIu16 "), "
> >          
> >                     "must be a power of 2 (max %d)",
> > 
> > -                   conf->queue_size, VIRTQUEUE_MAX_SIZE);
> > +                   conf->queue_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >          return;
> >      
> >      }
> > 
> > @@ -1214,7 +1214,7 @@ static void virtio_blk_device_realize(DeviceState
> > *dev, Error **errp)> 
> >      virtio_blk_set_config_size(s, s->host_features);
> >      
> >      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
> > 
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      s->blk = conf->conf.blk;
> >      s->rq = NULL;
> > 
> > diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
> > index 9ad9111115..2d4285ab53 100644
> > --- a/hw/char/virtio-serial-bus.c
> > +++ b/hw/char/virtio-serial-bus.c
> > @@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState
> > *dev, Error **errp)> 
> >          config_size = offsetof(struct virtio_console_config, emerg_wr);
> >      
> >      }
> >      virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
> > 
> > -                config_size, VIRTQUEUE_MAX_SIZE);
> > +                config_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      /* Spawn a new virtio-serial bus on which the ports will ride as
> >      devices */
> >      qbus_init(&vser->bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
> > 
> > diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
> > index 345eb2cce7..b6b77488f2 100644
> > --- a/hw/input/virtio-input.c
> > +++ b/hw/input/virtio-input.c
> > @@ -258,7 +258,7 @@ static void virtio_input_device_realize(DeviceState
> > *dev, Error **errp)> 
> >      assert(vinput->cfg_size <= sizeof(virtio_input_config));
> >      
> >      virtio_init(vdev, "virtio-input", VIRTIO_ID_INPUT,
> > 
> > -                vinput->cfg_size, VIRTQUEUE_MAX_SIZE);
> > +                vinput->cfg_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      vinput->evt = virtio_add_queue(vdev, 64, virtio_input_handle_evt);
> >      vinput->sts = virtio_add_queue(vdev, 64, virtio_input_handle_sts);
> >  
> >  }
> > 
> > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > index f74b5f6268..5100978b07 100644
> > --- a/hw/net/virtio-net.c
> > +++ b/hw/net/virtio-net.c
> > @@ -636,7 +636,7 @@ static int virtio_net_max_tx_queue_size(VirtIONet *n)
> > 
> >          return VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE;
> >      
> >      }
> > 
> > -    return VIRTQUEUE_MAX_SIZE;
> > +    return VIRTQUEUE_LEGACY_MAX_SIZE;
> > 
> >  }
> >  
> >  static int peer_attach(VirtIONet *n, int index)
> > 
> > @@ -3365,7 +3365,7 @@ static void virtio_net_device_realize(DeviceState
> > *dev, Error **errp)> 
> >      virtio_net_set_config_size(n, n->host_features);
> >      virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size,
> > 
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      /*
> >      
> >       * We set a lower limit on RX queue size to what it always was.
> > 
> > @@ -3373,23 +3373,23 @@ static void virtio_net_device_realize(DeviceState
> > *dev, Error **errp)> 
> >       * help from us (using virtio 1 and up).
> >       */
> >      
> >      if (n->net_conf.rx_queue_size < VIRTIO_NET_RX_QUEUE_MIN_SIZE ||
> > 
> > -        n->net_conf.rx_queue_size > VIRTQUEUE_MAX_SIZE ||
> > +        n->net_conf.rx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
> > 
> >          !is_power_of_2(n->net_conf.rx_queue_size)) {
> >          error_setg(errp, "Invalid rx_queue_size (= %" PRIu16 "), "
> >          
> >                     "must be a power of 2 between %d and %d.",
> >                     n->net_conf.rx_queue_size,
> >                     VIRTIO_NET_RX_QUEUE_MIN_SIZE,
> > 
> > -                   VIRTQUEUE_MAX_SIZE);
> > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >          virtio_cleanup(vdev);
> >          return;
> >      
> >      }
> >      
> >      if (n->net_conf.tx_queue_size < VIRTIO_NET_TX_QUEUE_MIN_SIZE ||
> > 
> > -        n->net_conf.tx_queue_size > VIRTQUEUE_MAX_SIZE ||
> > +        n->net_conf.tx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
> > 
> >          !is_power_of_2(n->net_conf.tx_queue_size)) {
> >          error_setg(errp, "Invalid tx_queue_size (= %" PRIu16 "), "
> >          
> >                     "must be a power of 2 between %d and %d",
> >                     n->net_conf.tx_queue_size,
> >                     VIRTIO_NET_TX_QUEUE_MIN_SIZE,
> > 
> > -                   VIRTQUEUE_MAX_SIZE);
> > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >          virtio_cleanup(vdev);
> >          return;
> >      
> >      }
> > 
> > diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> > index 5e5e657e1d..f204e8878a 100644
> > --- a/hw/scsi/virtio-scsi.c
> > +++ b/hw/scsi/virtio-scsi.c
> > @@ -973,7 +973,7 @@ void virtio_scsi_common_realize(DeviceState *dev,
> > 
> >      int i;
> >      
> >      virtio_init(vdev, "virtio-scsi", VIRTIO_ID_SCSI,
> > 
> > -                sizeof(VirtIOSCSIConfig), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(VirtIOSCSIConfig), VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      if (s->conf.num_queues == VIRTIO_SCSI_AUTO_NUM_QUEUES) {
> >      
> >          s->conf.num_queues = 1;
> > 
> > diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> > index ae1672d667..decc5def39 100644
> > --- a/hw/virtio/vhost-user-fs.c
> > +++ b/hw/virtio/vhost-user-fs.c
> > @@ -209,9 +209,9 @@ static void vuf_device_realize(DeviceState *dev, Error
> > **errp)> 
> >          return;
> >      
> >      }
> > 
> > -    if (fs->conf.queue_size > VIRTQUEUE_MAX_SIZE) {
> > +    if (fs->conf.queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> > 
> >          error_setg(errp, "queue-size property must be %u or smaller",
> > 
> > -                   VIRTQUEUE_MAX_SIZE);
> > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >          return;
> >      
> >      }
> > 
> > @@ -220,7 +220,7 @@ static void vuf_device_realize(DeviceState *dev, Error
> > **errp)> 
> >      }
> >      
> >      virtio_init(vdev, "vhost-user-fs", VIRTIO_ID_FS,
> > 
> > -                sizeof(struct virtio_fs_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_fs_config),
> > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> >      /* Hiprio queue */
> >      fs->hiprio_vq = virtio_add_queue(vdev, fs->conf.queue_size,
> >      vuf_handle_output);> 
> > diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c
> > index eeb1d8853a..b248ddbe93 100644
> > --- a/hw/virtio/vhost-user-i2c.c
> > +++ b/hw/virtio/vhost-user-i2c.c
> > @@ -221,7 +221,7 @@ static void vu_i2c_device_realize(DeviceState *dev,
> > Error **errp)> 
> >      }
> >      
> >      virtio_init(vdev, "vhost-user-i2c", VIRTIO_ID_I2C_ADAPTER, 0,
> > 
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      i2c->vhost_dev.nvqs = 1;
> >      i2c->vq = virtio_add_queue(vdev, 4, vu_i2c_handle_output);
> > 
> > diff --git a/hw/virtio/vhost-vsock-common.c
> > b/hw/virtio/vhost-vsock-common.c index a81fa884a8..73e6b72bba 100644
> > --- a/hw/virtio/vhost-vsock-common.c
> > +++ b/hw/virtio/vhost-vsock-common.c
> > @@ -201,7 +201,7 @@ void vhost_vsock_common_realize(VirtIODevice *vdev,
> > const char *name)> 
> >      VHostVSockCommon *vvc = VHOST_VSOCK_COMMON(vdev);
> >      
> >      virtio_init(vdev, name, VIRTIO_ID_VSOCK,
> > 
> > -                sizeof(struct virtio_vsock_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_vsock_config),
> > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> >      /* Receive and transmit queues belong to vhost */
> >      vvc->recv_vq = virtio_add_queue(vdev, VHOST_VSOCK_QUEUE_SIZE,
> > 
> > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > index 067c73223d..890fb15ed3 100644
> > --- a/hw/virtio/virtio-balloon.c
> > +++ b/hw/virtio/virtio-balloon.c
> > @@ -886,7 +886,7 @@ static void virtio_balloon_device_realize(DeviceState
> > *dev, Error **errp)> 
> >      int ret;
> >      
> >      virtio_init(vdev, "virtio-balloon", VIRTIO_ID_BALLOON,
> > 
> > -                virtio_balloon_config_size(s), VIRTQUEUE_MAX_SIZE);
> > +                virtio_balloon_config_size(s),
> > VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      ret = qemu_add_balloon_handler(virtio_balloon_to_target,
> >      
> >                                     virtio_balloon_stat, s);
> > 
> > diff --git a/hw/virtio/virtio-crypto.c b/hw/virtio/virtio-crypto.c
> > index 1e70d4d2a8..e13b6091d6 100644
> > --- a/hw/virtio/virtio-crypto.c
> > +++ b/hw/virtio/virtio-crypto.c
> > @@ -811,7 +811,7 @@ static void virtio_crypto_device_realize(DeviceState
> > *dev, Error **errp)> 
> >      }
> >      
> >      virtio_init(vdev, "virtio-crypto", VIRTIO_ID_CRYPTO,
> >      vcrypto->config_size,
> > 
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      vcrypto->curr_queues = 1;
> >      vcrypto->vqs = g_malloc0(sizeof(VirtIOCryptoQueue) *
> >      vcrypto->max_queues);
> >      for (i = 0; i < vcrypto->max_queues; i++) {
> > 
> > diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> > index ca360e74eb..845df78842 100644
> > --- a/hw/virtio/virtio-iommu.c
> > +++ b/hw/virtio/virtio-iommu.c
> > @@ -974,7 +974,7 @@ static void virtio_iommu_device_realize(DeviceState
> > *dev, Error **errp)> 
> >      VirtIOIOMMU *s = VIRTIO_IOMMU(dev);
> >      
> >      virtio_init(vdev, "virtio-iommu", VIRTIO_ID_IOMMU,
> > 
> > -                sizeof(struct virtio_iommu_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_iommu_config),
> > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> >      memset(s->iommu_pcibus_by_bus_num, 0,
> >      sizeof(s->iommu_pcibus_by_bus_num));
> > 
> > diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> > index 1d9d01b871..7a39550cde 100644
> > --- a/hw/virtio/virtio-mem.c
> > +++ b/hw/virtio/virtio-mem.c
> > @@ -738,7 +738,7 @@ static void virtio_mem_device_realize(DeviceState
> > *dev, Error **errp)> 
> >      vmem->bitmap = bitmap_new(vmem->bitmap_size);
> >      
> >      virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
> > 
> > -                sizeof(struct virtio_mem_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_mem_config),
> > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> >      vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
> >      
> >      host_memory_backend_set_mapped(vmem->memdev, true);
> > 
> > diff --git a/hw/virtio/virtio-mmio.c b/hw/virtio/virtio-mmio.c
> > index 7b3ebca178..ae0cc223e9 100644
> > --- a/hw/virtio/virtio-mmio.c
> > +++ b/hw/virtio/virtio-mmio.c
> > @@ -174,7 +174,7 @@ static uint64_t virtio_mmio_read(void *opaque, hwaddr
> > offset, unsigned size)> 
> >          if (!virtio_queue_get_num(vdev, vdev->queue_sel)) {
> >          
> >              return 0;
> >          
> >          }
> > 
> > -        return VIRTQUEUE_MAX_SIZE;
> > +        return VIRTQUEUE_LEGACY_MAX_SIZE;
> > 
> >      case VIRTIO_MMIO_QUEUE_PFN:
> >          if (!proxy->legacy) {
> >          
> >              qemu_log_mask(LOG_GUEST_ERROR,
> > 
> > @@ -348,7 +348,7 @@ static void virtio_mmio_write(void *opaque, hwaddr
> > offset, uint64_t value,> 
> >          }
> >          break;
> >      
> >      case VIRTIO_MMIO_QUEUE_NUM:
> > -        trace_virtio_mmio_queue_write(value, VIRTQUEUE_MAX_SIZE);
> > +        trace_virtio_mmio_queue_write(value, VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >          virtio_queue_set_num(vdev, vdev->queue_sel, value);
> >          
> >          if (proxy->legacy) {
> > 
> > diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
> > index 82b54b00c5..5f4d375b58 100644
> > --- a/hw/virtio/virtio-pmem.c
> > +++ b/hw/virtio/virtio-pmem.c
> > @@ -124,7 +124,7 @@ static void virtio_pmem_realize(DeviceState *dev,
> > Error **errp)> 
> >      host_memory_backend_set_mapped(pmem->memdev, true);
> >      virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
> > 
> > -                sizeof(struct virtio_pmem_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_pmem_config),
> > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> >      pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
> >  
> >  }
> > 
> > diff --git a/hw/virtio/virtio-rng.c b/hw/virtio/virtio-rng.c
> > index 0e91d60106..ab075b22b6 100644
> > --- a/hw/virtio/virtio-rng.c
> > +++ b/hw/virtio/virtio-rng.c
> > @@ -215,7 +215,8 @@ static void virtio_rng_device_realize(DeviceState
> > *dev, Error **errp)> 
> >          return;
> >      
> >      }
> > 
> > -    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0,
> > VIRTQUEUE_MAX_SIZE);
> > +    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0,
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      vrng->vq = virtio_add_queue(vdev, 8, handle_input);
> >      vrng->quota_remaining = vrng->conf.max_bytes;
> > 
> > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> > index a37d1f7d52..fe0f13266b 100644
> > --- a/include/hw/virtio/virtio.h
> > +++ b/include/hw/virtio/virtio.h
> > @@ -48,7 +48,25 @@ size_t virtio_feature_get_config_size(const
> > VirtIOFeature *features,> 
> >  typedef struct VirtQueue VirtQueue;
> > 
> > -#define VIRTQUEUE_MAX_SIZE 1024
> > +/*
> > + * This is meant as transitional measure for VIRTQUEUE_MAX_SIZE's old
> > value + * of 1024 to its new value of 32768. On the long-term virtio
> > users should + * either switch to VIRTQUEUE_MAX_SIZE, provided they
> > support 32768, + * otherwise they should replace this macro on their side
> > with an + * appropriate value actually supported by them.
> > + *
> > + * Once all virtio users switched, this macro will be removed.
> > + */
> > +#define VIRTQUEUE_LEGACY_MAX_SIZE 1024
> > +
> > +/*
> > + * Reflects the absolute theoretical maximum queue size (in amount of
> > pages) + * ever possible, which is actually the maximum queue size
> > allowed by the + * virtio protocol. This value therefore construes the
> > maximum transfer size + * possible with virtio (multiplied by system
> > dependent PAGE_SIZE); assuming + * a typical page size of 4k this would
> > be a maximum transfer size of 128M. + */
> > +#define VIRTQUEUE_MAX_SIZE 32768
> > 
> >  typedef struct VirtQueueElement
> >  {




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-05 11:17       ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-05 11:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Amit Shah, Jason Wang,
	David Hildenbrand, qemu-devel, Raphael Norwitz, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Dienstag, 5. Oktober 2021 09:16:07 CEST Michael S. Tsirkin wrote:
> On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> > Raise the maximum possible virtio transfer size to 128M
> > (more precisely: 32k * PAGE_SIZE). See previous commit for a
> > more detailed explanation for the reasons of this change.
> > 
> > For not breaking any virtio user, all virtio users transition
> > to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> > VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> > of 1k with this commit.
> > 
> > On the long-term, each virtio user should subsequently either
> > switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> > after checking that they support the new value of 32k, or
> > otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> > macro by an appropriate value supported by them.
> > 
> > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> 
> I don't think we need this. Legacy isn't descriptive either.  Just leave
> VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.

Does this mean you disagree that on the long-term all virtio users should 
transition either to the new upper limit of 32k max queue size or introduce 
their own limit at their end?

Independent of the name, and I would appreciate for suggestions for an 
adequate macro name here, I still think this new limit should be placed in the 
shared virtio.h file. Because this value is not something invented on virtio 
user side. It rather reflects the theoretical upper limited possible with the 
virtio protocol, which is and will be common for all virtio users.

> > ---
> > 
> >  hw/9pfs/virtio-9p-device.c     |  2 +-
> >  hw/block/vhost-user-blk.c      |  6 +++---
> >  hw/block/virtio-blk.c          |  6 +++---
> >  hw/char/virtio-serial-bus.c    |  2 +-
> >  hw/input/virtio-input.c        |  2 +-
> >  hw/net/virtio-net.c            | 12 ++++++------
> >  hw/scsi/virtio-scsi.c          |  2 +-
> >  hw/virtio/vhost-user-fs.c      |  6 +++---
> >  hw/virtio/vhost-user-i2c.c     |  2 +-
> >  hw/virtio/vhost-vsock-common.c |  2 +-
> >  hw/virtio/virtio-balloon.c     |  2 +-
> >  hw/virtio/virtio-crypto.c      |  2 +-
> >  hw/virtio/virtio-iommu.c       |  2 +-
> >  hw/virtio/virtio-mem.c         |  2 +-
> >  hw/virtio/virtio-mmio.c        |  4 ++--
> >  hw/virtio/virtio-pmem.c        |  2 +-
> >  hw/virtio/virtio-rng.c         |  3 ++-
> >  include/hw/virtio/virtio.h     | 20 +++++++++++++++++++-
> >  18 files changed, 49 insertions(+), 30 deletions(-)
> > 
> > diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> > index cd5d95dd51..9013e7df6e 100644
> > --- a/hw/9pfs/virtio-9p-device.c
> > +++ b/hw/9pfs/virtio-9p-device.c
> > @@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev,
> > Error **errp)> 
> >      v->config_size = sizeof(struct virtio_9p_config) +
> >      strlen(s->fsconf.tag);
> >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > 
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> >  
> >  }
> > 
> > diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> > index 336f56705c..e5e45262ab 100644
> > --- a/hw/block/vhost-user-blk.c
> > +++ b/hw/block/vhost-user-blk.c
> > @@ -480,9 +480,9 @@ static void vhost_user_blk_device_realize(DeviceState
> > *dev, Error **errp)> 
> >          error_setg(errp, "queue size must be non-zero");
> >          return;
> >      
> >      }
> > 
> > -    if (s->queue_size > VIRTQUEUE_MAX_SIZE) {
> > +    if (s->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> > 
> >          error_setg(errp, "queue size must not exceed %d",
> > 
> > -                   VIRTQUEUE_MAX_SIZE);
> > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >          return;
> >      
> >      }
> > 
> > @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState
> > *dev, Error **errp)> 
> >      }
> >      
> >      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> > 
> > -                sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_blk_config),
> > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> >      s->virtqs = g_new(VirtQueue *, s->num_queues);
> >      for (i = 0; i < s->num_queues; i++) {
> > 
> > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> > index 9c0f46815c..5883e3e7db 100644
> > --- a/hw/block/virtio-blk.c
> > +++ b/hw/block/virtio-blk.c
> > @@ -1171,10 +1171,10 @@ static void virtio_blk_device_realize(DeviceState
> > *dev, Error **errp)> 
> >          return;
> >      
> >      }
> >      if (!is_power_of_2(conf->queue_size) ||
> > 
> > -        conf->queue_size > VIRTQUEUE_MAX_SIZE) {
> > +        conf->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> > 
> >          error_setg(errp, "invalid queue-size property (%" PRIu16 "), "
> >          
> >                     "must be a power of 2 (max %d)",
> > 
> > -                   conf->queue_size, VIRTQUEUE_MAX_SIZE);
> > +                   conf->queue_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >          return;
> >      
> >      }
> > 
> > @@ -1214,7 +1214,7 @@ static void virtio_blk_device_realize(DeviceState
> > *dev, Error **errp)> 
> >      virtio_blk_set_config_size(s, s->host_features);
> >      
> >      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
> > 
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      s->blk = conf->conf.blk;
> >      s->rq = NULL;
> > 
> > diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
> > index 9ad9111115..2d4285ab53 100644
> > --- a/hw/char/virtio-serial-bus.c
> > +++ b/hw/char/virtio-serial-bus.c
> > @@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState
> > *dev, Error **errp)> 
> >          config_size = offsetof(struct virtio_console_config, emerg_wr);
> >      
> >      }
> >      virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
> > 
> > -                config_size, VIRTQUEUE_MAX_SIZE);
> > +                config_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      /* Spawn a new virtio-serial bus on which the ports will ride as
> >      devices */
> >      qbus_init(&vser->bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
> > 
> > diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
> > index 345eb2cce7..b6b77488f2 100644
> > --- a/hw/input/virtio-input.c
> > +++ b/hw/input/virtio-input.c
> > @@ -258,7 +258,7 @@ static void virtio_input_device_realize(DeviceState
> > *dev, Error **errp)> 
> >      assert(vinput->cfg_size <= sizeof(virtio_input_config));
> >      
> >      virtio_init(vdev, "virtio-input", VIRTIO_ID_INPUT,
> > 
> > -                vinput->cfg_size, VIRTQUEUE_MAX_SIZE);
> > +                vinput->cfg_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      vinput->evt = virtio_add_queue(vdev, 64, virtio_input_handle_evt);
> >      vinput->sts = virtio_add_queue(vdev, 64, virtio_input_handle_sts);
> >  
> >  }
> > 
> > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > index f74b5f6268..5100978b07 100644
> > --- a/hw/net/virtio-net.c
> > +++ b/hw/net/virtio-net.c
> > @@ -636,7 +636,7 @@ static int virtio_net_max_tx_queue_size(VirtIONet *n)
> > 
> >          return VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE;
> >      
> >      }
> > 
> > -    return VIRTQUEUE_MAX_SIZE;
> > +    return VIRTQUEUE_LEGACY_MAX_SIZE;
> > 
> >  }
> >  
> >  static int peer_attach(VirtIONet *n, int index)
> > 
> > @@ -3365,7 +3365,7 @@ static void virtio_net_device_realize(DeviceState
> > *dev, Error **errp)> 
> >      virtio_net_set_config_size(n, n->host_features);
> >      virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size,
> > 
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      /*
> >      
> >       * We set a lower limit on RX queue size to what it always was.
> > 
> > @@ -3373,23 +3373,23 @@ static void virtio_net_device_realize(DeviceState
> > *dev, Error **errp)> 
> >       * help from us (using virtio 1 and up).
> >       */
> >      
> >      if (n->net_conf.rx_queue_size < VIRTIO_NET_RX_QUEUE_MIN_SIZE ||
> > 
> > -        n->net_conf.rx_queue_size > VIRTQUEUE_MAX_SIZE ||
> > +        n->net_conf.rx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
> > 
> >          !is_power_of_2(n->net_conf.rx_queue_size)) {
> >          error_setg(errp, "Invalid rx_queue_size (= %" PRIu16 "), "
> >          
> >                     "must be a power of 2 between %d and %d.",
> >                     n->net_conf.rx_queue_size,
> >                     VIRTIO_NET_RX_QUEUE_MIN_SIZE,
> > 
> > -                   VIRTQUEUE_MAX_SIZE);
> > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >          virtio_cleanup(vdev);
> >          return;
> >      
> >      }
> >      
> >      if (n->net_conf.tx_queue_size < VIRTIO_NET_TX_QUEUE_MIN_SIZE ||
> > 
> > -        n->net_conf.tx_queue_size > VIRTQUEUE_MAX_SIZE ||
> > +        n->net_conf.tx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
> > 
> >          !is_power_of_2(n->net_conf.tx_queue_size)) {
> >          error_setg(errp, "Invalid tx_queue_size (= %" PRIu16 "), "
> >          
> >                     "must be a power of 2 between %d and %d",
> >                     n->net_conf.tx_queue_size,
> >                     VIRTIO_NET_TX_QUEUE_MIN_SIZE,
> > 
> > -                   VIRTQUEUE_MAX_SIZE);
> > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >          virtio_cleanup(vdev);
> >          return;
> >      
> >      }
> > 
> > diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> > index 5e5e657e1d..f204e8878a 100644
> > --- a/hw/scsi/virtio-scsi.c
> > +++ b/hw/scsi/virtio-scsi.c
> > @@ -973,7 +973,7 @@ void virtio_scsi_common_realize(DeviceState *dev,
> > 
> >      int i;
> >      
> >      virtio_init(vdev, "virtio-scsi", VIRTIO_ID_SCSI,
> > 
> > -                sizeof(VirtIOSCSIConfig), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(VirtIOSCSIConfig), VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      if (s->conf.num_queues == VIRTIO_SCSI_AUTO_NUM_QUEUES) {
> >      
> >          s->conf.num_queues = 1;
> > 
> > diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> > index ae1672d667..decc5def39 100644
> > --- a/hw/virtio/vhost-user-fs.c
> > +++ b/hw/virtio/vhost-user-fs.c
> > @@ -209,9 +209,9 @@ static void vuf_device_realize(DeviceState *dev, Error
> > **errp)> 
> >          return;
> >      
> >      }
> > 
> > -    if (fs->conf.queue_size > VIRTQUEUE_MAX_SIZE) {
> > +    if (fs->conf.queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> > 
> >          error_setg(errp, "queue-size property must be %u or smaller",
> > 
> > -                   VIRTQUEUE_MAX_SIZE);
> > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >          return;
> >      
> >      }
> > 
> > @@ -220,7 +220,7 @@ static void vuf_device_realize(DeviceState *dev, Error
> > **errp)> 
> >      }
> >      
> >      virtio_init(vdev, "vhost-user-fs", VIRTIO_ID_FS,
> > 
> > -                sizeof(struct virtio_fs_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_fs_config),
> > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> >      /* Hiprio queue */
> >      fs->hiprio_vq = virtio_add_queue(vdev, fs->conf.queue_size,
> >      vuf_handle_output);> 
> > diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c
> > index eeb1d8853a..b248ddbe93 100644
> > --- a/hw/virtio/vhost-user-i2c.c
> > +++ b/hw/virtio/vhost-user-i2c.c
> > @@ -221,7 +221,7 @@ static void vu_i2c_device_realize(DeviceState *dev,
> > Error **errp)> 
> >      }
> >      
> >      virtio_init(vdev, "vhost-user-i2c", VIRTIO_ID_I2C_ADAPTER, 0,
> > 
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      i2c->vhost_dev.nvqs = 1;
> >      i2c->vq = virtio_add_queue(vdev, 4, vu_i2c_handle_output);
> > 
> > diff --git a/hw/virtio/vhost-vsock-common.c
> > b/hw/virtio/vhost-vsock-common.c index a81fa884a8..73e6b72bba 100644
> > --- a/hw/virtio/vhost-vsock-common.c
> > +++ b/hw/virtio/vhost-vsock-common.c
> > @@ -201,7 +201,7 @@ void vhost_vsock_common_realize(VirtIODevice *vdev,
> > const char *name)> 
> >      VHostVSockCommon *vvc = VHOST_VSOCK_COMMON(vdev);
> >      
> >      virtio_init(vdev, name, VIRTIO_ID_VSOCK,
> > 
> > -                sizeof(struct virtio_vsock_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_vsock_config),
> > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> >      /* Receive and transmit queues belong to vhost */
> >      vvc->recv_vq = virtio_add_queue(vdev, VHOST_VSOCK_QUEUE_SIZE,
> > 
> > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > index 067c73223d..890fb15ed3 100644
> > --- a/hw/virtio/virtio-balloon.c
> > +++ b/hw/virtio/virtio-balloon.c
> > @@ -886,7 +886,7 @@ static void virtio_balloon_device_realize(DeviceState
> > *dev, Error **errp)> 
> >      int ret;
> >      
> >      virtio_init(vdev, "virtio-balloon", VIRTIO_ID_BALLOON,
> > 
> > -                virtio_balloon_config_size(s), VIRTQUEUE_MAX_SIZE);
> > +                virtio_balloon_config_size(s),
> > VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      ret = qemu_add_balloon_handler(virtio_balloon_to_target,
> >      
> >                                     virtio_balloon_stat, s);
> > 
> > diff --git a/hw/virtio/virtio-crypto.c b/hw/virtio/virtio-crypto.c
> > index 1e70d4d2a8..e13b6091d6 100644
> > --- a/hw/virtio/virtio-crypto.c
> > +++ b/hw/virtio/virtio-crypto.c
> > @@ -811,7 +811,7 @@ static void virtio_crypto_device_realize(DeviceState
> > *dev, Error **errp)> 
> >      }
> >      
> >      virtio_init(vdev, "virtio-crypto", VIRTIO_ID_CRYPTO,
> >      vcrypto->config_size,
> > 
> > -                VIRTQUEUE_MAX_SIZE);
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      vcrypto->curr_queues = 1;
> >      vcrypto->vqs = g_malloc0(sizeof(VirtIOCryptoQueue) *
> >      vcrypto->max_queues);
> >      for (i = 0; i < vcrypto->max_queues; i++) {
> > 
> > diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> > index ca360e74eb..845df78842 100644
> > --- a/hw/virtio/virtio-iommu.c
> > +++ b/hw/virtio/virtio-iommu.c
> > @@ -974,7 +974,7 @@ static void virtio_iommu_device_realize(DeviceState
> > *dev, Error **errp)> 
> >      VirtIOIOMMU *s = VIRTIO_IOMMU(dev);
> >      
> >      virtio_init(vdev, "virtio-iommu", VIRTIO_ID_IOMMU,
> > 
> > -                sizeof(struct virtio_iommu_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_iommu_config),
> > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> >      memset(s->iommu_pcibus_by_bus_num, 0,
> >      sizeof(s->iommu_pcibus_by_bus_num));
> > 
> > diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> > index 1d9d01b871..7a39550cde 100644
> > --- a/hw/virtio/virtio-mem.c
> > +++ b/hw/virtio/virtio-mem.c
> > @@ -738,7 +738,7 @@ static void virtio_mem_device_realize(DeviceState
> > *dev, Error **errp)> 
> >      vmem->bitmap = bitmap_new(vmem->bitmap_size);
> >      
> >      virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
> > 
> > -                sizeof(struct virtio_mem_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_mem_config),
> > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> >      vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
> >      
> >      host_memory_backend_set_mapped(vmem->memdev, true);
> > 
> > diff --git a/hw/virtio/virtio-mmio.c b/hw/virtio/virtio-mmio.c
> > index 7b3ebca178..ae0cc223e9 100644
> > --- a/hw/virtio/virtio-mmio.c
> > +++ b/hw/virtio/virtio-mmio.c
> > @@ -174,7 +174,7 @@ static uint64_t virtio_mmio_read(void *opaque, hwaddr
> > offset, unsigned size)> 
> >          if (!virtio_queue_get_num(vdev, vdev->queue_sel)) {
> >          
> >              return 0;
> >          
> >          }
> > 
> > -        return VIRTQUEUE_MAX_SIZE;
> > +        return VIRTQUEUE_LEGACY_MAX_SIZE;
> > 
> >      case VIRTIO_MMIO_QUEUE_PFN:
> >          if (!proxy->legacy) {
> >          
> >              qemu_log_mask(LOG_GUEST_ERROR,
> > 
> > @@ -348,7 +348,7 @@ static void virtio_mmio_write(void *opaque, hwaddr
> > offset, uint64_t value,> 
> >          }
> >          break;
> >      
> >      case VIRTIO_MMIO_QUEUE_NUM:
> > -        trace_virtio_mmio_queue_write(value, VIRTQUEUE_MAX_SIZE);
> > +        trace_virtio_mmio_queue_write(value, VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >          virtio_queue_set_num(vdev, vdev->queue_sel, value);
> >          
> >          if (proxy->legacy) {
> > 
> > diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
> > index 82b54b00c5..5f4d375b58 100644
> > --- a/hw/virtio/virtio-pmem.c
> > +++ b/hw/virtio/virtio-pmem.c
> > @@ -124,7 +124,7 @@ static void virtio_pmem_realize(DeviceState *dev,
> > Error **errp)> 
> >      host_memory_backend_set_mapped(pmem->memdev, true);
> >      virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
> > 
> > -                sizeof(struct virtio_pmem_config), VIRTQUEUE_MAX_SIZE);
> > +                sizeof(struct virtio_pmem_config),
> > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> >      pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
> >  
> >  }
> > 
> > diff --git a/hw/virtio/virtio-rng.c b/hw/virtio/virtio-rng.c
> > index 0e91d60106..ab075b22b6 100644
> > --- a/hw/virtio/virtio-rng.c
> > +++ b/hw/virtio/virtio-rng.c
> > @@ -215,7 +215,8 @@ static void virtio_rng_device_realize(DeviceState
> > *dev, Error **errp)> 
> >          return;
> >      
> >      }
> > 
> > -    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0,
> > VIRTQUEUE_MAX_SIZE);
> > +    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0,
> > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >      vrng->vq = virtio_add_queue(vdev, 8, handle_input);
> >      vrng->quota_remaining = vrng->conf.max_bytes;
> > 
> > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> > index a37d1f7d52..fe0f13266b 100644
> > --- a/include/hw/virtio/virtio.h
> > +++ b/include/hw/virtio/virtio.h
> > @@ -48,7 +48,25 @@ size_t virtio_feature_get_config_size(const
> > VirtIOFeature *features,> 
> >  typedef struct VirtQueue VirtQueue;
> > 
> > -#define VIRTQUEUE_MAX_SIZE 1024
> > +/*
> > + * This is meant as transitional measure for VIRTQUEUE_MAX_SIZE's old
> > value + * of 1024 to its new value of 32768. On the long-term virtio
> > users should + * either switch to VIRTQUEUE_MAX_SIZE, provided they
> > support 32768, + * otherwise they should replace this macro on their side
> > with an + * appropriate value actually supported by them.
> > + *
> > + * Once all virtio users switched, this macro will be removed.
> > + */
> > +#define VIRTQUEUE_LEGACY_MAX_SIZE 1024
> > +
> > +/*
> > + * Reflects the absolute theoretical maximum queue size (in amount of
> > pages) + * ever possible, which is actually the maximum queue size
> > allowed by the + * virtio protocol. This value therefore construes the
> > maximum transfer size + * possible with virtio (multiplied by system
> > dependent PAGE_SIZE); assuming + * a typical page size of 4k this would
> > be a maximum transfer size of 128M. + */
> > +#define VIRTQUEUE_MAX_SIZE 32768
> > 
> >  typedef struct VirtQueueElement
> >  {



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-05 11:10     ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-05 11:19       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 97+ messages in thread
From: Michael S. Tsirkin @ 2021-10-05 11:19 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, David Hildenbrand,
	Jason Wang, Amit Shah, qemu-devel, Greg Kurz, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Fam Zheng, Raphael Norwitz,
	Dr. David Alan Gilbert

On Tue, Oct 05, 2021 at 01:10:56PM +0200, Christian Schoenebeck wrote:
> On Dienstag, 5. Oktober 2021 09:38:53 CEST David Hildenbrand wrote:
> > On 04.10.21 21:38, Christian Schoenebeck wrote:
> > > At the moment the maximum transfer size with virtio is limited to 4M
> > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > theoretical possible transfer size of 128M (32k pages) according to the
> > > virtio specs:
> > > 
> > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > > x1-240006
> > I'm missing the "why do we care". Can you comment on that?
> 
> Primary motivation is the possibility of improved performance, e.g. in case of 
> 9pfs, people can raise the maximum transfer size with the Linux 9p client's 
> 'msize' option on guest side (and only on guest side actually). If guest 
> performs large chunk I/O, e.g. consider something "useful" like this one on 
> guest side:
> 
>   time cat large_file_on_9pfs.dat > /dev/null
> 
> Then there is a noticable performance increase with higher transfer size 
> values. That performance gain is continuous with rising transfer size values, 
> but the performance increase obviously shrinks with rising transfer sizes as 
> well, as with similar concepts in general like cache sizes, etc.
> 
> Then a secondary motivation is described in reason (2) of patch 2: if the 
> transfer size is configurable on guest side (like it is the case with the 9pfs 
> 'msize' option), then there is the unpleasant side effect that the current 
> virtio limit of 4M is invisible to guest; as this value of 4M is simply an 
> arbitrarily limit set on QEMU side in the past (probably just implementation 
> motivated on QEMU side at that point), i.e. it is not a limit specified by the 
> virtio protocol,

According to the spec it's specified, sure enough: vq size limits the
size of indirect descriptors too.
However, ever since commit 44ed8089e991a60d614abe0ee4b9057a28b364e4 we
do not enforce it in the driver ...

> nor is this limit be made aware to guest via virtio protocol 
> at all. The consequence with 9pfs would be if user tries to go higher than 4M, 
> then the system would simply hang with this QEMU error:
> 
>   virtio: too many write descriptors in indirect table
> 
> Now whether this is an issue or not for individual virtio users, depends on 
> whether the individual virtio user already had its own limitation <= 4M 
> enforced on its side.
> 
> Best regards,
> Christian Schoenebeck
> 



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-05 11:19       ` Michael S. Tsirkin
  0 siblings, 0 replies; 97+ messages in thread
From: Michael S. Tsirkin @ 2021-10-05 11:19 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, David Hildenbrand,
	Jason Wang, Amit Shah, qemu-devel, virtio-fs, Eric Auger,
	Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz

On Tue, Oct 05, 2021 at 01:10:56PM +0200, Christian Schoenebeck wrote:
> On Dienstag, 5. Oktober 2021 09:38:53 CEST David Hildenbrand wrote:
> > On 04.10.21 21:38, Christian Schoenebeck wrote:
> > > At the moment the maximum transfer size with virtio is limited to 4M
> > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > theoretical possible transfer size of 128M (32k pages) according to the
> > > virtio specs:
> > > 
> > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > > x1-240006
> > I'm missing the "why do we care". Can you comment on that?
> 
> Primary motivation is the possibility of improved performance, e.g. in case of 
> 9pfs, people can raise the maximum transfer size with the Linux 9p client's 
> 'msize' option on guest side (and only on guest side actually). If guest 
> performs large chunk I/O, e.g. consider something "useful" like this one on 
> guest side:
> 
>   time cat large_file_on_9pfs.dat > /dev/null
> 
> Then there is a noticable performance increase with higher transfer size 
> values. That performance gain is continuous with rising transfer size values, 
> but the performance increase obviously shrinks with rising transfer sizes as 
> well, as with similar concepts in general like cache sizes, etc.
> 
> Then a secondary motivation is described in reason (2) of patch 2: if the 
> transfer size is configurable on guest side (like it is the case with the 9pfs 
> 'msize' option), then there is the unpleasant side effect that the current 
> virtio limit of 4M is invisible to guest; as this value of 4M is simply an 
> arbitrarily limit set on QEMU side in the past (probably just implementation 
> motivated on QEMU side at that point), i.e. it is not a limit specified by the 
> virtio protocol,

According to the spec it's specified, sure enough: vq size limits the
size of indirect descriptors too.
However, ever since commit 44ed8089e991a60d614abe0ee4b9057a28b364e4 we
do not enforce it in the driver ...

> nor is this limit be made aware to guest via virtio protocol 
> at all. The consequence with 9pfs would be if user tries to go higher than 4M, 
> then the system would simply hang with this QEMU error:
> 
>   virtio: too many write descriptors in indirect table
> 
> Now whether this is an issue or not for individual virtio users, depends on 
> whether the individual virtio user already had its own limitation <= 4M 
> enforced on its side.
> 
> Best regards,
> Christian Schoenebeck
> 


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-05 11:17       ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-05 11:24         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 97+ messages in thread
From: Michael S. Tsirkin @ 2021-10-05 11:24 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Amit Shah, Jason Wang,
	David Hildenbrand, qemu-devel, Raphael Norwitz, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Dr. David Alan Gilbert,
	Greg Kurz

On Tue, Oct 05, 2021 at 01:17:59PM +0200, Christian Schoenebeck wrote:
> On Dienstag, 5. Oktober 2021 09:16:07 CEST Michael S. Tsirkin wrote:
> > On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> > > Raise the maximum possible virtio transfer size to 128M
> > > (more precisely: 32k * PAGE_SIZE). See previous commit for a
> > > more detailed explanation for the reasons of this change.
> > > 
> > > For not breaking any virtio user, all virtio users transition
> > > to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> > > VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> > > of 1k with this commit.
> > > 
> > > On the long-term, each virtio user should subsequently either
> > > switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> > > after checking that they support the new value of 32k, or
> > > otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> > > macro by an appropriate value supported by them.
> > > 
> > > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> > 
> > I don't think we need this. Legacy isn't descriptive either.  Just leave
> > VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.
> 
> Does this mean you disagree that on the long-term all virtio users should 
> transition either to the new upper limit of 32k max queue size or introduce 
> their own limit at their end?


depends. if 9pfs is the only one unhappy, we can keep 4k as
the default. it's sure a safe one.

> Independent of the name, and I would appreciate for suggestions for an 
> adequate macro name here, I still think this new limit should be placed in the 
> shared virtio.h file. Because this value is not something invented on virtio 
> user side. It rather reflects the theoretical upper limited possible with the 
> virtio protocol, which is and will be common for all virtio users.


We can add this to the linux uapi headers, sure.

> > > ---
> > > 
> > >  hw/9pfs/virtio-9p-device.c     |  2 +-
> > >  hw/block/vhost-user-blk.c      |  6 +++---
> > >  hw/block/virtio-blk.c          |  6 +++---
> > >  hw/char/virtio-serial-bus.c    |  2 +-
> > >  hw/input/virtio-input.c        |  2 +-
> > >  hw/net/virtio-net.c            | 12 ++++++------
> > >  hw/scsi/virtio-scsi.c          |  2 +-
> > >  hw/virtio/vhost-user-fs.c      |  6 +++---
> > >  hw/virtio/vhost-user-i2c.c     |  2 +-
> > >  hw/virtio/vhost-vsock-common.c |  2 +-
> > >  hw/virtio/virtio-balloon.c     |  2 +-
> > >  hw/virtio/virtio-crypto.c      |  2 +-
> > >  hw/virtio/virtio-iommu.c       |  2 +-
> > >  hw/virtio/virtio-mem.c         |  2 +-
> > >  hw/virtio/virtio-mmio.c        |  4 ++--
> > >  hw/virtio/virtio-pmem.c        |  2 +-
> > >  hw/virtio/virtio-rng.c         |  3 ++-
> > >  include/hw/virtio/virtio.h     | 20 +++++++++++++++++++-
> > >  18 files changed, 49 insertions(+), 30 deletions(-)
> > > 
> > > diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> > > index cd5d95dd51..9013e7df6e 100644
> > > --- a/hw/9pfs/virtio-9p-device.c
> > > +++ b/hw/9pfs/virtio-9p-device.c
> > > @@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev,
> > > Error **errp)> 
> > >      v->config_size = sizeof(struct virtio_9p_config) +
> > >      strlen(s->fsconf.tag);
> > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > > 
> > > -                VIRTQUEUE_MAX_SIZE);
> > > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > >  
> > >  }
> > > 
> > > diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> > > index 336f56705c..e5e45262ab 100644
> > > --- a/hw/block/vhost-user-blk.c
> > > +++ b/hw/block/vhost-user-blk.c
> > > @@ -480,9 +480,9 @@ static void vhost_user_blk_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >          error_setg(errp, "queue size must be non-zero");
> > >          return;
> > >      
> > >      }
> > > 
> > > -    if (s->queue_size > VIRTQUEUE_MAX_SIZE) {
> > > +    if (s->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> > > 
> > >          error_setg(errp, "queue size must not exceed %d",
> > > 
> > > -                   VIRTQUEUE_MAX_SIZE);
> > > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >          return;
> > >      
> > >      }
> > > 
> > > @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >      }
> > >      
> > >      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> > > 
> > > -                sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
> > > +                sizeof(struct virtio_blk_config),
> > > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> > >      s->virtqs = g_new(VirtQueue *, s->num_queues);
> > >      for (i = 0; i < s->num_queues; i++) {
> > > 
> > > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> > > index 9c0f46815c..5883e3e7db 100644
> > > --- a/hw/block/virtio-blk.c
> > > +++ b/hw/block/virtio-blk.c
> > > @@ -1171,10 +1171,10 @@ static void virtio_blk_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >          return;
> > >      
> > >      }
> > >      if (!is_power_of_2(conf->queue_size) ||
> > > 
> > > -        conf->queue_size > VIRTQUEUE_MAX_SIZE) {
> > > +        conf->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> > > 
> > >          error_setg(errp, "invalid queue-size property (%" PRIu16 "), "
> > >          
> > >                     "must be a power of 2 (max %d)",
> > > 
> > > -                   conf->queue_size, VIRTQUEUE_MAX_SIZE);
> > > +                   conf->queue_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >          return;
> > >      
> > >      }
> > > 
> > > @@ -1214,7 +1214,7 @@ static void virtio_blk_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >      virtio_blk_set_config_size(s, s->host_features);
> > >      
> > >      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
> > > 
> > > -                VIRTQUEUE_MAX_SIZE);
> > > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      s->blk = conf->conf.blk;
> > >      s->rq = NULL;
> > > 
> > > diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
> > > index 9ad9111115..2d4285ab53 100644
> > > --- a/hw/char/virtio-serial-bus.c
> > > +++ b/hw/char/virtio-serial-bus.c
> > > @@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >          config_size = offsetof(struct virtio_console_config, emerg_wr);
> > >      
> > >      }
> > >      virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
> > > 
> > > -                config_size, VIRTQUEUE_MAX_SIZE);
> > > +                config_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      /* Spawn a new virtio-serial bus on which the ports will ride as
> > >      devices */
> > >      qbus_init(&vser->bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
> > > 
> > > diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
> > > index 345eb2cce7..b6b77488f2 100644
> > > --- a/hw/input/virtio-input.c
> > > +++ b/hw/input/virtio-input.c
> > > @@ -258,7 +258,7 @@ static void virtio_input_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >      assert(vinput->cfg_size <= sizeof(virtio_input_config));
> > >      
> > >      virtio_init(vdev, "virtio-input", VIRTIO_ID_INPUT,
> > > 
> > > -                vinput->cfg_size, VIRTQUEUE_MAX_SIZE);
> > > +                vinput->cfg_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      vinput->evt = virtio_add_queue(vdev, 64, virtio_input_handle_evt);
> > >      vinput->sts = virtio_add_queue(vdev, 64, virtio_input_handle_sts);
> > >  
> > >  }
> > > 
> > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > > index f74b5f6268..5100978b07 100644
> > > --- a/hw/net/virtio-net.c
> > > +++ b/hw/net/virtio-net.c
> > > @@ -636,7 +636,7 @@ static int virtio_net_max_tx_queue_size(VirtIONet *n)
> > > 
> > >          return VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE;
> > >      
> > >      }
> > > 
> > > -    return VIRTQUEUE_MAX_SIZE;
> > > +    return VIRTQUEUE_LEGACY_MAX_SIZE;
> > > 
> > >  }
> > >  
> > >  static int peer_attach(VirtIONet *n, int index)
> > > 
> > > @@ -3365,7 +3365,7 @@ static void virtio_net_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >      virtio_net_set_config_size(n, n->host_features);
> > >      virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size,
> > > 
> > > -                VIRTQUEUE_MAX_SIZE);
> > > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      /*
> > >      
> > >       * We set a lower limit on RX queue size to what it always was.
> > > 
> > > @@ -3373,23 +3373,23 @@ static void virtio_net_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >       * help from us (using virtio 1 and up).
> > >       */
> > >      
> > >      if (n->net_conf.rx_queue_size < VIRTIO_NET_RX_QUEUE_MIN_SIZE ||
> > > 
> > > -        n->net_conf.rx_queue_size > VIRTQUEUE_MAX_SIZE ||
> > > +        n->net_conf.rx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
> > > 
> > >          !is_power_of_2(n->net_conf.rx_queue_size)) {
> > >          error_setg(errp, "Invalid rx_queue_size (= %" PRIu16 "), "
> > >          
> > >                     "must be a power of 2 between %d and %d.",
> > >                     n->net_conf.rx_queue_size,
> > >                     VIRTIO_NET_RX_QUEUE_MIN_SIZE,
> > > 
> > > -                   VIRTQUEUE_MAX_SIZE);
> > > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >          virtio_cleanup(vdev);
> > >          return;
> > >      
> > >      }
> > >      
> > >      if (n->net_conf.tx_queue_size < VIRTIO_NET_TX_QUEUE_MIN_SIZE ||
> > > 
> > > -        n->net_conf.tx_queue_size > VIRTQUEUE_MAX_SIZE ||
> > > +        n->net_conf.tx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
> > > 
> > >          !is_power_of_2(n->net_conf.tx_queue_size)) {
> > >          error_setg(errp, "Invalid tx_queue_size (= %" PRIu16 "), "
> > >          
> > >                     "must be a power of 2 between %d and %d",
> > >                     n->net_conf.tx_queue_size,
> > >                     VIRTIO_NET_TX_QUEUE_MIN_SIZE,
> > > 
> > > -                   VIRTQUEUE_MAX_SIZE);
> > > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >          virtio_cleanup(vdev);
> > >          return;
> > >      
> > >      }
> > > 
> > > diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> > > index 5e5e657e1d..f204e8878a 100644
> > > --- a/hw/scsi/virtio-scsi.c
> > > +++ b/hw/scsi/virtio-scsi.c
> > > @@ -973,7 +973,7 @@ void virtio_scsi_common_realize(DeviceState *dev,
> > > 
> > >      int i;
> > >      
> > >      virtio_init(vdev, "virtio-scsi", VIRTIO_ID_SCSI,
> > > 
> > > -                sizeof(VirtIOSCSIConfig), VIRTQUEUE_MAX_SIZE);
> > > +                sizeof(VirtIOSCSIConfig), VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      if (s->conf.num_queues == VIRTIO_SCSI_AUTO_NUM_QUEUES) {
> > >      
> > >          s->conf.num_queues = 1;
> > > 
> > > diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> > > index ae1672d667..decc5def39 100644
> > > --- a/hw/virtio/vhost-user-fs.c
> > > +++ b/hw/virtio/vhost-user-fs.c
> > > @@ -209,9 +209,9 @@ static void vuf_device_realize(DeviceState *dev, Error
> > > **errp)> 
> > >          return;
> > >      
> > >      }
> > > 
> > > -    if (fs->conf.queue_size > VIRTQUEUE_MAX_SIZE) {
> > > +    if (fs->conf.queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> > > 
> > >          error_setg(errp, "queue-size property must be %u or smaller",
> > > 
> > > -                   VIRTQUEUE_MAX_SIZE);
> > > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >          return;
> > >      
> > >      }
> > > 
> > > @@ -220,7 +220,7 @@ static void vuf_device_realize(DeviceState *dev, Error
> > > **errp)> 
> > >      }
> > >      
> > >      virtio_init(vdev, "vhost-user-fs", VIRTIO_ID_FS,
> > > 
> > > -                sizeof(struct virtio_fs_config), VIRTQUEUE_MAX_SIZE);
> > > +                sizeof(struct virtio_fs_config),
> > > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> > >      /* Hiprio queue */
> > >      fs->hiprio_vq = virtio_add_queue(vdev, fs->conf.queue_size,
> > >      vuf_handle_output);> 
> > > diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c
> > > index eeb1d8853a..b248ddbe93 100644
> > > --- a/hw/virtio/vhost-user-i2c.c
> > > +++ b/hw/virtio/vhost-user-i2c.c
> > > @@ -221,7 +221,7 @@ static void vu_i2c_device_realize(DeviceState *dev,
> > > Error **errp)> 
> > >      }
> > >      
> > >      virtio_init(vdev, "vhost-user-i2c", VIRTIO_ID_I2C_ADAPTER, 0,
> > > 
> > > -                VIRTQUEUE_MAX_SIZE);
> > > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      i2c->vhost_dev.nvqs = 1;
> > >      i2c->vq = virtio_add_queue(vdev, 4, vu_i2c_handle_output);
> > > 
> > > diff --git a/hw/virtio/vhost-vsock-common.c
> > > b/hw/virtio/vhost-vsock-common.c index a81fa884a8..73e6b72bba 100644
> > > --- a/hw/virtio/vhost-vsock-common.c
> > > +++ b/hw/virtio/vhost-vsock-common.c
> > > @@ -201,7 +201,7 @@ void vhost_vsock_common_realize(VirtIODevice *vdev,
> > > const char *name)> 
> > >      VHostVSockCommon *vvc = VHOST_VSOCK_COMMON(vdev);
> > >      
> > >      virtio_init(vdev, name, VIRTIO_ID_VSOCK,
> > > 
> > > -                sizeof(struct virtio_vsock_config), VIRTQUEUE_MAX_SIZE);
> > > +                sizeof(struct virtio_vsock_config),
> > > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> > >      /* Receive and transmit queues belong to vhost */
> > >      vvc->recv_vq = virtio_add_queue(vdev, VHOST_VSOCK_QUEUE_SIZE,
> > > 
> > > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > > index 067c73223d..890fb15ed3 100644
> > > --- a/hw/virtio/virtio-balloon.c
> > > +++ b/hw/virtio/virtio-balloon.c
> > > @@ -886,7 +886,7 @@ static void virtio_balloon_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >      int ret;
> > >      
> > >      virtio_init(vdev, "virtio-balloon", VIRTIO_ID_BALLOON,
> > > 
> > > -                virtio_balloon_config_size(s), VIRTQUEUE_MAX_SIZE);
> > > +                virtio_balloon_config_size(s),
> > > VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      ret = qemu_add_balloon_handler(virtio_balloon_to_target,
> > >      
> > >                                     virtio_balloon_stat, s);
> > > 
> > > diff --git a/hw/virtio/virtio-crypto.c b/hw/virtio/virtio-crypto.c
> > > index 1e70d4d2a8..e13b6091d6 100644
> > > --- a/hw/virtio/virtio-crypto.c
> > > +++ b/hw/virtio/virtio-crypto.c
> > > @@ -811,7 +811,7 @@ static void virtio_crypto_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >      }
> > >      
> > >      virtio_init(vdev, "virtio-crypto", VIRTIO_ID_CRYPTO,
> > >      vcrypto->config_size,
> > > 
> > > -                VIRTQUEUE_MAX_SIZE);
> > > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      vcrypto->curr_queues = 1;
> > >      vcrypto->vqs = g_malloc0(sizeof(VirtIOCryptoQueue) *
> > >      vcrypto->max_queues);
> > >      for (i = 0; i < vcrypto->max_queues; i++) {
> > > 
> > > diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> > > index ca360e74eb..845df78842 100644
> > > --- a/hw/virtio/virtio-iommu.c
> > > +++ b/hw/virtio/virtio-iommu.c
> > > @@ -974,7 +974,7 @@ static void virtio_iommu_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >      VirtIOIOMMU *s = VIRTIO_IOMMU(dev);
> > >      
> > >      virtio_init(vdev, "virtio-iommu", VIRTIO_ID_IOMMU,
> > > 
> > > -                sizeof(struct virtio_iommu_config), VIRTQUEUE_MAX_SIZE);
> > > +                sizeof(struct virtio_iommu_config),
> > > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> > >      memset(s->iommu_pcibus_by_bus_num, 0,
> > >      sizeof(s->iommu_pcibus_by_bus_num));
> > > 
> > > diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> > > index 1d9d01b871..7a39550cde 100644
> > > --- a/hw/virtio/virtio-mem.c
> > > +++ b/hw/virtio/virtio-mem.c
> > > @@ -738,7 +738,7 @@ static void virtio_mem_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >      vmem->bitmap = bitmap_new(vmem->bitmap_size);
> > >      
> > >      virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
> > > 
> > > -                sizeof(struct virtio_mem_config), VIRTQUEUE_MAX_SIZE);
> > > +                sizeof(struct virtio_mem_config),
> > > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> > >      vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
> > >      
> > >      host_memory_backend_set_mapped(vmem->memdev, true);
> > > 
> > > diff --git a/hw/virtio/virtio-mmio.c b/hw/virtio/virtio-mmio.c
> > > index 7b3ebca178..ae0cc223e9 100644
> > > --- a/hw/virtio/virtio-mmio.c
> > > +++ b/hw/virtio/virtio-mmio.c
> > > @@ -174,7 +174,7 @@ static uint64_t virtio_mmio_read(void *opaque, hwaddr
> > > offset, unsigned size)> 
> > >          if (!virtio_queue_get_num(vdev, vdev->queue_sel)) {
> > >          
> > >              return 0;
> > >          
> > >          }
> > > 
> > > -        return VIRTQUEUE_MAX_SIZE;
> > > +        return VIRTQUEUE_LEGACY_MAX_SIZE;
> > > 
> > >      case VIRTIO_MMIO_QUEUE_PFN:
> > >          if (!proxy->legacy) {
> > >          
> > >              qemu_log_mask(LOG_GUEST_ERROR,
> > > 
> > > @@ -348,7 +348,7 @@ static void virtio_mmio_write(void *opaque, hwaddr
> > > offset, uint64_t value,> 
> > >          }
> > >          break;
> > >      
> > >      case VIRTIO_MMIO_QUEUE_NUM:
> > > -        trace_virtio_mmio_queue_write(value, VIRTQUEUE_MAX_SIZE);
> > > +        trace_virtio_mmio_queue_write(value, VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >          virtio_queue_set_num(vdev, vdev->queue_sel, value);
> > >          
> > >          if (proxy->legacy) {
> > > 
> > > diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
> > > index 82b54b00c5..5f4d375b58 100644
> > > --- a/hw/virtio/virtio-pmem.c
> > > +++ b/hw/virtio/virtio-pmem.c
> > > @@ -124,7 +124,7 @@ static void virtio_pmem_realize(DeviceState *dev,
> > > Error **errp)> 
> > >      host_memory_backend_set_mapped(pmem->memdev, true);
> > >      virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
> > > 
> > > -                sizeof(struct virtio_pmem_config), VIRTQUEUE_MAX_SIZE);
> > > +                sizeof(struct virtio_pmem_config),
> > > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> > >      pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
> > >  
> > >  }
> > > 
> > > diff --git a/hw/virtio/virtio-rng.c b/hw/virtio/virtio-rng.c
> > > index 0e91d60106..ab075b22b6 100644
> > > --- a/hw/virtio/virtio-rng.c
> > > +++ b/hw/virtio/virtio-rng.c
> > > @@ -215,7 +215,8 @@ static void virtio_rng_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >          return;
> > >      
> > >      }
> > > 
> > > -    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0,
> > > VIRTQUEUE_MAX_SIZE);
> > > +    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0,
> > > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      vrng->vq = virtio_add_queue(vdev, 8, handle_input);
> > >      vrng->quota_remaining = vrng->conf.max_bytes;
> > > 
> > > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> > > index a37d1f7d52..fe0f13266b 100644
> > > --- a/include/hw/virtio/virtio.h
> > > +++ b/include/hw/virtio/virtio.h
> > > @@ -48,7 +48,25 @@ size_t virtio_feature_get_config_size(const
> > > VirtIOFeature *features,> 
> > >  typedef struct VirtQueue VirtQueue;
> > > 
> > > -#define VIRTQUEUE_MAX_SIZE 1024
> > > +/*
> > > + * This is meant as transitional measure for VIRTQUEUE_MAX_SIZE's old
> > > value + * of 1024 to its new value of 32768. On the long-term virtio
> > > users should + * either switch to VIRTQUEUE_MAX_SIZE, provided they
> > > support 32768, + * otherwise they should replace this macro on their side
> > > with an + * appropriate value actually supported by them.
> > > + *
> > > + * Once all virtio users switched, this macro will be removed.
> > > + */
> > > +#define VIRTQUEUE_LEGACY_MAX_SIZE 1024
> > > +
> > > +/*
> > > + * Reflects the absolute theoretical maximum queue size (in amount of
> > > pages) + * ever possible, which is actually the maximum queue size
> > > allowed by the + * virtio protocol. This value therefore construes the
> > > maximum transfer size + * possible with virtio (multiplied by system
> > > dependent PAGE_SIZE); assuming + * a typical page size of 4k this would
> > > be a maximum transfer size of 128M. + */
> > > +#define VIRTQUEUE_MAX_SIZE 32768
> > > 
> > >  typedef struct VirtQueueElement
> > >  {
> 



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-05 11:24         ` Michael S. Tsirkin
  0 siblings, 0 replies; 97+ messages in thread
From: Michael S. Tsirkin @ 2021-10-05 11:24 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Amit Shah, Jason Wang,
	David Hildenbrand, qemu-devel, Raphael Norwitz, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Tue, Oct 05, 2021 at 01:17:59PM +0200, Christian Schoenebeck wrote:
> On Dienstag, 5. Oktober 2021 09:16:07 CEST Michael S. Tsirkin wrote:
> > On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> > > Raise the maximum possible virtio transfer size to 128M
> > > (more precisely: 32k * PAGE_SIZE). See previous commit for a
> > > more detailed explanation for the reasons of this change.
> > > 
> > > For not breaking any virtio user, all virtio users transition
> > > to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> > > VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> > > of 1k with this commit.
> > > 
> > > On the long-term, each virtio user should subsequently either
> > > switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> > > after checking that they support the new value of 32k, or
> > > otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> > > macro by an appropriate value supported by them.
> > > 
> > > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> > 
> > I don't think we need this. Legacy isn't descriptive either.  Just leave
> > VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.
> 
> Does this mean you disagree that on the long-term all virtio users should 
> transition either to the new upper limit of 32k max queue size or introduce 
> their own limit at their end?


depends. if 9pfs is the only one unhappy, we can keep 4k as
the default. it's sure a safe one.

> Independent of the name, and I would appreciate for suggestions for an 
> adequate macro name here, I still think this new limit should be placed in the 
> shared virtio.h file. Because this value is not something invented on virtio 
> user side. It rather reflects the theoretical upper limited possible with the 
> virtio protocol, which is and will be common for all virtio users.


We can add this to the linux uapi headers, sure.

> > > ---
> > > 
> > >  hw/9pfs/virtio-9p-device.c     |  2 +-
> > >  hw/block/vhost-user-blk.c      |  6 +++---
> > >  hw/block/virtio-blk.c          |  6 +++---
> > >  hw/char/virtio-serial-bus.c    |  2 +-
> > >  hw/input/virtio-input.c        |  2 +-
> > >  hw/net/virtio-net.c            | 12 ++++++------
> > >  hw/scsi/virtio-scsi.c          |  2 +-
> > >  hw/virtio/vhost-user-fs.c      |  6 +++---
> > >  hw/virtio/vhost-user-i2c.c     |  2 +-
> > >  hw/virtio/vhost-vsock-common.c |  2 +-
> > >  hw/virtio/virtio-balloon.c     |  2 +-
> > >  hw/virtio/virtio-crypto.c      |  2 +-
> > >  hw/virtio/virtio-iommu.c       |  2 +-
> > >  hw/virtio/virtio-mem.c         |  2 +-
> > >  hw/virtio/virtio-mmio.c        |  4 ++--
> > >  hw/virtio/virtio-pmem.c        |  2 +-
> > >  hw/virtio/virtio-rng.c         |  3 ++-
> > >  include/hw/virtio/virtio.h     | 20 +++++++++++++++++++-
> > >  18 files changed, 49 insertions(+), 30 deletions(-)
> > > 
> > > diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> > > index cd5d95dd51..9013e7df6e 100644
> > > --- a/hw/9pfs/virtio-9p-device.c
> > > +++ b/hw/9pfs/virtio-9p-device.c
> > > @@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev,
> > > Error **errp)> 
> > >      v->config_size = sizeof(struct virtio_9p_config) +
> > >      strlen(s->fsconf.tag);
> > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > > 
> > > -                VIRTQUEUE_MAX_SIZE);
> > > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > >  
> > >  }
> > > 
> > > diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> > > index 336f56705c..e5e45262ab 100644
> > > --- a/hw/block/vhost-user-blk.c
> > > +++ b/hw/block/vhost-user-blk.c
> > > @@ -480,9 +480,9 @@ static void vhost_user_blk_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >          error_setg(errp, "queue size must be non-zero");
> > >          return;
> > >      
> > >      }
> > > 
> > > -    if (s->queue_size > VIRTQUEUE_MAX_SIZE) {
> > > +    if (s->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> > > 
> > >          error_setg(errp, "queue size must not exceed %d",
> > > 
> > > -                   VIRTQUEUE_MAX_SIZE);
> > > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >          return;
> > >      
> > >      }
> > > 
> > > @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >      }
> > >      
> > >      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> > > 
> > > -                sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
> > > +                sizeof(struct virtio_blk_config),
> > > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> > >      s->virtqs = g_new(VirtQueue *, s->num_queues);
> > >      for (i = 0; i < s->num_queues; i++) {
> > > 
> > > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> > > index 9c0f46815c..5883e3e7db 100644
> > > --- a/hw/block/virtio-blk.c
> > > +++ b/hw/block/virtio-blk.c
> > > @@ -1171,10 +1171,10 @@ static void virtio_blk_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >          return;
> > >      
> > >      }
> > >      if (!is_power_of_2(conf->queue_size) ||
> > > 
> > > -        conf->queue_size > VIRTQUEUE_MAX_SIZE) {
> > > +        conf->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> > > 
> > >          error_setg(errp, "invalid queue-size property (%" PRIu16 "), "
> > >          
> > >                     "must be a power of 2 (max %d)",
> > > 
> > > -                   conf->queue_size, VIRTQUEUE_MAX_SIZE);
> > > +                   conf->queue_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >          return;
> > >      
> > >      }
> > > 
> > > @@ -1214,7 +1214,7 @@ static void virtio_blk_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >      virtio_blk_set_config_size(s, s->host_features);
> > >      
> > >      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
> > > 
> > > -                VIRTQUEUE_MAX_SIZE);
> > > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      s->blk = conf->conf.blk;
> > >      s->rq = NULL;
> > > 
> > > diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
> > > index 9ad9111115..2d4285ab53 100644
> > > --- a/hw/char/virtio-serial-bus.c
> > > +++ b/hw/char/virtio-serial-bus.c
> > > @@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >          config_size = offsetof(struct virtio_console_config, emerg_wr);
> > >      
> > >      }
> > >      virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
> > > 
> > > -                config_size, VIRTQUEUE_MAX_SIZE);
> > > +                config_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      /* Spawn a new virtio-serial bus on which the ports will ride as
> > >      devices */
> > >      qbus_init(&vser->bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
> > > 
> > > diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
> > > index 345eb2cce7..b6b77488f2 100644
> > > --- a/hw/input/virtio-input.c
> > > +++ b/hw/input/virtio-input.c
> > > @@ -258,7 +258,7 @@ static void virtio_input_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >      assert(vinput->cfg_size <= sizeof(virtio_input_config));
> > >      
> > >      virtio_init(vdev, "virtio-input", VIRTIO_ID_INPUT,
> > > 
> > > -                vinput->cfg_size, VIRTQUEUE_MAX_SIZE);
> > > +                vinput->cfg_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      vinput->evt = virtio_add_queue(vdev, 64, virtio_input_handle_evt);
> > >      vinput->sts = virtio_add_queue(vdev, 64, virtio_input_handle_sts);
> > >  
> > >  }
> > > 
> > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > > index f74b5f6268..5100978b07 100644
> > > --- a/hw/net/virtio-net.c
> > > +++ b/hw/net/virtio-net.c
> > > @@ -636,7 +636,7 @@ static int virtio_net_max_tx_queue_size(VirtIONet *n)
> > > 
> > >          return VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE;
> > >      
> > >      }
> > > 
> > > -    return VIRTQUEUE_MAX_SIZE;
> > > +    return VIRTQUEUE_LEGACY_MAX_SIZE;
> > > 
> > >  }
> > >  
> > >  static int peer_attach(VirtIONet *n, int index)
> > > 
> > > @@ -3365,7 +3365,7 @@ static void virtio_net_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >      virtio_net_set_config_size(n, n->host_features);
> > >      virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size,
> > > 
> > > -                VIRTQUEUE_MAX_SIZE);
> > > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      /*
> > >      
> > >       * We set a lower limit on RX queue size to what it always was.
> > > 
> > > @@ -3373,23 +3373,23 @@ static void virtio_net_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >       * help from us (using virtio 1 and up).
> > >       */
> > >      
> > >      if (n->net_conf.rx_queue_size < VIRTIO_NET_RX_QUEUE_MIN_SIZE ||
> > > 
> > > -        n->net_conf.rx_queue_size > VIRTQUEUE_MAX_SIZE ||
> > > +        n->net_conf.rx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
> > > 
> > >          !is_power_of_2(n->net_conf.rx_queue_size)) {
> > >          error_setg(errp, "Invalid rx_queue_size (= %" PRIu16 "), "
> > >          
> > >                     "must be a power of 2 between %d and %d.",
> > >                     n->net_conf.rx_queue_size,
> > >                     VIRTIO_NET_RX_QUEUE_MIN_SIZE,
> > > 
> > > -                   VIRTQUEUE_MAX_SIZE);
> > > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >          virtio_cleanup(vdev);
> > >          return;
> > >      
> > >      }
> > >      
> > >      if (n->net_conf.tx_queue_size < VIRTIO_NET_TX_QUEUE_MIN_SIZE ||
> > > 
> > > -        n->net_conf.tx_queue_size > VIRTQUEUE_MAX_SIZE ||
> > > +        n->net_conf.tx_queue_size > VIRTQUEUE_LEGACY_MAX_SIZE ||
> > > 
> > >          !is_power_of_2(n->net_conf.tx_queue_size)) {
> > >          error_setg(errp, "Invalid tx_queue_size (= %" PRIu16 "), "
> > >          
> > >                     "must be a power of 2 between %d and %d",
> > >                     n->net_conf.tx_queue_size,
> > >                     VIRTIO_NET_TX_QUEUE_MIN_SIZE,
> > > 
> > > -                   VIRTQUEUE_MAX_SIZE);
> > > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >          virtio_cleanup(vdev);
> > >          return;
> > >      
> > >      }
> > > 
> > > diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> > > index 5e5e657e1d..f204e8878a 100644
> > > --- a/hw/scsi/virtio-scsi.c
> > > +++ b/hw/scsi/virtio-scsi.c
> > > @@ -973,7 +973,7 @@ void virtio_scsi_common_realize(DeviceState *dev,
> > > 
> > >      int i;
> > >      
> > >      virtio_init(vdev, "virtio-scsi", VIRTIO_ID_SCSI,
> > > 
> > > -                sizeof(VirtIOSCSIConfig), VIRTQUEUE_MAX_SIZE);
> > > +                sizeof(VirtIOSCSIConfig), VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      if (s->conf.num_queues == VIRTIO_SCSI_AUTO_NUM_QUEUES) {
> > >      
> > >          s->conf.num_queues = 1;
> > > 
> > > diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> > > index ae1672d667..decc5def39 100644
> > > --- a/hw/virtio/vhost-user-fs.c
> > > +++ b/hw/virtio/vhost-user-fs.c
> > > @@ -209,9 +209,9 @@ static void vuf_device_realize(DeviceState *dev, Error
> > > **errp)> 
> > >          return;
> > >      
> > >      }
> > > 
> > > -    if (fs->conf.queue_size > VIRTQUEUE_MAX_SIZE) {
> > > +    if (fs->conf.queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> > > 
> > >          error_setg(errp, "queue-size property must be %u or smaller",
> > > 
> > > -                   VIRTQUEUE_MAX_SIZE);
> > > +                   VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >          return;
> > >      
> > >      }
> > > 
> > > @@ -220,7 +220,7 @@ static void vuf_device_realize(DeviceState *dev, Error
> > > **errp)> 
> > >      }
> > >      
> > >      virtio_init(vdev, "vhost-user-fs", VIRTIO_ID_FS,
> > > 
> > > -                sizeof(struct virtio_fs_config), VIRTQUEUE_MAX_SIZE);
> > > +                sizeof(struct virtio_fs_config),
> > > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> > >      /* Hiprio queue */
> > >      fs->hiprio_vq = virtio_add_queue(vdev, fs->conf.queue_size,
> > >      vuf_handle_output);> 
> > > diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c
> > > index eeb1d8853a..b248ddbe93 100644
> > > --- a/hw/virtio/vhost-user-i2c.c
> > > +++ b/hw/virtio/vhost-user-i2c.c
> > > @@ -221,7 +221,7 @@ static void vu_i2c_device_realize(DeviceState *dev,
> > > Error **errp)> 
> > >      }
> > >      
> > >      virtio_init(vdev, "vhost-user-i2c", VIRTIO_ID_I2C_ADAPTER, 0,
> > > 
> > > -                VIRTQUEUE_MAX_SIZE);
> > > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      i2c->vhost_dev.nvqs = 1;
> > >      i2c->vq = virtio_add_queue(vdev, 4, vu_i2c_handle_output);
> > > 
> > > diff --git a/hw/virtio/vhost-vsock-common.c
> > > b/hw/virtio/vhost-vsock-common.c index a81fa884a8..73e6b72bba 100644
> > > --- a/hw/virtio/vhost-vsock-common.c
> > > +++ b/hw/virtio/vhost-vsock-common.c
> > > @@ -201,7 +201,7 @@ void vhost_vsock_common_realize(VirtIODevice *vdev,
> > > const char *name)> 
> > >      VHostVSockCommon *vvc = VHOST_VSOCK_COMMON(vdev);
> > >      
> > >      virtio_init(vdev, name, VIRTIO_ID_VSOCK,
> > > 
> > > -                sizeof(struct virtio_vsock_config), VIRTQUEUE_MAX_SIZE);
> > > +                sizeof(struct virtio_vsock_config),
> > > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> > >      /* Receive and transmit queues belong to vhost */
> > >      vvc->recv_vq = virtio_add_queue(vdev, VHOST_VSOCK_QUEUE_SIZE,
> > > 
> > > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > > index 067c73223d..890fb15ed3 100644
> > > --- a/hw/virtio/virtio-balloon.c
> > > +++ b/hw/virtio/virtio-balloon.c
> > > @@ -886,7 +886,7 @@ static void virtio_balloon_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >      int ret;
> > >      
> > >      virtio_init(vdev, "virtio-balloon", VIRTIO_ID_BALLOON,
> > > 
> > > -                virtio_balloon_config_size(s), VIRTQUEUE_MAX_SIZE);
> > > +                virtio_balloon_config_size(s),
> > > VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      ret = qemu_add_balloon_handler(virtio_balloon_to_target,
> > >      
> > >                                     virtio_balloon_stat, s);
> > > 
> > > diff --git a/hw/virtio/virtio-crypto.c b/hw/virtio/virtio-crypto.c
> > > index 1e70d4d2a8..e13b6091d6 100644
> > > --- a/hw/virtio/virtio-crypto.c
> > > +++ b/hw/virtio/virtio-crypto.c
> > > @@ -811,7 +811,7 @@ static void virtio_crypto_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >      }
> > >      
> > >      virtio_init(vdev, "virtio-crypto", VIRTIO_ID_CRYPTO,
> > >      vcrypto->config_size,
> > > 
> > > -                VIRTQUEUE_MAX_SIZE);
> > > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      vcrypto->curr_queues = 1;
> > >      vcrypto->vqs = g_malloc0(sizeof(VirtIOCryptoQueue) *
> > >      vcrypto->max_queues);
> > >      for (i = 0; i < vcrypto->max_queues; i++) {
> > > 
> > > diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> > > index ca360e74eb..845df78842 100644
> > > --- a/hw/virtio/virtio-iommu.c
> > > +++ b/hw/virtio/virtio-iommu.c
> > > @@ -974,7 +974,7 @@ static void virtio_iommu_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >      VirtIOIOMMU *s = VIRTIO_IOMMU(dev);
> > >      
> > >      virtio_init(vdev, "virtio-iommu", VIRTIO_ID_IOMMU,
> > > 
> > > -                sizeof(struct virtio_iommu_config), VIRTQUEUE_MAX_SIZE);
> > > +                sizeof(struct virtio_iommu_config),
> > > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> > >      memset(s->iommu_pcibus_by_bus_num, 0,
> > >      sizeof(s->iommu_pcibus_by_bus_num));
> > > 
> > > diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> > > index 1d9d01b871..7a39550cde 100644
> > > --- a/hw/virtio/virtio-mem.c
> > > +++ b/hw/virtio/virtio-mem.c
> > > @@ -738,7 +738,7 @@ static void virtio_mem_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >      vmem->bitmap = bitmap_new(vmem->bitmap_size);
> > >      
> > >      virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
> > > 
> > > -                sizeof(struct virtio_mem_config), VIRTQUEUE_MAX_SIZE);
> > > +                sizeof(struct virtio_mem_config),
> > > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> > >      vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
> > >      
> > >      host_memory_backend_set_mapped(vmem->memdev, true);
> > > 
> > > diff --git a/hw/virtio/virtio-mmio.c b/hw/virtio/virtio-mmio.c
> > > index 7b3ebca178..ae0cc223e9 100644
> > > --- a/hw/virtio/virtio-mmio.c
> > > +++ b/hw/virtio/virtio-mmio.c
> > > @@ -174,7 +174,7 @@ static uint64_t virtio_mmio_read(void *opaque, hwaddr
> > > offset, unsigned size)> 
> > >          if (!virtio_queue_get_num(vdev, vdev->queue_sel)) {
> > >          
> > >              return 0;
> > >          
> > >          }
> > > 
> > > -        return VIRTQUEUE_MAX_SIZE;
> > > +        return VIRTQUEUE_LEGACY_MAX_SIZE;
> > > 
> > >      case VIRTIO_MMIO_QUEUE_PFN:
> > >          if (!proxy->legacy) {
> > >          
> > >              qemu_log_mask(LOG_GUEST_ERROR,
> > > 
> > > @@ -348,7 +348,7 @@ static void virtio_mmio_write(void *opaque, hwaddr
> > > offset, uint64_t value,> 
> > >          }
> > >          break;
> > >      
> > >      case VIRTIO_MMIO_QUEUE_NUM:
> > > -        trace_virtio_mmio_queue_write(value, VIRTQUEUE_MAX_SIZE);
> > > +        trace_virtio_mmio_queue_write(value, VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >          virtio_queue_set_num(vdev, vdev->queue_sel, value);
> > >          
> > >          if (proxy->legacy) {
> > > 
> > > diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
> > > index 82b54b00c5..5f4d375b58 100644
> > > --- a/hw/virtio/virtio-pmem.c
> > > +++ b/hw/virtio/virtio-pmem.c
> > > @@ -124,7 +124,7 @@ static void virtio_pmem_realize(DeviceState *dev,
> > > Error **errp)> 
> > >      host_memory_backend_set_mapped(pmem->memdev, true);
> > >      virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
> > > 
> > > -                sizeof(struct virtio_pmem_config), VIRTQUEUE_MAX_SIZE);
> > > +                sizeof(struct virtio_pmem_config),
> > > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> > >      pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
> > >  
> > >  }
> > > 
> > > diff --git a/hw/virtio/virtio-rng.c b/hw/virtio/virtio-rng.c
> > > index 0e91d60106..ab075b22b6 100644
> > > --- a/hw/virtio/virtio-rng.c
> > > +++ b/hw/virtio/virtio-rng.c
> > > @@ -215,7 +215,8 @@ static void virtio_rng_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >          return;
> > >      
> > >      }
> > > 
> > > -    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0,
> > > VIRTQUEUE_MAX_SIZE);
> > > +    virtio_init(vdev, "virtio-rng", VIRTIO_ID_RNG, 0,
> > > +                VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >      vrng->vq = virtio_add_queue(vdev, 8, handle_input);
> > >      vrng->quota_remaining = vrng->conf.max_bytes;
> > > 
> > > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> > > index a37d1f7d52..fe0f13266b 100644
> > > --- a/include/hw/virtio/virtio.h
> > > +++ b/include/hw/virtio/virtio.h
> > > @@ -48,7 +48,25 @@ size_t virtio_feature_get_config_size(const
> > > VirtIOFeature *features,> 
> > >  typedef struct VirtQueue VirtQueue;
> > > 
> > > -#define VIRTQUEUE_MAX_SIZE 1024
> > > +/*
> > > + * This is meant as transitional measure for VIRTQUEUE_MAX_SIZE's old
> > > value + * of 1024 to its new value of 32768. On the long-term virtio
> > > users should + * either switch to VIRTQUEUE_MAX_SIZE, provided they
> > > support 32768, + * otherwise they should replace this macro on their side
> > > with an + * appropriate value actually supported by them.
> > > + *
> > > + * Once all virtio users switched, this macro will be removed.
> > > + */
> > > +#define VIRTQUEUE_LEGACY_MAX_SIZE 1024
> > > +
> > > +/*
> > > + * Reflects the absolute theoretical maximum queue size (in amount of
> > > pages) + * ever possible, which is actually the maximum queue size
> > > allowed by the + * virtio protocol. This value therefore construes the
> > > maximum transfer size + * possible with virtio (multiplied by system
> > > dependent PAGE_SIZE); assuming + * a typical page size of 4k this would
> > > be a maximum transfer size of 128M. + */
> > > +#define VIRTQUEUE_MAX_SIZE 32768
> > > 
> > >  typedef struct VirtQueueElement
> > >  {
> 


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-05 11:19       ` [Virtio-fs] " Michael S. Tsirkin
@ 2021-10-05 11:43         ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-05 11:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Kevin Wolf, Laurent Vivier, qemu-block,
	David Hildenbrand, Jason Wang, Amit Shah, Greg Kurz, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Fam Zheng, Raphael Norwitz,
	Dr. David Alan Gilbert

On Dienstag, 5. Oktober 2021 13:19:43 CEST Michael S. Tsirkin wrote:
> On Tue, Oct 05, 2021 at 01:10:56PM +0200, Christian Schoenebeck wrote:
> > On Dienstag, 5. Oktober 2021 09:38:53 CEST David Hildenbrand wrote:
> > > On 04.10.21 21:38, Christian Schoenebeck wrote:
> > > > At the moment the maximum transfer size with virtio is limited to 4M
> > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > theoretical possible transfer size of 128M (32k pages) according to
> > > > the
> > > > virtio specs:
> > > > 
> > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.h
> > > > tml#
> > > > x1-240006
> > > 
> > > I'm missing the "why do we care". Can you comment on that?
> > 
> > Primary motivation is the possibility of improved performance, e.g. in
> > case of 9pfs, people can raise the maximum transfer size with the Linux
> > 9p client's 'msize' option on guest side (and only on guest side
> > actually). If guest performs large chunk I/O, e.g. consider something
> > "useful" like this one on> 
> > guest side:
> >   time cat large_file_on_9pfs.dat > /dev/null
> > 
> > Then there is a noticable performance increase with higher transfer size
> > values. That performance gain is continuous with rising transfer size
> > values, but the performance increase obviously shrinks with rising
> > transfer sizes as well, as with similar concepts in general like cache
> > sizes, etc.
> > 
> > Then a secondary motivation is described in reason (2) of patch 2: if the
> > transfer size is configurable on guest side (like it is the case with the
> > 9pfs 'msize' option), then there is the unpleasant side effect that the
> > current virtio limit of 4M is invisible to guest; as this value of 4M is
> > simply an arbitrarily limit set on QEMU side in the past (probably just
> > implementation motivated on QEMU side at that point), i.e. it is not a
> > limit specified by the virtio protocol,
> 
> According to the spec it's specified, sure enough: vq size limits the
> size of indirect descriptors too.

In the virtio specs the only hard limit that I see is the aforementioned 32k:

"Queue Size corresponds to the maximum number of buffers in the virtqueue. 
Queue Size value is always a power of 2. The maximum Queue Size value is 
32768. This value is specified in a bus-specific way."

> However, ever since commit 44ed8089e991a60d614abe0ee4b9057a28b364e4 we
> do not enforce it in the driver ...

Then there is the current queue size (that you probably mean) which is 
transmitted to guest with whatever virtio was initialized with.

In case of 9p client however the virtio queue size is first initialized with 
some initial hard coded value when the 9p driver is loaded on Linux kernel 
guest side, then when some 9pfs is mounted later on by guest, it may include 
the 'msize' mount option to raise the transfer size, and that's the problem. I 
don't see any way for guest to see that it cannot go above that 4M transfer 
size now.

> > nor is this limit be made aware to guest via virtio protocol
> > at all. The consequence with 9pfs would be if user tries to go higher than
> > 4M,> 
> > then the system would simply hang with this QEMU error:
> >   virtio: too many write descriptors in indirect table
> > 
> > Now whether this is an issue or not for individual virtio users, depends
> > on
> > whether the individual virtio user already had its own limitation <= 4M
> > enforced on its side.
> > 
> > Best regards,
> > Christian Schoenebeck




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-05 11:43         ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-05 11:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, David Hildenbrand,
	Jason Wang, Michael S. Tsirkin, Amit Shah, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Dienstag, 5. Oktober 2021 13:19:43 CEST Michael S. Tsirkin wrote:
> On Tue, Oct 05, 2021 at 01:10:56PM +0200, Christian Schoenebeck wrote:
> > On Dienstag, 5. Oktober 2021 09:38:53 CEST David Hildenbrand wrote:
> > > On 04.10.21 21:38, Christian Schoenebeck wrote:
> > > > At the moment the maximum transfer size with virtio is limited to 4M
> > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > theoretical possible transfer size of 128M (32k pages) according to
> > > > the
> > > > virtio specs:
> > > > 
> > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.h
> > > > tml#
> > > > x1-240006
> > > 
> > > I'm missing the "why do we care". Can you comment on that?
> > 
> > Primary motivation is the possibility of improved performance, e.g. in
> > case of 9pfs, people can raise the maximum transfer size with the Linux
> > 9p client's 'msize' option on guest side (and only on guest side
> > actually). If guest performs large chunk I/O, e.g. consider something
> > "useful" like this one on> 
> > guest side:
> >   time cat large_file_on_9pfs.dat > /dev/null
> > 
> > Then there is a noticable performance increase with higher transfer size
> > values. That performance gain is continuous with rising transfer size
> > values, but the performance increase obviously shrinks with rising
> > transfer sizes as well, as with similar concepts in general like cache
> > sizes, etc.
> > 
> > Then a secondary motivation is described in reason (2) of patch 2: if the
> > transfer size is configurable on guest side (like it is the case with the
> > 9pfs 'msize' option), then there is the unpleasant side effect that the
> > current virtio limit of 4M is invisible to guest; as this value of 4M is
> > simply an arbitrarily limit set on QEMU side in the past (probably just
> > implementation motivated on QEMU side at that point), i.e. it is not a
> > limit specified by the virtio protocol,
> 
> According to the spec it's specified, sure enough: vq size limits the
> size of indirect descriptors too.

In the virtio specs the only hard limit that I see is the aforementioned 32k:

"Queue Size corresponds to the maximum number of buffers in the virtqueue. 
Queue Size value is always a power of 2. The maximum Queue Size value is 
32768. This value is specified in a bus-specific way."

> However, ever since commit 44ed8089e991a60d614abe0ee4b9057a28b364e4 we
> do not enforce it in the driver ...

Then there is the current queue size (that you probably mean) which is 
transmitted to guest with whatever virtio was initialized with.

In case of 9p client however the virtio queue size is first initialized with 
some initial hard coded value when the 9p driver is loaded on Linux kernel 
guest side, then when some 9pfs is mounted later on by guest, it may include 
the 'msize' mount option to raise the transfer size, and that's the problem. I 
don't see any way for guest to see that it cannot go above that 4M transfer 
size now.

> > nor is this limit be made aware to guest via virtio protocol
> > at all. The consequence with 9pfs would be if user tries to go higher than
> > 4M,> 
> > then the system would simply hang with this QEMU error:
> >   virtio: too many write descriptors in indirect table
> > 
> > Now whether this is an issue or not for individual virtio users, depends
> > on
> > whether the individual virtio user already had its own limitation <= 4M
> > enforced on its side.
> > 
> > Best regards,
> > Christian Schoenebeck



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-05 11:24         ` [Virtio-fs] " Michael S. Tsirkin
@ 2021-10-05 12:01           ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-05 12:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Kevin Wolf, Laurent Vivier, qemu-block,
	Amit Shah, Jason Wang, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Dr. David Alan Gilbert,
	Greg Kurz

On Dienstag, 5. Oktober 2021 13:24:36 CEST Michael S. Tsirkin wrote:
> On Tue, Oct 05, 2021 at 01:17:59PM +0200, Christian Schoenebeck wrote:
> > On Dienstag, 5. Oktober 2021 09:16:07 CEST Michael S. Tsirkin wrote:
> > > On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> > > > Raise the maximum possible virtio transfer size to 128M
> > > > (more precisely: 32k * PAGE_SIZE). See previous commit for a
> > > > more detailed explanation for the reasons of this change.
> > > > 
> > > > For not breaking any virtio user, all virtio users transition
> > > > to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> > > > VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> > > > of 1k with this commit.
> > > > 
> > > > On the long-term, each virtio user should subsequently either
> > > > switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> > > > after checking that they support the new value of 32k, or
> > > > otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> > > > macro by an appropriate value supported by them.
> > > > 
> > > > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> > > 
> > > I don't think we need this. Legacy isn't descriptive either.  Just leave
> > > VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.
> > 
> > Does this mean you disagree that on the long-term all virtio users should
> > transition either to the new upper limit of 32k max queue size or
> > introduce
> > their own limit at their end?
> 
> depends. if 9pfs is the only one unhappy, we can keep 4k as
> the default. it's sure a safe one.
> 
> > Independent of the name, and I would appreciate for suggestions for an
> > adequate macro name here, I still think this new limit should be placed in
> > the shared virtio.h file. Because this value is not something invented on
> > virtio user side. It rather reflects the theoretical upper limited
> > possible with the virtio protocol, which is and will be common for all
> > virtio users.
> We can add this to the linux uapi headers, sure.

Well, then I wait for few days, and if nobody else cares about this issue, 
then I just hard code 32k on 9pfs side exclusively in v3 for now and that's 
it.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-05 12:01           ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-05 12:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Amit Shah, Jason Wang,
	Michael S. Tsirkin, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng

On Dienstag, 5. Oktober 2021 13:24:36 CEST Michael S. Tsirkin wrote:
> On Tue, Oct 05, 2021 at 01:17:59PM +0200, Christian Schoenebeck wrote:
> > On Dienstag, 5. Oktober 2021 09:16:07 CEST Michael S. Tsirkin wrote:
> > > On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> > > > Raise the maximum possible virtio transfer size to 128M
> > > > (more precisely: 32k * PAGE_SIZE). See previous commit for a
> > > > more detailed explanation for the reasons of this change.
> > > > 
> > > > For not breaking any virtio user, all virtio users transition
> > > > to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> > > > VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> > > > of 1k with this commit.
> > > > 
> > > > On the long-term, each virtio user should subsequently either
> > > > switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> > > > after checking that they support the new value of 32k, or
> > > > otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> > > > macro by an appropriate value supported by them.
> > > > 
> > > > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> > > 
> > > I don't think we need this. Legacy isn't descriptive either.  Just leave
> > > VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.
> > 
> > Does this mean you disagree that on the long-term all virtio users should
> > transition either to the new upper limit of 32k max queue size or
> > introduce
> > their own limit at their end?
> 
> depends. if 9pfs is the only one unhappy, we can keep 4k as
> the default. it's sure a safe one.
> 
> > Independent of the name, and I would appreciate for suggestions for an
> > adequate macro name here, I still think this new limit should be placed in
> > the shared virtio.h file. Because this value is not something invented on
> > virtio user side. It rather reflects the theoretical upper limited
> > possible with the virtio protocol, which is and will be common for all
> > virtio users.
> We can add this to the linux uapi headers, sure.

Well, then I wait for few days, and if nobody else cares about this issue, 
then I just hard code 32k on 9pfs side exclusively in v3 for now and that's 
it.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
  2021-10-04 19:38   ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-05 12:45     ` Stefan Hajnoczi
  -1 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-10-05 12:45 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	qemu-devel, Jason Wang, Amit Shah, David Hildenbrand, Greg Kurz,
	Raphael Norwitz, virtio-fs, Eric Auger, Hanna Reitz,
	Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 8016 bytes --]

On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck wrote:
> Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> variable per virtio user.

virtio user == virtio device model?

> 
> Reasons:
> 
> (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
>     maximum queue size possible. Which is actually the maximum
>     queue size allowed by the virtio protocol. The appropriate
>     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> 
>     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006
> 
>     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
>     more or less arbitrary value of 1024 in the past, which
>     limits the maximum transfer size with virtio to 4M
>     (more precise: 1024 * PAGE_SIZE, with the latter typically
>     being 4k).

Being equal to IOV_MAX is a likely reason. Buffers with more iovecs than
that cannot be passed to host system calls (sendmsg(2), pwritev(2),
etc).

> (2) Additionally the current value of 1024 poses a hidden limit,
>     invisible to guest, which causes a system hang with the
>     following QEMU error if guest tries to exceed it:
> 
>     virtio: too many write descriptors in indirect table

I don't understand this point. 2.6.5 The Virtqueue Descriptor Table says:

  The number of descriptors in the table is defined by the queue size for this virtqueue: this is the maximum possible descriptor chain length.

and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:

  A driver MUST NOT create a descriptor chain longer than the Queue Size of the device.

Do you mean a broken/malicious guest driver that is violating the spec?
That's not a hidden limit, it's defined by the spec.

> (3) Unfortunately not all virtio users in QEMU would currently
>     work correctly with the new value of 32768.
> 
> So let's turn this hard coded global value into a runtime
> variable as a first step in this commit, configurable for each
> virtio user by passing a corresponding value with virtio_init()
> call.

virtio_add_queue() already has an int queue_size argument, why isn't
that enough to deal with the maximum queue size? There's probably a good
reason for it, but please include it in the commit description.

> 
> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> ---
>  hw/9pfs/virtio-9p-device.c     |  3 ++-
>  hw/block/vhost-user-blk.c      |  2 +-
>  hw/block/virtio-blk.c          |  3 ++-
>  hw/char/virtio-serial-bus.c    |  2 +-
>  hw/display/virtio-gpu-base.c   |  2 +-
>  hw/input/virtio-input.c        |  2 +-
>  hw/net/virtio-net.c            | 15 ++++++++-------
>  hw/scsi/virtio-scsi.c          |  2 +-
>  hw/virtio/vhost-user-fs.c      |  2 +-
>  hw/virtio/vhost-user-i2c.c     |  3 ++-
>  hw/virtio/vhost-vsock-common.c |  2 +-
>  hw/virtio/virtio-balloon.c     |  4 ++--
>  hw/virtio/virtio-crypto.c      |  3 ++-
>  hw/virtio/virtio-iommu.c       |  2 +-
>  hw/virtio/virtio-mem.c         |  2 +-
>  hw/virtio/virtio-pmem.c        |  2 +-
>  hw/virtio/virtio-rng.c         |  2 +-
>  hw/virtio/virtio.c             | 35 +++++++++++++++++++++++-----------
>  include/hw/virtio/virtio.h     |  5 ++++-
>  19 files changed, 57 insertions(+), 36 deletions(-)
> 
> diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> index 54ee93b71f..cd5d95dd51 100644
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -216,7 +216,8 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
> -    virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size);
> +    virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> +                VIRTQUEUE_MAX_SIZE);
>      v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
>  }
>  
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index ba13cb87e5..336f56705c 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> -                sizeof(struct virtio_blk_config));
> +                sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
>  
>      s->virtqs = g_new(VirtQueue *, s->num_queues);
>      for (i = 0; i < s->num_queues; i++) {
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index f139cd7cc9..9c0f46815c 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -1213,7 +1213,8 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
>  
>      virtio_blk_set_config_size(s, s->host_features);
>  
> -    virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size);
> +    virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
> +                VIRTQUEUE_MAX_SIZE);
>  
>      s->blk = conf->conf.blk;
>      s->rq = NULL;
> diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
> index f01ec2137c..9ad9111115 100644
> --- a/hw/char/virtio-serial-bus.c
> +++ b/hw/char/virtio-serial-bus.c
> @@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState *dev, Error **errp)
>          config_size = offsetof(struct virtio_console_config, emerg_wr);
>      }
>      virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
> -                config_size);
> +                config_size, VIRTQUEUE_MAX_SIZE);
>  
>      /* Spawn a new virtio-serial bus on which the ports will ride as devices */
>      qbus_init(&vser->bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
> diff --git a/hw/display/virtio-gpu-base.c b/hw/display/virtio-gpu-base.c
> index c8da4806e0..20b06a7adf 100644
> --- a/hw/display/virtio-gpu-base.c
> +++ b/hw/display/virtio-gpu-base.c
> @@ -171,7 +171,7 @@ virtio_gpu_base_device_realize(DeviceState *qdev,
>  
>      g->virtio_config.num_scanouts = cpu_to_le32(g->conf.max_outputs);
>      virtio_init(VIRTIO_DEVICE(g), "virtio-gpu", VIRTIO_ID_GPU,
> -                sizeof(struct virtio_gpu_config));
> +                sizeof(struct virtio_gpu_config), VIRTQUEUE_MAX_SIZE);
>  
>      if (virtio_gpu_virgl_enabled(g->conf)) {
>          /* use larger control queue in 3d mode */
> diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
> index 54bcb46c74..345eb2cce7 100644
> --- a/hw/input/virtio-input.c
> +++ b/hw/input/virtio-input.c
> @@ -258,7 +258,7 @@ static void virtio_input_device_realize(DeviceState *dev, Error **errp)
>      assert(vinput->cfg_size <= sizeof(virtio_input_config));
>  
>      virtio_init(vdev, "virtio-input", VIRTIO_ID_INPUT,
> -                vinput->cfg_size);
> +                vinput->cfg_size, VIRTQUEUE_MAX_SIZE);
>      vinput->evt = virtio_add_queue(vdev, 64, virtio_input_handle_evt);
>      vinput->sts = virtio_add_queue(vdev, 64, virtio_input_handle_sts);
>  }
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index f205331dcf..f74b5f6268 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -1746,9 +1746,9 @@ static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
>      VirtIONet *n = qemu_get_nic_opaque(nc);
>      VirtIONetQueue *q = virtio_net_get_subqueue(nc);
>      VirtIODevice *vdev = VIRTIO_DEVICE(n);
> -    VirtQueueElement *elems[VIRTQUEUE_MAX_SIZE];
> -    size_t lens[VIRTQUEUE_MAX_SIZE];
> -    struct iovec mhdr_sg[VIRTQUEUE_MAX_SIZE];
> +    VirtQueueElement *elems[vdev->queue_max_size];
> +    size_t lens[vdev->queue_max_size];
> +    struct iovec mhdr_sg[vdev->queue_max_size];

Can you make this value per-vq instead of per-vdev since virtqueues can
have different queue sizes?

The same applies to the rest of this patch. Anything using
vdev->queue_max_size should probably use vq->vring.num instead.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
@ 2021-10-05 12:45     ` Stefan Hajnoczi
  0 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-10-05 12:45 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	qemu-devel, Jason Wang, Amit Shah, David Hildenbrand,
	Raphael Norwitz, virtio-fs, Eric Auger, Hanna Reitz,
	Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng

[-- Attachment #1: Type: text/plain, Size: 8016 bytes --]

On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck wrote:
> Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> variable per virtio user.

virtio user == virtio device model?

> 
> Reasons:
> 
> (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
>     maximum queue size possible. Which is actually the maximum
>     queue size allowed by the virtio protocol. The appropriate
>     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> 
>     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006
> 
>     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
>     more or less arbitrary value of 1024 in the past, which
>     limits the maximum transfer size with virtio to 4M
>     (more precise: 1024 * PAGE_SIZE, with the latter typically
>     being 4k).

Being equal to IOV_MAX is a likely reason. Buffers with more iovecs than
that cannot be passed to host system calls (sendmsg(2), pwritev(2),
etc).

> (2) Additionally the current value of 1024 poses a hidden limit,
>     invisible to guest, which causes a system hang with the
>     following QEMU error if guest tries to exceed it:
> 
>     virtio: too many write descriptors in indirect table

I don't understand this point. 2.6.5 The Virtqueue Descriptor Table says:

  The number of descriptors in the table is defined by the queue size for this virtqueue: this is the maximum possible descriptor chain length.

and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:

  A driver MUST NOT create a descriptor chain longer than the Queue Size of the device.

Do you mean a broken/malicious guest driver that is violating the spec?
That's not a hidden limit, it's defined by the spec.

> (3) Unfortunately not all virtio users in QEMU would currently
>     work correctly with the new value of 32768.
> 
> So let's turn this hard coded global value into a runtime
> variable as a first step in this commit, configurable for each
> virtio user by passing a corresponding value with virtio_init()
> call.

virtio_add_queue() already has an int queue_size argument, why isn't
that enough to deal with the maximum queue size? There's probably a good
reason for it, but please include it in the commit description.

> 
> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> ---
>  hw/9pfs/virtio-9p-device.c     |  3 ++-
>  hw/block/vhost-user-blk.c      |  2 +-
>  hw/block/virtio-blk.c          |  3 ++-
>  hw/char/virtio-serial-bus.c    |  2 +-
>  hw/display/virtio-gpu-base.c   |  2 +-
>  hw/input/virtio-input.c        |  2 +-
>  hw/net/virtio-net.c            | 15 ++++++++-------
>  hw/scsi/virtio-scsi.c          |  2 +-
>  hw/virtio/vhost-user-fs.c      |  2 +-
>  hw/virtio/vhost-user-i2c.c     |  3 ++-
>  hw/virtio/vhost-vsock-common.c |  2 +-
>  hw/virtio/virtio-balloon.c     |  4 ++--
>  hw/virtio/virtio-crypto.c      |  3 ++-
>  hw/virtio/virtio-iommu.c       |  2 +-
>  hw/virtio/virtio-mem.c         |  2 +-
>  hw/virtio/virtio-pmem.c        |  2 +-
>  hw/virtio/virtio-rng.c         |  2 +-
>  hw/virtio/virtio.c             | 35 +++++++++++++++++++++++-----------
>  include/hw/virtio/virtio.h     |  5 ++++-
>  19 files changed, 57 insertions(+), 36 deletions(-)
> 
> diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> index 54ee93b71f..cd5d95dd51 100644
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -216,7 +216,8 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
> -    virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size);
> +    virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> +                VIRTQUEUE_MAX_SIZE);
>      v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
>  }
>  
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index ba13cb87e5..336f56705c 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
>      }
>  
>      virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> -                sizeof(struct virtio_blk_config));
> +                sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
>  
>      s->virtqs = g_new(VirtQueue *, s->num_queues);
>      for (i = 0; i < s->num_queues; i++) {
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index f139cd7cc9..9c0f46815c 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -1213,7 +1213,8 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
>  
>      virtio_blk_set_config_size(s, s->host_features);
>  
> -    virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size);
> +    virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
> +                VIRTQUEUE_MAX_SIZE);
>  
>      s->blk = conf->conf.blk;
>      s->rq = NULL;
> diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
> index f01ec2137c..9ad9111115 100644
> --- a/hw/char/virtio-serial-bus.c
> +++ b/hw/char/virtio-serial-bus.c
> @@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState *dev, Error **errp)
>          config_size = offsetof(struct virtio_console_config, emerg_wr);
>      }
>      virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
> -                config_size);
> +                config_size, VIRTQUEUE_MAX_SIZE);
>  
>      /* Spawn a new virtio-serial bus on which the ports will ride as devices */
>      qbus_init(&vser->bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
> diff --git a/hw/display/virtio-gpu-base.c b/hw/display/virtio-gpu-base.c
> index c8da4806e0..20b06a7adf 100644
> --- a/hw/display/virtio-gpu-base.c
> +++ b/hw/display/virtio-gpu-base.c
> @@ -171,7 +171,7 @@ virtio_gpu_base_device_realize(DeviceState *qdev,
>  
>      g->virtio_config.num_scanouts = cpu_to_le32(g->conf.max_outputs);
>      virtio_init(VIRTIO_DEVICE(g), "virtio-gpu", VIRTIO_ID_GPU,
> -                sizeof(struct virtio_gpu_config));
> +                sizeof(struct virtio_gpu_config), VIRTQUEUE_MAX_SIZE);
>  
>      if (virtio_gpu_virgl_enabled(g->conf)) {
>          /* use larger control queue in 3d mode */
> diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
> index 54bcb46c74..345eb2cce7 100644
> --- a/hw/input/virtio-input.c
> +++ b/hw/input/virtio-input.c
> @@ -258,7 +258,7 @@ static void virtio_input_device_realize(DeviceState *dev, Error **errp)
>      assert(vinput->cfg_size <= sizeof(virtio_input_config));
>  
>      virtio_init(vdev, "virtio-input", VIRTIO_ID_INPUT,
> -                vinput->cfg_size);
> +                vinput->cfg_size, VIRTQUEUE_MAX_SIZE);
>      vinput->evt = virtio_add_queue(vdev, 64, virtio_input_handle_evt);
>      vinput->sts = virtio_add_queue(vdev, 64, virtio_input_handle_sts);
>  }
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index f205331dcf..f74b5f6268 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -1746,9 +1746,9 @@ static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
>      VirtIONet *n = qemu_get_nic_opaque(nc);
>      VirtIONetQueue *q = virtio_net_get_subqueue(nc);
>      VirtIODevice *vdev = VIRTIO_DEVICE(n);
> -    VirtQueueElement *elems[VIRTQUEUE_MAX_SIZE];
> -    size_t lens[VIRTQUEUE_MAX_SIZE];
> -    struct iovec mhdr_sg[VIRTQUEUE_MAX_SIZE];
> +    VirtQueueElement *elems[vdev->queue_max_size];
> +    size_t lens[vdev->queue_max_size];
> +    struct iovec mhdr_sg[vdev->queue_max_size];

Can you make this value per-vq instead of per-vdev since virtqueues can
have different queue sizes?

The same applies to the rest of this patch. Anything using
vdev->queue_max_size should probably use vq->vring.num instead.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
  2021-10-05 12:45     ` [Virtio-fs] " Stefan Hajnoczi
@ 2021-10-05 13:15       ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-05 13:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Kevin Wolf, Laurent Vivier, qemu-block,
	Michael S. Tsirkin, Jason Wang, Amit Shah, David Hildenbrand,
	Greg Kurz, Raphael Norwitz, virtio-fs, Eric Auger, Hanna Reitz,
	Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Dr. David Alan Gilbert

On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck wrote:
> > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > variable per virtio user.
> 
> virtio user == virtio device model?

Yes

> > Reasons:
> > 
> > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > 
> >     maximum queue size possible. Which is actually the maximum
> >     queue size allowed by the virtio protocol. The appropriate
> >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> >     
> >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.h
> >     tml#x1-240006
> >     
> >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> >     more or less arbitrary value of 1024 in the past, which
> >     limits the maximum transfer size with virtio to 4M
> >     (more precise: 1024 * PAGE_SIZE, with the latter typically
> >     being 4k).
> 
> Being equal to IOV_MAX is a likely reason. Buffers with more iovecs than
> that cannot be passed to host system calls (sendmsg(2), pwritev(2),
> etc).

Yes, that's use case dependent. Hence the solution to opt-in if it is desired 
and feasible.

> > (2) Additionally the current value of 1024 poses a hidden limit,
> > 
> >     invisible to guest, which causes a system hang with the
> >     following QEMU error if guest tries to exceed it:
> >     
> >     virtio: too many write descriptors in indirect table
> 
> I don't understand this point. 2.6.5 The Virtqueue Descriptor Table says:
> 
>   The number of descriptors in the table is defined by the queue size for
> this virtqueue: this is the maximum possible descriptor chain length.
> 
> and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> 
>   A driver MUST NOT create a descriptor chain longer than the Queue Size of
> the device.
> 
> Do you mean a broken/malicious guest driver that is violating the spec?
> That's not a hidden limit, it's defined by the spec.

https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.html
https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.html

You can already go beyond that queue size at runtime with the indirection 
table. The only actual limit is the currently hard coded value of 1k pages. 
Hence the suggestion to turn that into a variable.

> > (3) Unfortunately not all virtio users in QEMU would currently
> > 
> >     work correctly with the new value of 32768.
> > 
> > So let's turn this hard coded global value into a runtime
> > variable as a first step in this commit, configurable for each
> > virtio user by passing a corresponding value with virtio_init()
> > call.
> 
> virtio_add_queue() already has an int queue_size argument, why isn't
> that enough to deal with the maximum queue size? There's probably a good
> reason for it, but please include it in the commit description.
[...]
> Can you make this value per-vq instead of per-vdev since virtqueues can
> have different queue sizes?
> 
> The same applies to the rest of this patch. Anything using
> vdev->queue_max_size should probably use vq->vring.num instead.

I would like to avoid that and keep it per device. The maximum size stored 
there is the maximum size supported by virtio user (or vortio device model, 
however you want to call it). So that's really a limit per device, not per 
queue, as no queue of the device would ever exceed that limit.

Plus a lot more code would need to be refactored, which I think is 
unnecessary.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
@ 2021-10-05 13:15       ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-05 13:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng

On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck wrote:
> > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > variable per virtio user.
> 
> virtio user == virtio device model?

Yes

> > Reasons:
> > 
> > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > 
> >     maximum queue size possible. Which is actually the maximum
> >     queue size allowed by the virtio protocol. The appropriate
> >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> >     
> >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.h
> >     tml#x1-240006
> >     
> >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> >     more or less arbitrary value of 1024 in the past, which
> >     limits the maximum transfer size with virtio to 4M
> >     (more precise: 1024 * PAGE_SIZE, with the latter typically
> >     being 4k).
> 
> Being equal to IOV_MAX is a likely reason. Buffers with more iovecs than
> that cannot be passed to host system calls (sendmsg(2), pwritev(2),
> etc).

Yes, that's use case dependent. Hence the solution to opt-in if it is desired 
and feasible.

> > (2) Additionally the current value of 1024 poses a hidden limit,
> > 
> >     invisible to guest, which causes a system hang with the
> >     following QEMU error if guest tries to exceed it:
> >     
> >     virtio: too many write descriptors in indirect table
> 
> I don't understand this point. 2.6.5 The Virtqueue Descriptor Table says:
> 
>   The number of descriptors in the table is defined by the queue size for
> this virtqueue: this is the maximum possible descriptor chain length.
> 
> and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> 
>   A driver MUST NOT create a descriptor chain longer than the Queue Size of
> the device.
> 
> Do you mean a broken/malicious guest driver that is violating the spec?
> That's not a hidden limit, it's defined by the spec.

https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.html
https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.html

You can already go beyond that queue size at runtime with the indirection 
table. The only actual limit is the currently hard coded value of 1k pages. 
Hence the suggestion to turn that into a variable.

> > (3) Unfortunately not all virtio users in QEMU would currently
> > 
> >     work correctly with the new value of 32768.
> > 
> > So let's turn this hard coded global value into a runtime
> > variable as a first step in this commit, configurable for each
> > virtio user by passing a corresponding value with virtio_init()
> > call.
> 
> virtio_add_queue() already has an int queue_size argument, why isn't
> that enough to deal with the maximum queue size? There's probably a good
> reason for it, but please include it in the commit description.
[...]
> Can you make this value per-vq instead of per-vdev since virtqueues can
> have different queue sizes?
> 
> The same applies to the rest of this patch. Anything using
> vdev->queue_max_size should probably use vq->vring.num instead.

I would like to avoid that and keep it per device. The maximum size stored 
there is the maximum size supported by virtio user (or vortio device model, 
however you want to call it). So that's really a limit per device, not per 
queue, as no queue of the device would ever exceed that limit.

Plus a lot more code would need to be refactored, which I think is 
unnecessary.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
  2021-10-05 13:15       ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-05 15:10         ` Stefan Hajnoczi
  -1 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-10-05 15:10 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, Greg Kurz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 4982 bytes --]

On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck wrote:
> On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck wrote:
> > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > > variable per virtio user.
> > 
> > virtio user == virtio device model?
> 
> Yes
> 
> > > Reasons:
> > > 
> > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > > 
> > >     maximum queue size possible. Which is actually the maximum
> > >     queue size allowed by the virtio protocol. The appropriate
> > >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > >     
> > >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.h
> > >     tml#x1-240006
> > >     
> > >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > >     more or less arbitrary value of 1024 in the past, which
> > >     limits the maximum transfer size with virtio to 4M
> > >     (more precise: 1024 * PAGE_SIZE, with the latter typically
> > >     being 4k).
> > 
> > Being equal to IOV_MAX is a likely reason. Buffers with more iovecs than
> > that cannot be passed to host system calls (sendmsg(2), pwritev(2),
> > etc).
> 
> Yes, that's use case dependent. Hence the solution to opt-in if it is desired 
> and feasible.
> 
> > > (2) Additionally the current value of 1024 poses a hidden limit,
> > > 
> > >     invisible to guest, which causes a system hang with the
> > >     following QEMU error if guest tries to exceed it:
> > >     
> > >     virtio: too many write descriptors in indirect table
> > 
> > I don't understand this point. 2.6.5 The Virtqueue Descriptor Table says:
> > 
> >   The number of descriptors in the table is defined by the queue size for
> > this virtqueue: this is the maximum possible descriptor chain length.
> > 
> > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> > 
> >   A driver MUST NOT create a descriptor chain longer than the Queue Size of
> > the device.
> > 
> > Do you mean a broken/malicious guest driver that is violating the spec?
> > That's not a hidden limit, it's defined by the spec.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.html
> https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.html
> 
> You can already go beyond that queue size at runtime with the indirection 
> table. The only actual limit is the currently hard coded value of 1k pages. 
> Hence the suggestion to turn that into a variable.

Exceeding Queue Size is a VIRTIO spec violation. Drivers that operate
outsided the spec do so at their own risk. They may not be compatible
with all device implementations.

The limit is not hidden, it's Queue Size as defined by the spec :).

If you have a driver that is exceeding the limit, then please fix the
driver.

> > > (3) Unfortunately not all virtio users in QEMU would currently
> > > 
> > >     work correctly with the new value of 32768.
> > > 
> > > So let's turn this hard coded global value into a runtime
> > > variable as a first step in this commit, configurable for each
> > > virtio user by passing a corresponding value with virtio_init()
> > > call.
> > 
> > virtio_add_queue() already has an int queue_size argument, why isn't
> > that enough to deal with the maximum queue size? There's probably a good
> > reason for it, but please include it in the commit description.
> [...]
> > Can you make this value per-vq instead of per-vdev since virtqueues can
> > have different queue sizes?
> > 
> > The same applies to the rest of this patch. Anything using
> > vdev->queue_max_size should probably use vq->vring.num instead.
> 
> I would like to avoid that and keep it per device. The maximum size stored 
> there is the maximum size supported by virtio user (or vortio device model, 
> however you want to call it). So that's really a limit per device, not per 
> queue, as no queue of the device would ever exceed that limit.
>
> Plus a lot more code would need to be refactored, which I think is 
> unnecessary.

I'm against a per-device limit because it's a concept that cannot
accurately describe reality. Some devices have multiple classes of
virtqueues and they are sized differently, so a per-device limit is
insufficient. virtio-net has separate rx_queue_size and tx_queue_size
parameters (plus a control vq hardcoded to 64 descriptors).

The specification already gives us Queue Size (vring.num in QEMU). The
variable exists in QEMU and just needs to be used.

If per-vq limits require a lot of work, please describe why. I think
replacing the variable from this patch with virtio_queue_get_num()
should be fairly straightforward, but maybe I'm missing something? (If
you prefer VirtQueue *vq instead of the index-based
virtio_queue_get_num() API, you can introduce a virtqueue_get_num()
API.)

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
@ 2021-10-05 15:10         ` Stefan Hajnoczi
  0 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-10-05 15:10 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz

[-- Attachment #1: Type: text/plain, Size: 4982 bytes --]

On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck wrote:
> On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck wrote:
> > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > > variable per virtio user.
> > 
> > virtio user == virtio device model?
> 
> Yes
> 
> > > Reasons:
> > > 
> > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > > 
> > >     maximum queue size possible. Which is actually the maximum
> > >     queue size allowed by the virtio protocol. The appropriate
> > >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > >     
> > >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.h
> > >     tml#x1-240006
> > >     
> > >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > >     more or less arbitrary value of 1024 in the past, which
> > >     limits the maximum transfer size with virtio to 4M
> > >     (more precise: 1024 * PAGE_SIZE, with the latter typically
> > >     being 4k).
> > 
> > Being equal to IOV_MAX is a likely reason. Buffers with more iovecs than
> > that cannot be passed to host system calls (sendmsg(2), pwritev(2),
> > etc).
> 
> Yes, that's use case dependent. Hence the solution to opt-in if it is desired 
> and feasible.
> 
> > > (2) Additionally the current value of 1024 poses a hidden limit,
> > > 
> > >     invisible to guest, which causes a system hang with the
> > >     following QEMU error if guest tries to exceed it:
> > >     
> > >     virtio: too many write descriptors in indirect table
> > 
> > I don't understand this point. 2.6.5 The Virtqueue Descriptor Table says:
> > 
> >   The number of descriptors in the table is defined by the queue size for
> > this virtqueue: this is the maximum possible descriptor chain length.
> > 
> > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> > 
> >   A driver MUST NOT create a descriptor chain longer than the Queue Size of
> > the device.
> > 
> > Do you mean a broken/malicious guest driver that is violating the spec?
> > That's not a hidden limit, it's defined by the spec.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.html
> https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.html
> 
> You can already go beyond that queue size at runtime with the indirection 
> table. The only actual limit is the currently hard coded value of 1k pages. 
> Hence the suggestion to turn that into a variable.

Exceeding Queue Size is a VIRTIO spec violation. Drivers that operate
outsided the spec do so at their own risk. They may not be compatible
with all device implementations.

The limit is not hidden, it's Queue Size as defined by the spec :).

If you have a driver that is exceeding the limit, then please fix the
driver.

> > > (3) Unfortunately not all virtio users in QEMU would currently
> > > 
> > >     work correctly with the new value of 32768.
> > > 
> > > So let's turn this hard coded global value into a runtime
> > > variable as a first step in this commit, configurable for each
> > > virtio user by passing a corresponding value with virtio_init()
> > > call.
> > 
> > virtio_add_queue() already has an int queue_size argument, why isn't
> > that enough to deal with the maximum queue size? There's probably a good
> > reason for it, but please include it in the commit description.
> [...]
> > Can you make this value per-vq instead of per-vdev since virtqueues can
> > have different queue sizes?
> > 
> > The same applies to the rest of this patch. Anything using
> > vdev->queue_max_size should probably use vq->vring.num instead.
> 
> I would like to avoid that and keep it per device. The maximum size stored 
> there is the maximum size supported by virtio user (or vortio device model, 
> however you want to call it). So that's really a limit per device, not per 
> queue, as no queue of the device would ever exceed that limit.
>
> Plus a lot more code would need to be refactored, which I think is 
> unnecessary.

I'm against a per-device limit because it's a concept that cannot
accurately describe reality. Some devices have multiple classes of
virtqueues and they are sized differently, so a per-device limit is
insufficient. virtio-net has separate rx_queue_size and tx_queue_size
parameters (plus a control vq hardcoded to 64 descriptors).

The specification already gives us Queue Size (vring.num in QEMU). The
variable exists in QEMU and just needs to be used.

If per-vq limits require a lot of work, please describe why. I think
replacing the variable from this patch with virtio_queue_get_num()
should be fairly straightforward, but maybe I'm missing something? (If
you prefer VirtQueue *vq instead of the index-based
virtio_queue_get_num() API, you can introduce a virtqueue_get_num()
API.)

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
  2021-10-05 15:10         ` [Virtio-fs] " Stefan Hajnoczi
@ 2021-10-05 16:32           ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-05 16:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Kevin Wolf, Laurent Vivier, qemu-block,
	Michael S. Tsirkin, Jason Wang, Amit Shah, David Hildenbrand,
	Greg Kurz, virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck wrote:
> > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck wrote:
> > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > > > variable per virtio user.
> > > 
> > > virtio user == virtio device model?
> > 
> > Yes
> > 
> > > > Reasons:
> > > > 
> > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > > > 
> > > >     maximum queue size possible. Which is actually the maximum
> > > >     queue size allowed by the virtio protocol. The appropriate
> > > >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > >     
> > > >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs
> > > >     01.h
> > > >     tml#x1-240006
> > > >     
> > > >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > > >     more or less arbitrary value of 1024 in the past, which
> > > >     limits the maximum transfer size with virtio to 4M
> > > >     (more precise: 1024 * PAGE_SIZE, with the latter typically
> > > >     being 4k).
> > > 
> > > Being equal to IOV_MAX is a likely reason. Buffers with more iovecs than
> > > that cannot be passed to host system calls (sendmsg(2), pwritev(2),
> > > etc).
> > 
> > Yes, that's use case dependent. Hence the solution to opt-in if it is
> > desired and feasible.
> > 
> > > > (2) Additionally the current value of 1024 poses a hidden limit,
> > > > 
> > > >     invisible to guest, which causes a system hang with the
> > > >     following QEMU error if guest tries to exceed it:
> > > >     
> > > >     virtio: too many write descriptors in indirect table
> > > 
> > > I don't understand this point. 2.6.5 The Virtqueue Descriptor Table 
says:
> > >   The number of descriptors in the table is defined by the queue size
> > >   for
> > > 
> > > this virtqueue: this is the maximum possible descriptor chain length.
> > > 
> > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> > >   A driver MUST NOT create a descriptor chain longer than the Queue Size
> > >   of
> > > 
> > > the device.
> > > 
> > > Do you mean a broken/malicious guest driver that is violating the spec?
> > > That's not a hidden limit, it's defined by the spec.
> > 
> > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.html
> > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.html
> > 
> > You can already go beyond that queue size at runtime with the indirection
> > table. The only actual limit is the currently hard coded value of 1k
> > pages.
> > Hence the suggestion to turn that into a variable.
> 
> Exceeding Queue Size is a VIRTIO spec violation. Drivers that operate
> outsided the spec do so at their own risk. They may not be compatible
> with all device implementations.

Yes, I am ware about that. And still, this practice is already done, which 
apparently is not limited to 9pfs.

> The limit is not hidden, it's Queue Size as defined by the spec :).
> 
> If you have a driver that is exceeding the limit, then please fix the
> driver.

I absolutely understand your position, but I hope you also understand that 
this violation of the specs is a theoretical issue, it is not a real-life 
problem right now, and due to lack of man power unfortunately I have to 
prioritize real-life problems over theoretical ones ATM. Keep in mind that 
right now I am the only person working on 9pfs actively, I do this voluntarily 
whenever I find a free time slice, and I am not paid for it either.

I don't see any reasonable way with reasonable effort to do what you are 
asking for here in 9pfs, and Greg may correct me here if I am saying anything 
wrong. If you are seeing any specific real-life issue here, then please tell 
me which one, otherwise I have to postpone that "specs violation" issue.

There is still a long list of real problems that I need to hunt down in 9pfs, 
afterwards I can continue with theoretical ones if you want, but right now I 
simply can't, sorry.

> > > > (3) Unfortunately not all virtio users in QEMU would currently
> > > > 
> > > >     work correctly with the new value of 32768.
> > > > 
> > > > So let's turn this hard coded global value into a runtime
> > > > variable as a first step in this commit, configurable for each
> > > > virtio user by passing a corresponding value with virtio_init()
> > > > call.
> > > 
> > > virtio_add_queue() already has an int queue_size argument, why isn't
> > > that enough to deal with the maximum queue size? There's probably a good
> > > reason for it, but please include it in the commit description.
> > 
> > [...]
> > 
> > > Can you make this value per-vq instead of per-vdev since virtqueues can
> > > have different queue sizes?
> > > 
> > > The same applies to the rest of this patch. Anything using
> > > vdev->queue_max_size should probably use vq->vring.num instead.
> > 
> > I would like to avoid that and keep it per device. The maximum size stored
> > there is the maximum size supported by virtio user (or vortio device
> > model,
> > however you want to call it). So that's really a limit per device, not per
> > queue, as no queue of the device would ever exceed that limit.
> > 
> > Plus a lot more code would need to be refactored, which I think is
> > unnecessary.
> 
> I'm against a per-device limit because it's a concept that cannot
> accurately describe reality. Some devices have multiple classes of

It describes current reality, because VIRTQUEUE_MAX_SIZE obviously is not per 
queue either ATM, and nobody ever cared.

All this series does, is allowing to override that currently project-wide 
compile-time constant to a per-driver-model compile-time constant. Which makes 
sense, because that's what it is: some drivers could cope with any transfer 
size, and some drivers are constrained to a certain maximum application 
specific transfer size (e.g. IOV_MAX).

> virtqueues and they are sized differently, so a per-device limit is
> insufficient. virtio-net has separate rx_queue_size and tx_queue_size
> parameters (plus a control vq hardcoded to 64 descriptors).

I simply find this overkill. This value semantically means "my driver model 
supports at any time and at any coincidence at the very most x * PAGE_SIZE = 
max_transfer_size". Do you see any driver that might want a more fine graded 
control over this?

As far as I can see, no other driver maintainer even seems to care to 
transition to 32k. So I simply doubt that anybody would even want a more 
fained graded control over this in practice, but anyway ...

> The specification already gives us Queue Size (vring.num in QEMU). The
> variable exists in QEMU and just needs to be used.
> 
> If per-vq limits require a lot of work, please describe why. I think
> replacing the variable from this patch with virtio_queue_get_num()
> should be fairly straightforward, but maybe I'm missing something? (If
> you prefer VirtQueue *vq instead of the index-based
> virtio_queue_get_num() API, you can introduce a virtqueue_get_num()
> API.)
> 
> Stefan

... I leave that up to Michael or whoever might be in charge to decide. I 
still find this overkill, but I will adapt this to whatever the decision 
eventually will be in v3.

But then please tell me the precise representation that you find appropriate, 
i.e. whether you want a new function for that, or rather an additional 
argument to virtio_add_queue(). Your call.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
@ 2021-10-05 16:32           ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-05 16:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck wrote:
> > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck wrote:
> > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > > > variable per virtio user.
> > > 
> > > virtio user == virtio device model?
> > 
> > Yes
> > 
> > > > Reasons:
> > > > 
> > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > > > 
> > > >     maximum queue size possible. Which is actually the maximum
> > > >     queue size allowed by the virtio protocol. The appropriate
> > > >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > >     
> > > >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs
> > > >     01.h
> > > >     tml#x1-240006
> > > >     
> > > >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > > >     more or less arbitrary value of 1024 in the past, which
> > > >     limits the maximum transfer size with virtio to 4M
> > > >     (more precise: 1024 * PAGE_SIZE, with the latter typically
> > > >     being 4k).
> > > 
> > > Being equal to IOV_MAX is a likely reason. Buffers with more iovecs than
> > > that cannot be passed to host system calls (sendmsg(2), pwritev(2),
> > > etc).
> > 
> > Yes, that's use case dependent. Hence the solution to opt-in if it is
> > desired and feasible.
> > 
> > > > (2) Additionally the current value of 1024 poses a hidden limit,
> > > > 
> > > >     invisible to guest, which causes a system hang with the
> > > >     following QEMU error if guest tries to exceed it:
> > > >     
> > > >     virtio: too many write descriptors in indirect table
> > > 
> > > I don't understand this point. 2.6.5 The Virtqueue Descriptor Table 
says:
> > >   The number of descriptors in the table is defined by the queue size
> > >   for
> > > 
> > > this virtqueue: this is the maximum possible descriptor chain length.
> > > 
> > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> > >   A driver MUST NOT create a descriptor chain longer than the Queue Size
> > >   of
> > > 
> > > the device.
> > > 
> > > Do you mean a broken/malicious guest driver that is violating the spec?
> > > That's not a hidden limit, it's defined by the spec.
> > 
> > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.html
> > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.html
> > 
> > You can already go beyond that queue size at runtime with the indirection
> > table. The only actual limit is the currently hard coded value of 1k
> > pages.
> > Hence the suggestion to turn that into a variable.
> 
> Exceeding Queue Size is a VIRTIO spec violation. Drivers that operate
> outsided the spec do so at their own risk. They may not be compatible
> with all device implementations.

Yes, I am ware about that. And still, this practice is already done, which 
apparently is not limited to 9pfs.

> The limit is not hidden, it's Queue Size as defined by the spec :).
> 
> If you have a driver that is exceeding the limit, then please fix the
> driver.

I absolutely understand your position, but I hope you also understand that 
this violation of the specs is a theoretical issue, it is not a real-life 
problem right now, and due to lack of man power unfortunately I have to 
prioritize real-life problems over theoretical ones ATM. Keep in mind that 
right now I am the only person working on 9pfs actively, I do this voluntarily 
whenever I find a free time slice, and I am not paid for it either.

I don't see any reasonable way with reasonable effort to do what you are 
asking for here in 9pfs, and Greg may correct me here if I am saying anything 
wrong. If you are seeing any specific real-life issue here, then please tell 
me which one, otherwise I have to postpone that "specs violation" issue.

There is still a long list of real problems that I need to hunt down in 9pfs, 
afterwards I can continue with theoretical ones if you want, but right now I 
simply can't, sorry.

> > > > (3) Unfortunately not all virtio users in QEMU would currently
> > > > 
> > > >     work correctly with the new value of 32768.
> > > > 
> > > > So let's turn this hard coded global value into a runtime
> > > > variable as a first step in this commit, configurable for each
> > > > virtio user by passing a corresponding value with virtio_init()
> > > > call.
> > > 
> > > virtio_add_queue() already has an int queue_size argument, why isn't
> > > that enough to deal with the maximum queue size? There's probably a good
> > > reason for it, but please include it in the commit description.
> > 
> > [...]
> > 
> > > Can you make this value per-vq instead of per-vdev since virtqueues can
> > > have different queue sizes?
> > > 
> > > The same applies to the rest of this patch. Anything using
> > > vdev->queue_max_size should probably use vq->vring.num instead.
> > 
> > I would like to avoid that and keep it per device. The maximum size stored
> > there is the maximum size supported by virtio user (or vortio device
> > model,
> > however you want to call it). So that's really a limit per device, not per
> > queue, as no queue of the device would ever exceed that limit.
> > 
> > Plus a lot more code would need to be refactored, which I think is
> > unnecessary.
> 
> I'm against a per-device limit because it's a concept that cannot
> accurately describe reality. Some devices have multiple classes of

It describes current reality, because VIRTQUEUE_MAX_SIZE obviously is not per 
queue either ATM, and nobody ever cared.

All this series does, is allowing to override that currently project-wide 
compile-time constant to a per-driver-model compile-time constant. Which makes 
sense, because that's what it is: some drivers could cope with any transfer 
size, and some drivers are constrained to a certain maximum application 
specific transfer size (e.g. IOV_MAX).

> virtqueues and they are sized differently, so a per-device limit is
> insufficient. virtio-net has separate rx_queue_size and tx_queue_size
> parameters (plus a control vq hardcoded to 64 descriptors).

I simply find this overkill. This value semantically means "my driver model 
supports at any time and at any coincidence at the very most x * PAGE_SIZE = 
max_transfer_size". Do you see any driver that might want a more fine graded 
control over this?

As far as I can see, no other driver maintainer even seems to care to 
transition to 32k. So I simply doubt that anybody would even want a more 
fained graded control over this in practice, but anyway ...

> The specification already gives us Queue Size (vring.num in QEMU). The
> variable exists in QEMU and just needs to be used.
> 
> If per-vq limits require a lot of work, please describe why. I think
> replacing the variable from this patch with virtio_queue_get_num()
> should be fairly straightforward, but maybe I'm missing something? (If
> you prefer VirtQueue *vq instead of the index-based
> virtio_queue_get_num() API, you can introduce a virtqueue_get_num()
> API.)
> 
> Stefan

... I leave that up to Michael or whoever might be in charge to decide. I 
still find this overkill, but I will adapt this to whatever the decision 
eventually will be in v3.

But then please tell me the precise representation that you find appropriate, 
i.e. whether you want a new function for that, or rather an additional 
argument to virtio_add_queue(). Your call.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
  2021-10-05 16:32           ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-06 11:06             ` Stefan Hajnoczi
  -1 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-10-06 11:06 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, Greg Kurz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 9875 bytes --]

On Tue, Oct 05, 2021 at 06:32:46PM +0200, Christian Schoenebeck wrote:
> On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> > On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck wrote:
> > > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> > > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck wrote:
> > > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > > > > variable per virtio user.
> > > > 
> > > > virtio user == virtio device model?
> > > 
> > > Yes
> > > 
> > > > > Reasons:
> > > > > 
> > > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > > > > 
> > > > >     maximum queue size possible. Which is actually the maximum
> > > > >     queue size allowed by the virtio protocol. The appropriate
> > > > >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > > >     
> > > > >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs
> > > > >     01.h
> > > > >     tml#x1-240006
> > > > >     
> > > > >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > > > >     more or less arbitrary value of 1024 in the past, which
> > > > >     limits the maximum transfer size with virtio to 4M
> > > > >     (more precise: 1024 * PAGE_SIZE, with the latter typically
> > > > >     being 4k).
> > > > 
> > > > Being equal to IOV_MAX is a likely reason. Buffers with more iovecs than
> > > > that cannot be passed to host system calls (sendmsg(2), pwritev(2),
> > > > etc).
> > > 
> > > Yes, that's use case dependent. Hence the solution to opt-in if it is
> > > desired and feasible.
> > > 
> > > > > (2) Additionally the current value of 1024 poses a hidden limit,
> > > > > 
> > > > >     invisible to guest, which causes a system hang with the
> > > > >     following QEMU error if guest tries to exceed it:
> > > > >     
> > > > >     virtio: too many write descriptors in indirect table
> > > > 
> > > > I don't understand this point. 2.6.5 The Virtqueue Descriptor Table 
> says:
> > > >   The number of descriptors in the table is defined by the queue size
> > > >   for
> > > > 
> > > > this virtqueue: this is the maximum possible descriptor chain length.
> > > > 
> > > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> > > >   A driver MUST NOT create a descriptor chain longer than the Queue Size
> > > >   of
> > > > 
> > > > the device.
> > > > 
> > > > Do you mean a broken/malicious guest driver that is violating the spec?
> > > > That's not a hidden limit, it's defined by the spec.
> > > 
> > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.html
> > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.html
> > > 
> > > You can already go beyond that queue size at runtime with the indirection
> > > table. The only actual limit is the currently hard coded value of 1k
> > > pages.
> > > Hence the suggestion to turn that into a variable.
> > 
> > Exceeding Queue Size is a VIRTIO spec violation. Drivers that operate
> > outsided the spec do so at their own risk. They may not be compatible
> > with all device implementations.
> 
> Yes, I am ware about that. And still, this practice is already done, which 
> apparently is not limited to 9pfs.
> 
> > The limit is not hidden, it's Queue Size as defined by the spec :).
> > 
> > If you have a driver that is exceeding the limit, then please fix the
> > driver.
> 
> I absolutely understand your position, but I hope you also understand that 
> this violation of the specs is a theoretical issue, it is not a real-life 
> problem right now, and due to lack of man power unfortunately I have to 
> prioritize real-life problems over theoretical ones ATM. Keep in mind that 
> right now I am the only person working on 9pfs actively, I do this voluntarily 
> whenever I find a free time slice, and I am not paid for it either.
> 
> I don't see any reasonable way with reasonable effort to do what you are 
> asking for here in 9pfs, and Greg may correct me here if I am saying anything 
> wrong. If you are seeing any specific real-life issue here, then please tell 
> me which one, otherwise I have to postpone that "specs violation" issue.
> 
> There is still a long list of real problems that I need to hunt down in 9pfs, 
> afterwards I can continue with theoretical ones if you want, but right now I 
> simply can't, sorry.

I understand. If you don't have time to fix the Linux virtio-9p driver
then that's fine.

I still wanted us to agree on the spec position because the commit
description says it's a "hidden limit", which is incorrect. It might
seem pedantic, but my concern is that misconceptions can spread if we
let them. That could cause people to write incorrect code later on.
Please update the commit description either by dropping 2) or by
replacing it with something else. For example:

  2) The Linux virtio-9p guest driver does not honor the VIRTIO Queue
     Size value and can submit descriptor chains that exceed it. That is
     a spec violation but is accepted by QEMU's device implementation.

     When the guest creates a descriptor chain larger than 1024 the
     following QEMU error is printed and the guest hangs:

     virtio: too many write descriptors in indirect table

> > > > > (3) Unfortunately not all virtio users in QEMU would currently
> > > > > 
> > > > >     work correctly with the new value of 32768.
> > > > > 
> > > > > So let's turn this hard coded global value into a runtime
> > > > > variable as a first step in this commit, configurable for each
> > > > > virtio user by passing a corresponding value with virtio_init()
> > > > > call.
> > > > 
> > > > virtio_add_queue() already has an int queue_size argument, why isn't
> > > > that enough to deal with the maximum queue size? There's probably a good
> > > > reason for it, but please include it in the commit description.
> > > 
> > > [...]
> > > 
> > > > Can you make this value per-vq instead of per-vdev since virtqueues can
> > > > have different queue sizes?
> > > > 
> > > > The same applies to the rest of this patch. Anything using
> > > > vdev->queue_max_size should probably use vq->vring.num instead.
> > > 
> > > I would like to avoid that and keep it per device. The maximum size stored
> > > there is the maximum size supported by virtio user (or vortio device
> > > model,
> > > however you want to call it). So that's really a limit per device, not per
> > > queue, as no queue of the device would ever exceed that limit.
> > > 
> > > Plus a lot more code would need to be refactored, which I think is
> > > unnecessary.
> > 
> > I'm against a per-device limit because it's a concept that cannot
> > accurately describe reality. Some devices have multiple classes of
> 
> It describes current reality, because VIRTQUEUE_MAX_SIZE obviously is not per 
> queue either ATM, and nobody ever cared.
> 
> All this series does, is allowing to override that currently project-wide 
> compile-time constant to a per-driver-model compile-time constant. Which makes 
> sense, because that's what it is: some drivers could cope with any transfer 
> size, and some drivers are constrained to a certain maximum application 
> specific transfer size (e.g. IOV_MAX).
> 
> > virtqueues and they are sized differently, so a per-device limit is
> > insufficient. virtio-net has separate rx_queue_size and tx_queue_size
> > parameters (plus a control vq hardcoded to 64 descriptors).
> 
> I simply find this overkill. This value semantically means "my driver model 
> supports at any time and at any coincidence at the very most x * PAGE_SIZE = 
> max_transfer_size". Do you see any driver that might want a more fine graded 
> control over this?

One reason why per-vq limits could make sense is that the maximum
possible number of struct elements is allocated upfront in some code
paths. Those code paths may need to differentiate between per-vq limits
for performance or memory utilization reasons. Today some places
allocate 1024 elements on the stack in some code paths, but maybe that's
not acceptable when the per-device limit is 32k. This can matter when a
device has vqs with very different sizes.

> As far as I can see, no other driver maintainer even seems to care to 
> transition to 32k. So I simply doubt that anybody would even want a more 
> fained graded control over this in practice, but anyway ...
> 
> > The specification already gives us Queue Size (vring.num in QEMU). The
> > variable exists in QEMU and just needs to be used.
> > 
> > If per-vq limits require a lot of work, please describe why. I think
> > replacing the variable from this patch with virtio_queue_get_num()
> > should be fairly straightforward, but maybe I'm missing something? (If
> > you prefer VirtQueue *vq instead of the index-based
> > virtio_queue_get_num() API, you can introduce a virtqueue_get_num()
> > API.)
> > 
> > Stefan
> 
> ... I leave that up to Michael or whoever might be in charge to decide. I 
> still find this overkill, but I will adapt this to whatever the decision 
> eventually will be in v3.
> 
> But then please tell me the precise representation that you find appropriate, 
> i.e. whether you want a new function for that, or rather an additional 
> argument to virtio_add_queue(). Your call.

virtio_add_queue() already takes an int queue_size argument. I think the
necessary information is already there.

This patch just needs to be tweaked to use the virtio_queue_get_num()
(or a new virtqueue_get_num() API if that's easier because only a
VirtQueue *vq pointer is available) instead of introducing a new
per-device limit.

The patch will probably become smaller.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
@ 2021-10-06 11:06             ` Stefan Hajnoczi
  0 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-10-06 11:06 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Raphael Norwitz

[-- Attachment #1: Type: text/plain, Size: 9875 bytes --]

On Tue, Oct 05, 2021 at 06:32:46PM +0200, Christian Schoenebeck wrote:
> On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> > On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck wrote:
> > > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> > > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck wrote:
> > > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > > > > variable per virtio user.
> > > > 
> > > > virtio user == virtio device model?
> > > 
> > > Yes
> > > 
> > > > > Reasons:
> > > > > 
> > > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > > > > 
> > > > >     maximum queue size possible. Which is actually the maximum
> > > > >     queue size allowed by the virtio protocol. The appropriate
> > > > >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > > >     
> > > > >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs
> > > > >     01.h
> > > > >     tml#x1-240006
> > > > >     
> > > > >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > > > >     more or less arbitrary value of 1024 in the past, which
> > > > >     limits the maximum transfer size with virtio to 4M
> > > > >     (more precise: 1024 * PAGE_SIZE, with the latter typically
> > > > >     being 4k).
> > > > 
> > > > Being equal to IOV_MAX is a likely reason. Buffers with more iovecs than
> > > > that cannot be passed to host system calls (sendmsg(2), pwritev(2),
> > > > etc).
> > > 
> > > Yes, that's use case dependent. Hence the solution to opt-in if it is
> > > desired and feasible.
> > > 
> > > > > (2) Additionally the current value of 1024 poses a hidden limit,
> > > > > 
> > > > >     invisible to guest, which causes a system hang with the
> > > > >     following QEMU error if guest tries to exceed it:
> > > > >     
> > > > >     virtio: too many write descriptors in indirect table
> > > > 
> > > > I don't understand this point. 2.6.5 The Virtqueue Descriptor Table 
> says:
> > > >   The number of descriptors in the table is defined by the queue size
> > > >   for
> > > > 
> > > > this virtqueue: this is the maximum possible descriptor chain length.
> > > > 
> > > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> > > >   A driver MUST NOT create a descriptor chain longer than the Queue Size
> > > >   of
> > > > 
> > > > the device.
> > > > 
> > > > Do you mean a broken/malicious guest driver that is violating the spec?
> > > > That's not a hidden limit, it's defined by the spec.
> > > 
> > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.html
> > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.html
> > > 
> > > You can already go beyond that queue size at runtime with the indirection
> > > table. The only actual limit is the currently hard coded value of 1k
> > > pages.
> > > Hence the suggestion to turn that into a variable.
> > 
> > Exceeding Queue Size is a VIRTIO spec violation. Drivers that operate
> > outsided the spec do so at their own risk. They may not be compatible
> > with all device implementations.
> 
> Yes, I am ware about that. And still, this practice is already done, which 
> apparently is not limited to 9pfs.
> 
> > The limit is not hidden, it's Queue Size as defined by the spec :).
> > 
> > If you have a driver that is exceeding the limit, then please fix the
> > driver.
> 
> I absolutely understand your position, but I hope you also understand that 
> this violation of the specs is a theoretical issue, it is not a real-life 
> problem right now, and due to lack of man power unfortunately I have to 
> prioritize real-life problems over theoretical ones ATM. Keep in mind that 
> right now I am the only person working on 9pfs actively, I do this voluntarily 
> whenever I find a free time slice, and I am not paid for it either.
> 
> I don't see any reasonable way with reasonable effort to do what you are 
> asking for here in 9pfs, and Greg may correct me here if I am saying anything 
> wrong. If you are seeing any specific real-life issue here, then please tell 
> me which one, otherwise I have to postpone that "specs violation" issue.
> 
> There is still a long list of real problems that I need to hunt down in 9pfs, 
> afterwards I can continue with theoretical ones if you want, but right now I 
> simply can't, sorry.

I understand. If you don't have time to fix the Linux virtio-9p driver
then that's fine.

I still wanted us to agree on the spec position because the commit
description says it's a "hidden limit", which is incorrect. It might
seem pedantic, but my concern is that misconceptions can spread if we
let them. That could cause people to write incorrect code later on.
Please update the commit description either by dropping 2) or by
replacing it with something else. For example:

  2) The Linux virtio-9p guest driver does not honor the VIRTIO Queue
     Size value and can submit descriptor chains that exceed it. That is
     a spec violation but is accepted by QEMU's device implementation.

     When the guest creates a descriptor chain larger than 1024 the
     following QEMU error is printed and the guest hangs:

     virtio: too many write descriptors in indirect table

> > > > > (3) Unfortunately not all virtio users in QEMU would currently
> > > > > 
> > > > >     work correctly with the new value of 32768.
> > > > > 
> > > > > So let's turn this hard coded global value into a runtime
> > > > > variable as a first step in this commit, configurable for each
> > > > > virtio user by passing a corresponding value with virtio_init()
> > > > > call.
> > > > 
> > > > virtio_add_queue() already has an int queue_size argument, why isn't
> > > > that enough to deal with the maximum queue size? There's probably a good
> > > > reason for it, but please include it in the commit description.
> > > 
> > > [...]
> > > 
> > > > Can you make this value per-vq instead of per-vdev since virtqueues can
> > > > have different queue sizes?
> > > > 
> > > > The same applies to the rest of this patch. Anything using
> > > > vdev->queue_max_size should probably use vq->vring.num instead.
> > > 
> > > I would like to avoid that and keep it per device. The maximum size stored
> > > there is the maximum size supported by virtio user (or vortio device
> > > model,
> > > however you want to call it). So that's really a limit per device, not per
> > > queue, as no queue of the device would ever exceed that limit.
> > > 
> > > Plus a lot more code would need to be refactored, which I think is
> > > unnecessary.
> > 
> > I'm against a per-device limit because it's a concept that cannot
> > accurately describe reality. Some devices have multiple classes of
> 
> It describes current reality, because VIRTQUEUE_MAX_SIZE obviously is not per 
> queue either ATM, and nobody ever cared.
> 
> All this series does, is allowing to override that currently project-wide 
> compile-time constant to a per-driver-model compile-time constant. Which makes 
> sense, because that's what it is: some drivers could cope with any transfer 
> size, and some drivers are constrained to a certain maximum application 
> specific transfer size (e.g. IOV_MAX).
> 
> > virtqueues and they are sized differently, so a per-device limit is
> > insufficient. virtio-net has separate rx_queue_size and tx_queue_size
> > parameters (plus a control vq hardcoded to 64 descriptors).
> 
> I simply find this overkill. This value semantically means "my driver model 
> supports at any time and at any coincidence at the very most x * PAGE_SIZE = 
> max_transfer_size". Do you see any driver that might want a more fine graded 
> control over this?

One reason why per-vq limits could make sense is that the maximum
possible number of struct elements is allocated upfront in some code
paths. Those code paths may need to differentiate between per-vq limits
for performance or memory utilization reasons. Today some places
allocate 1024 elements on the stack in some code paths, but maybe that's
not acceptable when the per-device limit is 32k. This can matter when a
device has vqs with very different sizes.

> As far as I can see, no other driver maintainer even seems to care to 
> transition to 32k. So I simply doubt that anybody would even want a more 
> fained graded control over this in practice, but anyway ...
> 
> > The specification already gives us Queue Size (vring.num in QEMU). The
> > variable exists in QEMU and just needs to be used.
> > 
> > If per-vq limits require a lot of work, please describe why. I think
> > replacing the variable from this patch with virtio_queue_get_num()
> > should be fairly straightforward, but maybe I'm missing something? (If
> > you prefer VirtQueue *vq instead of the index-based
> > virtio_queue_get_num() API, you can introduce a virtqueue_get_num()
> > API.)
> > 
> > Stefan
> 
> ... I leave that up to Michael or whoever might be in charge to decide. I 
> still find this overkill, but I will adapt this to whatever the decision 
> eventually will be in v3.
> 
> But then please tell me the precise representation that you find appropriate, 
> i.e. whether you want a new function for that, or rather an additional 
> argument to virtio_add_queue(). Your call.

virtio_add_queue() already takes an int queue_size argument. I think the
necessary information is already there.

This patch just needs to be tweaked to use the virtio_queue_get_num()
(or a new virtqueue_get_num() API if that's easier because only a
VirtQueue *vq pointer is available) instead of introducing a new
per-device limit.

The patch will probably become smaller.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
  2021-10-06 11:06             ` [Virtio-fs] " Stefan Hajnoczi
@ 2021-10-06 12:50               ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-06 12:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Kevin Wolf, Laurent Vivier, qemu-block,
	Michael S. Tsirkin, Jason Wang, Amit Shah, David Hildenbrand,
	Greg Kurz, virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

On Mittwoch, 6. Oktober 2021 13:06:55 CEST Stefan Hajnoczi wrote:
> On Tue, Oct 05, 2021 at 06:32:46PM +0200, Christian Schoenebeck wrote:
> > On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> > > On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck wrote:
> > > > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> > > > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck 
wrote:
> > > > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > > > > > variable per virtio user.
> > > > > 
> > > > > virtio user == virtio device model?
> > > > 
> > > > Yes
> > > > 
> > > > > > Reasons:
> > > > > > 
> > > > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > > > > > 
> > > > > >     maximum queue size possible. Which is actually the maximum
> > > > > >     queue size allowed by the virtio protocol. The appropriate
> > > > > >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > > > >     
> > > > > >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.
> > > > > >     1-cs
> > > > > >     01.h
> > > > > >     tml#x1-240006
> > > > > >     
> > > > > >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > > > > >     more or less arbitrary value of 1024 in the past, which
> > > > > >     limits the maximum transfer size with virtio to 4M
> > > > > >     (more precise: 1024 * PAGE_SIZE, with the latter typically
> > > > > >     being 4k).
> > > > > 
> > > > > Being equal to IOV_MAX is a likely reason. Buffers with more iovecs
> > > > > than
> > > > > that cannot be passed to host system calls (sendmsg(2), pwritev(2),
> > > > > etc).
> > > > 
> > > > Yes, that's use case dependent. Hence the solution to opt-in if it is
> > > > desired and feasible.
> > > > 
> > > > > > (2) Additionally the current value of 1024 poses a hidden limit,
> > > > > > 
> > > > > >     invisible to guest, which causes a system hang with the
> > > > > >     following QEMU error if guest tries to exceed it:
> > > > > >     
> > > > > >     virtio: too many write descriptors in indirect table
> > > > > 
> > > > > I don't understand this point. 2.6.5 The Virtqueue Descriptor Table
> > 
> > says:
> > > > >   The number of descriptors in the table is defined by the queue
> > > > >   size
> > > > >   for
> > > > > 
> > > > > this virtqueue: this is the maximum possible descriptor chain
> > > > > length.
> > > > > 
> > > > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> > > > >   A driver MUST NOT create a descriptor chain longer than the Queue
> > > > >   Size
> > > > >   of
> > > > > 
> > > > > the device.
> > > > > 
> > > > > Do you mean a broken/malicious guest driver that is violating the
> > > > > spec?
> > > > > That's not a hidden limit, it's defined by the spec.
> > > > 
> > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.html
> > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.html
> > > > 
> > > > You can already go beyond that queue size at runtime with the
> > > > indirection
> > > > table. The only actual limit is the currently hard coded value of 1k
> > > > pages.
> > > > Hence the suggestion to turn that into a variable.
> > > 
> > > Exceeding Queue Size is a VIRTIO spec violation. Drivers that operate
> > > outsided the spec do so at their own risk. They may not be compatible
> > > with all device implementations.
> > 
> > Yes, I am ware about that. And still, this practice is already done, which
> > apparently is not limited to 9pfs.
> > 
> > > The limit is not hidden, it's Queue Size as defined by the spec :).
> > > 
> > > If you have a driver that is exceeding the limit, then please fix the
> > > driver.
> > 
> > I absolutely understand your position, but I hope you also understand that
> > this violation of the specs is a theoretical issue, it is not a real-life
> > problem right now, and due to lack of man power unfortunately I have to
> > prioritize real-life problems over theoretical ones ATM. Keep in mind that
> > right now I am the only person working on 9pfs actively, I do this
> > voluntarily whenever I find a free time slice, and I am not paid for it
> > either.
> > 
> > I don't see any reasonable way with reasonable effort to do what you are
> > asking for here in 9pfs, and Greg may correct me here if I am saying
> > anything wrong. If you are seeing any specific real-life issue here, then
> > please tell me which one, otherwise I have to postpone that "specs
> > violation" issue.
> > 
> > There is still a long list of real problems that I need to hunt down in
> > 9pfs, afterwards I can continue with theoretical ones if you want, but
> > right now I simply can't, sorry.
> 
> I understand. If you don't have time to fix the Linux virtio-9p driver
> then that's fine.

I will look at this again, but it might be tricky. On doubt I'll postpone it.

> I still wanted us to agree on the spec position because the commit
> description says it's a "hidden limit", which is incorrect. It might
> seem pedantic, but my concern is that misconceptions can spread if we
> let them. That could cause people to write incorrect code later on.
> Please update the commit description either by dropping 2) or by
> replacing it with something else. For example:
> 
>   2) The Linux virtio-9p guest driver does not honor the VIRTIO Queue
>      Size value and can submit descriptor chains that exceed it. That is
>      a spec violation but is accepted by QEMU's device implementation.
> 
>      When the guest creates a descriptor chain larger than 1024 the
>      following QEMU error is printed and the guest hangs:
> 
>      virtio: too many write descriptors in indirect table

I am fine with both, probably preferring the text block above instead of 
silently dropping the reason, just for clarity.

But keep in mind that this might not be limited to virtio-9p as your text 
would suggest, see below.

> > > > > > (3) Unfortunately not all virtio users in QEMU would currently
> > > > > > 
> > > > > >     work correctly with the new value of 32768.
> > > > > > 
> > > > > > So let's turn this hard coded global value into a runtime
> > > > > > variable as a first step in this commit, configurable for each
> > > > > > virtio user by passing a corresponding value with virtio_init()
> > > > > > call.
> > > > > 
> > > > > virtio_add_queue() already has an int queue_size argument, why isn't
> > > > > that enough to deal with the maximum queue size? There's probably a
> > > > > good
> > > > > reason for it, but please include it in the commit description.
> > > > 
> > > > [...]
> > > > 
> > > > > Can you make this value per-vq instead of per-vdev since virtqueues
> > > > > can
> > > > > have different queue sizes?
> > > > > 
> > > > > The same applies to the rest of this patch. Anything using
> > > > > vdev->queue_max_size should probably use vq->vring.num instead.
> > > > 
> > > > I would like to avoid that and keep it per device. The maximum size
> > > > stored
> > > > there is the maximum size supported by virtio user (or vortio device
> > > > model,
> > > > however you want to call it). So that's really a limit per device, not
> > > > per
> > > > queue, as no queue of the device would ever exceed that limit.
> > > > 
> > > > Plus a lot more code would need to be refactored, which I think is
> > > > unnecessary.
> > > 
> > > I'm against a per-device limit because it's a concept that cannot
> > > accurately describe reality. Some devices have multiple classes of
> > 
> > It describes current reality, because VIRTQUEUE_MAX_SIZE obviously is not
> > per queue either ATM, and nobody ever cared.
> > 
> > All this series does, is allowing to override that currently project-wide
> > compile-time constant to a per-driver-model compile-time constant. Which
> > makes sense, because that's what it is: some drivers could cope with any
> > transfer size, and some drivers are constrained to a certain maximum
> > application specific transfer size (e.g. IOV_MAX).
> > 
> > > virtqueues and they are sized differently, so a per-device limit is
> > > insufficient. virtio-net has separate rx_queue_size and tx_queue_size
> > > parameters (plus a control vq hardcoded to 64 descriptors).
> > 
> > I simply find this overkill. This value semantically means "my driver
> > model
> > supports at any time and at any coincidence at the very most x * PAGE_SIZE
> > = max_transfer_size". Do you see any driver that might want a more fine
> > graded control over this?
> 
> One reason why per-vq limits could make sense is that the maximum
> possible number of struct elements is allocated upfront in some code
> paths. Those code paths may need to differentiate between per-vq limits
> for performance or memory utilization reasons. Today some places
> allocate 1024 elements on the stack in some code paths, but maybe that's
> not acceptable when the per-device limit is 32k. This can matter when a
> device has vqs with very different sizes.
> 
[...]
> > ... I leave that up to Michael or whoever might be in charge to decide. I
> > still find this overkill, but I will adapt this to whatever the decision
> > eventually will be in v3.
> > 
> > But then please tell me the precise representation that you find
> > appropriate, i.e. whether you want a new function for that, or rather an
> > additional argument to virtio_add_queue(). Your call.
> 
> virtio_add_queue() already takes an int queue_size argument. I think the
> necessary information is already there.
> 
> This patch just needs to be tweaked to use the virtio_queue_get_num()
> (or a new virtqueue_get_num() API if that's easier because only a
> VirtQueue *vq pointer is available) instead of introducing a new
> per-device limit.

My understanding is that both the original 9p virtio device authors, as well 
as other virtio device authors in QEMU have been and are still using this as a 
default value (i.e. to allocate some upfront, and the rest on demand).

So yes, I know your argument about the specs, but AFAICS if I would just take 
this existing numeric argument for the limit, then it would probably break 
those other QEMU devices as well.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
@ 2021-10-06 12:50               ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-06 12:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng

On Mittwoch, 6. Oktober 2021 13:06:55 CEST Stefan Hajnoczi wrote:
> On Tue, Oct 05, 2021 at 06:32:46PM +0200, Christian Schoenebeck wrote:
> > On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> > > On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck wrote:
> > > > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> > > > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck 
wrote:
> > > > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > > > > > variable per virtio user.
> > > > > 
> > > > > virtio user == virtio device model?
> > > > 
> > > > Yes
> > > > 
> > > > > > Reasons:
> > > > > > 
> > > > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > > > > > 
> > > > > >     maximum queue size possible. Which is actually the maximum
> > > > > >     queue size allowed by the virtio protocol. The appropriate
> > > > > >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > > > >     
> > > > > >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.
> > > > > >     1-cs
> > > > > >     01.h
> > > > > >     tml#x1-240006
> > > > > >     
> > > > > >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > > > > >     more or less arbitrary value of 1024 in the past, which
> > > > > >     limits the maximum transfer size with virtio to 4M
> > > > > >     (more precise: 1024 * PAGE_SIZE, with the latter typically
> > > > > >     being 4k).
> > > > > 
> > > > > Being equal to IOV_MAX is a likely reason. Buffers with more iovecs
> > > > > than
> > > > > that cannot be passed to host system calls (sendmsg(2), pwritev(2),
> > > > > etc).
> > > > 
> > > > Yes, that's use case dependent. Hence the solution to opt-in if it is
> > > > desired and feasible.
> > > > 
> > > > > > (2) Additionally the current value of 1024 poses a hidden limit,
> > > > > > 
> > > > > >     invisible to guest, which causes a system hang with the
> > > > > >     following QEMU error if guest tries to exceed it:
> > > > > >     
> > > > > >     virtio: too many write descriptors in indirect table
> > > > > 
> > > > > I don't understand this point. 2.6.5 The Virtqueue Descriptor Table
> > 
> > says:
> > > > >   The number of descriptors in the table is defined by the queue
> > > > >   size
> > > > >   for
> > > > > 
> > > > > this virtqueue: this is the maximum possible descriptor chain
> > > > > length.
> > > > > 
> > > > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> > > > >   A driver MUST NOT create a descriptor chain longer than the Queue
> > > > >   Size
> > > > >   of
> > > > > 
> > > > > the device.
> > > > > 
> > > > > Do you mean a broken/malicious guest driver that is violating the
> > > > > spec?
> > > > > That's not a hidden limit, it's defined by the spec.
> > > > 
> > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.html
> > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.html
> > > > 
> > > > You can already go beyond that queue size at runtime with the
> > > > indirection
> > > > table. The only actual limit is the currently hard coded value of 1k
> > > > pages.
> > > > Hence the suggestion to turn that into a variable.
> > > 
> > > Exceeding Queue Size is a VIRTIO spec violation. Drivers that operate
> > > outsided the spec do so at their own risk. They may not be compatible
> > > with all device implementations.
> > 
> > Yes, I am ware about that. And still, this practice is already done, which
> > apparently is not limited to 9pfs.
> > 
> > > The limit is not hidden, it's Queue Size as defined by the spec :).
> > > 
> > > If you have a driver that is exceeding the limit, then please fix the
> > > driver.
> > 
> > I absolutely understand your position, but I hope you also understand that
> > this violation of the specs is a theoretical issue, it is not a real-life
> > problem right now, and due to lack of man power unfortunately I have to
> > prioritize real-life problems over theoretical ones ATM. Keep in mind that
> > right now I am the only person working on 9pfs actively, I do this
> > voluntarily whenever I find a free time slice, and I am not paid for it
> > either.
> > 
> > I don't see any reasonable way with reasonable effort to do what you are
> > asking for here in 9pfs, and Greg may correct me here if I am saying
> > anything wrong. If you are seeing any specific real-life issue here, then
> > please tell me which one, otherwise I have to postpone that "specs
> > violation" issue.
> > 
> > There is still a long list of real problems that I need to hunt down in
> > 9pfs, afterwards I can continue with theoretical ones if you want, but
> > right now I simply can't, sorry.
> 
> I understand. If you don't have time to fix the Linux virtio-9p driver
> then that's fine.

I will look at this again, but it might be tricky. On doubt I'll postpone it.

> I still wanted us to agree on the spec position because the commit
> description says it's a "hidden limit", which is incorrect. It might
> seem pedantic, but my concern is that misconceptions can spread if we
> let them. That could cause people to write incorrect code later on.
> Please update the commit description either by dropping 2) or by
> replacing it with something else. For example:
> 
>   2) The Linux virtio-9p guest driver does not honor the VIRTIO Queue
>      Size value and can submit descriptor chains that exceed it. That is
>      a spec violation but is accepted by QEMU's device implementation.
> 
>      When the guest creates a descriptor chain larger than 1024 the
>      following QEMU error is printed and the guest hangs:
> 
>      virtio: too many write descriptors in indirect table

I am fine with both, probably preferring the text block above instead of 
silently dropping the reason, just for clarity.

But keep in mind that this might not be limited to virtio-9p as your text 
would suggest, see below.

> > > > > > (3) Unfortunately not all virtio users in QEMU would currently
> > > > > > 
> > > > > >     work correctly with the new value of 32768.
> > > > > > 
> > > > > > So let's turn this hard coded global value into a runtime
> > > > > > variable as a first step in this commit, configurable for each
> > > > > > virtio user by passing a corresponding value with virtio_init()
> > > > > > call.
> > > > > 
> > > > > virtio_add_queue() already has an int queue_size argument, why isn't
> > > > > that enough to deal with the maximum queue size? There's probably a
> > > > > good
> > > > > reason for it, but please include it in the commit description.
> > > > 
> > > > [...]
> > > > 
> > > > > Can you make this value per-vq instead of per-vdev since virtqueues
> > > > > can
> > > > > have different queue sizes?
> > > > > 
> > > > > The same applies to the rest of this patch. Anything using
> > > > > vdev->queue_max_size should probably use vq->vring.num instead.
> > > > 
> > > > I would like to avoid that and keep it per device. The maximum size
> > > > stored
> > > > there is the maximum size supported by virtio user (or vortio device
> > > > model,
> > > > however you want to call it). So that's really a limit per device, not
> > > > per
> > > > queue, as no queue of the device would ever exceed that limit.
> > > > 
> > > > Plus a lot more code would need to be refactored, which I think is
> > > > unnecessary.
> > > 
> > > I'm against a per-device limit because it's a concept that cannot
> > > accurately describe reality. Some devices have multiple classes of
> > 
> > It describes current reality, because VIRTQUEUE_MAX_SIZE obviously is not
> > per queue either ATM, and nobody ever cared.
> > 
> > All this series does, is allowing to override that currently project-wide
> > compile-time constant to a per-driver-model compile-time constant. Which
> > makes sense, because that's what it is: some drivers could cope with any
> > transfer size, and some drivers are constrained to a certain maximum
> > application specific transfer size (e.g. IOV_MAX).
> > 
> > > virtqueues and they are sized differently, so a per-device limit is
> > > insufficient. virtio-net has separate rx_queue_size and tx_queue_size
> > > parameters (plus a control vq hardcoded to 64 descriptors).
> > 
> > I simply find this overkill. This value semantically means "my driver
> > model
> > supports at any time and at any coincidence at the very most x * PAGE_SIZE
> > = max_transfer_size". Do you see any driver that might want a more fine
> > graded control over this?
> 
> One reason why per-vq limits could make sense is that the maximum
> possible number of struct elements is allocated upfront in some code
> paths. Those code paths may need to differentiate between per-vq limits
> for performance or memory utilization reasons. Today some places
> allocate 1024 elements on the stack in some code paths, but maybe that's
> not acceptable when the per-device limit is 32k. This can matter when a
> device has vqs with very different sizes.
> 
[...]
> > ... I leave that up to Michael or whoever might be in charge to decide. I
> > still find this overkill, but I will adapt this to whatever the decision
> > eventually will be in v3.
> > 
> > But then please tell me the precise representation that you find
> > appropriate, i.e. whether you want a new function for that, or rather an
> > additional argument to virtio_add_queue(). Your call.
> 
> virtio_add_queue() already takes an int queue_size argument. I think the
> necessary information is already there.
> 
> This patch just needs to be tweaked to use the virtio_queue_get_num()
> (or a new virtqueue_get_num() API if that's easier because only a
> VirtQueue *vq pointer is available) instead of introducing a new
> per-device limit.

My understanding is that both the original 9p virtio device authors, as well 
as other virtio device authors in QEMU have been and are still using this as a 
default value (i.e. to allocate some upfront, and the rest on demand).

So yes, I know your argument about the specs, but AFAICS if I would just take 
this existing numeric argument for the limit, then it would probably break 
those other QEMU devices as well.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
  2021-10-06 12:50               ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-06 14:42                 ` Stefan Hajnoczi
  -1 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-10-06 14:42 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, Greg Kurz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 11702 bytes --]

On Wed, Oct 06, 2021 at 02:50:07PM +0200, Christian Schoenebeck wrote:
> On Mittwoch, 6. Oktober 2021 13:06:55 CEST Stefan Hajnoczi wrote:
> > On Tue, Oct 05, 2021 at 06:32:46PM +0200, Christian Schoenebeck wrote:
> > > On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> > > > On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck wrote:
> > > > > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> > > > > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck 
> wrote:
> > > > > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > > > > > > variable per virtio user.
> > > > > > 
> > > > > > virtio user == virtio device model?
> > > > > 
> > > > > Yes
> > > > > 
> > > > > > > Reasons:
> > > > > > > 
> > > > > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > > > > > > 
> > > > > > >     maximum queue size possible. Which is actually the maximum
> > > > > > >     queue size allowed by the virtio protocol. The appropriate
> > > > > > >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > > > > >     
> > > > > > >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.
> > > > > > >     1-cs
> > > > > > >     01.h
> > > > > > >     tml#x1-240006
> > > > > > >     
> > > > > > >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > > > > > >     more or less arbitrary value of 1024 in the past, which
> > > > > > >     limits the maximum transfer size with virtio to 4M
> > > > > > >     (more precise: 1024 * PAGE_SIZE, with the latter typically
> > > > > > >     being 4k).
> > > > > > 
> > > > > > Being equal to IOV_MAX is a likely reason. Buffers with more iovecs
> > > > > > than
> > > > > > that cannot be passed to host system calls (sendmsg(2), pwritev(2),
> > > > > > etc).
> > > > > 
> > > > > Yes, that's use case dependent. Hence the solution to opt-in if it is
> > > > > desired and feasible.
> > > > > 
> > > > > > > (2) Additionally the current value of 1024 poses a hidden limit,
> > > > > > > 
> > > > > > >     invisible to guest, which causes a system hang with the
> > > > > > >     following QEMU error if guest tries to exceed it:
> > > > > > >     
> > > > > > >     virtio: too many write descriptors in indirect table
> > > > > > 
> > > > > > I don't understand this point. 2.6.5 The Virtqueue Descriptor Table
> > > 
> > > says:
> > > > > >   The number of descriptors in the table is defined by the queue
> > > > > >   size
> > > > > >   for
> > > > > > 
> > > > > > this virtqueue: this is the maximum possible descriptor chain
> > > > > > length.
> > > > > > 
> > > > > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> > > > > >   A driver MUST NOT create a descriptor chain longer than the Queue
> > > > > >   Size
> > > > > >   of
> > > > > > 
> > > > > > the device.
> > > > > > 
> > > > > > Do you mean a broken/malicious guest driver that is violating the
> > > > > > spec?
> > > > > > That's not a hidden limit, it's defined by the spec.
> > > > > 
> > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.html
> > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.html
> > > > > 
> > > > > You can already go beyond that queue size at runtime with the
> > > > > indirection
> > > > > table. The only actual limit is the currently hard coded value of 1k
> > > > > pages.
> > > > > Hence the suggestion to turn that into a variable.
> > > > 
> > > > Exceeding Queue Size is a VIRTIO spec violation. Drivers that operate
> > > > outsided the spec do so at their own risk. They may not be compatible
> > > > with all device implementations.
> > > 
> > > Yes, I am ware about that. And still, this practice is already done, which
> > > apparently is not limited to 9pfs.
> > > 
> > > > The limit is not hidden, it's Queue Size as defined by the spec :).
> > > > 
> > > > If you have a driver that is exceeding the limit, then please fix the
> > > > driver.
> > > 
> > > I absolutely understand your position, but I hope you also understand that
> > > this violation of the specs is a theoretical issue, it is not a real-life
> > > problem right now, and due to lack of man power unfortunately I have to
> > > prioritize real-life problems over theoretical ones ATM. Keep in mind that
> > > right now I am the only person working on 9pfs actively, I do this
> > > voluntarily whenever I find a free time slice, and I am not paid for it
> > > either.
> > > 
> > > I don't see any reasonable way with reasonable effort to do what you are
> > > asking for here in 9pfs, and Greg may correct me here if I am saying
> > > anything wrong. If you are seeing any specific real-life issue here, then
> > > please tell me which one, otherwise I have to postpone that "specs
> > > violation" issue.
> > > 
> > > There is still a long list of real problems that I need to hunt down in
> > > 9pfs, afterwards I can continue with theoretical ones if you want, but
> > > right now I simply can't, sorry.
> > 
> > I understand. If you don't have time to fix the Linux virtio-9p driver
> > then that's fine.
> 
> I will look at this again, but it might be tricky. On doubt I'll postpone it.
> 
> > I still wanted us to agree on the spec position because the commit
> > description says it's a "hidden limit", which is incorrect. It might
> > seem pedantic, but my concern is that misconceptions can spread if we
> > let them. That could cause people to write incorrect code later on.
> > Please update the commit description either by dropping 2) or by
> > replacing it with something else. For example:
> > 
> >   2) The Linux virtio-9p guest driver does not honor the VIRTIO Queue
> >      Size value and can submit descriptor chains that exceed it. That is
> >      a spec violation but is accepted by QEMU's device implementation.
> > 
> >      When the guest creates a descriptor chain larger than 1024 the
> >      following QEMU error is printed and the guest hangs:
> > 
> >      virtio: too many write descriptors in indirect table
> 
> I am fine with both, probably preferring the text block above instead of 
> silently dropping the reason, just for clarity.
> 
> But keep in mind that this might not be limited to virtio-9p as your text 
> would suggest, see below.
> 
> > > > > > > (3) Unfortunately not all virtio users in QEMU would currently
> > > > > > > 
> > > > > > >     work correctly with the new value of 32768.
> > > > > > > 
> > > > > > > So let's turn this hard coded global value into a runtime
> > > > > > > variable as a first step in this commit, configurable for each
> > > > > > > virtio user by passing a corresponding value with virtio_init()
> > > > > > > call.
> > > > > > 
> > > > > > virtio_add_queue() already has an int queue_size argument, why isn't
> > > > > > that enough to deal with the maximum queue size? There's probably a
> > > > > > good
> > > > > > reason for it, but please include it in the commit description.
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > Can you make this value per-vq instead of per-vdev since virtqueues
> > > > > > can
> > > > > > have different queue sizes?
> > > > > > 
> > > > > > The same applies to the rest of this patch. Anything using
> > > > > > vdev->queue_max_size should probably use vq->vring.num instead.
> > > > > 
> > > > > I would like to avoid that and keep it per device. The maximum size
> > > > > stored
> > > > > there is the maximum size supported by virtio user (or vortio device
> > > > > model,
> > > > > however you want to call it). So that's really a limit per device, not
> > > > > per
> > > > > queue, as no queue of the device would ever exceed that limit.
> > > > > 
> > > > > Plus a lot more code would need to be refactored, which I think is
> > > > > unnecessary.
> > > > 
> > > > I'm against a per-device limit because it's a concept that cannot
> > > > accurately describe reality. Some devices have multiple classes of
> > > 
> > > It describes current reality, because VIRTQUEUE_MAX_SIZE obviously is not
> > > per queue either ATM, and nobody ever cared.
> > > 
> > > All this series does, is allowing to override that currently project-wide
> > > compile-time constant to a per-driver-model compile-time constant. Which
> > > makes sense, because that's what it is: some drivers could cope with any
> > > transfer size, and some drivers are constrained to a certain maximum
> > > application specific transfer size (e.g. IOV_MAX).
> > > 
> > > > virtqueues and they are sized differently, so a per-device limit is
> > > > insufficient. virtio-net has separate rx_queue_size and tx_queue_size
> > > > parameters (plus a control vq hardcoded to 64 descriptors).
> > > 
> > > I simply find this overkill. This value semantically means "my driver
> > > model
> > > supports at any time and at any coincidence at the very most x * PAGE_SIZE
> > > = max_transfer_size". Do you see any driver that might want a more fine
> > > graded control over this?
> > 
> > One reason why per-vq limits could make sense is that the maximum
> > possible number of struct elements is allocated upfront in some code
> > paths. Those code paths may need to differentiate between per-vq limits
> > for performance or memory utilization reasons. Today some places
> > allocate 1024 elements on the stack in some code paths, but maybe that's
> > not acceptable when the per-device limit is 32k. This can matter when a
> > device has vqs with very different sizes.
> > 
> [...]
> > > ... I leave that up to Michael or whoever might be in charge to decide. I
> > > still find this overkill, but I will adapt this to whatever the decision
> > > eventually will be in v3.
> > > 
> > > But then please tell me the precise representation that you find
> > > appropriate, i.e. whether you want a new function for that, or rather an
> > > additional argument to virtio_add_queue(). Your call.
> > 
> > virtio_add_queue() already takes an int queue_size argument. I think the
> > necessary information is already there.
> > 
> > This patch just needs to be tweaked to use the virtio_queue_get_num()
> > (or a new virtqueue_get_num() API if that's easier because only a
> > VirtQueue *vq pointer is available) instead of introducing a new
> > per-device limit.
> 
> My understanding is that both the original 9p virtio device authors, as well 
> as other virtio device authors in QEMU have been and are still using this as a 
> default value (i.e. to allocate some upfront, and the rest on demand).
> 
> So yes, I know your argument about the specs, but AFAICS if I would just take 
> this existing numeric argument for the limit, then it would probably break 
> those other QEMU devices as well.

This is a good point that I didn't consider. If guest drivers currently
violate the spec, then restricting descriptor chain length to vring.num
will introduce regressions.

We can't use virtio_queue_get_num() directly. A backwards-compatible
limit is required:

  int virtio_queue_get_desc_chain_max(VirtIODevice *vdev, int n)
  {
      /*
       * QEMU historically allowed 1024 descriptors even if the
       * descriptor table was smaller.
       */
      return MAX(virtio_queue_get_num(vdev, qidx), 1024);
  }

Device models should call virtio_queue_get_desc_chain_max(). It
preserves the 1024 descriptor chain length but also allows larger values
if the virtqueue was configured appropriately.

Does this address the breakage you were thinking about?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
@ 2021-10-06 14:42                 ` Stefan Hajnoczi
  0 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-10-06 14:42 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz

[-- Attachment #1: Type: text/plain, Size: 11702 bytes --]

On Wed, Oct 06, 2021 at 02:50:07PM +0200, Christian Schoenebeck wrote:
> On Mittwoch, 6. Oktober 2021 13:06:55 CEST Stefan Hajnoczi wrote:
> > On Tue, Oct 05, 2021 at 06:32:46PM +0200, Christian Schoenebeck wrote:
> > > On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> > > > On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck wrote:
> > > > > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> > > > > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck 
> wrote:
> > > > > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > > > > > > variable per virtio user.
> > > > > > 
> > > > > > virtio user == virtio device model?
> > > > > 
> > > > > Yes
> > > > > 
> > > > > > > Reasons:
> > > > > > > 
> > > > > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > > > > > > 
> > > > > > >     maximum queue size possible. Which is actually the maximum
> > > > > > >     queue size allowed by the virtio protocol. The appropriate
> > > > > > >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > > > > >     
> > > > > > >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.
> > > > > > >     1-cs
> > > > > > >     01.h
> > > > > > >     tml#x1-240006
> > > > > > >     
> > > > > > >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > > > > > >     more or less arbitrary value of 1024 in the past, which
> > > > > > >     limits the maximum transfer size with virtio to 4M
> > > > > > >     (more precise: 1024 * PAGE_SIZE, with the latter typically
> > > > > > >     being 4k).
> > > > > > 
> > > > > > Being equal to IOV_MAX is a likely reason. Buffers with more iovecs
> > > > > > than
> > > > > > that cannot be passed to host system calls (sendmsg(2), pwritev(2),
> > > > > > etc).
> > > > > 
> > > > > Yes, that's use case dependent. Hence the solution to opt-in if it is
> > > > > desired and feasible.
> > > > > 
> > > > > > > (2) Additionally the current value of 1024 poses a hidden limit,
> > > > > > > 
> > > > > > >     invisible to guest, which causes a system hang with the
> > > > > > >     following QEMU error if guest tries to exceed it:
> > > > > > >     
> > > > > > >     virtio: too many write descriptors in indirect table
> > > > > > 
> > > > > > I don't understand this point. 2.6.5 The Virtqueue Descriptor Table
> > > 
> > > says:
> > > > > >   The number of descriptors in the table is defined by the queue
> > > > > >   size
> > > > > >   for
> > > > > > 
> > > > > > this virtqueue: this is the maximum possible descriptor chain
> > > > > > length.
> > > > > > 
> > > > > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> > > > > >   A driver MUST NOT create a descriptor chain longer than the Queue
> > > > > >   Size
> > > > > >   of
> > > > > > 
> > > > > > the device.
> > > > > > 
> > > > > > Do you mean a broken/malicious guest driver that is violating the
> > > > > > spec?
> > > > > > That's not a hidden limit, it's defined by the spec.
> > > > > 
> > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.html
> > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.html
> > > > > 
> > > > > You can already go beyond that queue size at runtime with the
> > > > > indirection
> > > > > table. The only actual limit is the currently hard coded value of 1k
> > > > > pages.
> > > > > Hence the suggestion to turn that into a variable.
> > > > 
> > > > Exceeding Queue Size is a VIRTIO spec violation. Drivers that operate
> > > > outsided the spec do so at their own risk. They may not be compatible
> > > > with all device implementations.
> > > 
> > > Yes, I am ware about that. And still, this practice is already done, which
> > > apparently is not limited to 9pfs.
> > > 
> > > > The limit is not hidden, it's Queue Size as defined by the spec :).
> > > > 
> > > > If you have a driver that is exceeding the limit, then please fix the
> > > > driver.
> > > 
> > > I absolutely understand your position, but I hope you also understand that
> > > this violation of the specs is a theoretical issue, it is not a real-life
> > > problem right now, and due to lack of man power unfortunately I have to
> > > prioritize real-life problems over theoretical ones ATM. Keep in mind that
> > > right now I am the only person working on 9pfs actively, I do this
> > > voluntarily whenever I find a free time slice, and I am not paid for it
> > > either.
> > > 
> > > I don't see any reasonable way with reasonable effort to do what you are
> > > asking for here in 9pfs, and Greg may correct me here if I am saying
> > > anything wrong. If you are seeing any specific real-life issue here, then
> > > please tell me which one, otherwise I have to postpone that "specs
> > > violation" issue.
> > > 
> > > There is still a long list of real problems that I need to hunt down in
> > > 9pfs, afterwards I can continue with theoretical ones if you want, but
> > > right now I simply can't, sorry.
> > 
> > I understand. If you don't have time to fix the Linux virtio-9p driver
> > then that's fine.
> 
> I will look at this again, but it might be tricky. On doubt I'll postpone it.
> 
> > I still wanted us to agree on the spec position because the commit
> > description says it's a "hidden limit", which is incorrect. It might
> > seem pedantic, but my concern is that misconceptions can spread if we
> > let them. That could cause people to write incorrect code later on.
> > Please update the commit description either by dropping 2) or by
> > replacing it with something else. For example:
> > 
> >   2) The Linux virtio-9p guest driver does not honor the VIRTIO Queue
> >      Size value and can submit descriptor chains that exceed it. That is
> >      a spec violation but is accepted by QEMU's device implementation.
> > 
> >      When the guest creates a descriptor chain larger than 1024 the
> >      following QEMU error is printed and the guest hangs:
> > 
> >      virtio: too many write descriptors in indirect table
> 
> I am fine with both, probably preferring the text block above instead of 
> silently dropping the reason, just for clarity.
> 
> But keep in mind that this might not be limited to virtio-9p as your text 
> would suggest, see below.
> 
> > > > > > > (3) Unfortunately not all virtio users in QEMU would currently
> > > > > > > 
> > > > > > >     work correctly with the new value of 32768.
> > > > > > > 
> > > > > > > So let's turn this hard coded global value into a runtime
> > > > > > > variable as a first step in this commit, configurable for each
> > > > > > > virtio user by passing a corresponding value with virtio_init()
> > > > > > > call.
> > > > > > 
> > > > > > virtio_add_queue() already has an int queue_size argument, why isn't
> > > > > > that enough to deal with the maximum queue size? There's probably a
> > > > > > good
> > > > > > reason for it, but please include it in the commit description.
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > Can you make this value per-vq instead of per-vdev since virtqueues
> > > > > > can
> > > > > > have different queue sizes?
> > > > > > 
> > > > > > The same applies to the rest of this patch. Anything using
> > > > > > vdev->queue_max_size should probably use vq->vring.num instead.
> > > > > 
> > > > > I would like to avoid that and keep it per device. The maximum size
> > > > > stored
> > > > > there is the maximum size supported by virtio user (or vortio device
> > > > > model,
> > > > > however you want to call it). So that's really a limit per device, not
> > > > > per
> > > > > queue, as no queue of the device would ever exceed that limit.
> > > > > 
> > > > > Plus a lot more code would need to be refactored, which I think is
> > > > > unnecessary.
> > > > 
> > > > I'm against a per-device limit because it's a concept that cannot
> > > > accurately describe reality. Some devices have multiple classes of
> > > 
> > > It describes current reality, because VIRTQUEUE_MAX_SIZE obviously is not
> > > per queue either ATM, and nobody ever cared.
> > > 
> > > All this series does, is allowing to override that currently project-wide
> > > compile-time constant to a per-driver-model compile-time constant. Which
> > > makes sense, because that's what it is: some drivers could cope with any
> > > transfer size, and some drivers are constrained to a certain maximum
> > > application specific transfer size (e.g. IOV_MAX).
> > > 
> > > > virtqueues and they are sized differently, so a per-device limit is
> > > > insufficient. virtio-net has separate rx_queue_size and tx_queue_size
> > > > parameters (plus a control vq hardcoded to 64 descriptors).
> > > 
> > > I simply find this overkill. This value semantically means "my driver
> > > model
> > > supports at any time and at any coincidence at the very most x * PAGE_SIZE
> > > = max_transfer_size". Do you see any driver that might want a more fine
> > > graded control over this?
> > 
> > One reason why per-vq limits could make sense is that the maximum
> > possible number of struct elements is allocated upfront in some code
> > paths. Those code paths may need to differentiate between per-vq limits
> > for performance or memory utilization reasons. Today some places
> > allocate 1024 elements on the stack in some code paths, but maybe that's
> > not acceptable when the per-device limit is 32k. This can matter when a
> > device has vqs with very different sizes.
> > 
> [...]
> > > ... I leave that up to Michael or whoever might be in charge to decide. I
> > > still find this overkill, but I will adapt this to whatever the decision
> > > eventually will be in v3.
> > > 
> > > But then please tell me the precise representation that you find
> > > appropriate, i.e. whether you want a new function for that, or rather an
> > > additional argument to virtio_add_queue(). Your call.
> > 
> > virtio_add_queue() already takes an int queue_size argument. I think the
> > necessary information is already there.
> > 
> > This patch just needs to be tweaked to use the virtio_queue_get_num()
> > (or a new virtqueue_get_num() API if that's easier because only a
> > VirtQueue *vq pointer is available) instead of introducing a new
> > per-device limit.
> 
> My understanding is that both the original 9p virtio device authors, as well 
> as other virtio device authors in QEMU have been and are still using this as a 
> default value (i.e. to allocate some upfront, and the rest on demand).
> 
> So yes, I know your argument about the specs, but AFAICS if I would just take 
> this existing numeric argument for the limit, then it would probably break 
> those other QEMU devices as well.

This is a good point that I didn't consider. If guest drivers currently
violate the spec, then restricting descriptor chain length to vring.num
will introduce regressions.

We can't use virtio_queue_get_num() directly. A backwards-compatible
limit is required:

  int virtio_queue_get_desc_chain_max(VirtIODevice *vdev, int n)
  {
      /*
       * QEMU historically allowed 1024 descriptors even if the
       * descriptor table was smaller.
       */
      return MAX(virtio_queue_get_num(vdev, qidx), 1024);
  }

Device models should call virtio_queue_get_desc_chain_max(). It
preserves the 1024 descriptor chain length but also allows larger values
if the virtqueue was configured appropriately.

Does this address the breakage you were thinking about?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-04 19:38 ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-07  5:23   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-10-07  5:23 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	qemu-devel, Jason Wang, Amit Shah, David Hildenbrand, Greg Kurz,
	Raphael Norwitz, virtio-fs, Eric Auger, Hanna Reitz,
	Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 844 bytes --]

On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> At the moment the maximum transfer size with virtio is limited to 4M
> (1024 * PAGE_SIZE). This series raises this limit to its maximum
> theoretical possible transfer size of 128M (32k pages) according to the
> virtio specs:
> 
> https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006

Hi Christian,
I took a quick look at the code:

- The Linux 9p driver restricts descriptor chains to 128 elements
  (net/9p/trans_virtio.c:VIRTQUEUE_NUM)

- The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
  with EINVAL when called with more than IOV_MAX iovecs
  (hw/9pfs/9p.c:v9fs_read())

Unless I misunderstood the code, neither side can take advantage of the
new 32k descriptor chain limit?

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-07  5:23   ` Stefan Hajnoczi
  0 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-10-07  5:23 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	qemu-devel, Jason Wang, Amit Shah, David Hildenbrand,
	Raphael Norwitz, virtio-fs, Eric Auger, Hanna Reitz,
	Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng

[-- Attachment #1: Type: text/plain, Size: 844 bytes --]

On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> At the moment the maximum transfer size with virtio is limited to 4M
> (1024 * PAGE_SIZE). This series raises this limit to its maximum
> theoretical possible transfer size of 128M (32k pages) according to the
> virtio specs:
> 
> https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006

Hi Christian,
I took a quick look at the code:

- The Linux 9p driver restricts descriptor chains to 128 elements
  (net/9p/trans_virtio.c:VIRTQUEUE_NUM)

- The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
  with EINVAL when called with more than IOV_MAX iovecs
  (hw/9pfs/9p.c:v9fs_read())

Unless I misunderstood the code, neither side can take advantage of the
new 32k descriptor chain limit?

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-07  5:23   ` [Virtio-fs] " Stefan Hajnoczi
@ 2021-10-07 12:51     ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-07 12:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Kevin Wolf, Laurent Vivier, qemu-block,
	Michael S. Tsirkin, Jason Wang, Amit Shah, David Hildenbrand,
	Greg Kurz, Raphael Norwitz, virtio-fs, Eric Auger, Hanna Reitz,
	Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Dr. David Alan Gilbert

On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> > At the moment the maximum transfer size with virtio is limited to 4M
> > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > theoretical possible transfer size of 128M (32k pages) according to the
> > virtio specs:
> > 
> > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > x1-240006
> Hi Christian,
> I took a quick look at the code:
> 
> - The Linux 9p driver restricts descriptor chains to 128 elements
>   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)

Yes, that's the limitation that I am about to remove (WIP); current kernel 
patches:
https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/

> - The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
>   with EINVAL when called with more than IOV_MAX iovecs
>   (hw/9pfs/9p.c:v9fs_read())

Hmm, which makes me wonder why I never encountered this error during testing.

Most people will use the 9p qemu 'local' fs driver backend in practice, so 
that v9fs_read() call would translate for most people to this implementation 
on QEMU side (hw/9p/9p-local.c):

static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
                            const struct iovec *iov,
                            int iovcnt, off_t offset)
{
#ifdef CONFIG_PREADV
    return preadv(fs->fd, iov, iovcnt, offset);
#else
    int err = lseek(fs->fd, offset, SEEK_SET);
    if (err == -1) {
        return err;
    } else {
        return readv(fs->fd, iov, iovcnt);
    }
#endif
}

> Unless I misunderstood the code, neither side can take advantage of the
> new 32k descriptor chain limit?
> 
> Thanks,
> Stefan

I need to check that when I have some more time. One possible explanation 
might be that preadv() already has this wrapped into a loop in its 
implementation to circumvent a limit like IOV_MAX. It might be another "it 
works, but not portable" issue, but not sure.

There are still a bunch of other issues I have to resolve. If you look at
net/9p/client.c on kernel side, you'll notice that it basically does this ATM

    kmalloc(msize);

for every 9p request. So not only does it allocate much more memory for every 
request than actually required (i.e. say 9pfs was mounted with msize=8M, then 
a 9p request that actually would just need 1k would nevertheless allocate 8M), 
but also it allocates > PAGE_SIZE, which obviously may fail at any time.

With those kernel patches above and QEMU being patched with these series as 
well, I can go above 4M msize now, and the test system runs stable if 9pfs was 
mounted with an msize not being "too high". If I try to mount 9pfs with msize 
being very high, the upper described kmalloc() issue would kick in and cause 
an immediate kernel oops when mounting. So that's a high priority issue that I 
still need to resolve.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-07 12:51     ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-07 12:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng

On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> > At the moment the maximum transfer size with virtio is limited to 4M
> > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > theoretical possible transfer size of 128M (32k pages) according to the
> > virtio specs:
> > 
> > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > x1-240006
> Hi Christian,
> I took a quick look at the code:
> 
> - The Linux 9p driver restricts descriptor chains to 128 elements
>   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)

Yes, that's the limitation that I am about to remove (WIP); current kernel 
patches:
https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/

> - The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
>   with EINVAL when called with more than IOV_MAX iovecs
>   (hw/9pfs/9p.c:v9fs_read())

Hmm, which makes me wonder why I never encountered this error during testing.

Most people will use the 9p qemu 'local' fs driver backend in practice, so 
that v9fs_read() call would translate for most people to this implementation 
on QEMU side (hw/9p/9p-local.c):

static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
                            const struct iovec *iov,
                            int iovcnt, off_t offset)
{
#ifdef CONFIG_PREADV
    return preadv(fs->fd, iov, iovcnt, offset);
#else
    int err = lseek(fs->fd, offset, SEEK_SET);
    if (err == -1) {
        return err;
    } else {
        return readv(fs->fd, iov, iovcnt);
    }
#endif
}

> Unless I misunderstood the code, neither side can take advantage of the
> new 32k descriptor chain limit?
> 
> Thanks,
> Stefan

I need to check that when I have some more time. One possible explanation 
might be that preadv() already has this wrapped into a loop in its 
implementation to circumvent a limit like IOV_MAX. It might be another "it 
works, but not portable" issue, but not sure.

There are still a bunch of other issues I have to resolve. If you look at
net/9p/client.c on kernel side, you'll notice that it basically does this ATM

    kmalloc(msize);

for every 9p request. So not only does it allocate much more memory for every 
request than actually required (i.e. say 9pfs was mounted with msize=8M, then 
a 9p request that actually would just need 1k would nevertheless allocate 8M), 
but also it allocates > PAGE_SIZE, which obviously may fail at any time.

With those kernel patches above and QEMU being patched with these series as 
well, I can go above 4M msize now, and the test system runs stable if 9pfs was 
mounted with an msize not being "too high". If I try to mount 9pfs with msize 
being very high, the upper described kmalloc() issue would kick in and cause 
an immediate kernel oops when mounting. So that's a high priority issue that I 
still need to resolve.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
  2021-10-06 14:42                 ` [Virtio-fs] " Stefan Hajnoczi
@ 2021-10-07 13:09                   ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-07 13:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Kevin Wolf, Laurent Vivier, qemu-block,
	Michael S. Tsirkin, Jason Wang, Amit Shah, David Hildenbrand,
	Greg Kurz, virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

On Mittwoch, 6. Oktober 2021 16:42:34 CEST Stefan Hajnoczi wrote:
> On Wed, Oct 06, 2021 at 02:50:07PM +0200, Christian Schoenebeck wrote:
> > On Mittwoch, 6. Oktober 2021 13:06:55 CEST Stefan Hajnoczi wrote:
> > > On Tue, Oct 05, 2021 at 06:32:46PM +0200, Christian Schoenebeck wrote:
> > > > On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> > > > > On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck 
wrote:
> > > > > > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> > > > > > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck
> > 
> > wrote:
> > > > > > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > > > > > > > variable per virtio user.
> > > > > > > 
> > > > > > > virtio user == virtio device model?
> > > > > > 
> > > > > > Yes
> > > > > > 
> > > > > > > > Reasons:
> > > > > > > > 
> > > > > > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > > > > > > > 
> > > > > > > >     maximum queue size possible. Which is actually the maximum
> > > > > > > >     queue size allowed by the virtio protocol. The appropriate
> > > > > > > >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > > > > > >     
> > > > > > > >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio
> > > > > > > >     -v1.
> > > > > > > >     1-cs
> > > > > > > >     01.h
> > > > > > > >     tml#x1-240006
> > > > > > > >     
> > > > > > > >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > > > > > > >     more or less arbitrary value of 1024 in the past, which
> > > > > > > >     limits the maximum transfer size with virtio to 4M
> > > > > > > >     (more precise: 1024 * PAGE_SIZE, with the latter typically
> > > > > > > >     being 4k).
> > > > > > > 
> > > > > > > Being equal to IOV_MAX is a likely reason. Buffers with more
> > > > > > > iovecs
> > > > > > > than
> > > > > > > that cannot be passed to host system calls (sendmsg(2),
> > > > > > > pwritev(2),
> > > > > > > etc).
> > > > > > 
> > > > > > Yes, that's use case dependent. Hence the solution to opt-in if it
> > > > > > is
> > > > > > desired and feasible.
> > > > > > 
> > > > > > > > (2) Additionally the current value of 1024 poses a hidden
> > > > > > > > limit,
> > > > > > > > 
> > > > > > > >     invisible to guest, which causes a system hang with the
> > > > > > > >     following QEMU error if guest tries to exceed it:
> > > > > > > >     
> > > > > > > >     virtio: too many write descriptors in indirect table
> > > > > > > 
> > > > > > > I don't understand this point. 2.6.5 The Virtqueue Descriptor
> > > > > > > Table
> > > > 
> > > > says:
> > > > > > >   The number of descriptors in the table is defined by the queue
> > > > > > >   size
> > > > > > >   for
> > > > > > > 
> > > > > > > this virtqueue: this is the maximum possible descriptor chain
> > > > > > > length.
> > > > > > > 
> > > > > > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> > > > > > >   A driver MUST NOT create a descriptor chain longer than the
> > > > > > >   Queue
> > > > > > >   Size
> > > > > > >   of
> > > > > > > 
> > > > > > > the device.
> > > > > > > 
> > > > > > > Do you mean a broken/malicious guest driver that is violating
> > > > > > > the
> > > > > > > spec?
> > > > > > > That's not a hidden limit, it's defined by the spec.
> > > > > > 
> > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.htm
> > > > > > l
> > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.htm
> > > > > > l
> > > > > > 
> > > > > > You can already go beyond that queue size at runtime with the
> > > > > > indirection
> > > > > > table. The only actual limit is the currently hard coded value of
> > > > > > 1k
> > > > > > pages.
> > > > > > Hence the suggestion to turn that into a variable.
> > > > > 
> > > > > Exceeding Queue Size is a VIRTIO spec violation. Drivers that
> > > > > operate
> > > > > outsided the spec do so at their own risk. They may not be
> > > > > compatible
> > > > > with all device implementations.
> > > > 
> > > > Yes, I am ware about that. And still, this practice is already done,
> > > > which
> > > > apparently is not limited to 9pfs.
> > > > 
> > > > > The limit is not hidden, it's Queue Size as defined by the spec :).
> > > > > 
> > > > > If you have a driver that is exceeding the limit, then please fix
> > > > > the
> > > > > driver.
> > > > 
> > > > I absolutely understand your position, but I hope you also understand
> > > > that
> > > > this violation of the specs is a theoretical issue, it is not a
> > > > real-life
> > > > problem right now, and due to lack of man power unfortunately I have
> > > > to
> > > > prioritize real-life problems over theoretical ones ATM. Keep in mind
> > > > that
> > > > right now I am the only person working on 9pfs actively, I do this
> > > > voluntarily whenever I find a free time slice, and I am not paid for
> > > > it
> > > > either.
> > > > 
> > > > I don't see any reasonable way with reasonable effort to do what you
> > > > are
> > > > asking for here in 9pfs, and Greg may correct me here if I am saying
> > > > anything wrong. If you are seeing any specific real-life issue here,
> > > > then
> > > > please tell me which one, otherwise I have to postpone that "specs
> > > > violation" issue.
> > > > 
> > > > There is still a long list of real problems that I need to hunt down
> > > > in
> > > > 9pfs, afterwards I can continue with theoretical ones if you want, but
> > > > right now I simply can't, sorry.
> > > 
> > > I understand. If you don't have time to fix the Linux virtio-9p driver
> > > then that's fine.
> > 
> > I will look at this again, but it might be tricky. On doubt I'll postpone
> > it.> 
> > > I still wanted us to agree on the spec position because the commit
> > > description says it's a "hidden limit", which is incorrect. It might
> > > seem pedantic, but my concern is that misconceptions can spread if we
> > > let them. That could cause people to write incorrect code later on.
> > > Please update the commit description either by dropping 2) or by
> > > 
> > > replacing it with something else. For example:
> > >   2) The Linux virtio-9p guest driver does not honor the VIRTIO Queue
> > >   
> > >      Size value and can submit descriptor chains that exceed it. That is
> > >      a spec violation but is accepted by QEMU's device implementation.
> > >      
> > >      When the guest creates a descriptor chain larger than 1024 the
> > >      following QEMU error is printed and the guest hangs:
> > >      
> > >      virtio: too many write descriptors in indirect table
> > 
> > I am fine with both, probably preferring the text block above instead of
> > silently dropping the reason, just for clarity.
> > 
> > But keep in mind that this might not be limited to virtio-9p as your text
> > would suggest, see below.
> > 
> > > > > > > > (3) Unfortunately not all virtio users in QEMU would currently
> > > > > > > > 
> > > > > > > >     work correctly with the new value of 32768.
> > > > > > > > 
> > > > > > > > So let's turn this hard coded global value into a runtime
> > > > > > > > variable as a first step in this commit, configurable for each
> > > > > > > > virtio user by passing a corresponding value with
> > > > > > > > virtio_init()
> > > > > > > > call.
> > > > > > > 
> > > > > > > virtio_add_queue() already has an int queue_size argument, why
> > > > > > > isn't
> > > > > > > that enough to deal with the maximum queue size? There's
> > > > > > > probably a
> > > > > > > good
> > > > > > > reason for it, but please include it in the commit description.
> > > > > > 
> > > > > > [...]
> > > > > > 
> > > > > > > Can you make this value per-vq instead of per-vdev since
> > > > > > > virtqueues
> > > > > > > can
> > > > > > > have different queue sizes?
> > > > > > > 
> > > > > > > The same applies to the rest of this patch. Anything using
> > > > > > > vdev->queue_max_size should probably use vq->vring.num instead.
> > > > > > 
> > > > > > I would like to avoid that and keep it per device. The maximum
> > > > > > size
> > > > > > stored
> > > > > > there is the maximum size supported by virtio user (or vortio
> > > > > > device
> > > > > > model,
> > > > > > however you want to call it). So that's really a limit per device,
> > > > > > not
> > > > > > per
> > > > > > queue, as no queue of the device would ever exceed that limit.
> > > > > > 
> > > > > > Plus a lot more code would need to be refactored, which I think is
> > > > > > unnecessary.
> > > > > 
> > > > > I'm against a per-device limit because it's a concept that cannot
> > > > > accurately describe reality. Some devices have multiple classes of
> > > > 
> > > > It describes current reality, because VIRTQUEUE_MAX_SIZE obviously is
> > > > not
> > > > per queue either ATM, and nobody ever cared.
> > > > 
> > > > All this series does, is allowing to override that currently
> > > > project-wide
> > > > compile-time constant to a per-driver-model compile-time constant.
> > > > Which
> > > > makes sense, because that's what it is: some drivers could cope with
> > > > any
> > > > transfer size, and some drivers are constrained to a certain maximum
> > > > application specific transfer size (e.g. IOV_MAX).
> > > > 
> > > > > virtqueues and they are sized differently, so a per-device limit is
> > > > > insufficient. virtio-net has separate rx_queue_size and
> > > > > tx_queue_size
> > > > > parameters (plus a control vq hardcoded to 64 descriptors).
> > > > 
> > > > I simply find this overkill. This value semantically means "my driver
> > > > model
> > > > supports at any time and at any coincidence at the very most x *
> > > > PAGE_SIZE
> > > > = max_transfer_size". Do you see any driver that might want a more
> > > > fine
> > > > graded control over this?
> > > 
> > > One reason why per-vq limits could make sense is that the maximum
> > > possible number of struct elements is allocated upfront in some code
> > > paths. Those code paths may need to differentiate between per-vq limits
> > > for performance or memory utilization reasons. Today some places
> > > allocate 1024 elements on the stack in some code paths, but maybe that's
> > > not acceptable when the per-device limit is 32k. This can matter when a
> > > device has vqs with very different sizes.
> > 
> > [...]
> > 
> > > > ... I leave that up to Michael or whoever might be in charge to
> > > > decide. I
> > > > still find this overkill, but I will adapt this to whatever the
> > > > decision
> > > > eventually will be in v3.
> > > > 
> > > > But then please tell me the precise representation that you find
> > > > appropriate, i.e. whether you want a new function for that, or rather
> > > > an
> > > > additional argument to virtio_add_queue(). Your call.
> > > 
> > > virtio_add_queue() already takes an int queue_size argument. I think the
> > > necessary information is already there.
> > > 
> > > This patch just needs to be tweaked to use the virtio_queue_get_num()
> > > (or a new virtqueue_get_num() API if that's easier because only a
> > > VirtQueue *vq pointer is available) instead of introducing a new
> > > per-device limit.
> > 
> > My understanding is that both the original 9p virtio device authors, as
> > well as other virtio device authors in QEMU have been and are still using
> > this as a default value (i.e. to allocate some upfront, and the rest on
> > demand).
> > 
> > So yes, I know your argument about the specs, but AFAICS if I would just
> > take this existing numeric argument for the limit, then it would probably
> > break those other QEMU devices as well.
> 
> This is a good point that I didn't consider. If guest drivers currently
> violate the spec, then restricting descriptor chain length to vring.num
> will introduce regressions.
> 
> We can't use virtio_queue_get_num() directly. A backwards-compatible
> limit is required:
> 
>   int virtio_queue_get_desc_chain_max(VirtIODevice *vdev, int n)
>   {
>       /*
>        * QEMU historically allowed 1024 descriptors even if the
>        * descriptor table was smaller.
>        */
>       return MAX(virtio_queue_get_num(vdev, qidx), 1024);
>   }

That was an alternative that I thought about as well, but decided against. It 
would require devices (that would want to support large transmissions sizes) 
to create the virtio queue(s) with the maximum possible size, i.e:

  virtio_add_queue(32k);

And that's the point where my current lack of knowledge, of what this would 
precisely mean to the resulting allocation set, decided against it. I mean 
would that mean would QEMU's virtio implementation would just a) allocate 32k 
scatter gather list entries? Or would it rather b) additionally also allocate 
the destination memory pages as well?

If you know the answer to that, that would be appreciated. Otherwise I will 
check this when I find some time.

Because if it is b) then I guess many people would not be happy about this. 
Because that change would mean it would allocate 128M for every 9p mount 
point, even if people don't care about high transmission sizes.

> Device models should call virtio_queue_get_desc_chain_max(). It
> preserves the 1024 descriptor chain length but also allows larger values
> if the virtqueue was configured appropriately.
> 
> Does this address the breakage you were thinking about?
> 
> Stefan

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
@ 2021-10-07 13:09                   ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-07 13:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Mittwoch, 6. Oktober 2021 16:42:34 CEST Stefan Hajnoczi wrote:
> On Wed, Oct 06, 2021 at 02:50:07PM +0200, Christian Schoenebeck wrote:
> > On Mittwoch, 6. Oktober 2021 13:06:55 CEST Stefan Hajnoczi wrote:
> > > On Tue, Oct 05, 2021 at 06:32:46PM +0200, Christian Schoenebeck wrote:
> > > > On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> > > > > On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck 
wrote:
> > > > > > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> > > > > > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck
> > 
> > wrote:
> > > > > > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > > > > > > > variable per virtio user.
> > > > > > > 
> > > > > > > virtio user == virtio device model?
> > > > > > 
> > > > > > Yes
> > > > > > 
> > > > > > > > Reasons:
> > > > > > > > 
> > > > > > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > > > > > > > 
> > > > > > > >     maximum queue size possible. Which is actually the maximum
> > > > > > > >     queue size allowed by the virtio protocol. The appropriate
> > > > > > > >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > > > > > >     
> > > > > > > >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio
> > > > > > > >     -v1.
> > > > > > > >     1-cs
> > > > > > > >     01.h
> > > > > > > >     tml#x1-240006
> > > > > > > >     
> > > > > > > >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > > > > > > >     more or less arbitrary value of 1024 in the past, which
> > > > > > > >     limits the maximum transfer size with virtio to 4M
> > > > > > > >     (more precise: 1024 * PAGE_SIZE, with the latter typically
> > > > > > > >     being 4k).
> > > > > > > 
> > > > > > > Being equal to IOV_MAX is a likely reason. Buffers with more
> > > > > > > iovecs
> > > > > > > than
> > > > > > > that cannot be passed to host system calls (sendmsg(2),
> > > > > > > pwritev(2),
> > > > > > > etc).
> > > > > > 
> > > > > > Yes, that's use case dependent. Hence the solution to opt-in if it
> > > > > > is
> > > > > > desired and feasible.
> > > > > > 
> > > > > > > > (2) Additionally the current value of 1024 poses a hidden
> > > > > > > > limit,
> > > > > > > > 
> > > > > > > >     invisible to guest, which causes a system hang with the
> > > > > > > >     following QEMU error if guest tries to exceed it:
> > > > > > > >     
> > > > > > > >     virtio: too many write descriptors in indirect table
> > > > > > > 
> > > > > > > I don't understand this point. 2.6.5 The Virtqueue Descriptor
> > > > > > > Table
> > > > 
> > > > says:
> > > > > > >   The number of descriptors in the table is defined by the queue
> > > > > > >   size
> > > > > > >   for
> > > > > > > 
> > > > > > > this virtqueue: this is the maximum possible descriptor chain
> > > > > > > length.
> > > > > > > 
> > > > > > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> > > > > > >   A driver MUST NOT create a descriptor chain longer than the
> > > > > > >   Queue
> > > > > > >   Size
> > > > > > >   of
> > > > > > > 
> > > > > > > the device.
> > > > > > > 
> > > > > > > Do you mean a broken/malicious guest driver that is violating
> > > > > > > the
> > > > > > > spec?
> > > > > > > That's not a hidden limit, it's defined by the spec.
> > > > > > 
> > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.htm
> > > > > > l
> > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.htm
> > > > > > l
> > > > > > 
> > > > > > You can already go beyond that queue size at runtime with the
> > > > > > indirection
> > > > > > table. The only actual limit is the currently hard coded value of
> > > > > > 1k
> > > > > > pages.
> > > > > > Hence the suggestion to turn that into a variable.
> > > > > 
> > > > > Exceeding Queue Size is a VIRTIO spec violation. Drivers that
> > > > > operate
> > > > > outsided the spec do so at their own risk. They may not be
> > > > > compatible
> > > > > with all device implementations.
> > > > 
> > > > Yes, I am ware about that. And still, this practice is already done,
> > > > which
> > > > apparently is not limited to 9pfs.
> > > > 
> > > > > The limit is not hidden, it's Queue Size as defined by the spec :).
> > > > > 
> > > > > If you have a driver that is exceeding the limit, then please fix
> > > > > the
> > > > > driver.
> > > > 
> > > > I absolutely understand your position, but I hope you also understand
> > > > that
> > > > this violation of the specs is a theoretical issue, it is not a
> > > > real-life
> > > > problem right now, and due to lack of man power unfortunately I have
> > > > to
> > > > prioritize real-life problems over theoretical ones ATM. Keep in mind
> > > > that
> > > > right now I am the only person working on 9pfs actively, I do this
> > > > voluntarily whenever I find a free time slice, and I am not paid for
> > > > it
> > > > either.
> > > > 
> > > > I don't see any reasonable way with reasonable effort to do what you
> > > > are
> > > > asking for here in 9pfs, and Greg may correct me here if I am saying
> > > > anything wrong. If you are seeing any specific real-life issue here,
> > > > then
> > > > please tell me which one, otherwise I have to postpone that "specs
> > > > violation" issue.
> > > > 
> > > > There is still a long list of real problems that I need to hunt down
> > > > in
> > > > 9pfs, afterwards I can continue with theoretical ones if you want, but
> > > > right now I simply can't, sorry.
> > > 
> > > I understand. If you don't have time to fix the Linux virtio-9p driver
> > > then that's fine.
> > 
> > I will look at this again, but it might be tricky. On doubt I'll postpone
> > it.> 
> > > I still wanted us to agree on the spec position because the commit
> > > description says it's a "hidden limit", which is incorrect. It might
> > > seem pedantic, but my concern is that misconceptions can spread if we
> > > let them. That could cause people to write incorrect code later on.
> > > Please update the commit description either by dropping 2) or by
> > > 
> > > replacing it with something else. For example:
> > >   2) The Linux virtio-9p guest driver does not honor the VIRTIO Queue
> > >   
> > >      Size value and can submit descriptor chains that exceed it. That is
> > >      a spec violation but is accepted by QEMU's device implementation.
> > >      
> > >      When the guest creates a descriptor chain larger than 1024 the
> > >      following QEMU error is printed and the guest hangs:
> > >      
> > >      virtio: too many write descriptors in indirect table
> > 
> > I am fine with both, probably preferring the text block above instead of
> > silently dropping the reason, just for clarity.
> > 
> > But keep in mind that this might not be limited to virtio-9p as your text
> > would suggest, see below.
> > 
> > > > > > > > (3) Unfortunately not all virtio users in QEMU would currently
> > > > > > > > 
> > > > > > > >     work correctly with the new value of 32768.
> > > > > > > > 
> > > > > > > > So let's turn this hard coded global value into a runtime
> > > > > > > > variable as a first step in this commit, configurable for each
> > > > > > > > virtio user by passing a corresponding value with
> > > > > > > > virtio_init()
> > > > > > > > call.
> > > > > > > 
> > > > > > > virtio_add_queue() already has an int queue_size argument, why
> > > > > > > isn't
> > > > > > > that enough to deal with the maximum queue size? There's
> > > > > > > probably a
> > > > > > > good
> > > > > > > reason for it, but please include it in the commit description.
> > > > > > 
> > > > > > [...]
> > > > > > 
> > > > > > > Can you make this value per-vq instead of per-vdev since
> > > > > > > virtqueues
> > > > > > > can
> > > > > > > have different queue sizes?
> > > > > > > 
> > > > > > > The same applies to the rest of this patch. Anything using
> > > > > > > vdev->queue_max_size should probably use vq->vring.num instead.
> > > > > > 
> > > > > > I would like to avoid that and keep it per device. The maximum
> > > > > > size
> > > > > > stored
> > > > > > there is the maximum size supported by virtio user (or vortio
> > > > > > device
> > > > > > model,
> > > > > > however you want to call it). So that's really a limit per device,
> > > > > > not
> > > > > > per
> > > > > > queue, as no queue of the device would ever exceed that limit.
> > > > > > 
> > > > > > Plus a lot more code would need to be refactored, which I think is
> > > > > > unnecessary.
> > > > > 
> > > > > I'm against a per-device limit because it's a concept that cannot
> > > > > accurately describe reality. Some devices have multiple classes of
> > > > 
> > > > It describes current reality, because VIRTQUEUE_MAX_SIZE obviously is
> > > > not
> > > > per queue either ATM, and nobody ever cared.
> > > > 
> > > > All this series does, is allowing to override that currently
> > > > project-wide
> > > > compile-time constant to a per-driver-model compile-time constant.
> > > > Which
> > > > makes sense, because that's what it is: some drivers could cope with
> > > > any
> > > > transfer size, and some drivers are constrained to a certain maximum
> > > > application specific transfer size (e.g. IOV_MAX).
> > > > 
> > > > > virtqueues and they are sized differently, so a per-device limit is
> > > > > insufficient. virtio-net has separate rx_queue_size and
> > > > > tx_queue_size
> > > > > parameters (plus a control vq hardcoded to 64 descriptors).
> > > > 
> > > > I simply find this overkill. This value semantically means "my driver
> > > > model
> > > > supports at any time and at any coincidence at the very most x *
> > > > PAGE_SIZE
> > > > = max_transfer_size". Do you see any driver that might want a more
> > > > fine
> > > > graded control over this?
> > > 
> > > One reason why per-vq limits could make sense is that the maximum
> > > possible number of struct elements is allocated upfront in some code
> > > paths. Those code paths may need to differentiate between per-vq limits
> > > for performance or memory utilization reasons. Today some places
> > > allocate 1024 elements on the stack in some code paths, but maybe that's
> > > not acceptable when the per-device limit is 32k. This can matter when a
> > > device has vqs with very different sizes.
> > 
> > [...]
> > 
> > > > ... I leave that up to Michael or whoever might be in charge to
> > > > decide. I
> > > > still find this overkill, but I will adapt this to whatever the
> > > > decision
> > > > eventually will be in v3.
> > > > 
> > > > But then please tell me the precise representation that you find
> > > > appropriate, i.e. whether you want a new function for that, or rather
> > > > an
> > > > additional argument to virtio_add_queue(). Your call.
> > > 
> > > virtio_add_queue() already takes an int queue_size argument. I think the
> > > necessary information is already there.
> > > 
> > > This patch just needs to be tweaked to use the virtio_queue_get_num()
> > > (or a new virtqueue_get_num() API if that's easier because only a
> > > VirtQueue *vq pointer is available) instead of introducing a new
> > > per-device limit.
> > 
> > My understanding is that both the original 9p virtio device authors, as
> > well as other virtio device authors in QEMU have been and are still using
> > this as a default value (i.e. to allocate some upfront, and the rest on
> > demand).
> > 
> > So yes, I know your argument about the specs, but AFAICS if I would just
> > take this existing numeric argument for the limit, then it would probably
> > break those other QEMU devices as well.
> 
> This is a good point that I didn't consider. If guest drivers currently
> violate the spec, then restricting descriptor chain length to vring.num
> will introduce regressions.
> 
> We can't use virtio_queue_get_num() directly. A backwards-compatible
> limit is required:
> 
>   int virtio_queue_get_desc_chain_max(VirtIODevice *vdev, int n)
>   {
>       /*
>        * QEMU historically allowed 1024 descriptors even if the
>        * descriptor table was smaller.
>        */
>       return MAX(virtio_queue_get_num(vdev, qidx), 1024);
>   }

That was an alternative that I thought about as well, but decided against. It 
would require devices (that would want to support large transmissions sizes) 
to create the virtio queue(s) with the maximum possible size, i.e:

  virtio_add_queue(32k);

And that's the point where my current lack of knowledge, of what this would 
precisely mean to the resulting allocation set, decided against it. I mean 
would that mean would QEMU's virtio implementation would just a) allocate 32k 
scatter gather list entries? Or would it rather b) additionally also allocate 
the destination memory pages as well?

If you know the answer to that, that would be appreciated. Otherwise I will 
check this when I find some time.

Because if it is b) then I guess many people would not be happy about this. 
Because that change would mean it would allocate 128M for every 9p mount 
point, even if people don't care about high transmission sizes.

> Device models should call virtio_queue_get_desc_chain_max(). It
> preserves the 1024 descriptor chain length but also allows larger values
> if the virtqueue was configured appropriately.
> 
> Does this address the breakage you were thinking about?
> 
> Stefan

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
  2021-10-07 13:09                   ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-07 15:18                     ` Stefan Hajnoczi
  -1 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-10-07 15:18 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, Greg Kurz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 15345 bytes --]

On Thu, Oct 07, 2021 at 03:09:16PM +0200, Christian Schoenebeck wrote:
> On Mittwoch, 6. Oktober 2021 16:42:34 CEST Stefan Hajnoczi wrote:
> > On Wed, Oct 06, 2021 at 02:50:07PM +0200, Christian Schoenebeck wrote:
> > > On Mittwoch, 6. Oktober 2021 13:06:55 CEST Stefan Hajnoczi wrote:
> > > > On Tue, Oct 05, 2021 at 06:32:46PM +0200, Christian Schoenebeck wrote:
> > > > > On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> > > > > > On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck 
> wrote:
> > > > > > > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> > > > > > > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck
> > > 
> > > wrote:
> > > > > > > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > > > > > > > > variable per virtio user.
> > > > > > > > 
> > > > > > > > virtio user == virtio device model?
> > > > > > > 
> > > > > > > Yes
> > > > > > > 
> > > > > > > > > Reasons:
> > > > > > > > > 
> > > > > > > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > > > > > > > > 
> > > > > > > > >     maximum queue size possible. Which is actually the maximum
> > > > > > > > >     queue size allowed by the virtio protocol. The appropriate
> > > > > > > > >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > > > > > > >     
> > > > > > > > >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio
> > > > > > > > >     -v1.
> > > > > > > > >     1-cs
> > > > > > > > >     01.h
> > > > > > > > >     tml#x1-240006
> > > > > > > > >     
> > > > > > > > >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > > > > > > > >     more or less arbitrary value of 1024 in the past, which
> > > > > > > > >     limits the maximum transfer size with virtio to 4M
> > > > > > > > >     (more precise: 1024 * PAGE_SIZE, with the latter typically
> > > > > > > > >     being 4k).
> > > > > > > > 
> > > > > > > > Being equal to IOV_MAX is a likely reason. Buffers with more
> > > > > > > > iovecs
> > > > > > > > than
> > > > > > > > that cannot be passed to host system calls (sendmsg(2),
> > > > > > > > pwritev(2),
> > > > > > > > etc).
> > > > > > > 
> > > > > > > Yes, that's use case dependent. Hence the solution to opt-in if it
> > > > > > > is
> > > > > > > desired and feasible.
> > > > > > > 
> > > > > > > > > (2) Additionally the current value of 1024 poses a hidden
> > > > > > > > > limit,
> > > > > > > > > 
> > > > > > > > >     invisible to guest, which causes a system hang with the
> > > > > > > > >     following QEMU error if guest tries to exceed it:
> > > > > > > > >     
> > > > > > > > >     virtio: too many write descriptors in indirect table
> > > > > > > > 
> > > > > > > > I don't understand this point. 2.6.5 The Virtqueue Descriptor
> > > > > > > > Table
> > > > > 
> > > > > says:
> > > > > > > >   The number of descriptors in the table is defined by the queue
> > > > > > > >   size
> > > > > > > >   for
> > > > > > > > 
> > > > > > > > this virtqueue: this is the maximum possible descriptor chain
> > > > > > > > length.
> > > > > > > > 
> > > > > > > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> > > > > > > >   A driver MUST NOT create a descriptor chain longer than the
> > > > > > > >   Queue
> > > > > > > >   Size
> > > > > > > >   of
> > > > > > > > 
> > > > > > > > the device.
> > > > > > > > 
> > > > > > > > Do you mean a broken/malicious guest driver that is violating
> > > > > > > > the
> > > > > > > > spec?
> > > > > > > > That's not a hidden limit, it's defined by the spec.
> > > > > > > 
> > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.htm
> > > > > > > l
> > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.htm
> > > > > > > l
> > > > > > > 
> > > > > > > You can already go beyond that queue size at runtime with the
> > > > > > > indirection
> > > > > > > table. The only actual limit is the currently hard coded value of
> > > > > > > 1k
> > > > > > > pages.
> > > > > > > Hence the suggestion to turn that into a variable.
> > > > > > 
> > > > > > Exceeding Queue Size is a VIRTIO spec violation. Drivers that
> > > > > > operate
> > > > > > outsided the spec do so at their own risk. They may not be
> > > > > > compatible
> > > > > > with all device implementations.
> > > > > 
> > > > > Yes, I am ware about that. And still, this practice is already done,
> > > > > which
> > > > > apparently is not limited to 9pfs.
> > > > > 
> > > > > > The limit is not hidden, it's Queue Size as defined by the spec :).
> > > > > > 
> > > > > > If you have a driver that is exceeding the limit, then please fix
> > > > > > the
> > > > > > driver.
> > > > > 
> > > > > I absolutely understand your position, but I hope you also understand
> > > > > that
> > > > > this violation of the specs is a theoretical issue, it is not a
> > > > > real-life
> > > > > problem right now, and due to lack of man power unfortunately I have
> > > > > to
> > > > > prioritize real-life problems over theoretical ones ATM. Keep in mind
> > > > > that
> > > > > right now I am the only person working on 9pfs actively, I do this
> > > > > voluntarily whenever I find a free time slice, and I am not paid for
> > > > > it
> > > > > either.
> > > > > 
> > > > > I don't see any reasonable way with reasonable effort to do what you
> > > > > are
> > > > > asking for here in 9pfs, and Greg may correct me here if I am saying
> > > > > anything wrong. If you are seeing any specific real-life issue here,
> > > > > then
> > > > > please tell me which one, otherwise I have to postpone that "specs
> > > > > violation" issue.
> > > > > 
> > > > > There is still a long list of real problems that I need to hunt down
> > > > > in
> > > > > 9pfs, afterwards I can continue with theoretical ones if you want, but
> > > > > right now I simply can't, sorry.
> > > > 
> > > > I understand. If you don't have time to fix the Linux virtio-9p driver
> > > > then that's fine.
> > > 
> > > I will look at this again, but it might be tricky. On doubt I'll postpone
> > > it.> 
> > > > I still wanted us to agree on the spec position because the commit
> > > > description says it's a "hidden limit", which is incorrect. It might
> > > > seem pedantic, but my concern is that misconceptions can spread if we
> > > > let them. That could cause people to write incorrect code later on.
> > > > Please update the commit description either by dropping 2) or by
> > > > 
> > > > replacing it with something else. For example:
> > > >   2) The Linux virtio-9p guest driver does not honor the VIRTIO Queue
> > > >   
> > > >      Size value and can submit descriptor chains that exceed it. That is
> > > >      a spec violation but is accepted by QEMU's device implementation.
> > > >      
> > > >      When the guest creates a descriptor chain larger than 1024 the
> > > >      following QEMU error is printed and the guest hangs:
> > > >      
> > > >      virtio: too many write descriptors in indirect table
> > > 
> > > I am fine with both, probably preferring the text block above instead of
> > > silently dropping the reason, just for clarity.
> > > 
> > > But keep in mind that this might not be limited to virtio-9p as your text
> > > would suggest, see below.
> > > 
> > > > > > > > > (3) Unfortunately not all virtio users in QEMU would currently
> > > > > > > > > 
> > > > > > > > >     work correctly with the new value of 32768.
> > > > > > > > > 
> > > > > > > > > So let's turn this hard coded global value into a runtime
> > > > > > > > > variable as a first step in this commit, configurable for each
> > > > > > > > > virtio user by passing a corresponding value with
> > > > > > > > > virtio_init()
> > > > > > > > > call.
> > > > > > > > 
> > > > > > > > virtio_add_queue() already has an int queue_size argument, why
> > > > > > > > isn't
> > > > > > > > that enough to deal with the maximum queue size? There's
> > > > > > > > probably a
> > > > > > > > good
> > > > > > > > reason for it, but please include it in the commit description.
> > > > > > > 
> > > > > > > [...]
> > > > > > > 
> > > > > > > > Can you make this value per-vq instead of per-vdev since
> > > > > > > > virtqueues
> > > > > > > > can
> > > > > > > > have different queue sizes?
> > > > > > > > 
> > > > > > > > The same applies to the rest of this patch. Anything using
> > > > > > > > vdev->queue_max_size should probably use vq->vring.num instead.
> > > > > > > 
> > > > > > > I would like to avoid that and keep it per device. The maximum
> > > > > > > size
> > > > > > > stored
> > > > > > > there is the maximum size supported by virtio user (or vortio
> > > > > > > device
> > > > > > > model,
> > > > > > > however you want to call it). So that's really a limit per device,
> > > > > > > not
> > > > > > > per
> > > > > > > queue, as no queue of the device would ever exceed that limit.
> > > > > > > 
> > > > > > > Plus a lot more code would need to be refactored, which I think is
> > > > > > > unnecessary.
> > > > > > 
> > > > > > I'm against a per-device limit because it's a concept that cannot
> > > > > > accurately describe reality. Some devices have multiple classes of
> > > > > 
> > > > > It describes current reality, because VIRTQUEUE_MAX_SIZE obviously is
> > > > > not
> > > > > per queue either ATM, and nobody ever cared.
> > > > > 
> > > > > All this series does, is allowing to override that currently
> > > > > project-wide
> > > > > compile-time constant to a per-driver-model compile-time constant.
> > > > > Which
> > > > > makes sense, because that's what it is: some drivers could cope with
> > > > > any
> > > > > transfer size, and some drivers are constrained to a certain maximum
> > > > > application specific transfer size (e.g. IOV_MAX).
> > > > > 
> > > > > > virtqueues and they are sized differently, so a per-device limit is
> > > > > > insufficient. virtio-net has separate rx_queue_size and
> > > > > > tx_queue_size
> > > > > > parameters (plus a control vq hardcoded to 64 descriptors).
> > > > > 
> > > > > I simply find this overkill. This value semantically means "my driver
> > > > > model
> > > > > supports at any time and at any coincidence at the very most x *
> > > > > PAGE_SIZE
> > > > > = max_transfer_size". Do you see any driver that might want a more
> > > > > fine
> > > > > graded control over this?
> > > > 
> > > > One reason why per-vq limits could make sense is that the maximum
> > > > possible number of struct elements is allocated upfront in some code
> > > > paths. Those code paths may need to differentiate between per-vq limits
> > > > for performance or memory utilization reasons. Today some places
> > > > allocate 1024 elements on the stack in some code paths, but maybe that's
> > > > not acceptable when the per-device limit is 32k. This can matter when a
> > > > device has vqs with very different sizes.
> > > 
> > > [...]
> > > 
> > > > > ... I leave that up to Michael or whoever might be in charge to
> > > > > decide. I
> > > > > still find this overkill, but I will adapt this to whatever the
> > > > > decision
> > > > > eventually will be in v3.
> > > > > 
> > > > > But then please tell me the precise representation that you find
> > > > > appropriate, i.e. whether you want a new function for that, or rather
> > > > > an
> > > > > additional argument to virtio_add_queue(). Your call.
> > > > 
> > > > virtio_add_queue() already takes an int queue_size argument. I think the
> > > > necessary information is already there.
> > > > 
> > > > This patch just needs to be tweaked to use the virtio_queue_get_num()
> > > > (or a new virtqueue_get_num() API if that's easier because only a
> > > > VirtQueue *vq pointer is available) instead of introducing a new
> > > > per-device limit.
> > > 
> > > My understanding is that both the original 9p virtio device authors, as
> > > well as other virtio device authors in QEMU have been and are still using
> > > this as a default value (i.e. to allocate some upfront, and the rest on
> > > demand).
> > > 
> > > So yes, I know your argument about the specs, but AFAICS if I would just
> > > take this existing numeric argument for the limit, then it would probably
> > > break those other QEMU devices as well.
> > 
> > This is a good point that I didn't consider. If guest drivers currently
> > violate the spec, then restricting descriptor chain length to vring.num
> > will introduce regressions.
> > 
> > We can't use virtio_queue_get_num() directly. A backwards-compatible
> > limit is required:
> > 
> >   int virtio_queue_get_desc_chain_max(VirtIODevice *vdev, int n)
> >   {
> >       /*
> >        * QEMU historically allowed 1024 descriptors even if the
> >        * descriptor table was smaller.
> >        */
> >       return MAX(virtio_queue_get_num(vdev, qidx), 1024);
> >   }
> 
> That was an alternative that I thought about as well, but decided against. It 
> would require devices (that would want to support large transmissions sizes) 
> to create the virtio queue(s) with the maximum possible size, i.e:
> 
>   virtio_add_queue(32k);

The spec allows drivers to set the size of the vring as long as they do
not exceed Queue Size.

The Linux drivers accept the device's default size, so you're right that
this would cause large vrings to be allocated if the device sets the
virtqueue size to 32k.

> And that's the point where my current lack of knowledge, of what this would 
> precisely mean to the resulting allocation set, decided against it. I mean 
> would that mean would QEMU's virtio implementation would just a) allocate 32k 
> scatter gather list entries? Or would it rather b) additionally also allocate 
> the destination memory pages as well?

The vring consumes guest RAM but it just consists of descriptors, not
the buffer memory pages. The guest RAM requirements are:
- split layout: 32k * 16 + 6 + 32k * 2 + 6 + 8 * 32k = 851,980 bytes
- packed layout: 32k * 16 + 4 + 4 = 524,296 bytes

That's still quite large!

By the way, virtio-blk currently uses a virtqueue size of 256
descriptors and this has been found reasonable for disk I/O performance.
The Linux block layer splits requests at around 1.25 MB for virtio-blk.
The virtio-blk queue limits are reported by the device and the guest
Linux block layer uses them to size/split requests appropriately. I'm
not sure 9p really needs 32k, although you're right that fragmented
physical memory requires 32k descriptors to describe 128 MB of buffers.

Going back to the original problem, a vring feature bit could be added
to the VIRTIO specification indicating that indirect descriptor tables
are limited to the maximum (32k) instead of Queue Size. This way the
device's default vring size could be small but drivers could allocate
indirect descriptor tables that are large when necessary. Then the Linux
virtio driver API would need to report the maximum supported sglist
length for a given virtqueue so drivers can take advantage of this
information.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
@ 2021-10-07 15:18                     ` Stefan Hajnoczi
  0 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-10-07 15:18 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Raphael Norwitz

[-- Attachment #1: Type: text/plain, Size: 15345 bytes --]

On Thu, Oct 07, 2021 at 03:09:16PM +0200, Christian Schoenebeck wrote:
> On Mittwoch, 6. Oktober 2021 16:42:34 CEST Stefan Hajnoczi wrote:
> > On Wed, Oct 06, 2021 at 02:50:07PM +0200, Christian Schoenebeck wrote:
> > > On Mittwoch, 6. Oktober 2021 13:06:55 CEST Stefan Hajnoczi wrote:
> > > > On Tue, Oct 05, 2021 at 06:32:46PM +0200, Christian Schoenebeck wrote:
> > > > > On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> > > > > > On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck 
> wrote:
> > > > > > > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> > > > > > > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck
> > > 
> > > wrote:
> > > > > > > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > > > > > > > > variable per virtio user.
> > > > > > > > 
> > > > > > > > virtio user == virtio device model?
> > > > > > > 
> > > > > > > Yes
> > > > > > > 
> > > > > > > > > Reasons:
> > > > > > > > > 
> > > > > > > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > > > > > > > > 
> > > > > > > > >     maximum queue size possible. Which is actually the maximum
> > > > > > > > >     queue size allowed by the virtio protocol. The appropriate
> > > > > > > > >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > > > > > > >     
> > > > > > > > >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio
> > > > > > > > >     -v1.
> > > > > > > > >     1-cs
> > > > > > > > >     01.h
> > > > > > > > >     tml#x1-240006
> > > > > > > > >     
> > > > > > > > >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > > > > > > > >     more or less arbitrary value of 1024 in the past, which
> > > > > > > > >     limits the maximum transfer size with virtio to 4M
> > > > > > > > >     (more precise: 1024 * PAGE_SIZE, with the latter typically
> > > > > > > > >     being 4k).
> > > > > > > > 
> > > > > > > > Being equal to IOV_MAX is a likely reason. Buffers with more
> > > > > > > > iovecs
> > > > > > > > than
> > > > > > > > that cannot be passed to host system calls (sendmsg(2),
> > > > > > > > pwritev(2),
> > > > > > > > etc).
> > > > > > > 
> > > > > > > Yes, that's use case dependent. Hence the solution to opt-in if it
> > > > > > > is
> > > > > > > desired and feasible.
> > > > > > > 
> > > > > > > > > (2) Additionally the current value of 1024 poses a hidden
> > > > > > > > > limit,
> > > > > > > > > 
> > > > > > > > >     invisible to guest, which causes a system hang with the
> > > > > > > > >     following QEMU error if guest tries to exceed it:
> > > > > > > > >     
> > > > > > > > >     virtio: too many write descriptors in indirect table
> > > > > > > > 
> > > > > > > > I don't understand this point. 2.6.5 The Virtqueue Descriptor
> > > > > > > > Table
> > > > > 
> > > > > says:
> > > > > > > >   The number of descriptors in the table is defined by the queue
> > > > > > > >   size
> > > > > > > >   for
> > > > > > > > 
> > > > > > > > this virtqueue: this is the maximum possible descriptor chain
> > > > > > > > length.
> > > > > > > > 
> > > > > > > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> > > > > > > >   A driver MUST NOT create a descriptor chain longer than the
> > > > > > > >   Queue
> > > > > > > >   Size
> > > > > > > >   of
> > > > > > > > 
> > > > > > > > the device.
> > > > > > > > 
> > > > > > > > Do you mean a broken/malicious guest driver that is violating
> > > > > > > > the
> > > > > > > > spec?
> > > > > > > > That's not a hidden limit, it's defined by the spec.
> > > > > > > 
> > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.htm
> > > > > > > l
> > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.htm
> > > > > > > l
> > > > > > > 
> > > > > > > You can already go beyond that queue size at runtime with the
> > > > > > > indirection
> > > > > > > table. The only actual limit is the currently hard coded value of
> > > > > > > 1k
> > > > > > > pages.
> > > > > > > Hence the suggestion to turn that into a variable.
> > > > > > 
> > > > > > Exceeding Queue Size is a VIRTIO spec violation. Drivers that
> > > > > > operate
> > > > > > outsided the spec do so at their own risk. They may not be
> > > > > > compatible
> > > > > > with all device implementations.
> > > > > 
> > > > > Yes, I am ware about that. And still, this practice is already done,
> > > > > which
> > > > > apparently is not limited to 9pfs.
> > > > > 
> > > > > > The limit is not hidden, it's Queue Size as defined by the spec :).
> > > > > > 
> > > > > > If you have a driver that is exceeding the limit, then please fix
> > > > > > the
> > > > > > driver.
> > > > > 
> > > > > I absolutely understand your position, but I hope you also understand
> > > > > that
> > > > > this violation of the specs is a theoretical issue, it is not a
> > > > > real-life
> > > > > problem right now, and due to lack of man power unfortunately I have
> > > > > to
> > > > > prioritize real-life problems over theoretical ones ATM. Keep in mind
> > > > > that
> > > > > right now I am the only person working on 9pfs actively, I do this
> > > > > voluntarily whenever I find a free time slice, and I am not paid for
> > > > > it
> > > > > either.
> > > > > 
> > > > > I don't see any reasonable way with reasonable effort to do what you
> > > > > are
> > > > > asking for here in 9pfs, and Greg may correct me here if I am saying
> > > > > anything wrong. If you are seeing any specific real-life issue here,
> > > > > then
> > > > > please tell me which one, otherwise I have to postpone that "specs
> > > > > violation" issue.
> > > > > 
> > > > > There is still a long list of real problems that I need to hunt down
> > > > > in
> > > > > 9pfs, afterwards I can continue with theoretical ones if you want, but
> > > > > right now I simply can't, sorry.
> > > > 
> > > > I understand. If you don't have time to fix the Linux virtio-9p driver
> > > > then that's fine.
> > > 
> > > I will look at this again, but it might be tricky. On doubt I'll postpone
> > > it.> 
> > > > I still wanted us to agree on the spec position because the commit
> > > > description says it's a "hidden limit", which is incorrect. It might
> > > > seem pedantic, but my concern is that misconceptions can spread if we
> > > > let them. That could cause people to write incorrect code later on.
> > > > Please update the commit description either by dropping 2) or by
> > > > 
> > > > replacing it with something else. For example:
> > > >   2) The Linux virtio-9p guest driver does not honor the VIRTIO Queue
> > > >   
> > > >      Size value and can submit descriptor chains that exceed it. That is
> > > >      a spec violation but is accepted by QEMU's device implementation.
> > > >      
> > > >      When the guest creates a descriptor chain larger than 1024 the
> > > >      following QEMU error is printed and the guest hangs:
> > > >      
> > > >      virtio: too many write descriptors in indirect table
> > > 
> > > I am fine with both, probably preferring the text block above instead of
> > > silently dropping the reason, just for clarity.
> > > 
> > > But keep in mind that this might not be limited to virtio-9p as your text
> > > would suggest, see below.
> > > 
> > > > > > > > > (3) Unfortunately not all virtio users in QEMU would currently
> > > > > > > > > 
> > > > > > > > >     work correctly with the new value of 32768.
> > > > > > > > > 
> > > > > > > > > So let's turn this hard coded global value into a runtime
> > > > > > > > > variable as a first step in this commit, configurable for each
> > > > > > > > > virtio user by passing a corresponding value with
> > > > > > > > > virtio_init()
> > > > > > > > > call.
> > > > > > > > 
> > > > > > > > virtio_add_queue() already has an int queue_size argument, why
> > > > > > > > isn't
> > > > > > > > that enough to deal with the maximum queue size? There's
> > > > > > > > probably a
> > > > > > > > good
> > > > > > > > reason for it, but please include it in the commit description.
> > > > > > > 
> > > > > > > [...]
> > > > > > > 
> > > > > > > > Can you make this value per-vq instead of per-vdev since
> > > > > > > > virtqueues
> > > > > > > > can
> > > > > > > > have different queue sizes?
> > > > > > > > 
> > > > > > > > The same applies to the rest of this patch. Anything using
> > > > > > > > vdev->queue_max_size should probably use vq->vring.num instead.
> > > > > > > 
> > > > > > > I would like to avoid that and keep it per device. The maximum
> > > > > > > size
> > > > > > > stored
> > > > > > > there is the maximum size supported by virtio user (or vortio
> > > > > > > device
> > > > > > > model,
> > > > > > > however you want to call it). So that's really a limit per device,
> > > > > > > not
> > > > > > > per
> > > > > > > queue, as no queue of the device would ever exceed that limit.
> > > > > > > 
> > > > > > > Plus a lot more code would need to be refactored, which I think is
> > > > > > > unnecessary.
> > > > > > 
> > > > > > I'm against a per-device limit because it's a concept that cannot
> > > > > > accurately describe reality. Some devices have multiple classes of
> > > > > 
> > > > > It describes current reality, because VIRTQUEUE_MAX_SIZE obviously is
> > > > > not
> > > > > per queue either ATM, and nobody ever cared.
> > > > > 
> > > > > All this series does, is allowing to override that currently
> > > > > project-wide
> > > > > compile-time constant to a per-driver-model compile-time constant.
> > > > > Which
> > > > > makes sense, because that's what it is: some drivers could cope with
> > > > > any
> > > > > transfer size, and some drivers are constrained to a certain maximum
> > > > > application specific transfer size (e.g. IOV_MAX).
> > > > > 
> > > > > > virtqueues and they are sized differently, so a per-device limit is
> > > > > > insufficient. virtio-net has separate rx_queue_size and
> > > > > > tx_queue_size
> > > > > > parameters (plus a control vq hardcoded to 64 descriptors).
> > > > > 
> > > > > I simply find this overkill. This value semantically means "my driver
> > > > > model
> > > > > supports at any time and at any coincidence at the very most x *
> > > > > PAGE_SIZE
> > > > > = max_transfer_size". Do you see any driver that might want a more
> > > > > fine
> > > > > graded control over this?
> > > > 
> > > > One reason why per-vq limits could make sense is that the maximum
> > > > possible number of struct elements is allocated upfront in some code
> > > > paths. Those code paths may need to differentiate between per-vq limits
> > > > for performance or memory utilization reasons. Today some places
> > > > allocate 1024 elements on the stack in some code paths, but maybe that's
> > > > not acceptable when the per-device limit is 32k. This can matter when a
> > > > device has vqs with very different sizes.
> > > 
> > > [...]
> > > 
> > > > > ... I leave that up to Michael or whoever might be in charge to
> > > > > decide. I
> > > > > still find this overkill, but I will adapt this to whatever the
> > > > > decision
> > > > > eventually will be in v3.
> > > > > 
> > > > > But then please tell me the precise representation that you find
> > > > > appropriate, i.e. whether you want a new function for that, or rather
> > > > > an
> > > > > additional argument to virtio_add_queue(). Your call.
> > > > 
> > > > virtio_add_queue() already takes an int queue_size argument. I think the
> > > > necessary information is already there.
> > > > 
> > > > This patch just needs to be tweaked to use the virtio_queue_get_num()
> > > > (or a new virtqueue_get_num() API if that's easier because only a
> > > > VirtQueue *vq pointer is available) instead of introducing a new
> > > > per-device limit.
> > > 
> > > My understanding is that both the original 9p virtio device authors, as
> > > well as other virtio device authors in QEMU have been and are still using
> > > this as a default value (i.e. to allocate some upfront, and the rest on
> > > demand).
> > > 
> > > So yes, I know your argument about the specs, but AFAICS if I would just
> > > take this existing numeric argument for the limit, then it would probably
> > > break those other QEMU devices as well.
> > 
> > This is a good point that I didn't consider. If guest drivers currently
> > violate the spec, then restricting descriptor chain length to vring.num
> > will introduce regressions.
> > 
> > We can't use virtio_queue_get_num() directly. A backwards-compatible
> > limit is required:
> > 
> >   int virtio_queue_get_desc_chain_max(VirtIODevice *vdev, int n)
> >   {
> >       /*
> >        * QEMU historically allowed 1024 descriptors even if the
> >        * descriptor table was smaller.
> >        */
> >       return MAX(virtio_queue_get_num(vdev, qidx), 1024);
> >   }
> 
> That was an alternative that I thought about as well, but decided against. It 
> would require devices (that would want to support large transmissions sizes) 
> to create the virtio queue(s) with the maximum possible size, i.e:
> 
>   virtio_add_queue(32k);

The spec allows drivers to set the size of the vring as long as they do
not exceed Queue Size.

The Linux drivers accept the device's default size, so you're right that
this would cause large vrings to be allocated if the device sets the
virtqueue size to 32k.

> And that's the point where my current lack of knowledge, of what this would 
> precisely mean to the resulting allocation set, decided against it. I mean 
> would that mean would QEMU's virtio implementation would just a) allocate 32k 
> scatter gather list entries? Or would it rather b) additionally also allocate 
> the destination memory pages as well?

The vring consumes guest RAM but it just consists of descriptors, not
the buffer memory pages. The guest RAM requirements are:
- split layout: 32k * 16 + 6 + 32k * 2 + 6 + 8 * 32k = 851,980 bytes
- packed layout: 32k * 16 + 4 + 4 = 524,296 bytes

That's still quite large!

By the way, virtio-blk currently uses a virtqueue size of 256
descriptors and this has been found reasonable for disk I/O performance.
The Linux block layer splits requests at around 1.25 MB for virtio-blk.
The virtio-blk queue limits are reported by the device and the guest
Linux block layer uses them to size/split requests appropriately. I'm
not sure 9p really needs 32k, although you're right that fragmented
physical memory requires 32k descriptors to describe 128 MB of buffers.

Going back to the original problem, a vring feature bit could be added
to the VIRTIO specification indicating that indirect descriptor tables
are limited to the maximum (32k) instead of Queue Size. This way the
device's default vring size could be small but drivers could allocate
indirect descriptor tables that are large when necessary. Then the Linux
virtio driver API would need to report the maximum supported sglist
length for a given virtqueue so drivers can take advantage of this
information.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-07 12:51     ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-07 15:42       ` Stefan Hajnoczi
  -1 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-10-07 15:42 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, Greg Kurz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 3423 bytes --]

On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> > > At the moment the maximum transfer size with virtio is limited to 4M
> > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > theoretical possible transfer size of 128M (32k pages) according to the
> > > virtio specs:
> > > 
> > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > > x1-240006
> > Hi Christian,
> > I took a quick look at the code:
> > 
> > - The Linux 9p driver restricts descriptor chains to 128 elements
> >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> 
> Yes, that's the limitation that I am about to remove (WIP); current kernel 
> patches:
> https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/

I haven't read the patches yet but I'm concerned that today the driver
is pretty well-behaved and this new patch series introduces a spec
violation. Not fixing existing spec violations is okay, but adding new
ones is a red flag. I think we need to figure out a clean solution.

> > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
> >   with EINVAL when called with more than IOV_MAX iovecs
> >   (hw/9pfs/9p.c:v9fs_read())
> 
> Hmm, which makes me wonder why I never encountered this error during testing.
> 
> Most people will use the 9p qemu 'local' fs driver backend in practice, so 
> that v9fs_read() call would translate for most people to this implementation 
> on QEMU side (hw/9p/9p-local.c):
> 
> static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
>                             const struct iovec *iov,
>                             int iovcnt, off_t offset)
> {
> #ifdef CONFIG_PREADV
>     return preadv(fs->fd, iov, iovcnt, offset);
> #else
>     int err = lseek(fs->fd, offset, SEEK_SET);
>     if (err == -1) {
>         return err;
>     } else {
>         return readv(fs->fd, iov, iovcnt);
>     }
> #endif
> }
> 
> > Unless I misunderstood the code, neither side can take advantage of the
> > new 32k descriptor chain limit?
> > 
> > Thanks,
> > Stefan
> 
> I need to check that when I have some more time. One possible explanation 
> might be that preadv() already has this wrapped into a loop in its 
> implementation to circumvent a limit like IOV_MAX. It might be another "it 
> works, but not portable" issue, but not sure.
>
> There are still a bunch of other issues I have to resolve. If you look at
> net/9p/client.c on kernel side, you'll notice that it basically does this ATM
> 
>     kmalloc(msize);
> 
> for every 9p request. So not only does it allocate much more memory for every 
> request than actually required (i.e. say 9pfs was mounted with msize=8M, then 
> a 9p request that actually would just need 1k would nevertheless allocate 8M), 
> but also it allocates > PAGE_SIZE, which obviously may fail at any time.

The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc() situation.

I saw zerocopy code in the 9p guest driver but didn't investigate when
it's used. Maybe that should be used for large requests (file
reads/writes)? virtio-blk/scsi don't memcpy data into a new buffer, they
directly access page cache or O_DIRECT pinned pages.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-07 15:42       ` Stefan Hajnoczi
  0 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-10-07 15:42 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz

[-- Attachment #1: Type: text/plain, Size: 3423 bytes --]

On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> > > At the moment the maximum transfer size with virtio is limited to 4M
> > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > theoretical possible transfer size of 128M (32k pages) according to the
> > > virtio specs:
> > > 
> > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > > x1-240006
> > Hi Christian,
> > I took a quick look at the code:
> > 
> > - The Linux 9p driver restricts descriptor chains to 128 elements
> >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> 
> Yes, that's the limitation that I am about to remove (WIP); current kernel 
> patches:
> https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/

I haven't read the patches yet but I'm concerned that today the driver
is pretty well-behaved and this new patch series introduces a spec
violation. Not fixing existing spec violations is okay, but adding new
ones is a red flag. I think we need to figure out a clean solution.

> > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
> >   with EINVAL when called with more than IOV_MAX iovecs
> >   (hw/9pfs/9p.c:v9fs_read())
> 
> Hmm, which makes me wonder why I never encountered this error during testing.
> 
> Most people will use the 9p qemu 'local' fs driver backend in practice, so 
> that v9fs_read() call would translate for most people to this implementation 
> on QEMU side (hw/9p/9p-local.c):
> 
> static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
>                             const struct iovec *iov,
>                             int iovcnt, off_t offset)
> {
> #ifdef CONFIG_PREADV
>     return preadv(fs->fd, iov, iovcnt, offset);
> #else
>     int err = lseek(fs->fd, offset, SEEK_SET);
>     if (err == -1) {
>         return err;
>     } else {
>         return readv(fs->fd, iov, iovcnt);
>     }
> #endif
> }
> 
> > Unless I misunderstood the code, neither side can take advantage of the
> > new 32k descriptor chain limit?
> > 
> > Thanks,
> > Stefan
> 
> I need to check that when I have some more time. One possible explanation 
> might be that preadv() already has this wrapped into a loop in its 
> implementation to circumvent a limit like IOV_MAX. It might be another "it 
> works, but not portable" issue, but not sure.
>
> There are still a bunch of other issues I have to resolve. If you look at
> net/9p/client.c on kernel side, you'll notice that it basically does this ATM
> 
>     kmalloc(msize);
> 
> for every 9p request. So not only does it allocate much more memory for every 
> request than actually required (i.e. say 9pfs was mounted with msize=8M, then 
> a 9p request that actually would just need 1k would nevertheless allocate 8M), 
> but also it allocates > PAGE_SIZE, which obviously may fail at any time.

The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc() situation.

I saw zerocopy code in the 9p guest driver but didn't investigate when
it's used. Maybe that should be used for large requests (file
reads/writes)? virtio-blk/scsi don't memcpy data into a new buffer, they
directly access page cache or O_DIRECT pinned pages.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-07 15:42       ` [Virtio-fs] " Stefan Hajnoczi
@ 2021-10-08  7:25         ` Greg Kurz
  -1 siblings, 0 replies; 97+ messages in thread
From: Greg Kurz @ 2021-10-08  7:25 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, Christian Schoenebeck, qemu-devel,
	Raphael Norwitz, virtio-fs, Eric Auger, Hanna Reitz,
	Gonglei (Arei),
	Gerd Hoffmann, Fam Zheng, Marc-André Lureau, Paolo Bonzini,
	David Hildenbrand, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 5054 bytes --]

On Thu, 7 Oct 2021 16:42:49 +0100
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> > > > At the moment the maximum transfer size with virtio is limited to 4M
> > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > theoretical possible transfer size of 128M (32k pages) according to the
> > > > virtio specs:
> > > > 
> > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > > > x1-240006
> > > Hi Christian,
> > > I took a quick look at the code:
> > > 


Hi,

Thanks Stefan for sharing virtio expertise and helping Christian !

> > > - The Linux 9p driver restricts descriptor chains to 128 elements
> > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > 
> > Yes, that's the limitation that I am about to remove (WIP); current kernel 
> > patches:
> > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/
> 
> I haven't read the patches yet but I'm concerned that today the driver
> is pretty well-behaved and this new patch series introduces a spec
> violation. Not fixing existing spec violations is okay, but adding new
> ones is a red flag. I think we need to figure out a clean solution.
> 
> > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
> > >   with EINVAL when called with more than IOV_MAX iovecs
> > >   (hw/9pfs/9p.c:v9fs_read())
> > 
> > Hmm, which makes me wonder why I never encountered this error during testing.
> > 
> > Most people will use the 9p qemu 'local' fs driver backend in practice, so 
> > that v9fs_read() call would translate for most people to this implementation 
> > on QEMU side (hw/9p/9p-local.c):
> > 
> > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
> >                             const struct iovec *iov,
> >                             int iovcnt, off_t offset)
> > {
> > #ifdef CONFIG_PREADV
> >     return preadv(fs->fd, iov, iovcnt, offset);
> > #else
> >     int err = lseek(fs->fd, offset, SEEK_SET);
> >     if (err == -1) {
> >         return err;
> >     } else {
> >         return readv(fs->fd, iov, iovcnt);
> >     }
> > #endif
> > }
> > 
> > > Unless I misunderstood the code, neither side can take advantage of the
> > > new 32k descriptor chain limit?
> > > 
> > > Thanks,
> > > Stefan
> > 
> > I need to check that when I have some more time. One possible explanation 
> > might be that preadv() already has this wrapped into a loop in its 
> > implementation to circumvent a limit like IOV_MAX. It might be another "it 
> > works, but not portable" issue, but not sure.
> >
> > There are still a bunch of other issues I have to resolve. If you look at
> > net/9p/client.c on kernel side, you'll notice that it basically does this ATM
> > 
> >     kmalloc(msize);
> > 

Note that this is done twice : once for the T message (client request) and once
for the R message (server answer). The 9p driver could adjust the size of the T
message to what's really needed instead of allocating the full msize. R message
size is not known though.

> > for every 9p request. So not only does it allocate much more memory for every 
> > request than actually required (i.e. say 9pfs was mounted with msize=8M, then 
> > a 9p request that actually would just need 1k would nevertheless allocate 8M), 
> > but also it allocates > PAGE_SIZE, which obviously may fail at any time.
> 
> The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc() situation.
> 
> I saw zerocopy code in the 9p guest driver but didn't investigate when
> it's used. Maybe that should be used for large requests (file
> reads/writes)?

This is the case already : zero-copy is only used for reads/writes/readdir
if the requested size is 1k or more.

Also you'll note that in this case, the 9p driver doesn't allocate msize
for the T/R messages but only 4k, which is largely enough to hold the
header.

	/*
	 * We allocate a inline protocol data of only 4k bytes.
	 * The actual content is passed in zero-copy fashion.
	 */
	req = p9_client_prepare_req(c, type, P9_ZC_HDR_SZ, fmt, ap);

and

/* size of header for zero copy read/write */
#define P9_ZC_HDR_SZ 4096

A huge msize only makes sense for Twrite, Rread and Rreaddir because
of the amount of data they convey. All other messages certainly fit
in a couple of kilobytes only (sorry, don't remember the numbers).

A first change should be to allocate MIN(XXX, msize) for the
regular non-zc case, where XXX could be a reasonable fixed
value (8k?). In the case of T messages, it is even possible
to adjust the size to what's exactly needed, ala snprintf(NULL).

> virtio-blk/scsi don't memcpy data into a new buffer, they
> directly access page cache or O_DIRECT pinned pages.
> 
> Stefan

Cheers,

--
Greg

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-08  7:25         ` Greg Kurz
  0 siblings, 0 replies; 97+ messages in thread
From: Greg Kurz @ 2021-10-08  7:25 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, Christian Schoenebeck, qemu-devel,
	Raphael Norwitz, virtio-fs, Eric Auger, Hanna Reitz,
	Gonglei (Arei),
	Hoffmann, Zheng, Marc-André Lureau, Paolo Bonzini,
	David Hildenbrand

[-- Attachment #1: Type: text/plain, Size: 5054 bytes --]

On Thu, 7 Oct 2021 16:42:49 +0100
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> > > > At the moment the maximum transfer size with virtio is limited to 4M
> > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > theoretical possible transfer size of 128M (32k pages) according to the
> > > > virtio specs:
> > > > 
> > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > > > x1-240006
> > > Hi Christian,
> > > I took a quick look at the code:
> > > 


Hi,

Thanks Stefan for sharing virtio expertise and helping Christian !

> > > - The Linux 9p driver restricts descriptor chains to 128 elements
> > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > 
> > Yes, that's the limitation that I am about to remove (WIP); current kernel 
> > patches:
> > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/
> 
> I haven't read the patches yet but I'm concerned that today the driver
> is pretty well-behaved and this new patch series introduces a spec
> violation. Not fixing existing spec violations is okay, but adding new
> ones is a red flag. I think we need to figure out a clean solution.
> 
> > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
> > >   with EINVAL when called with more than IOV_MAX iovecs
> > >   (hw/9pfs/9p.c:v9fs_read())
> > 
> > Hmm, which makes me wonder why I never encountered this error during testing.
> > 
> > Most people will use the 9p qemu 'local' fs driver backend in practice, so 
> > that v9fs_read() call would translate for most people to this implementation 
> > on QEMU side (hw/9p/9p-local.c):
> > 
> > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
> >                             const struct iovec *iov,
> >                             int iovcnt, off_t offset)
> > {
> > #ifdef CONFIG_PREADV
> >     return preadv(fs->fd, iov, iovcnt, offset);
> > #else
> >     int err = lseek(fs->fd, offset, SEEK_SET);
> >     if (err == -1) {
> >         return err;
> >     } else {
> >         return readv(fs->fd, iov, iovcnt);
> >     }
> > #endif
> > }
> > 
> > > Unless I misunderstood the code, neither side can take advantage of the
> > > new 32k descriptor chain limit?
> > > 
> > > Thanks,
> > > Stefan
> > 
> > I need to check that when I have some more time. One possible explanation 
> > might be that preadv() already has this wrapped into a loop in its 
> > implementation to circumvent a limit like IOV_MAX. It might be another "it 
> > works, but not portable" issue, but not sure.
> >
> > There are still a bunch of other issues I have to resolve. If you look at
> > net/9p/client.c on kernel side, you'll notice that it basically does this ATM
> > 
> >     kmalloc(msize);
> > 

Note that this is done twice : once for the T message (client request) and once
for the R message (server answer). The 9p driver could adjust the size of the T
message to what's really needed instead of allocating the full msize. R message
size is not known though.

> > for every 9p request. So not only does it allocate much more memory for every 
> > request than actually required (i.e. say 9pfs was mounted with msize=8M, then 
> > a 9p request that actually would just need 1k would nevertheless allocate 8M), 
> > but also it allocates > PAGE_SIZE, which obviously may fail at any time.
> 
> The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc() situation.
> 
> I saw zerocopy code in the 9p guest driver but didn't investigate when
> it's used. Maybe that should be used for large requests (file
> reads/writes)?

This is the case already : zero-copy is only used for reads/writes/readdir
if the requested size is 1k or more.

Also you'll note that in this case, the 9p driver doesn't allocate msize
for the T/R messages but only 4k, which is largely enough to hold the
header.

	/*
	 * We allocate a inline protocol data of only 4k bytes.
	 * The actual content is passed in zero-copy fashion.
	 */
	req = p9_client_prepare_req(c, type, P9_ZC_HDR_SZ, fmt, ap);

and

/* size of header for zero copy read/write */
#define P9_ZC_HDR_SZ 4096

A huge msize only makes sense for Twrite, Rread and Rreaddir because
of the amount of data they convey. All other messages certainly fit
in a couple of kilobytes only (sorry, don't remember the numbers).

A first change should be to allocate MIN(XXX, msize) for the
regular non-zc case, where XXX could be a reasonable fixed
value (8k?). In the case of T messages, it is even possible
to adjust the size to what's exactly needed, ala snprintf(NULL).

> virtio-blk/scsi don't memcpy data into a new buffer, they
> directly access page cache or O_DIRECT pinned pages.
> 
> Stefan

Cheers,

--
Greg

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-08  7:25         ` [Virtio-fs] " Greg Kurz
  (?)
@ 2021-10-08  8:27         ` Greg Kurz
  -1 siblings, 0 replies; 97+ messages in thread
From: Greg Kurz @ 2021-10-08  8:27 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Zheng, Michael S. Tsirkin, Wang, Schoenebeck, qemu-devel, Gerd,
	Hoffmann, virtio-fs, qemu-block, David Hildenbrand,
	Gonglei (Arei),
	Marc-André Lureau, Laurent Vivier, Amit Shah, Eric Auger,
	Kevin Wolf, Norwitz, Hanna Reitz, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 5792 bytes --]

On Fri, 8 Oct 2021 09:25:33 +0200
Greg Kurz <groug@kaod.org> wrote:

> On Thu, 7 Oct 2021 16:42:49 +0100
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> > > > > At the moment the maximum transfer size with virtio is limited to 4M
> > > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > > theoretical possible transfer size of 128M (32k pages) according to the
> > > > > virtio specs:
> > > > > 
> > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > > > > x1-240006
> > > > Hi Christian,
> > > > I took a quick look at the code:
> > > > 
> 
> 
> Hi,
> 
> Thanks Stefan for sharing virtio expertise and helping Christian !
> 
> > > > - The Linux 9p driver restricts descriptor chains to 128 elements
> > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > 
> > > Yes, that's the limitation that I am about to remove (WIP); current kernel 
> > > patches:
> > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/
> > 
> > I haven't read the patches yet but I'm concerned that today the driver
> > is pretty well-behaved and this new patch series introduces a spec
> > violation. Not fixing existing spec violations is okay, but adding new
> > ones is a red flag. I think we need to figure out a clean solution.
> > 
> > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
> > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > >   (hw/9pfs/9p.c:v9fs_read())
> > > 
> > > Hmm, which makes me wonder why I never encountered this error during testing.
> > > 
> > > Most people will use the 9p qemu 'local' fs driver backend in practice, so 
> > > that v9fs_read() call would translate for most people to this implementation 
> > > on QEMU side (hw/9p/9p-local.c):
> > > 
> > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
> > >                             const struct iovec *iov,
> > >                             int iovcnt, off_t offset)
> > > {
> > > #ifdef CONFIG_PREADV
> > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > #else
> > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > >     if (err == -1) {
> > >         return err;
> > >     } else {
> > >         return readv(fs->fd, iov, iovcnt);
> > >     }
> > > #endif
> > > }
> > > 
> > > > Unless I misunderstood the code, neither side can take advantage of the
> > > > new 32k descriptor chain limit?
> > > > 
> > > > Thanks,
> > > > Stefan
> > > 
> > > I need to check that when I have some more time. One possible explanation 
> > > might be that preadv() already has this wrapped into a loop in its 
> > > implementation to circumvent a limit like IOV_MAX. It might be another "it 
> > > works, but not portable" issue, but not sure.
> > >
> > > There are still a bunch of other issues I have to resolve. If you look at
> > > net/9p/client.c on kernel side, you'll notice that it basically does this ATM
> > > 
> > >     kmalloc(msize);
> > > 
> 
> Note that this is done twice : once for the T message (client request) and once
> for the R message (server answer). The 9p driver could adjust the size of the T
> message to what's really needed instead of allocating the full msize. R message
> size is not known though.
> 
> > > for every 9p request. So not only does it allocate much more memory for every 
> > > request than actually required (i.e. say 9pfs was mounted with msize=8M, then 
> > > a 9p request that actually would just need 1k would nevertheless allocate 8M), 
> > > but also it allocates > PAGE_SIZE, which obviously may fail at any time.
> > 
> > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc() situation.
> > 
> > I saw zerocopy code in the 9p guest driver but didn't investigate when
> > it's used. Maybe that should be used for large requests (file
> > reads/writes)?
> 
> This is the case already : zero-copy is only used for reads/writes/readdir
> if the requested size is 1k or more.
> 
> Also you'll note that in this case, the 9p driver doesn't allocate msize
> for the T/R messages but only 4k, which is largely enough to hold the
> header.
> 
> 	/*
> 	 * We allocate a inline protocol data of only 4k bytes.
> 	 * The actual content is passed in zero-copy fashion.
> 	 */
> 	req = p9_client_prepare_req(c, type, P9_ZC_HDR_SZ, fmt, ap);
> 
> and
> 
> /* size of header for zero copy read/write */
> #define P9_ZC_HDR_SZ 4096
> 
> A huge msize only makes sense for Twrite, Rread and Rreaddir because
> of the amount of data they convey. All other messages certainly fit
> in a couple of kilobytes only (sorry, don't remember the numbers).
> 
> A first change should be to allocate MIN(XXX, msize) for the
> regular non-zc case, where XXX could be a reasonable fixed
> value (8k?). 


Note that this would violate the 9p spec since the server
can legitimately use the negotiated msize for all R messages
even if all of them only need a couple of bytes in practice,
at worse a couple of kilobytes if a path is involved.

In a ideal world, this would call for a spec refinement to
special case Rread and Rreaddir, which are the only ones
where a high msize is useful AFAICT.

> In the case of T messages, it is even possible
> to adjust the size to what's exactly needed, ala snprintf(NULL).
> 
> > virtio-blk/scsi don't memcpy data into a new buffer, they
> > directly access page cache or O_DIRECT pinned pages.
> > 
> > Stefan
> 
> Cheers,
> 
> --
> Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-08  7:25         ` [Virtio-fs] " Greg Kurz
@ 2021-10-08 14:24           ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-08 14:24 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Stefan Hajnoczi, qemu-devel, Kevin Wolf, Laurent Vivier,
	qemu-block, Michael S. Tsirkin, Jason Wang, Amit Shah,
	David Hildenbrand, Raphael Norwitz, virtio-fs, Eric Auger,
	Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Dr. David Alan Gilbert

On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> On Thu, 7 Oct 2021 16:42:49 +0100
> 
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> > > > > At the moment the maximum transfer size with virtio is limited to 4M
> > > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > > theoretical possible transfer size of 128M (32k pages) according to
> > > > > the
> > > > > virtio specs:
> > > > > 
> > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01
> > > > > .html#
> > > > > x1-240006
> > > > 
> > > > Hi Christian,
> 
> > > > I took a quick look at the code:
> Hi,
> 
> Thanks Stefan for sharing virtio expertise and helping Christian !
> 
> > > > - The Linux 9p driver restricts descriptor chains to 128 elements
> > > > 
> > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > 
> > > Yes, that's the limitation that I am about to remove (WIP); current
> > > kernel
> > > patches:
> > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.
> > > com/> 
> > I haven't read the patches yet but I'm concerned that today the driver
> > is pretty well-behaved and this new patch series introduces a spec
> > violation. Not fixing existing spec violations is okay, but adding new
> > ones is a red flag. I think we need to figure out a clean solution.

Nobody has reviewed the kernel patches yet. My main concern therefore actually 
is that the kernel patches are already too complex, because the current 
situation is that only Dominique is handling 9p patches on kernel side, and he 
barely has time for 9p anymore.

Another reason for me to catch up on reading current kernel code and stepping 
in as reviewer of 9p on kernel side ASAP, independent of this issue.

As for current kernel patches' complexity: I can certainly drop patch 7 
entirely as it is probably just overkill. Patch 4 is then the biggest chunk, I 
have to see if I can simplify it, and whether it would make sense to squash 
with patch 3.

> > 
> > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
> > > > 
> > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > >   (hw/9pfs/9p.c:v9fs_read())
> > > 
> > > Hmm, which makes me wonder why I never encountered this error during
> > > testing.
> > > 
> > > Most people will use the 9p qemu 'local' fs driver backend in practice,
> > > so
> > > that v9fs_read() call would translate for most people to this
> > > implementation on QEMU side (hw/9p/9p-local.c):
> > > 
> > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
> > > 
> > >                             const struct iovec *iov,
> > >                             int iovcnt, off_t offset)
> > > 
> > > {
> > > #ifdef CONFIG_PREADV
> > > 
> > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > 
> > > #else
> > > 
> > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > >     if (err == -1) {
> > >     
> > >         return err;
> > >     
> > >     } else {
> > >     
> > >         return readv(fs->fd, iov, iovcnt);
> > >     
> > >     }
> > > 
> > > #endif
> > > }
> > > 
> > > > Unless I misunderstood the code, neither side can take advantage of
> > > > the
> > > > new 32k descriptor chain limit?
> > > > 
> > > > Thanks,
> > > > Stefan
> > > 
> > > I need to check that when I have some more time. One possible
> > > explanation
> > > might be that preadv() already has this wrapped into a loop in its
> > > implementation to circumvent a limit like IOV_MAX. It might be another
> > > "it
> > > works, but not portable" issue, but not sure.
> > > 
> > > There are still a bunch of other issues I have to resolve. If you look
> > > at
> > > net/9p/client.c on kernel side, you'll notice that it basically does
> > > this ATM> > 
> > >     kmalloc(msize);
> 
> Note that this is done twice : once for the T message (client request) and
> once for the R message (server answer). The 9p driver could adjust the size
> of the T message to what's really needed instead of allocating the full
> msize. R message size is not known though.

Would it make sense adding a second virtio ring, dedicated to server responses 
to solve this? IIRC 9p server already calculates appropriate exact sizes for 
each response type. So server could just push space that's really needed for 
its responses.

> > > for every 9p request. So not only does it allocate much more memory for
> > > every request than actually required (i.e. say 9pfs was mounted with
> > > msize=8M, then a 9p request that actually would just need 1k would
> > > nevertheless allocate 8M), but also it allocates > PAGE_SIZE, which
> > > obviously may fail at any time.> 
> > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc() situation.

Hu, I didn't even consider vmalloc(). I just tried the kvmalloc() wrapper as a 
quick & dirty test, but it crashed in the same way as kmalloc() with large 
msize values immediately on mounting:

diff --git a/net/9p/client.c b/net/9p/client.c
index a75034fa249b..cfe300a4b6ca 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct p9_client 
*clnt)
 static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
                         int alloc_msize)
 {
-       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
+       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
+       if (false) {
                fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
                fc->cache = c->fcall_cache;
        } else {
-               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
+               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
                fc->cache = NULL;
        }
-       if (!fc->sdata)
+       if (!fc->sdata) {
+               pr_info("%s !fc->sdata", __func__);
                return -ENOMEM;
+       }
        fc->capacity = alloc_msize;
        return 0;
 }

I try to look at this at the weekend, I would have expected this hack to 
bypass this issue.

> > I saw zerocopy code in the 9p guest driver but didn't investigate when
> > it's used. Maybe that should be used for large requests (file
> > reads/writes)?
> 
> This is the case already : zero-copy is only used for reads/writes/readdir
> if the requested size is 1k or more.
> 
> Also you'll note that in this case, the 9p driver doesn't allocate msize
> for the T/R messages but only 4k, which is largely enough to hold the
> header.
> 
> 	/*
> 	 * We allocate a inline protocol data of only 4k bytes.
> 	 * The actual content is passed in zero-copy fashion.
> 	 */
> 	req = p9_client_prepare_req(c, type, P9_ZC_HDR_SZ, fmt, ap);
> 
> and
> 
> /* size of header for zero copy read/write */
> #define P9_ZC_HDR_SZ 4096
> 
> A huge msize only makes sense for Twrite, Rread and Rreaddir because
> of the amount of data they convey. All other messages certainly fit
> in a couple of kilobytes only (sorry, don't remember the numbers).
> 
> A first change should be to allocate MIN(XXX, msize) for the
> regular non-zc case, where XXX could be a reasonable fixed
> value (8k?). In the case of T messages, it is even possible
> to adjust the size to what's exactly needed, ala snprintf(NULL).

Good idea actually! That would limit this problem to reviewing the 9p specs 
and picking one reasonable max value. Because you are right, those message 
types are tiny. Probably not worth to pile up new code to calculate exact 
message sizes for each one of them.

Adding some safety net would make sense though, to force e.g. if a new message 
type is added in future, that this value would be reviewed as well, something 
like:

static int max_msg_size(int msg_type) {
    switch (msg_type) {
        /* large zero copy messages */
        case Twrite:
        case Tread:
        case Treaddir:
            BUG_ON(true);

        /* small messages */
        case Tversion:
        ....
            return 8k; /* to be replaced with appropriate max value */
    }
}

That way the compiler would bark on future additions. But on doubt, a simple 
comment on msg type enum might do as well though.

> > virtio-blk/scsi don't memcpy data into a new buffer, they
> > directly access page cache or O_DIRECT pinned pages.
> > 
> > Stefan
> 
> Cheers,
> 
> --
> Greg




^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-08 14:24           ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-08 14:24 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel,
	Raphael Norwitz, virtio-fs, Eric Auger, Hanna Reitz,
	Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng

On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> On Thu, 7 Oct 2021 16:42:49 +0100
> 
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> > > > > At the moment the maximum transfer size with virtio is limited to 4M
> > > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > > theoretical possible transfer size of 128M (32k pages) according to
> > > > > the
> > > > > virtio specs:
> > > > > 
> > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01
> > > > > .html#
> > > > > x1-240006
> > > > 
> > > > Hi Christian,
> 
> > > > I took a quick look at the code:
> Hi,
> 
> Thanks Stefan for sharing virtio expertise and helping Christian !
> 
> > > > - The Linux 9p driver restricts descriptor chains to 128 elements
> > > > 
> > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > 
> > > Yes, that's the limitation that I am about to remove (WIP); current
> > > kernel
> > > patches:
> > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.
> > > com/> 
> > I haven't read the patches yet but I'm concerned that today the driver
> > is pretty well-behaved and this new patch series introduces a spec
> > violation. Not fixing existing spec violations is okay, but adding new
> > ones is a red flag. I think we need to figure out a clean solution.

Nobody has reviewed the kernel patches yet. My main concern therefore actually 
is that the kernel patches are already too complex, because the current 
situation is that only Dominique is handling 9p patches on kernel side, and he 
barely has time for 9p anymore.

Another reason for me to catch up on reading current kernel code and stepping 
in as reviewer of 9p on kernel side ASAP, independent of this issue.

As for current kernel patches' complexity: I can certainly drop patch 7 
entirely as it is probably just overkill. Patch 4 is then the biggest chunk, I 
have to see if I can simplify it, and whether it would make sense to squash 
with patch 3.

> > 
> > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
> > > > 
> > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > >   (hw/9pfs/9p.c:v9fs_read())
> > > 
> > > Hmm, which makes me wonder why I never encountered this error during
> > > testing.
> > > 
> > > Most people will use the 9p qemu 'local' fs driver backend in practice,
> > > so
> > > that v9fs_read() call would translate for most people to this
> > > implementation on QEMU side (hw/9p/9p-local.c):
> > > 
> > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
> > > 
> > >                             const struct iovec *iov,
> > >                             int iovcnt, off_t offset)
> > > 
> > > {
> > > #ifdef CONFIG_PREADV
> > > 
> > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > 
> > > #else
> > > 
> > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > >     if (err == -1) {
> > >     
> > >         return err;
> > >     
> > >     } else {
> > >     
> > >         return readv(fs->fd, iov, iovcnt);
> > >     
> > >     }
> > > 
> > > #endif
> > > }
> > > 
> > > > Unless I misunderstood the code, neither side can take advantage of
> > > > the
> > > > new 32k descriptor chain limit?
> > > > 
> > > > Thanks,
> > > > Stefan
> > > 
> > > I need to check that when I have some more time. One possible
> > > explanation
> > > might be that preadv() already has this wrapped into a loop in its
> > > implementation to circumvent a limit like IOV_MAX. It might be another
> > > "it
> > > works, but not portable" issue, but not sure.
> > > 
> > > There are still a bunch of other issues I have to resolve. If you look
> > > at
> > > net/9p/client.c on kernel side, you'll notice that it basically does
> > > this ATM> > 
> > >     kmalloc(msize);
> 
> Note that this is done twice : once for the T message (client request) and
> once for the R message (server answer). The 9p driver could adjust the size
> of the T message to what's really needed instead of allocating the full
> msize. R message size is not known though.

Would it make sense adding a second virtio ring, dedicated to server responses 
to solve this? IIRC 9p server already calculates appropriate exact sizes for 
each response type. So server could just push space that's really needed for 
its responses.

> > > for every 9p request. So not only does it allocate much more memory for
> > > every request than actually required (i.e. say 9pfs was mounted with
> > > msize=8M, then a 9p request that actually would just need 1k would
> > > nevertheless allocate 8M), but also it allocates > PAGE_SIZE, which
> > > obviously may fail at any time.> 
> > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc() situation.

Hu, I didn't even consider vmalloc(). I just tried the kvmalloc() wrapper as a 
quick & dirty test, but it crashed in the same way as kmalloc() with large 
msize values immediately on mounting:

diff --git a/net/9p/client.c b/net/9p/client.c
index a75034fa249b..cfe300a4b6ca 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct p9_client 
*clnt)
 static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
                         int alloc_msize)
 {
-       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
+       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
+       if (false) {
                fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
                fc->cache = c->fcall_cache;
        } else {
-               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
+               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
                fc->cache = NULL;
        }
-       if (!fc->sdata)
+       if (!fc->sdata) {
+               pr_info("%s !fc->sdata", __func__);
                return -ENOMEM;
+       }
        fc->capacity = alloc_msize;
        return 0;
 }

I try to look at this at the weekend, I would have expected this hack to 
bypass this issue.

> > I saw zerocopy code in the 9p guest driver but didn't investigate when
> > it's used. Maybe that should be used for large requests (file
> > reads/writes)?
> 
> This is the case already : zero-copy is only used for reads/writes/readdir
> if the requested size is 1k or more.
> 
> Also you'll note that in this case, the 9p driver doesn't allocate msize
> for the T/R messages but only 4k, which is largely enough to hold the
> header.
> 
> 	/*
> 	 * We allocate a inline protocol data of only 4k bytes.
> 	 * The actual content is passed in zero-copy fashion.
> 	 */
> 	req = p9_client_prepare_req(c, type, P9_ZC_HDR_SZ, fmt, ap);
> 
> and
> 
> /* size of header for zero copy read/write */
> #define P9_ZC_HDR_SZ 4096
> 
> A huge msize only makes sense for Twrite, Rread and Rreaddir because
> of the amount of data they convey. All other messages certainly fit
> in a couple of kilobytes only (sorry, don't remember the numbers).
> 
> A first change should be to allocate MIN(XXX, msize) for the
> regular non-zc case, where XXX could be a reasonable fixed
> value (8k?). In the case of T messages, it is even possible
> to adjust the size to what's exactly needed, ala snprintf(NULL).

Good idea actually! That would limit this problem to reviewing the 9p specs 
and picking one reasonable max value. Because you are right, those message 
types are tiny. Probably not worth to pile up new code to calculate exact 
message sizes for each one of them.

Adding some safety net would make sense though, to force e.g. if a new message 
type is added in future, that this value would be reviewed as well, something 
like:

static int max_msg_size(int msg_type) {
    switch (msg_type) {
        /* large zero copy messages */
        case Twrite:
        case Tread:
        case Treaddir:
            BUG_ON(true);

        /* small messages */
        case Tversion:
        ....
            return 8k; /* to be replaced with appropriate max value */
    }
}

That way the compiler would bark on future additions. But on doubt, a simple 
comment on msg type enum might do as well though.

> > virtio-blk/scsi don't memcpy data into a new buffer, they
> > directly access page cache or O_DIRECT pinned pages.
> > 
> > Stefan
> 
> Cheers,
> 
> --
> Greg



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
  2021-10-07 15:18                     ` [Virtio-fs] " Stefan Hajnoczi
@ 2021-10-08 14:48                       ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-08 14:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Kevin Wolf, Laurent Vivier, qemu-block,
	Michael S. Tsirkin, Jason Wang, Amit Shah, David Hildenbrand,
	Greg Kurz, virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

On Donnerstag, 7. Oktober 2021 17:18:03 CEST Stefan Hajnoczi wrote:
> On Thu, Oct 07, 2021 at 03:09:16PM +0200, Christian Schoenebeck wrote:
> > On Mittwoch, 6. Oktober 2021 16:42:34 CEST Stefan Hajnoczi wrote:
> > > On Wed, Oct 06, 2021 at 02:50:07PM +0200, Christian Schoenebeck wrote:
> > > > On Mittwoch, 6. Oktober 2021 13:06:55 CEST Stefan Hajnoczi wrote:
> > > > > On Tue, Oct 05, 2021 at 06:32:46PM +0200, Christian Schoenebeck 
wrote:
> > > > > > On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> > > > > > > On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck
> > 
> > wrote:
> > > > > > > > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi 
wrote:
> > > > > > > > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian
> > > > > > > > > Schoenebeck
> > > > 
> > > > wrote:
> > > > > > > > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a
> > > > > > > > > > runtime
> > > > > > > > > > variable per virtio user.
> > > > > > > > > 
> > > > > > > > > virtio user == virtio device model?
> > > > > > > > 
> > > > > > > > Yes
> > > > > > > > 
> > > > > > > > > > Reasons:
> > > > > > > > > > 
> > > > > > > > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute
> > > > > > > > > > theoretical
> > > > > > > > > > 
> > > > > > > > > >     maximum queue size possible. Which is actually the
> > > > > > > > > >     maximum
> > > > > > > > > >     queue size allowed by the virtio protocol. The
> > > > > > > > > >     appropriate
> > > > > > > > > >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > > > > > > > >     
> > > > > > > > > >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/vi
> > > > > > > > > >     rtio
> > > > > > > > > >     -v1.
> > > > > > > > > >     1-cs
> > > > > > > > > >     01.h
> > > > > > > > > >     tml#x1-240006
> > > > > > > > > >     
> > > > > > > > > >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with
> > > > > > > > > >     a
> > > > > > > > > >     more or less arbitrary value of 1024 in the past,
> > > > > > > > > >     which
> > > > > > > > > >     limits the maximum transfer size with virtio to 4M
> > > > > > > > > >     (more precise: 1024 * PAGE_SIZE, with the latter
> > > > > > > > > >     typically
> > > > > > > > > >     being 4k).
> > > > > > > > > 
> > > > > > > > > Being equal to IOV_MAX is a likely reason. Buffers with more
> > > > > > > > > iovecs
> > > > > > > > > than
> > > > > > > > > that cannot be passed to host system calls (sendmsg(2),
> > > > > > > > > pwritev(2),
> > > > > > > > > etc).
> > > > > > > > 
> > > > > > > > Yes, that's use case dependent. Hence the solution to opt-in
> > > > > > > > if it
> > > > > > > > is
> > > > > > > > desired and feasible.
> > > > > > > > 
> > > > > > > > > > (2) Additionally the current value of 1024 poses a hidden
> > > > > > > > > > limit,
> > > > > > > > > > 
> > > > > > > > > >     invisible to guest, which causes a system hang with
> > > > > > > > > >     the
> > > > > > > > > >     following QEMU error if guest tries to exceed it:
> > > > > > > > > >     
> > > > > > > > > >     virtio: too many write descriptors in indirect table
> > > > > > > > > 
> > > > > > > > > I don't understand this point. 2.6.5 The Virtqueue
> > > > > > > > > Descriptor
> > > > > > > > > Table
> > > > > > 
> > > > > > says:
> > > > > > > > >   The number of descriptors in the table is defined by the
> > > > > > > > >   queue
> > > > > > > > >   size
> > > > > > > > >   for
> > > > > > > > > 
> > > > > > > > > this virtqueue: this is the maximum possible descriptor
> > > > > > > > > chain
> > > > > > > > > length.
> > > > > > > > > 
> > > > > > > > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors 
says:
> > > > > > > > >   A driver MUST NOT create a descriptor chain longer than
> > > > > > > > >   the
> > > > > > > > >   Queue
> > > > > > > > >   Size
> > > > > > > > >   of
> > > > > > > > > 
> > > > > > > > > the device.
> > > > > > > > > 
> > > > > > > > > Do you mean a broken/malicious guest driver that is
> > > > > > > > > violating
> > > > > > > > > the
> > > > > > > > > spec?
> > > > > > > > > That's not a hidden limit, it's defined by the spec.
> > > > > > > > 
> > > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781
> > > > > > > > .htm
> > > > > > > > l
> > > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788
> > > > > > > > .htm
> > > > > > > > l
> > > > > > > > 
> > > > > > > > You can already go beyond that queue size at runtime with the
> > > > > > > > indirection
> > > > > > > > table. The only actual limit is the currently hard coded value
> > > > > > > > of
> > > > > > > > 1k
> > > > > > > > pages.
> > > > > > > > Hence the suggestion to turn that into a variable.
> > > > > > > 
> > > > > > > Exceeding Queue Size is a VIRTIO spec violation. Drivers that
> > > > > > > operate
> > > > > > > outsided the spec do so at their own risk. They may not be
> > > > > > > compatible
> > > > > > > with all device implementations.
> > > > > > 
> > > > > > Yes, I am ware about that. And still, this practice is already
> > > > > > done,
> > > > > > which
> > > > > > apparently is not limited to 9pfs.
> > > > > > 
> > > > > > > The limit is not hidden, it's Queue Size as defined by the spec
> > > > > > > :).
> > > > > > > 
> > > > > > > If you have a driver that is exceeding the limit, then please
> > > > > > > fix
> > > > > > > the
> > > > > > > driver.
> > > > > > 
> > > > > > I absolutely understand your position, but I hope you also
> > > > > > understand
> > > > > > that
> > > > > > this violation of the specs is a theoretical issue, it is not a
> > > > > > real-life
> > > > > > problem right now, and due to lack of man power unfortunately I
> > > > > > have
> > > > > > to
> > > > > > prioritize real-life problems over theoretical ones ATM. Keep in
> > > > > > mind
> > > > > > that
> > > > > > right now I am the only person working on 9pfs actively, I do this
> > > > > > voluntarily whenever I find a free time slice, and I am not paid
> > > > > > for
> > > > > > it
> > > > > > either.
> > > > > > 
> > > > > > I don't see any reasonable way with reasonable effort to do what
> > > > > > you
> > > > > > are
> > > > > > asking for here in 9pfs, and Greg may correct me here if I am
> > > > > > saying
> > > > > > anything wrong. If you are seeing any specific real-life issue
> > > > > > here,
> > > > > > then
> > > > > > please tell me which one, otherwise I have to postpone that "specs
> > > > > > violation" issue.
> > > > > > 
> > > > > > There is still a long list of real problems that I need to hunt
> > > > > > down
> > > > > > in
> > > > > > 9pfs, afterwards I can continue with theoretical ones if you want,
> > > > > > but
> > > > > > right now I simply can't, sorry.
> > > > > 
> > > > > I understand. If you don't have time to fix the Linux virtio-9p
> > > > > driver
> > > > > then that's fine.
> > > > 
> > > > I will look at this again, but it might be tricky. On doubt I'll
> > > > postpone
> > > > it.>
> > > > 
> > > > > I still wanted us to agree on the spec position because the commit
> > > > > description says it's a "hidden limit", which is incorrect. It might
> > > > > seem pedantic, but my concern is that misconceptions can spread if
> > > > > we
> > > > > let them. That could cause people to write incorrect code later on.
> > > > > Please update the commit description either by dropping 2) or by
> > > > > 
> > > > > replacing it with something else. For example:
> > > > >   2) The Linux virtio-9p guest driver does not honor the VIRTIO
> > > > >   Queue
> > > > >   
> > > > >      Size value and can submit descriptor chains that exceed it.
> > > > >      That is
> > > > >      a spec violation but is accepted by QEMU's device
> > > > >      implementation.
> > > > >      
> > > > >      When the guest creates a descriptor chain larger than 1024 the
> > > > >      following QEMU error is printed and the guest hangs:
> > > > >      
> > > > >      virtio: too many write descriptors in indirect table
> > > > 
> > > > I am fine with both, probably preferring the text block above instead
> > > > of
> > > > silently dropping the reason, just for clarity.
> > > > 
> > > > But keep in mind that this might not be limited to virtio-9p as your
> > > > text
> > > > would suggest, see below.
> > > > 
> > > > > > > > > > (3) Unfortunately not all virtio users in QEMU would
> > > > > > > > > > currently
> > > > > > > > > > 
> > > > > > > > > >     work correctly with the new value of 32768.
> > > > > > > > > > 
> > > > > > > > > > So let's turn this hard coded global value into a runtime
> > > > > > > > > > variable as a first step in this commit, configurable for
> > > > > > > > > > each
> > > > > > > > > > virtio user by passing a corresponding value with
> > > > > > > > > > virtio_init()
> > > > > > > > > > call.
> > > > > > > > > 
> > > > > > > > > virtio_add_queue() already has an int queue_size argument,
> > > > > > > > > why
> > > > > > > > > isn't
> > > > > > > > > that enough to deal with the maximum queue size? There's
> > > > > > > > > probably a
> > > > > > > > > good
> > > > > > > > > reason for it, but please include it in the commit
> > > > > > > > > description.
> > > > > > > > 
> > > > > > > > [...]
> > > > > > > > 
> > > > > > > > > Can you make this value per-vq instead of per-vdev since
> > > > > > > > > virtqueues
> > > > > > > > > can
> > > > > > > > > have different queue sizes?
> > > > > > > > > 
> > > > > > > > > The same applies to the rest of this patch. Anything using
> > > > > > > > > vdev->queue_max_size should probably use vq->vring.num
> > > > > > > > > instead.
> > > > > > > > 
> > > > > > > > I would like to avoid that and keep it per device. The maximum
> > > > > > > > size
> > > > > > > > stored
> > > > > > > > there is the maximum size supported by virtio user (or vortio
> > > > > > > > device
> > > > > > > > model,
> > > > > > > > however you want to call it). So that's really a limit per
> > > > > > > > device,
> > > > > > > > not
> > > > > > > > per
> > > > > > > > queue, as no queue of the device would ever exceed that limit.
> > > > > > > > 
> > > > > > > > Plus a lot more code would need to be refactored, which I
> > > > > > > > think is
> > > > > > > > unnecessary.
> > > > > > > 
> > > > > > > I'm against a per-device limit because it's a concept that
> > > > > > > cannot
> > > > > > > accurately describe reality. Some devices have multiple classes
> > > > > > > of
> > > > > > 
> > > > > > It describes current reality, because VIRTQUEUE_MAX_SIZE obviously
> > > > > > is
> > > > > > not
> > > > > > per queue either ATM, and nobody ever cared.
> > > > > > 
> > > > > > All this series does, is allowing to override that currently
> > > > > > project-wide
> > > > > > compile-time constant to a per-driver-model compile-time constant.
> > > > > > Which
> > > > > > makes sense, because that's what it is: some drivers could cope
> > > > > > with
> > > > > > any
> > > > > > transfer size, and some drivers are constrained to a certain
> > > > > > maximum
> > > > > > application specific transfer size (e.g. IOV_MAX).
> > > > > > 
> > > > > > > virtqueues and they are sized differently, so a per-device limit
> > > > > > > is
> > > > > > > insufficient. virtio-net has separate rx_queue_size and
> > > > > > > tx_queue_size
> > > > > > > parameters (plus a control vq hardcoded to 64 descriptors).
> > > > > > 
> > > > > > I simply find this overkill. This value semantically means "my
> > > > > > driver
> > > > > > model
> > > > > > supports at any time and at any coincidence at the very most x *
> > > > > > PAGE_SIZE
> > > > > > = max_transfer_size". Do you see any driver that might want a more
> > > > > > fine
> > > > > > graded control over this?
> > > > > 
> > > > > One reason why per-vq limits could make sense is that the maximum
> > > > > possible number of struct elements is allocated upfront in some code
> > > > > paths. Those code paths may need to differentiate between per-vq
> > > > > limits
> > > > > for performance or memory utilization reasons. Today some places
> > > > > allocate 1024 elements on the stack in some code paths, but maybe
> > > > > that's
> > > > > not acceptable when the per-device limit is 32k. This can matter
> > > > > when a
> > > > > device has vqs with very different sizes.
> > > > 
> > > > [...]
> > > > 
> > > > > > ... I leave that up to Michael or whoever might be in charge to
> > > > > > decide. I
> > > > > > still find this overkill, but I will adapt this to whatever the
> > > > > > decision
> > > > > > eventually will be in v3.
> > > > > > 
> > > > > > But then please tell me the precise representation that you find
> > > > > > appropriate, i.e. whether you want a new function for that, or
> > > > > > rather
> > > > > > an
> > > > > > additional argument to virtio_add_queue(). Your call.
> > > > > 
> > > > > virtio_add_queue() already takes an int queue_size argument. I think
> > > > > the
> > > > > necessary information is already there.
> > > > > 
> > > > > This patch just needs to be tweaked to use the
> > > > > virtio_queue_get_num()
> > > > > (or a new virtqueue_get_num() API if that's easier because only a
> > > > > VirtQueue *vq pointer is available) instead of introducing a new
> > > > > per-device limit.
> > > > 
> > > > My understanding is that both the original 9p virtio device authors,
> > > > as
> > > > well as other virtio device authors in QEMU have been and are still
> > > > using
> > > > this as a default value (i.e. to allocate some upfront, and the rest
> > > > on
> > > > demand).
> > > > 
> > > > So yes, I know your argument about the specs, but AFAICS if I would
> > > > just
> > > > take this existing numeric argument for the limit, then it would
> > > > probably
> > > > break those other QEMU devices as well.
> > > 
> > > This is a good point that I didn't consider. If guest drivers currently
> > > violate the spec, then restricting descriptor chain length to vring.num
> > > will introduce regressions.
> > > 
> > > We can't use virtio_queue_get_num() directly. A backwards-compatible
> > > 
> > > limit is required:
> > >   int virtio_queue_get_desc_chain_max(VirtIODevice *vdev, int n)
> > >   {
> > >   
> > >       /*
> > >       
> > >        * QEMU historically allowed 1024 descriptors even if the
> > >        * descriptor table was smaller.
> > >        */
> > >       
> > >       return MAX(virtio_queue_get_num(vdev, qidx), 1024);
> > >   
> > >   }
> > 
> > That was an alternative that I thought about as well, but decided against.
> > It would require devices (that would want to support large transmissions
> > sizes)> 
> > to create the virtio queue(s) with the maximum possible size, i.e:
> >   virtio_add_queue(32k);
> 
> The spec allows drivers to set the size of the vring as long as they do
> not exceed Queue Size.
> 
> The Linux drivers accept the device's default size, so you're right that
> this would cause large vrings to be allocated if the device sets the
> virtqueue size to 32k.
> 
> > And that's the point where my current lack of knowledge, of what this
> > would
> > precisely mean to the resulting allocation set, decided against it. I mean
> > would that mean would QEMU's virtio implementation would just a) allocate
> > 32k scatter gather list entries? Or would it rather b) additionally also
> > allocate the destination memory pages as well?
> 
> The vring consumes guest RAM but it just consists of descriptors, not
> the buffer memory pages. The guest RAM requirements are:
> - split layout: 32k * 16 + 6 + 32k * 2 + 6 + 8 * 32k = 851,980 bytes
> - packed layout: 32k * 16 + 4 + 4 = 524,296 bytes
> 
> That's still quite large!
> 
> By the way, virtio-blk currently uses a virtqueue size of 256
> descriptors and this has been found reasonable for disk I/O performance.
> The Linux block layer splits requests at around 1.25 MB for virtio-blk.
> The virtio-blk queue limits are reported by the device and the guest
> Linux block layer uses them to size/split requests appropriately. I'm
> not sure 9p really needs 32k, although you're right that fragmented
> physical memory requires 32k descriptors to describe 128 MB of buffers.
> 
> Going back to the original problem, a vring feature bit could be added
> to the VIRTIO specification indicating that indirect descriptor tables
> are limited to the maximum (32k) instead of Queue Size. This way the
> device's default vring size could be small but drivers could allocate
> indirect descriptor tables that are large when necessary. Then the Linux
> virtio driver API would need to report the maximum supported sglist
> length for a given virtqueue so drivers can take advantage of this
> information.

Due to forced pragmatism, ~1M unused/wasted space would be acceptable IMHO.

But if that might really go up to 128M+ as you said if physical RAM is highly 
fragmented, then a cleaner solution would definitely make sense, yes, if 
that's possible.

But as changing specs etc. is probably a long process, it would make sense 
first doing some more tests with the kernel patches to find out whether there 
was probably some show stopper like IOV_MAX anyway.

As for whether large transfer sizes make sense at all: well, from the 
benchmarks I made so far I "think" it does make sense going >4M. It might be 
something specific to 9p (its a full file server), I guess it has a higher 
latency than raw virtio block devices. OTOH with M.2 SSDs we now have several 
thousand MB/s, so not sure if the old common transfer size of 1M for block 
devices is still reasonable today.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
@ 2021-10-08 14:48                       ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-08 14:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng

On Donnerstag, 7. Oktober 2021 17:18:03 CEST Stefan Hajnoczi wrote:
> On Thu, Oct 07, 2021 at 03:09:16PM +0200, Christian Schoenebeck wrote:
> > On Mittwoch, 6. Oktober 2021 16:42:34 CEST Stefan Hajnoczi wrote:
> > > On Wed, Oct 06, 2021 at 02:50:07PM +0200, Christian Schoenebeck wrote:
> > > > On Mittwoch, 6. Oktober 2021 13:06:55 CEST Stefan Hajnoczi wrote:
> > > > > On Tue, Oct 05, 2021 at 06:32:46PM +0200, Christian Schoenebeck 
wrote:
> > > > > > On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> > > > > > > On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck
> > 
> > wrote:
> > > > > > > > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi 
wrote:
> > > > > > > > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian
> > > > > > > > > Schoenebeck
> > > > 
> > > > wrote:
> > > > > > > > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a
> > > > > > > > > > runtime
> > > > > > > > > > variable per virtio user.
> > > > > > > > > 
> > > > > > > > > virtio user == virtio device model?
> > > > > > > > 
> > > > > > > > Yes
> > > > > > > > 
> > > > > > > > > > Reasons:
> > > > > > > > > > 
> > > > > > > > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute
> > > > > > > > > > theoretical
> > > > > > > > > > 
> > > > > > > > > >     maximum queue size possible. Which is actually the
> > > > > > > > > >     maximum
> > > > > > > > > >     queue size allowed by the virtio protocol. The
> > > > > > > > > >     appropriate
> > > > > > > > > >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > > > > > > > >     
> > > > > > > > > >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/vi
> > > > > > > > > >     rtio
> > > > > > > > > >     -v1.
> > > > > > > > > >     1-cs
> > > > > > > > > >     01.h
> > > > > > > > > >     tml#x1-240006
> > > > > > > > > >     
> > > > > > > > > >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with
> > > > > > > > > >     a
> > > > > > > > > >     more or less arbitrary value of 1024 in the past,
> > > > > > > > > >     which
> > > > > > > > > >     limits the maximum transfer size with virtio to 4M
> > > > > > > > > >     (more precise: 1024 * PAGE_SIZE, with the latter
> > > > > > > > > >     typically
> > > > > > > > > >     being 4k).
> > > > > > > > > 
> > > > > > > > > Being equal to IOV_MAX is a likely reason. Buffers with more
> > > > > > > > > iovecs
> > > > > > > > > than
> > > > > > > > > that cannot be passed to host system calls (sendmsg(2),
> > > > > > > > > pwritev(2),
> > > > > > > > > etc).
> > > > > > > > 
> > > > > > > > Yes, that's use case dependent. Hence the solution to opt-in
> > > > > > > > if it
> > > > > > > > is
> > > > > > > > desired and feasible.
> > > > > > > > 
> > > > > > > > > > (2) Additionally the current value of 1024 poses a hidden
> > > > > > > > > > limit,
> > > > > > > > > > 
> > > > > > > > > >     invisible to guest, which causes a system hang with
> > > > > > > > > >     the
> > > > > > > > > >     following QEMU error if guest tries to exceed it:
> > > > > > > > > >     
> > > > > > > > > >     virtio: too many write descriptors in indirect table
> > > > > > > > > 
> > > > > > > > > I don't understand this point. 2.6.5 The Virtqueue
> > > > > > > > > Descriptor
> > > > > > > > > Table
> > > > > > 
> > > > > > says:
> > > > > > > > >   The number of descriptors in the table is defined by the
> > > > > > > > >   queue
> > > > > > > > >   size
> > > > > > > > >   for
> > > > > > > > > 
> > > > > > > > > this virtqueue: this is the maximum possible descriptor
> > > > > > > > > chain
> > > > > > > > > length.
> > > > > > > > > 
> > > > > > > > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors 
says:
> > > > > > > > >   A driver MUST NOT create a descriptor chain longer than
> > > > > > > > >   the
> > > > > > > > >   Queue
> > > > > > > > >   Size
> > > > > > > > >   of
> > > > > > > > > 
> > > > > > > > > the device.
> > > > > > > > > 
> > > > > > > > > Do you mean a broken/malicious guest driver that is
> > > > > > > > > violating
> > > > > > > > > the
> > > > > > > > > spec?
> > > > > > > > > That's not a hidden limit, it's defined by the spec.
> > > > > > > > 
> > > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781
> > > > > > > > .htm
> > > > > > > > l
> > > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788
> > > > > > > > .htm
> > > > > > > > l
> > > > > > > > 
> > > > > > > > You can already go beyond that queue size at runtime with the
> > > > > > > > indirection
> > > > > > > > table. The only actual limit is the currently hard coded value
> > > > > > > > of
> > > > > > > > 1k
> > > > > > > > pages.
> > > > > > > > Hence the suggestion to turn that into a variable.
> > > > > > > 
> > > > > > > Exceeding Queue Size is a VIRTIO spec violation. Drivers that
> > > > > > > operate
> > > > > > > outsided the spec do so at their own risk. They may not be
> > > > > > > compatible
> > > > > > > with all device implementations.
> > > > > > 
> > > > > > Yes, I am ware about that. And still, this practice is already
> > > > > > done,
> > > > > > which
> > > > > > apparently is not limited to 9pfs.
> > > > > > 
> > > > > > > The limit is not hidden, it's Queue Size as defined by the spec
> > > > > > > :).
> > > > > > > 
> > > > > > > If you have a driver that is exceeding the limit, then please
> > > > > > > fix
> > > > > > > the
> > > > > > > driver.
> > > > > > 
> > > > > > I absolutely understand your position, but I hope you also
> > > > > > understand
> > > > > > that
> > > > > > this violation of the specs is a theoretical issue, it is not a
> > > > > > real-life
> > > > > > problem right now, and due to lack of man power unfortunately I
> > > > > > have
> > > > > > to
> > > > > > prioritize real-life problems over theoretical ones ATM. Keep in
> > > > > > mind
> > > > > > that
> > > > > > right now I am the only person working on 9pfs actively, I do this
> > > > > > voluntarily whenever I find a free time slice, and I am not paid
> > > > > > for
> > > > > > it
> > > > > > either.
> > > > > > 
> > > > > > I don't see any reasonable way with reasonable effort to do what
> > > > > > you
> > > > > > are
> > > > > > asking for here in 9pfs, and Greg may correct me here if I am
> > > > > > saying
> > > > > > anything wrong. If you are seeing any specific real-life issue
> > > > > > here,
> > > > > > then
> > > > > > please tell me which one, otherwise I have to postpone that "specs
> > > > > > violation" issue.
> > > > > > 
> > > > > > There is still a long list of real problems that I need to hunt
> > > > > > down
> > > > > > in
> > > > > > 9pfs, afterwards I can continue with theoretical ones if you want,
> > > > > > but
> > > > > > right now I simply can't, sorry.
> > > > > 
> > > > > I understand. If you don't have time to fix the Linux virtio-9p
> > > > > driver
> > > > > then that's fine.
> > > > 
> > > > I will look at this again, but it might be tricky. On doubt I'll
> > > > postpone
> > > > it.>
> > > > 
> > > > > I still wanted us to agree on the spec position because the commit
> > > > > description says it's a "hidden limit", which is incorrect. It might
> > > > > seem pedantic, but my concern is that misconceptions can spread if
> > > > > we
> > > > > let them. That could cause people to write incorrect code later on.
> > > > > Please update the commit description either by dropping 2) or by
> > > > > 
> > > > > replacing it with something else. For example:
> > > > >   2) The Linux virtio-9p guest driver does not honor the VIRTIO
> > > > >   Queue
> > > > >   
> > > > >      Size value and can submit descriptor chains that exceed it.
> > > > >      That is
> > > > >      a spec violation but is accepted by QEMU's device
> > > > >      implementation.
> > > > >      
> > > > >      When the guest creates a descriptor chain larger than 1024 the
> > > > >      following QEMU error is printed and the guest hangs:
> > > > >      
> > > > >      virtio: too many write descriptors in indirect table
> > > > 
> > > > I am fine with both, probably preferring the text block above instead
> > > > of
> > > > silently dropping the reason, just for clarity.
> > > > 
> > > > But keep in mind that this might not be limited to virtio-9p as your
> > > > text
> > > > would suggest, see below.
> > > > 
> > > > > > > > > > (3) Unfortunately not all virtio users in QEMU would
> > > > > > > > > > currently
> > > > > > > > > > 
> > > > > > > > > >     work correctly with the new value of 32768.
> > > > > > > > > > 
> > > > > > > > > > So let's turn this hard coded global value into a runtime
> > > > > > > > > > variable as a first step in this commit, configurable for
> > > > > > > > > > each
> > > > > > > > > > virtio user by passing a corresponding value with
> > > > > > > > > > virtio_init()
> > > > > > > > > > call.
> > > > > > > > > 
> > > > > > > > > virtio_add_queue() already has an int queue_size argument,
> > > > > > > > > why
> > > > > > > > > isn't
> > > > > > > > > that enough to deal with the maximum queue size? There's
> > > > > > > > > probably a
> > > > > > > > > good
> > > > > > > > > reason for it, but please include it in the commit
> > > > > > > > > description.
> > > > > > > > 
> > > > > > > > [...]
> > > > > > > > 
> > > > > > > > > Can you make this value per-vq instead of per-vdev since
> > > > > > > > > virtqueues
> > > > > > > > > can
> > > > > > > > > have different queue sizes?
> > > > > > > > > 
> > > > > > > > > The same applies to the rest of this patch. Anything using
> > > > > > > > > vdev->queue_max_size should probably use vq->vring.num
> > > > > > > > > instead.
> > > > > > > > 
> > > > > > > > I would like to avoid that and keep it per device. The maximum
> > > > > > > > size
> > > > > > > > stored
> > > > > > > > there is the maximum size supported by virtio user (or vortio
> > > > > > > > device
> > > > > > > > model,
> > > > > > > > however you want to call it). So that's really a limit per
> > > > > > > > device,
> > > > > > > > not
> > > > > > > > per
> > > > > > > > queue, as no queue of the device would ever exceed that limit.
> > > > > > > > 
> > > > > > > > Plus a lot more code would need to be refactored, which I
> > > > > > > > think is
> > > > > > > > unnecessary.
> > > > > > > 
> > > > > > > I'm against a per-device limit because it's a concept that
> > > > > > > cannot
> > > > > > > accurately describe reality. Some devices have multiple classes
> > > > > > > of
> > > > > > 
> > > > > > It describes current reality, because VIRTQUEUE_MAX_SIZE obviously
> > > > > > is
> > > > > > not
> > > > > > per queue either ATM, and nobody ever cared.
> > > > > > 
> > > > > > All this series does, is allowing to override that currently
> > > > > > project-wide
> > > > > > compile-time constant to a per-driver-model compile-time constant.
> > > > > > Which
> > > > > > makes sense, because that's what it is: some drivers could cope
> > > > > > with
> > > > > > any
> > > > > > transfer size, and some drivers are constrained to a certain
> > > > > > maximum
> > > > > > application specific transfer size (e.g. IOV_MAX).
> > > > > > 
> > > > > > > virtqueues and they are sized differently, so a per-device limit
> > > > > > > is
> > > > > > > insufficient. virtio-net has separate rx_queue_size and
> > > > > > > tx_queue_size
> > > > > > > parameters (plus a control vq hardcoded to 64 descriptors).
> > > > > > 
> > > > > > I simply find this overkill. This value semantically means "my
> > > > > > driver
> > > > > > model
> > > > > > supports at any time and at any coincidence at the very most x *
> > > > > > PAGE_SIZE
> > > > > > = max_transfer_size". Do you see any driver that might want a more
> > > > > > fine
> > > > > > graded control over this?
> > > > > 
> > > > > One reason why per-vq limits could make sense is that the maximum
> > > > > possible number of struct elements is allocated upfront in some code
> > > > > paths. Those code paths may need to differentiate between per-vq
> > > > > limits
> > > > > for performance or memory utilization reasons. Today some places
> > > > > allocate 1024 elements on the stack in some code paths, but maybe
> > > > > that's
> > > > > not acceptable when the per-device limit is 32k. This can matter
> > > > > when a
> > > > > device has vqs with very different sizes.
> > > > 
> > > > [...]
> > > > 
> > > > > > ... I leave that up to Michael or whoever might be in charge to
> > > > > > decide. I
> > > > > > still find this overkill, but I will adapt this to whatever the
> > > > > > decision
> > > > > > eventually will be in v3.
> > > > > > 
> > > > > > But then please tell me the precise representation that you find
> > > > > > appropriate, i.e. whether you want a new function for that, or
> > > > > > rather
> > > > > > an
> > > > > > additional argument to virtio_add_queue(). Your call.
> > > > > 
> > > > > virtio_add_queue() already takes an int queue_size argument. I think
> > > > > the
> > > > > necessary information is already there.
> > > > > 
> > > > > This patch just needs to be tweaked to use the
> > > > > virtio_queue_get_num()
> > > > > (or a new virtqueue_get_num() API if that's easier because only a
> > > > > VirtQueue *vq pointer is available) instead of introducing a new
> > > > > per-device limit.
> > > > 
> > > > My understanding is that both the original 9p virtio device authors,
> > > > as
> > > > well as other virtio device authors in QEMU have been and are still
> > > > using
> > > > this as a default value (i.e. to allocate some upfront, and the rest
> > > > on
> > > > demand).
> > > > 
> > > > So yes, I know your argument about the specs, but AFAICS if I would
> > > > just
> > > > take this existing numeric argument for the limit, then it would
> > > > probably
> > > > break those other QEMU devices as well.
> > > 
> > > This is a good point that I didn't consider. If guest drivers currently
> > > violate the spec, then restricting descriptor chain length to vring.num
> > > will introduce regressions.
> > > 
> > > We can't use virtio_queue_get_num() directly. A backwards-compatible
> > > 
> > > limit is required:
> > >   int virtio_queue_get_desc_chain_max(VirtIODevice *vdev, int n)
> > >   {
> > >   
> > >       /*
> > >       
> > >        * QEMU historically allowed 1024 descriptors even if the
> > >        * descriptor table was smaller.
> > >        */
> > >       
> > >       return MAX(virtio_queue_get_num(vdev, qidx), 1024);
> > >   
> > >   }
> > 
> > That was an alternative that I thought about as well, but decided against.
> > It would require devices (that would want to support large transmissions
> > sizes)> 
> > to create the virtio queue(s) with the maximum possible size, i.e:
> >   virtio_add_queue(32k);
> 
> The spec allows drivers to set the size of the vring as long as they do
> not exceed Queue Size.
> 
> The Linux drivers accept the device's default size, so you're right that
> this would cause large vrings to be allocated if the device sets the
> virtqueue size to 32k.
> 
> > And that's the point where my current lack of knowledge, of what this
> > would
> > precisely mean to the resulting allocation set, decided against it. I mean
> > would that mean would QEMU's virtio implementation would just a) allocate
> > 32k scatter gather list entries? Or would it rather b) additionally also
> > allocate the destination memory pages as well?
> 
> The vring consumes guest RAM but it just consists of descriptors, not
> the buffer memory pages. The guest RAM requirements are:
> - split layout: 32k * 16 + 6 + 32k * 2 + 6 + 8 * 32k = 851,980 bytes
> - packed layout: 32k * 16 + 4 + 4 = 524,296 bytes
> 
> That's still quite large!
> 
> By the way, virtio-blk currently uses a virtqueue size of 256
> descriptors and this has been found reasonable for disk I/O performance.
> The Linux block layer splits requests at around 1.25 MB for virtio-blk.
> The virtio-blk queue limits are reported by the device and the guest
> Linux block layer uses them to size/split requests appropriately. I'm
> not sure 9p really needs 32k, although you're right that fragmented
> physical memory requires 32k descriptors to describe 128 MB of buffers.
> 
> Going back to the original problem, a vring feature bit could be added
> to the VIRTIO specification indicating that indirect descriptor tables
> are limited to the maximum (32k) instead of Queue Size. This way the
> device's default vring size could be small but drivers could allocate
> indirect descriptor tables that are large when necessary. Then the Linux
> virtio driver API would need to report the maximum supported sglist
> length for a given virtqueue so drivers can take advantage of this
> information.

Due to forced pragmatism, ~1M unused/wasted space would be acceptable IMHO.

But if that might really go up to 128M+ as you said if physical RAM is highly 
fragmented, then a cleaner solution would definitely make sense, yes, if 
that's possible.

But as changing specs etc. is probably a long process, it would make sense 
first doing some more tests with the kernel patches to find out whether there 
was probably some show stopper like IOV_MAX anyway.

As for whether large transfer sizes make sense at all: well, from the 
benchmarks I made so far I "think" it does make sense going >4M. It might be 
something specific to 9p (its a full file server), I guess it has a higher 
latency than raw virtio block devices. OTOH with M.2 SSDs we now have several 
thousand MB/s, so not sure if the old common transfer size of 1M for block 
devices is still reasonable today.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-08 14:24           ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-08 16:08             ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-08 16:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Greg Kurz, Stefan Hajnoczi, Kevin Wolf, Laurent Vivier,
	qemu-block, Michael S. Tsirkin, Jason Wang, Amit Shah,
	David Hildenbrand, Raphael Norwitz, virtio-fs, Eric Auger,
	Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Dr. David Alan Gilbert

On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > On Thu, 7 Oct 2021 16:42:49 +0100
> > 
> > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck 
wrote:
> > > > > > At the moment the maximum transfer size with virtio is limited to
> > > > > > 4M
> > > > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > > > theoretical possible transfer size of 128M (32k pages) according
> > > > > > to
> > > > > > the
> > > > > > virtio specs:
> > > > > > 
> > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs
> > > > > > 01
> > > > > > .html#
> > > > > > x1-240006
> > > > > 
> > > > > Hi Christian,
> > 
> > > > > I took a quick look at the code:
> > Hi,
> > 
> > Thanks Stefan for sharing virtio expertise and helping Christian !
> > 
> > > > > - The Linux 9p driver restricts descriptor chains to 128 elements
> > > > > 
> > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > 
> > > > Yes, that's the limitation that I am about to remove (WIP); current
> > > > kernel
> > > > patches:
> > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyt
> > > > e.
> > > > com/>
> > > 
> > > I haven't read the patches yet but I'm concerned that today the driver
> > > is pretty well-behaved and this new patch series introduces a spec
> > > violation. Not fixing existing spec violations is okay, but adding new
> > > ones is a red flag. I think we need to figure out a clean solution.
> 
> Nobody has reviewed the kernel patches yet. My main concern therefore
> actually is that the kernel patches are already too complex, because the
> current situation is that only Dominique is handling 9p patches on kernel
> side, and he barely has time for 9p anymore.
> 
> Another reason for me to catch up on reading current kernel code and
> stepping in as reviewer of 9p on kernel side ASAP, independent of this
> issue.
> 
> As for current kernel patches' complexity: I can certainly drop patch 7
> entirely as it is probably just overkill. Patch 4 is then the biggest chunk,
> I have to see if I can simplify it, and whether it would make sense to
> squash with patch 3.
> 
> > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will
> > > > > fail
> > > > > 
> > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > 
> > > > Hmm, which makes me wonder why I never encountered this error during
> > > > testing.
> > > > 
> > > > Most people will use the 9p qemu 'local' fs driver backend in
> > > > practice,
> > > > so
> > > > that v9fs_read() call would translate for most people to this
> > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > 
> > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
> > > > 
> > > >                             const struct iovec *iov,
> > > >                             int iovcnt, off_t offset)
> > > > 
> > > > {
> > > > #ifdef CONFIG_PREADV
> > > > 
> > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > 
> > > > #else
> > > > 
> > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > >     if (err == -1) {
> > > >     
> > > >         return err;
> > > >     
> > > >     } else {
> > > >     
> > > >         return readv(fs->fd, iov, iovcnt);
> > > >     
> > > >     }
> > > > 
> > > > #endif
> > > > }
> > > > 
> > > > > Unless I misunderstood the code, neither side can take advantage of
> > > > > the
> > > > > new 32k descriptor chain limit?
> > > > > 
> > > > > Thanks,
> > > > > Stefan
> > > > 
> > > > I need to check that when I have some more time. One possible
> > > > explanation
> > > > might be that preadv() already has this wrapped into a loop in its
> > > > implementation to circumvent a limit like IOV_MAX. It might be another
> > > > "it
> > > > works, but not portable" issue, but not sure.
> > > > 
> > > > There are still a bunch of other issues I have to resolve. If you look
> > > > at
> > > > net/9p/client.c on kernel side, you'll notice that it basically does
> > > > this ATM> >
> > > > 
> > > >     kmalloc(msize);
> > 
> > Note that this is done twice : once for the T message (client request) and
> > once for the R message (server answer). The 9p driver could adjust the
> > size
> > of the T message to what's really needed instead of allocating the full
> > msize. R message size is not known though.
> 
> Would it make sense adding a second virtio ring, dedicated to server
> responses to solve this? IIRC 9p server already calculates appropriate
> exact sizes for each response type. So server could just push space that's
> really needed for its responses.
> 
> > > > for every 9p request. So not only does it allocate much more memory
> > > > for
> > > > every request than actually required (i.e. say 9pfs was mounted with
> > > > msize=8M, then a 9p request that actually would just need 1k would
> > > > nevertheless allocate 8M), but also it allocates > PAGE_SIZE, which
> > > > obviously may fail at any time.>
> > > 
> > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc() situation.
> 
> Hu, I didn't even consider vmalloc(). I just tried the kvmalloc() wrapper as
> a quick & dirty test, but it crashed in the same way as kmalloc() with
> large msize values immediately on mounting:
> 
> diff --git a/net/9p/client.c b/net/9p/client.c
> index a75034fa249b..cfe300a4b6ca 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct p9_client
> *clnt)
>  static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
>                          int alloc_msize)
>  {
> -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> +       if (false) {
>                 fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
>                 fc->cache = c->fcall_cache;
>         } else {
> -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);

Ok, GFP_NOFS -> GFP_KERNEL did the trick.

Now I get:

   virtio: bogus descriptor or out of resources

So, still some work ahead on both ends.

>                 fc->cache = NULL;
>         }
> -       if (!fc->sdata)
> +       if (!fc->sdata) {
> +               pr_info("%s !fc->sdata", __func__);
>                 return -ENOMEM;
> +       }
>         fc->capacity = alloc_msize;
>         return 0;
>  }
> 
> I try to look at this at the weekend, I would have expected this hack to
> bypass this issue.
> 
> > > I saw zerocopy code in the 9p guest driver but didn't investigate when
> > > it's used. Maybe that should be used for large requests (file
> > > reads/writes)?
> > 
> > This is the case already : zero-copy is only used for reads/writes/readdir
> > if the requested size is 1k or more.
> > 
> > Also you'll note that in this case, the 9p driver doesn't allocate msize
> > for the T/R messages but only 4k, which is largely enough to hold the
> > header.
> > 
> > 	/*
> > 	
> > 	 * We allocate a inline protocol data of only 4k bytes.
> > 	 * The actual content is passed in zero-copy fashion.
> > 	 */
> > 	
> > 	req = p9_client_prepare_req(c, type, P9_ZC_HDR_SZ, fmt, ap);
> > 
> > and
> > 
> > /* size of header for zero copy read/write */
> > #define P9_ZC_HDR_SZ 4096
> > 
> > A huge msize only makes sense for Twrite, Rread and Rreaddir because
> > of the amount of data they convey. All other messages certainly fit
> > in a couple of kilobytes only (sorry, don't remember the numbers).
> > 
> > A first change should be to allocate MIN(XXX, msize) for the
> > regular non-zc case, where XXX could be a reasonable fixed
> > value (8k?). In the case of T messages, it is even possible
> > to adjust the size to what's exactly needed, ala snprintf(NULL).
> 
> Good idea actually! That would limit this problem to reviewing the 9p specs
> and picking one reasonable max value. Because you are right, those message
> types are tiny. Probably not worth to pile up new code to calculate exact
> message sizes for each one of them.
> 
> Adding some safety net would make sense though, to force e.g. if a new
> message type is added in future, that this value would be reviewed as well,
> something like:
> 
> static int max_msg_size(int msg_type) {
>     switch (msg_type) {
>         /* large zero copy messages */
>         case Twrite:
>         case Tread:
>         case Treaddir:
>             BUG_ON(true);
> 
>         /* small messages */
>         case Tversion:
>         ....
>             return 8k; /* to be replaced with appropriate max value */
>     }
> }
> 
> That way the compiler would bark on future additions. But on doubt, a simple
> comment on msg type enum might do as well though.
> 
> > > virtio-blk/scsi don't memcpy data into a new buffer, they
> > > directly access page cache or O_DIRECT pinned pages.
> > > 
> > > Stefan
> > 
> > Cheers,
> > 
> > --
> > Greg




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-08 16:08             ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-08 16:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng

On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > On Thu, 7 Oct 2021 16:42:49 +0100
> > 
> > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck 
wrote:
> > > > > > At the moment the maximum transfer size with virtio is limited to
> > > > > > 4M
> > > > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > > > theoretical possible transfer size of 128M (32k pages) according
> > > > > > to
> > > > > > the
> > > > > > virtio specs:
> > > > > > 
> > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs
> > > > > > 01
> > > > > > .html#
> > > > > > x1-240006
> > > > > 
> > > > > Hi Christian,
> > 
> > > > > I took a quick look at the code:
> > Hi,
> > 
> > Thanks Stefan for sharing virtio expertise and helping Christian !
> > 
> > > > > - The Linux 9p driver restricts descriptor chains to 128 elements
> > > > > 
> > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > 
> > > > Yes, that's the limitation that I am about to remove (WIP); current
> > > > kernel
> > > > patches:
> > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyt
> > > > e.
> > > > com/>
> > > 
> > > I haven't read the patches yet but I'm concerned that today the driver
> > > is pretty well-behaved and this new patch series introduces a spec
> > > violation. Not fixing existing spec violations is okay, but adding new
> > > ones is a red flag. I think we need to figure out a clean solution.
> 
> Nobody has reviewed the kernel patches yet. My main concern therefore
> actually is that the kernel patches are already too complex, because the
> current situation is that only Dominique is handling 9p patches on kernel
> side, and he barely has time for 9p anymore.
> 
> Another reason for me to catch up on reading current kernel code and
> stepping in as reviewer of 9p on kernel side ASAP, independent of this
> issue.
> 
> As for current kernel patches' complexity: I can certainly drop patch 7
> entirely as it is probably just overkill. Patch 4 is then the biggest chunk,
> I have to see if I can simplify it, and whether it would make sense to
> squash with patch 3.
> 
> > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will
> > > > > fail
> > > > > 
> > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > 
> > > > Hmm, which makes me wonder why I never encountered this error during
> > > > testing.
> > > > 
> > > > Most people will use the 9p qemu 'local' fs driver backend in
> > > > practice,
> > > > so
> > > > that v9fs_read() call would translate for most people to this
> > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > 
> > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
> > > > 
> > > >                             const struct iovec *iov,
> > > >                             int iovcnt, off_t offset)
> > > > 
> > > > {
> > > > #ifdef CONFIG_PREADV
> > > > 
> > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > 
> > > > #else
> > > > 
> > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > >     if (err == -1) {
> > > >     
> > > >         return err;
> > > >     
> > > >     } else {
> > > >     
> > > >         return readv(fs->fd, iov, iovcnt);
> > > >     
> > > >     }
> > > > 
> > > > #endif
> > > > }
> > > > 
> > > > > Unless I misunderstood the code, neither side can take advantage of
> > > > > the
> > > > > new 32k descriptor chain limit?
> > > > > 
> > > > > Thanks,
> > > > > Stefan
> > > > 
> > > > I need to check that when I have some more time. One possible
> > > > explanation
> > > > might be that preadv() already has this wrapped into a loop in its
> > > > implementation to circumvent a limit like IOV_MAX. It might be another
> > > > "it
> > > > works, but not portable" issue, but not sure.
> > > > 
> > > > There are still a bunch of other issues I have to resolve. If you look
> > > > at
> > > > net/9p/client.c on kernel side, you'll notice that it basically does
> > > > this ATM> >
> > > > 
> > > >     kmalloc(msize);
> > 
> > Note that this is done twice : once for the T message (client request) and
> > once for the R message (server answer). The 9p driver could adjust the
> > size
> > of the T message to what's really needed instead of allocating the full
> > msize. R message size is not known though.
> 
> Would it make sense adding a second virtio ring, dedicated to server
> responses to solve this? IIRC 9p server already calculates appropriate
> exact sizes for each response type. So server could just push space that's
> really needed for its responses.
> 
> > > > for every 9p request. So not only does it allocate much more memory
> > > > for
> > > > every request than actually required (i.e. say 9pfs was mounted with
> > > > msize=8M, then a 9p request that actually would just need 1k would
> > > > nevertheless allocate 8M), but also it allocates > PAGE_SIZE, which
> > > > obviously may fail at any time.>
> > > 
> > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc() situation.
> 
> Hu, I didn't even consider vmalloc(). I just tried the kvmalloc() wrapper as
> a quick & dirty test, but it crashed in the same way as kmalloc() with
> large msize values immediately on mounting:
> 
> diff --git a/net/9p/client.c b/net/9p/client.c
> index a75034fa249b..cfe300a4b6ca 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct p9_client
> *clnt)
>  static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
>                          int alloc_msize)
>  {
> -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> +       if (false) {
>                 fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
>                 fc->cache = c->fcall_cache;
>         } else {
> -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);

Ok, GFP_NOFS -> GFP_KERNEL did the trick.

Now I get:

   virtio: bogus descriptor or out of resources

So, still some work ahead on both ends.

>                 fc->cache = NULL;
>         }
> -       if (!fc->sdata)
> +       if (!fc->sdata) {
> +               pr_info("%s !fc->sdata", __func__);
>                 return -ENOMEM;
> +       }
>         fc->capacity = alloc_msize;
>         return 0;
>  }
> 
> I try to look at this at the weekend, I would have expected this hack to
> bypass this issue.
> 
> > > I saw zerocopy code in the 9p guest driver but didn't investigate when
> > > it's used. Maybe that should be used for large requests (file
> > > reads/writes)?
> > 
> > This is the case already : zero-copy is only used for reads/writes/readdir
> > if the requested size is 1k or more.
> > 
> > Also you'll note that in this case, the 9p driver doesn't allocate msize
> > for the T/R messages but only 4k, which is largely enough to hold the
> > header.
> > 
> > 	/*
> > 	
> > 	 * We allocate a inline protocol data of only 4k bytes.
> > 	 * The actual content is passed in zero-copy fashion.
> > 	 */
> > 	
> > 	req = p9_client_prepare_req(c, type, P9_ZC_HDR_SZ, fmt, ap);
> > 
> > and
> > 
> > /* size of header for zero copy read/write */
> > #define P9_ZC_HDR_SZ 4096
> > 
> > A huge msize only makes sense for Twrite, Rread and Rreaddir because
> > of the amount of data they convey. All other messages certainly fit
> > in a couple of kilobytes only (sorry, don't remember the numbers).
> > 
> > A first change should be to allocate MIN(XXX, msize) for the
> > regular non-zc case, where XXX could be a reasonable fixed
> > value (8k?). In the case of T messages, it is even possible
> > to adjust the size to what's exactly needed, ala snprintf(NULL).
> 
> Good idea actually! That would limit this problem to reviewing the 9p specs
> and picking one reasonable max value. Because you are right, those message
> types are tiny. Probably not worth to pile up new code to calculate exact
> message sizes for each one of them.
> 
> Adding some safety net would make sense though, to force e.g. if a new
> message type is added in future, that this value would be reviewed as well,
> something like:
> 
> static int max_msg_size(int msg_type) {
>     switch (msg_type) {
>         /* large zero copy messages */
>         case Twrite:
>         case Tread:
>         case Treaddir:
>             BUG_ON(true);
> 
>         /* small messages */
>         case Tversion:
>         ....
>             return 8k; /* to be replaced with appropriate max value */
>     }
> }
> 
> That way the compiler would bark on future additions. But on doubt, a simple
> comment on msg type enum might do as well though.
> 
> > > virtio-blk/scsi don't memcpy data into a new buffer, they
> > > directly access page cache or O_DIRECT pinned pages.
> > > 
> > > Stefan
> > 
> > Cheers,
> > 
> > --
> > Greg



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-08 16:08             ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-21 15:39               ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-21 15:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Greg Kurz, Stefan Hajnoczi, Kevin Wolf, Laurent Vivier,
	qemu-block, Michael S. Tsirkin, Jason Wang, Amit Shah,
	David Hildenbrand, Raphael Norwitz, virtio-fs, Eric Auger,
	Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Dr. David Alan Gilbert

On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > 
> > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck
> 
> wrote:
> > > > > > > At the moment the maximum transfer size with virtio is limited
> > > > > > > to
> > > > > > > 4M
> > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > > > > theoretical possible transfer size of 128M (32k pages) according
> > > > > > > to
> > > > > > > the
> > > > > > > virtio specs:
> > > > > > > 
> > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-
> > > > > > > cs
> > > > > > > 01
> > > > > > > .html#
> > > > > > > x1-240006
> > > > > > 
> > > > > > Hi Christian,
> > > 
> > > > > > I took a quick look at the code:
> > > Hi,
> > > 
> > > Thanks Stefan for sharing virtio expertise and helping Christian !
> > > 
> > > > > > - The Linux 9p driver restricts descriptor chains to 128 elements
> > > > > > 
> > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > 
> > > > > Yes, that's the limitation that I am about to remove (WIP); current
> > > > > kernel
> > > > > patches:
> > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudeb
> > > > > yt
> > > > > e.
> > > > > com/>
> > > > 
> > > > I haven't read the patches yet but I'm concerned that today the driver
> > > > is pretty well-behaved and this new patch series introduces a spec
> > > > violation. Not fixing existing spec violations is okay, but adding new
> > > > ones is a red flag. I think we need to figure out a clean solution.
> > 
> > Nobody has reviewed the kernel patches yet. My main concern therefore
> > actually is that the kernel patches are already too complex, because the
> > current situation is that only Dominique is handling 9p patches on kernel
> > side, and he barely has time for 9p anymore.
> > 
> > Another reason for me to catch up on reading current kernel code and
> > stepping in as reviewer of 9p on kernel side ASAP, independent of this
> > issue.
> > 
> > As for current kernel patches' complexity: I can certainly drop patch 7
> > entirely as it is probably just overkill. Patch 4 is then the biggest
> > chunk, I have to see if I can simplify it, and whether it would make
> > sense to squash with patch 3.
> > 
> > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will
> > > > > > fail
> > > > > > 
> > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > 
> > > > > Hmm, which makes me wonder why I never encountered this error during
> > > > > testing.
> > > > > 
> > > > > Most people will use the 9p qemu 'local' fs driver backend in
> > > > > practice,
> > > > > so
> > > > > that v9fs_read() call would translate for most people to this
> > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > 
> > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
> > > > > 
> > > > >                             const struct iovec *iov,
> > > > >                             int iovcnt, off_t offset)
> > > > > 
> > > > > {
> > > > > #ifdef CONFIG_PREADV
> > > > > 
> > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > 
> > > > > #else
> > > > > 
> > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > >     if (err == -1) {
> > > > >     
> > > > >         return err;
> > > > >     
> > > > >     } else {
> > > > >     
> > > > >         return readv(fs->fd, iov, iovcnt);
> > > > >     
> > > > >     }
> > > > > 
> > > > > #endif
> > > > > }
> > > > > 
> > > > > > Unless I misunderstood the code, neither side can take advantage
> > > > > > of
> > > > > > the
> > > > > > new 32k descriptor chain limit?
> > > > > > 
> > > > > > Thanks,
> > > > > > Stefan
> > > > > 
> > > > > I need to check that when I have some more time. One possible
> > > > > explanation
> > > > > might be that preadv() already has this wrapped into a loop in its
> > > > > implementation to circumvent a limit like IOV_MAX. It might be
> > > > > another
> > > > > "it
> > > > > works, but not portable" issue, but not sure.
> > > > > 
> > > > > There are still a bunch of other issues I have to resolve. If you
> > > > > look
> > > > > at
> > > > > net/9p/client.c on kernel side, you'll notice that it basically does
> > > > > this ATM> >
> > > > > 
> > > > >     kmalloc(msize);
> > > 
> > > Note that this is done twice : once for the T message (client request)
> > > and
> > > once for the R message (server answer). The 9p driver could adjust the
> > > size
> > > of the T message to what's really needed instead of allocating the full
> > > msize. R message size is not known though.
> > 
> > Would it make sense adding a second virtio ring, dedicated to server
> > responses to solve this? IIRC 9p server already calculates appropriate
> > exact sizes for each response type. So server could just push space that's
> > really needed for its responses.
> > 
> > > > > for every 9p request. So not only does it allocate much more memory
> > > > > for
> > > > > every request than actually required (i.e. say 9pfs was mounted with
> > > > > msize=8M, then a 9p request that actually would just need 1k would
> > > > > nevertheless allocate 8M), but also it allocates > PAGE_SIZE, which
> > > > > obviously may fail at any time.>
> > > > 
> > > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc()
> > > > situation.
> > 
> > Hu, I didn't even consider vmalloc(). I just tried the kvmalloc() wrapper
> > as a quick & dirty test, but it crashed in the same way as kmalloc() with
> > large msize values immediately on mounting:
> > 
> > diff --git a/net/9p/client.c b/net/9p/client.c
> > index a75034fa249b..cfe300a4b6ca 100644
> > --- a/net/9p/client.c
> > +++ b/net/9p/client.c
> > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct p9_client
> > *clnt)
> > 
> >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
> >  
> >                          int alloc_msize)
> >  
> >  {
> > 
> > -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > +       if (false) {
> > 
> >                 fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
> >                 fc->cache = c->fcall_cache;
> >         
> >         } else {
> > 
> > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> 
> Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> 
> Now I get:
> 
>    virtio: bogus descriptor or out of resources
> 
> So, still some work ahead on both ends.

Few hacks later (only changes on 9p client side) I got this running stable
now. The reason for the virtio error above was that kvmalloc() returns a
non-logical kernel address for any kvmalloc(>4M), i.e. an address that is
inaccessible from host side, hence that "bogus descriptor" message by QEMU.
So I had to split those linear 9p client buffers into sparse ones (set of
individual pages).

I tested this for some days with various virtio transmission sizes and it
works as expected up to 128 MB (more precisely: 128 MB read space + 128 MB
write space per virtio round trip message).

I did not encounter a show stopper for large virtio transmission sizes
(4 MB ... 128 MB) on virtio level, neither as a result of testing, nor after
reviewing the existing code.

About IOV_MAX: that's apparently not an issue on virtio level. Most of the
iovec code, both on Linux kernel side and on QEMU side do not have this
limitation. It is apparently however indeed a limitation for userland apps
calling the Linux kernel's syscalls yet.

Stefan, as it stands now, I am even more convinced that the upper virtio
transmission size limit should not be squeezed into the queue size argument of
virtio_add_queue(). Not because of the previous argument that it would waste
space (~1MB), but rather because they are two different things. To outline
this, just a quick recap of what happens exactly when a bulk message is pushed
over the virtio wire (assuming virtio "split" layout here):

---------- [recap-start] ----------

For each bulk message sent guest <-> host, exactly *one* of the pre-allocated
descriptors is taken and placed (subsequently) into exactly *one* position of
the two available/used ring buffers. The actual descriptor table though,
containing all the DMA addresses of the message bulk data, is allocated just
in time for each round trip message. Say, it is the first message sent, it
yields in the following structure:

Ring Buffer   Descriptor Table      Bulk Data Pages

   +-+              +-+           +-----------------+
   |D|------------->|d|---------->| Bulk data block |
   +-+              |d|--------+  +-----------------+
   | |              |d|------+ |
   +-+               .       | |  +-----------------+
   | |               .       | +->| Bulk data block |
    .                .       |    +-----------------+
    .               |d|-+    |
    .               +-+ |    |    +-----------------+
   | |                  |    +--->| Bulk data block |
   +-+                  |         +-----------------+
   | |                  |                 .
   +-+                  |                 .
                        |                 .
                        |         +-----------------+
                        +-------->| Bulk data block |
                                  +-----------------+
Legend:
D: pre-allocated descriptor
d: just in time allocated descriptor
-->: memory pointer (DMA)

The bulk data blocks are allocated by the respective device driver above
virtio subsystem level (guest side).

There are exactly as many descriptors pre-allocated (D) as the size of a ring
buffer.

A "descriptor" is more or less just a chainable DMA memory pointer; defined
as:

/* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
struct vring_desc {
	/* Address (guest-physical). */
	__virtio64 addr;
	/* Length. */
	__virtio32 len;
	/* The flags as indicated above. */
	__virtio16 flags;
	/* We chain unused descriptors via this, too */
	__virtio16 next;
};

There are 2 ring buffers; the "available" ring buffer is for sending a message
guest->host (which will transmit DMA addresses of guest allocated bulk data
blocks that are used for data sent to device, and separate guest allocated
bulk data blocks that will be used by host side to place its response bulk
data), and the "used" ring buffer is for sending host->guest to let guest know
about host's response and that it could now safely consume and then deallocate
the bulk data blocks subsequently.

---------- [recap-end] ----------

So the "queue size" actually defines the ringbuffer size. It does not define
the maximum amount of descriptors. The "queue size" rather defines how many
pending messages can be pushed into either one ringbuffer before the other
side would need to wait until the counter side would step up (i.e. ring buffer
full).

The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE actually is) OTOH
defines the max. bulk data size that could be transmitted with each virtio
round trip message.

And in fact, 9p currently handles the virtio "queue size" as directly
associative with its maximum amount of active 9p requests the server could
handle simultaniously:

  hw/9pfs/9p.h:#define MAX_REQ         128
  hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
  hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev, MAX_REQ,
                                 handle_9p_output);

So if I would change it like this, just for the purpose to increase the max.
virtio transmission size:

--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -218,7 +218,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
     v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
     virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
                 VIRTQUEUE_MAX_SIZE);
-    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
+    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
 }
 
Then it would require additional synchronization code on both ends and
therefore unnecessary complexity, because it would now be possible that more
requests are pushed into the ringbuffer than server could handle.

There is one potential issue though that probably did justify the "don't
exceed the queue size" rule:

ATM the descriptor table is allocated (just in time) as *one* continuous
buffer via kmalloc_array():
https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7d33a4/drivers/virtio/virtio_ring.c#L440

So assuming transmission size of 2 * 128 MB that kmalloc_array() call would
yield in kmalloc(1M) and the latter might fail if guest had highly fragmented
physical memory. For such kind of error case there is currently a fallback
path in virtqueue_add_split() that would then use the required amount of
pre-allocated descriptors instead:
https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7d33a4/drivers/virtio/virtio_ring.c#L525

That fallback recovery path would no longer be viable if the queue size was
exceeded. There would be alternatives though, e.g. by allowing to chain
indirect descriptor tables (currently prohibited by the virtio specs).

Best regards,
Christian Schoenebeck

> 
> >                 fc->cache = NULL;
> >         
> >         }
> > 
> > -       if (!fc->sdata)
> > +       if (!fc->sdata) {
> > +               pr_info("%s !fc->sdata", __func__);
> > 
> >                 return -ENOMEM;
> > 
> > +       }
> > 
> >         fc->capacity = alloc_msize;
> >         return 0;
> >  
> >  }
> > 
> > I try to look at this at the weekend, I would have expected this hack to
> > bypass this issue.
> > 
> > > > I saw zerocopy code in the 9p guest driver but didn't investigate when
> > > > it's used. Maybe that should be used for large requests (file
> > > > reads/writes)?
> > > 
> > > This is the case already : zero-copy is only used for
> > > reads/writes/readdir
> > > if the requested size is 1k or more.
> > > 
> > > Also you'll note that in this case, the 9p driver doesn't allocate msize
> > > for the T/R messages but only 4k, which is largely enough to hold the
> > > header.
> > > 
> > > 	/*
> > > 	
> > > 	 * We allocate a inline protocol data of only 4k bytes.
> > > 	 * The actual content is passed in zero-copy fashion.
> > > 	 */
> > > 	
> > > 	req = p9_client_prepare_req(c, type, P9_ZC_HDR_SZ, fmt, ap);
> > > 
> > > and
> > > 
> > > /* size of header for zero copy read/write */
> > > #define P9_ZC_HDR_SZ 4096
> > > 
> > > A huge msize only makes sense for Twrite, Rread and Rreaddir because
> > > of the amount of data they convey. All other messages certainly fit
> > > in a couple of kilobytes only (sorry, don't remember the numbers).
> > > 
> > > A first change should be to allocate MIN(XXX, msize) for the
> > > regular non-zc case, where XXX could be a reasonable fixed
> > > value (8k?). In the case of T messages, it is even possible
> > > to adjust the size to what's exactly needed, ala snprintf(NULL).
> > 
> > Good idea actually! That would limit this problem to reviewing the 9p
> > specs
> > and picking one reasonable max value. Because you are right, those message
> > types are tiny. Probably not worth to pile up new code to calculate exact
> > message sizes for each one of them.
> > 
> > Adding some safety net would make sense though, to force e.g. if a new
> > message type is added in future, that this value would be reviewed as
> > well,
> > something like:
> > 
> > static int max_msg_size(int msg_type) {
> > 
> >     switch (msg_type) {
> >     
> >         /* large zero copy messages */
> >         case Twrite:
> >         case Tread:
> >         
> >         case Treaddir:
> >             BUG_ON(true);
> >         
> >         /* small messages */
> >         case Tversion:
> >         ....
> >         
> >             return 8k; /* to be replaced with appropriate max value */
> >     
> >     }
> > 
> > }
> > 
> > That way the compiler would bark on future additions. But on doubt, a
> > simple comment on msg type enum might do as well though.
> > 
> > > > virtio-blk/scsi don't memcpy data into a new buffer, they
> > > > directly access page cache or O_DIRECT pinned pages.
> > > > 
> > > > Stefan
> > > 
> > > Cheers,
> > > 
> > > --
> > > Greg




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-21 15:39               ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-21 15:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng

On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > 
> > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck
> 
> wrote:
> > > > > > > At the moment the maximum transfer size with virtio is limited
> > > > > > > to
> > > > > > > 4M
> > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > > > > theoretical possible transfer size of 128M (32k pages) according
> > > > > > > to
> > > > > > > the
> > > > > > > virtio specs:
> > > > > > > 
> > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-
> > > > > > > cs
> > > > > > > 01
> > > > > > > .html#
> > > > > > > x1-240006
> > > > > > 
> > > > > > Hi Christian,
> > > 
> > > > > > I took a quick look at the code:
> > > Hi,
> > > 
> > > Thanks Stefan for sharing virtio expertise and helping Christian !
> > > 
> > > > > > - The Linux 9p driver restricts descriptor chains to 128 elements
> > > > > > 
> > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > 
> > > > > Yes, that's the limitation that I am about to remove (WIP); current
> > > > > kernel
> > > > > patches:
> > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudeb
> > > > > yt
> > > > > e.
> > > > > com/>
> > > > 
> > > > I haven't read the patches yet but I'm concerned that today the driver
> > > > is pretty well-behaved and this new patch series introduces a spec
> > > > violation. Not fixing existing spec violations is okay, but adding new
> > > > ones is a red flag. I think we need to figure out a clean solution.
> > 
> > Nobody has reviewed the kernel patches yet. My main concern therefore
> > actually is that the kernel patches are already too complex, because the
> > current situation is that only Dominique is handling 9p patches on kernel
> > side, and he barely has time for 9p anymore.
> > 
> > Another reason for me to catch up on reading current kernel code and
> > stepping in as reviewer of 9p on kernel side ASAP, independent of this
> > issue.
> > 
> > As for current kernel patches' complexity: I can certainly drop patch 7
> > entirely as it is probably just overkill. Patch 4 is then the biggest
> > chunk, I have to see if I can simplify it, and whether it would make
> > sense to squash with patch 3.
> > 
> > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will
> > > > > > fail
> > > > > > 
> > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > 
> > > > > Hmm, which makes me wonder why I never encountered this error during
> > > > > testing.
> > > > > 
> > > > > Most people will use the 9p qemu 'local' fs driver backend in
> > > > > practice,
> > > > > so
> > > > > that v9fs_read() call would translate for most people to this
> > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > 
> > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
> > > > > 
> > > > >                             const struct iovec *iov,
> > > > >                             int iovcnt, off_t offset)
> > > > > 
> > > > > {
> > > > > #ifdef CONFIG_PREADV
> > > > > 
> > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > 
> > > > > #else
> > > > > 
> > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > >     if (err == -1) {
> > > > >     
> > > > >         return err;
> > > > >     
> > > > >     } else {
> > > > >     
> > > > >         return readv(fs->fd, iov, iovcnt);
> > > > >     
> > > > >     }
> > > > > 
> > > > > #endif
> > > > > }
> > > > > 
> > > > > > Unless I misunderstood the code, neither side can take advantage
> > > > > > of
> > > > > > the
> > > > > > new 32k descriptor chain limit?
> > > > > > 
> > > > > > Thanks,
> > > > > > Stefan
> > > > > 
> > > > > I need to check that when I have some more time. One possible
> > > > > explanation
> > > > > might be that preadv() already has this wrapped into a loop in its
> > > > > implementation to circumvent a limit like IOV_MAX. It might be
> > > > > another
> > > > > "it
> > > > > works, but not portable" issue, but not sure.
> > > > > 
> > > > > There are still a bunch of other issues I have to resolve. If you
> > > > > look
> > > > > at
> > > > > net/9p/client.c on kernel side, you'll notice that it basically does
> > > > > this ATM> >
> > > > > 
> > > > >     kmalloc(msize);
> > > 
> > > Note that this is done twice : once for the T message (client request)
> > > and
> > > once for the R message (server answer). The 9p driver could adjust the
> > > size
> > > of the T message to what's really needed instead of allocating the full
> > > msize. R message size is not known though.
> > 
> > Would it make sense adding a second virtio ring, dedicated to server
> > responses to solve this? IIRC 9p server already calculates appropriate
> > exact sizes for each response type. So server could just push space that's
> > really needed for its responses.
> > 
> > > > > for every 9p request. So not only does it allocate much more memory
> > > > > for
> > > > > every request than actually required (i.e. say 9pfs was mounted with
> > > > > msize=8M, then a 9p request that actually would just need 1k would
> > > > > nevertheless allocate 8M), but also it allocates > PAGE_SIZE, which
> > > > > obviously may fail at any time.>
> > > > 
> > > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc()
> > > > situation.
> > 
> > Hu, I didn't even consider vmalloc(). I just tried the kvmalloc() wrapper
> > as a quick & dirty test, but it crashed in the same way as kmalloc() with
> > large msize values immediately on mounting:
> > 
> > diff --git a/net/9p/client.c b/net/9p/client.c
> > index a75034fa249b..cfe300a4b6ca 100644
> > --- a/net/9p/client.c
> > +++ b/net/9p/client.c
> > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct p9_client
> > *clnt)
> > 
> >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
> >  
> >                          int alloc_msize)
> >  
> >  {
> > 
> > -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > +       if (false) {
> > 
> >                 fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
> >                 fc->cache = c->fcall_cache;
> >         
> >         } else {
> > 
> > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> 
> Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> 
> Now I get:
> 
>    virtio: bogus descriptor or out of resources
> 
> So, still some work ahead on both ends.

Few hacks later (only changes on 9p client side) I got this running stable
now. The reason for the virtio error above was that kvmalloc() returns a
non-logical kernel address for any kvmalloc(>4M), i.e. an address that is
inaccessible from host side, hence that "bogus descriptor" message by QEMU.
So I had to split those linear 9p client buffers into sparse ones (set of
individual pages).

I tested this for some days with various virtio transmission sizes and it
works as expected up to 128 MB (more precisely: 128 MB read space + 128 MB
write space per virtio round trip message).

I did not encounter a show stopper for large virtio transmission sizes
(4 MB ... 128 MB) on virtio level, neither as a result of testing, nor after
reviewing the existing code.

About IOV_MAX: that's apparently not an issue on virtio level. Most of the
iovec code, both on Linux kernel side and on QEMU side do not have this
limitation. It is apparently however indeed a limitation for userland apps
calling the Linux kernel's syscalls yet.

Stefan, as it stands now, I am even more convinced that the upper virtio
transmission size limit should not be squeezed into the queue size argument of
virtio_add_queue(). Not because of the previous argument that it would waste
space (~1MB), but rather because they are two different things. To outline
this, just a quick recap of what happens exactly when a bulk message is pushed
over the virtio wire (assuming virtio "split" layout here):

---------- [recap-start] ----------

For each bulk message sent guest <-> host, exactly *one* of the pre-allocated
descriptors is taken and placed (subsequently) into exactly *one* position of
the two available/used ring buffers. The actual descriptor table though,
containing all the DMA addresses of the message bulk data, is allocated just
in time for each round trip message. Say, it is the first message sent, it
yields in the following structure:

Ring Buffer   Descriptor Table      Bulk Data Pages

   +-+              +-+           +-----------------+
   |D|------------->|d|---------->| Bulk data block |
   +-+              |d|--------+  +-----------------+
   | |              |d|------+ |
   +-+               .       | |  +-----------------+
   | |               .       | +->| Bulk data block |
    .                .       |    +-----------------+
    .               |d|-+    |
    .               +-+ |    |    +-----------------+
   | |                  |    +--->| Bulk data block |
   +-+                  |         +-----------------+
   | |                  |                 .
   +-+                  |                 .
                        |                 .
                        |         +-----------------+
                        +-------->| Bulk data block |
                                  +-----------------+
Legend:
D: pre-allocated descriptor
d: just in time allocated descriptor
-->: memory pointer (DMA)

The bulk data blocks are allocated by the respective device driver above
virtio subsystem level (guest side).

There are exactly as many descriptors pre-allocated (D) as the size of a ring
buffer.

A "descriptor" is more or less just a chainable DMA memory pointer; defined
as:

/* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
struct vring_desc {
	/* Address (guest-physical). */
	__virtio64 addr;
	/* Length. */
	__virtio32 len;
	/* The flags as indicated above. */
	__virtio16 flags;
	/* We chain unused descriptors via this, too */
	__virtio16 next;
};

There are 2 ring buffers; the "available" ring buffer is for sending a message
guest->host (which will transmit DMA addresses of guest allocated bulk data
blocks that are used for data sent to device, and separate guest allocated
bulk data blocks that will be used by host side to place its response bulk
data), and the "used" ring buffer is for sending host->guest to let guest know
about host's response and that it could now safely consume and then deallocate
the bulk data blocks subsequently.

---------- [recap-end] ----------

So the "queue size" actually defines the ringbuffer size. It does not define
the maximum amount of descriptors. The "queue size" rather defines how many
pending messages can be pushed into either one ringbuffer before the other
side would need to wait until the counter side would step up (i.e. ring buffer
full).

The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE actually is) OTOH
defines the max. bulk data size that could be transmitted with each virtio
round trip message.

And in fact, 9p currently handles the virtio "queue size" as directly
associative with its maximum amount of active 9p requests the server could
handle simultaniously:

  hw/9pfs/9p.h:#define MAX_REQ         128
  hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
  hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev, MAX_REQ,
                                 handle_9p_output);

So if I would change it like this, just for the purpose to increase the max.
virtio transmission size:

--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -218,7 +218,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
     v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
     virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
                 VIRTQUEUE_MAX_SIZE);
-    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
+    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
 }
 
Then it would require additional synchronization code on both ends and
therefore unnecessary complexity, because it would now be possible that more
requests are pushed into the ringbuffer than server could handle.

There is one potential issue though that probably did justify the "don't
exceed the queue size" rule:

ATM the descriptor table is allocated (just in time) as *one* continuous
buffer via kmalloc_array():
https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7d33a4/drivers/virtio/virtio_ring.c#L440

So assuming transmission size of 2 * 128 MB that kmalloc_array() call would
yield in kmalloc(1M) and the latter might fail if guest had highly fragmented
physical memory. For such kind of error case there is currently a fallback
path in virtqueue_add_split() that would then use the required amount of
pre-allocated descriptors instead:
https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7d33a4/drivers/virtio/virtio_ring.c#L525

That fallback recovery path would no longer be viable if the queue size was
exceeded. There would be alternatives though, e.g. by allowing to chain
indirect descriptor tables (currently prohibited by the virtio specs).

Best regards,
Christian Schoenebeck

> 
> >                 fc->cache = NULL;
> >         
> >         }
> > 
> > -       if (!fc->sdata)
> > +       if (!fc->sdata) {
> > +               pr_info("%s !fc->sdata", __func__);
> > 
> >                 return -ENOMEM;
> > 
> > +       }
> > 
> >         fc->capacity = alloc_msize;
> >         return 0;
> >  
> >  }
> > 
> > I try to look at this at the weekend, I would have expected this hack to
> > bypass this issue.
> > 
> > > > I saw zerocopy code in the 9p guest driver but didn't investigate when
> > > > it's used. Maybe that should be used for large requests (file
> > > > reads/writes)?
> > > 
> > > This is the case already : zero-copy is only used for
> > > reads/writes/readdir
> > > if the requested size is 1k or more.
> > > 
> > > Also you'll note that in this case, the 9p driver doesn't allocate msize
> > > for the T/R messages but only 4k, which is largely enough to hold the
> > > header.
> > > 
> > > 	/*
> > > 	
> > > 	 * We allocate a inline protocol data of only 4k bytes.
> > > 	 * The actual content is passed in zero-copy fashion.
> > > 	 */
> > > 	
> > > 	req = p9_client_prepare_req(c, type, P9_ZC_HDR_SZ, fmt, ap);
> > > 
> > > and
> > > 
> > > /* size of header for zero copy read/write */
> > > #define P9_ZC_HDR_SZ 4096
> > > 
> > > A huge msize only makes sense for Twrite, Rread and Rreaddir because
> > > of the amount of data they convey. All other messages certainly fit
> > > in a couple of kilobytes only (sorry, don't remember the numbers).
> > > 
> > > A first change should be to allocate MIN(XXX, msize) for the
> > > regular non-zc case, where XXX could be a reasonable fixed
> > > value (8k?). In the case of T messages, it is even possible
> > > to adjust the size to what's exactly needed, ala snprintf(NULL).
> > 
> > Good idea actually! That would limit this problem to reviewing the 9p
> > specs
> > and picking one reasonable max value. Because you are right, those message
> > types are tiny. Probably not worth to pile up new code to calculate exact
> > message sizes for each one of them.
> > 
> > Adding some safety net would make sense though, to force e.g. if a new
> > message type is added in future, that this value would be reviewed as
> > well,
> > something like:
> > 
> > static int max_msg_size(int msg_type) {
> > 
> >     switch (msg_type) {
> >     
> >         /* large zero copy messages */
> >         case Twrite:
> >         case Tread:
> >         
> >         case Treaddir:
> >             BUG_ON(true);
> >         
> >         /* small messages */
> >         case Tversion:
> >         ....
> >         
> >             return 8k; /* to be replaced with appropriate max value */
> >     
> >     }
> > 
> > }
> > 
> > That way the compiler would bark on future additions. But on doubt, a
> > simple comment on msg type enum might do as well though.
> > 
> > > > virtio-blk/scsi don't memcpy data into a new buffer, they
> > > > directly access page cache or O_DIRECT pinned pages.
> > > > 
> > > > Stefan
> > > 
> > > Cheers,
> > > 
> > > --
> > > Greg



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-21 15:39               ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-25 10:30                 ` Stefan Hajnoczi
  -1 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-10-25 10:30 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, Greg Kurz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 15382 bytes --]

On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > 
> > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck
> > 
> > wrote:
> > > > > > > > At the moment the maximum transfer size with virtio is limited
> > > > > > > > to
> > > > > > > > 4M
> > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > > > > > theoretical possible transfer size of 128M (32k pages) according
> > > > > > > > to
> > > > > > > > the
> > > > > > > > virtio specs:
> > > > > > > > 
> > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-
> > > > > > > > cs
> > > > > > > > 01
> > > > > > > > .html#
> > > > > > > > x1-240006
> > > > > > > 
> > > > > > > Hi Christian,
> > > > 
> > > > > > > I took a quick look at the code:
> > > > Hi,
> > > > 
> > > > Thanks Stefan for sharing virtio expertise and helping Christian !
> > > > 
> > > > > > > - The Linux 9p driver restricts descriptor chains to 128 elements
> > > > > > > 
> > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > 
> > > > > > Yes, that's the limitation that I am about to remove (WIP); current
> > > > > > kernel
> > > > > > patches:
> > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudeb
> > > > > > yt
> > > > > > e.
> > > > > > com/>
> > > > > 
> > > > > I haven't read the patches yet but I'm concerned that today the driver
> > > > > is pretty well-behaved and this new patch series introduces a spec
> > > > > violation. Not fixing existing spec violations is okay, but adding new
> > > > > ones is a red flag. I think we need to figure out a clean solution.
> > > 
> > > Nobody has reviewed the kernel patches yet. My main concern therefore
> > > actually is that the kernel patches are already too complex, because the
> > > current situation is that only Dominique is handling 9p patches on kernel
> > > side, and he barely has time for 9p anymore.
> > > 
> > > Another reason for me to catch up on reading current kernel code and
> > > stepping in as reviewer of 9p on kernel side ASAP, independent of this
> > > issue.
> > > 
> > > As for current kernel patches' complexity: I can certainly drop patch 7
> > > entirely as it is probably just overkill. Patch 4 is then the biggest
> > > chunk, I have to see if I can simplify it, and whether it would make
> > > sense to squash with patch 3.
> > > 
> > > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will
> > > > > > > fail
> > > > > > > 
> > > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > 
> > > > > > Hmm, which makes me wonder why I never encountered this error during
> > > > > > testing.
> > > > > > 
> > > > > > Most people will use the 9p qemu 'local' fs driver backend in
> > > > > > practice,
> > > > > > so
> > > > > > that v9fs_read() call would translate for most people to this
> > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > 
> > > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
> > > > > > 
> > > > > >                             const struct iovec *iov,
> > > > > >                             int iovcnt, off_t offset)
> > > > > > 
> > > > > > {
> > > > > > #ifdef CONFIG_PREADV
> > > > > > 
> > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > 
> > > > > > #else
> > > > > > 
> > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > >     if (err == -1) {
> > > > > >     
> > > > > >         return err;
> > > > > >     
> > > > > >     } else {
> > > > > >     
> > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > >     
> > > > > >     }
> > > > > > 
> > > > > > #endif
> > > > > > }
> > > > > > 
> > > > > > > Unless I misunderstood the code, neither side can take advantage
> > > > > > > of
> > > > > > > the
> > > > > > > new 32k descriptor chain limit?
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Stefan
> > > > > > 
> > > > > > I need to check that when I have some more time. One possible
> > > > > > explanation
> > > > > > might be that preadv() already has this wrapped into a loop in its
> > > > > > implementation to circumvent a limit like IOV_MAX. It might be
> > > > > > another
> > > > > > "it
> > > > > > works, but not portable" issue, but not sure.
> > > > > > 
> > > > > > There are still a bunch of other issues I have to resolve. If you
> > > > > > look
> > > > > > at
> > > > > > net/9p/client.c on kernel side, you'll notice that it basically does
> > > > > > this ATM> >
> > > > > > 
> > > > > >     kmalloc(msize);
> > > > 
> > > > Note that this is done twice : once for the T message (client request)
> > > > and
> > > > once for the R message (server answer). The 9p driver could adjust the
> > > > size
> > > > of the T message to what's really needed instead of allocating the full
> > > > msize. R message size is not known though.
> > > 
> > > Would it make sense adding a second virtio ring, dedicated to server
> > > responses to solve this? IIRC 9p server already calculates appropriate
> > > exact sizes for each response type. So server could just push space that's
> > > really needed for its responses.
> > > 
> > > > > > for every 9p request. So not only does it allocate much more memory
> > > > > > for
> > > > > > every request than actually required (i.e. say 9pfs was mounted with
> > > > > > msize=8M, then a 9p request that actually would just need 1k would
> > > > > > nevertheless allocate 8M), but also it allocates > PAGE_SIZE, which
> > > > > > obviously may fail at any time.>
> > > > > 
> > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc()
> > > > > situation.
> > > 
> > > Hu, I didn't even consider vmalloc(). I just tried the kvmalloc() wrapper
> > > as a quick & dirty test, but it crashed in the same way as kmalloc() with
> > > large msize values immediately on mounting:
> > > 
> > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > index a75034fa249b..cfe300a4b6ca 100644
> > > --- a/net/9p/client.c
> > > +++ b/net/9p/client.c
> > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct p9_client
> > > *clnt)
> > > 
> > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
> > >  
> > >                          int alloc_msize)
> > >  
> > >  {
> > > 
> > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > +       if (false) {
> > > 
> > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
> > >                 fc->cache = c->fcall_cache;
> > >         
> > >         } else {
> > > 
> > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > 
> > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > 
> > Now I get:
> > 
> >    virtio: bogus descriptor or out of resources
> > 
> > So, still some work ahead on both ends.
> 
> Few hacks later (only changes on 9p client side) I got this running stable
> now. The reason for the virtio error above was that kvmalloc() returns a
> non-logical kernel address for any kvmalloc(>4M), i.e. an address that is
> inaccessible from host side, hence that "bogus descriptor" message by QEMU.
> So I had to split those linear 9p client buffers into sparse ones (set of
> individual pages).
> 
> I tested this for some days with various virtio transmission sizes and it
> works as expected up to 128 MB (more precisely: 128 MB read space + 128 MB
> write space per virtio round trip message).
> 
> I did not encounter a show stopper for large virtio transmission sizes
> (4 MB ... 128 MB) on virtio level, neither as a result of testing, nor after
> reviewing the existing code.
> 
> About IOV_MAX: that's apparently not an issue on virtio level. Most of the
> iovec code, both on Linux kernel side and on QEMU side do not have this
> limitation. It is apparently however indeed a limitation for userland apps
> calling the Linux kernel's syscalls yet.
> 
> Stefan, as it stands now, I am even more convinced that the upper virtio
> transmission size limit should not be squeezed into the queue size argument of
> virtio_add_queue(). Not because of the previous argument that it would waste
> space (~1MB), but rather because they are two different things. To outline
> this, just a quick recap of what happens exactly when a bulk message is pushed
> over the virtio wire (assuming virtio "split" layout here):
> 
> ---------- [recap-start] ----------
> 
> For each bulk message sent guest <-> host, exactly *one* of the pre-allocated
> descriptors is taken and placed (subsequently) into exactly *one* position of
> the two available/used ring buffers. The actual descriptor table though,
> containing all the DMA addresses of the message bulk data, is allocated just
> in time for each round trip message. Say, it is the first message sent, it
> yields in the following structure:
> 
> Ring Buffer   Descriptor Table      Bulk Data Pages
> 
>    +-+              +-+           +-----------------+
>    |D|------------->|d|---------->| Bulk data block |
>    +-+              |d|--------+  +-----------------+
>    | |              |d|------+ |
>    +-+               .       | |  +-----------------+
>    | |               .       | +->| Bulk data block |
>     .                .       |    +-----------------+
>     .               |d|-+    |
>     .               +-+ |    |    +-----------------+
>    | |                  |    +--->| Bulk data block |
>    +-+                  |         +-----------------+
>    | |                  |                 .
>    +-+                  |                 .
>                         |                 .
>                         |         +-----------------+
>                         +-------->| Bulk data block |
>                                   +-----------------+
> Legend:
> D: pre-allocated descriptor
> d: just in time allocated descriptor
> -->: memory pointer (DMA)
> 
> The bulk data blocks are allocated by the respective device driver above
> virtio subsystem level (guest side).
> 
> There are exactly as many descriptors pre-allocated (D) as the size of a ring
> buffer.
> 
> A "descriptor" is more or less just a chainable DMA memory pointer; defined
> as:
> 
> /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
> struct vring_desc {
> 	/* Address (guest-physical). */
> 	__virtio64 addr;
> 	/* Length. */
> 	__virtio32 len;
> 	/* The flags as indicated above. */
> 	__virtio16 flags;
> 	/* We chain unused descriptors via this, too */
> 	__virtio16 next;
> };
> 
> There are 2 ring buffers; the "available" ring buffer is for sending a message
> guest->host (which will transmit DMA addresses of guest allocated bulk data
> blocks that are used for data sent to device, and separate guest allocated
> bulk data blocks that will be used by host side to place its response bulk
> data), and the "used" ring buffer is for sending host->guest to let guest know
> about host's response and that it could now safely consume and then deallocate
> the bulk data blocks subsequently.
> 
> ---------- [recap-end] ----------
> 
> So the "queue size" actually defines the ringbuffer size. It does not define
> the maximum amount of descriptors. The "queue size" rather defines how many
> pending messages can be pushed into either one ringbuffer before the other
> side would need to wait until the counter side would step up (i.e. ring buffer
> full).
> 
> The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE actually is) OTOH
> defines the max. bulk data size that could be transmitted with each virtio
> round trip message.
> 
> And in fact, 9p currently handles the virtio "queue size" as directly
> associative with its maximum amount of active 9p requests the server could
> handle simultaniously:
> 
>   hw/9pfs/9p.h:#define MAX_REQ         128
>   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
>   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev, MAX_REQ,
>                                  handle_9p_output);
> 
> So if I would change it like this, just for the purpose to increase the max.
> virtio transmission size:
> 
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -218,7 +218,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
>      v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
>      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
>                  VIRTQUEUE_MAX_SIZE);
> -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
>  }
>  
> Then it would require additional synchronization code on both ends and
> therefore unnecessary complexity, because it would now be possible that more
> requests are pushed into the ringbuffer than server could handle.
> 
> There is one potential issue though that probably did justify the "don't
> exceed the queue size" rule:
> 
> ATM the descriptor table is allocated (just in time) as *one* continuous
> buffer via kmalloc_array():
> https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7d33a4/drivers/virtio/virtio_ring.c#L440
> 
> So assuming transmission size of 2 * 128 MB that kmalloc_array() call would
> yield in kmalloc(1M) and the latter might fail if guest had highly fragmented
> physical memory. For such kind of error case there is currently a fallback
> path in virtqueue_add_split() that would then use the required amount of
> pre-allocated descriptors instead:
> https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7d33a4/drivers/virtio/virtio_ring.c#L525
> 
> That fallback recovery path would no longer be viable if the queue size was
> exceeded. There would be alternatives though, e.g. by allowing to chain
> indirect descriptor tables (currently prohibited by the virtio specs).

Making the maximum number of descriptors independent of the queue size
requires a change to the VIRTIO spec since the two values are currently
explicitly tied together by the spec.

Before doing that, are there benchmark results showing that 1 MB vs 128
MB produces a performance improvement? I'm asking because if performance
with 1 MB is good then you can probably do that without having to change
VIRTIO and also because it's counter-intuitive that 9p needs 128 MB for
good performance when it's ultimately implemented on top of disk and
network I/O that have lower size limits.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-25 10:30                 ` Stefan Hajnoczi
  0 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-10-25 10:30 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz

[-- Attachment #1: Type: text/plain, Size: 15382 bytes --]

On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > 
> > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck
> > 
> > wrote:
> > > > > > > > At the moment the maximum transfer size with virtio is limited
> > > > > > > > to
> > > > > > > > 4M
> > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > > > > > theoretical possible transfer size of 128M (32k pages) according
> > > > > > > > to
> > > > > > > > the
> > > > > > > > virtio specs:
> > > > > > > > 
> > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-
> > > > > > > > cs
> > > > > > > > 01
> > > > > > > > .html#
> > > > > > > > x1-240006
> > > > > > > 
> > > > > > > Hi Christian,
> > > > 
> > > > > > > I took a quick look at the code:
> > > > Hi,
> > > > 
> > > > Thanks Stefan for sharing virtio expertise and helping Christian !
> > > > 
> > > > > > > - The Linux 9p driver restricts descriptor chains to 128 elements
> > > > > > > 
> > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > 
> > > > > > Yes, that's the limitation that I am about to remove (WIP); current
> > > > > > kernel
> > > > > > patches:
> > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudeb
> > > > > > yt
> > > > > > e.
> > > > > > com/>
> > > > > 
> > > > > I haven't read the patches yet but I'm concerned that today the driver
> > > > > is pretty well-behaved and this new patch series introduces a spec
> > > > > violation. Not fixing existing spec violations is okay, but adding new
> > > > > ones is a red flag. I think we need to figure out a clean solution.
> > > 
> > > Nobody has reviewed the kernel patches yet. My main concern therefore
> > > actually is that the kernel patches are already too complex, because the
> > > current situation is that only Dominique is handling 9p patches on kernel
> > > side, and he barely has time for 9p anymore.
> > > 
> > > Another reason for me to catch up on reading current kernel code and
> > > stepping in as reviewer of 9p on kernel side ASAP, independent of this
> > > issue.
> > > 
> > > As for current kernel patches' complexity: I can certainly drop patch 7
> > > entirely as it is probably just overkill. Patch 4 is then the biggest
> > > chunk, I have to see if I can simplify it, and whether it would make
> > > sense to squash with patch 3.
> > > 
> > > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will
> > > > > > > fail
> > > > > > > 
> > > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > 
> > > > > > Hmm, which makes me wonder why I never encountered this error during
> > > > > > testing.
> > > > > > 
> > > > > > Most people will use the 9p qemu 'local' fs driver backend in
> > > > > > practice,
> > > > > > so
> > > > > > that v9fs_read() call would translate for most people to this
> > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > 
> > > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
> > > > > > 
> > > > > >                             const struct iovec *iov,
> > > > > >                             int iovcnt, off_t offset)
> > > > > > 
> > > > > > {
> > > > > > #ifdef CONFIG_PREADV
> > > > > > 
> > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > 
> > > > > > #else
> > > > > > 
> > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > >     if (err == -1) {
> > > > > >     
> > > > > >         return err;
> > > > > >     
> > > > > >     } else {
> > > > > >     
> > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > >     
> > > > > >     }
> > > > > > 
> > > > > > #endif
> > > > > > }
> > > > > > 
> > > > > > > Unless I misunderstood the code, neither side can take advantage
> > > > > > > of
> > > > > > > the
> > > > > > > new 32k descriptor chain limit?
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Stefan
> > > > > > 
> > > > > > I need to check that when I have some more time. One possible
> > > > > > explanation
> > > > > > might be that preadv() already has this wrapped into a loop in its
> > > > > > implementation to circumvent a limit like IOV_MAX. It might be
> > > > > > another
> > > > > > "it
> > > > > > works, but not portable" issue, but not sure.
> > > > > > 
> > > > > > There are still a bunch of other issues I have to resolve. If you
> > > > > > look
> > > > > > at
> > > > > > net/9p/client.c on kernel side, you'll notice that it basically does
> > > > > > this ATM> >
> > > > > > 
> > > > > >     kmalloc(msize);
> > > > 
> > > > Note that this is done twice : once for the T message (client request)
> > > > and
> > > > once for the R message (server answer). The 9p driver could adjust the
> > > > size
> > > > of the T message to what's really needed instead of allocating the full
> > > > msize. R message size is not known though.
> > > 
> > > Would it make sense adding a second virtio ring, dedicated to server
> > > responses to solve this? IIRC 9p server already calculates appropriate
> > > exact sizes for each response type. So server could just push space that's
> > > really needed for its responses.
> > > 
> > > > > > for every 9p request. So not only does it allocate much more memory
> > > > > > for
> > > > > > every request than actually required (i.e. say 9pfs was mounted with
> > > > > > msize=8M, then a 9p request that actually would just need 1k would
> > > > > > nevertheless allocate 8M), but also it allocates > PAGE_SIZE, which
> > > > > > obviously may fail at any time.>
> > > > > 
> > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc()
> > > > > situation.
> > > 
> > > Hu, I didn't even consider vmalloc(). I just tried the kvmalloc() wrapper
> > > as a quick & dirty test, but it crashed in the same way as kmalloc() with
> > > large msize values immediately on mounting:
> > > 
> > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > index a75034fa249b..cfe300a4b6ca 100644
> > > --- a/net/9p/client.c
> > > +++ b/net/9p/client.c
> > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct p9_client
> > > *clnt)
> > > 
> > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
> > >  
> > >                          int alloc_msize)
> > >  
> > >  {
> > > 
> > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > +       if (false) {
> > > 
> > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
> > >                 fc->cache = c->fcall_cache;
> > >         
> > >         } else {
> > > 
> > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > 
> > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > 
> > Now I get:
> > 
> >    virtio: bogus descriptor or out of resources
> > 
> > So, still some work ahead on both ends.
> 
> Few hacks later (only changes on 9p client side) I got this running stable
> now. The reason for the virtio error above was that kvmalloc() returns a
> non-logical kernel address for any kvmalloc(>4M), i.e. an address that is
> inaccessible from host side, hence that "bogus descriptor" message by QEMU.
> So I had to split those linear 9p client buffers into sparse ones (set of
> individual pages).
> 
> I tested this for some days with various virtio transmission sizes and it
> works as expected up to 128 MB (more precisely: 128 MB read space + 128 MB
> write space per virtio round trip message).
> 
> I did not encounter a show stopper for large virtio transmission sizes
> (4 MB ... 128 MB) on virtio level, neither as a result of testing, nor after
> reviewing the existing code.
> 
> About IOV_MAX: that's apparently not an issue on virtio level. Most of the
> iovec code, both on Linux kernel side and on QEMU side do not have this
> limitation. It is apparently however indeed a limitation for userland apps
> calling the Linux kernel's syscalls yet.
> 
> Stefan, as it stands now, I am even more convinced that the upper virtio
> transmission size limit should not be squeezed into the queue size argument of
> virtio_add_queue(). Not because of the previous argument that it would waste
> space (~1MB), but rather because they are two different things. To outline
> this, just a quick recap of what happens exactly when a bulk message is pushed
> over the virtio wire (assuming virtio "split" layout here):
> 
> ---------- [recap-start] ----------
> 
> For each bulk message sent guest <-> host, exactly *one* of the pre-allocated
> descriptors is taken and placed (subsequently) into exactly *one* position of
> the two available/used ring buffers. The actual descriptor table though,
> containing all the DMA addresses of the message bulk data, is allocated just
> in time for each round trip message. Say, it is the first message sent, it
> yields in the following structure:
> 
> Ring Buffer   Descriptor Table      Bulk Data Pages
> 
>    +-+              +-+           +-----------------+
>    |D|------------->|d|---------->| Bulk data block |
>    +-+              |d|--------+  +-----------------+
>    | |              |d|------+ |
>    +-+               .       | |  +-----------------+
>    | |               .       | +->| Bulk data block |
>     .                .       |    +-----------------+
>     .               |d|-+    |
>     .               +-+ |    |    +-----------------+
>    | |                  |    +--->| Bulk data block |
>    +-+                  |         +-----------------+
>    | |                  |                 .
>    +-+                  |                 .
>                         |                 .
>                         |         +-----------------+
>                         +-------->| Bulk data block |
>                                   +-----------------+
> Legend:
> D: pre-allocated descriptor
> d: just in time allocated descriptor
> -->: memory pointer (DMA)
> 
> The bulk data blocks are allocated by the respective device driver above
> virtio subsystem level (guest side).
> 
> There are exactly as many descriptors pre-allocated (D) as the size of a ring
> buffer.
> 
> A "descriptor" is more or less just a chainable DMA memory pointer; defined
> as:
> 
> /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
> struct vring_desc {
> 	/* Address (guest-physical). */
> 	__virtio64 addr;
> 	/* Length. */
> 	__virtio32 len;
> 	/* The flags as indicated above. */
> 	__virtio16 flags;
> 	/* We chain unused descriptors via this, too */
> 	__virtio16 next;
> };
> 
> There are 2 ring buffers; the "available" ring buffer is for sending a message
> guest->host (which will transmit DMA addresses of guest allocated bulk data
> blocks that are used for data sent to device, and separate guest allocated
> bulk data blocks that will be used by host side to place its response bulk
> data), and the "used" ring buffer is for sending host->guest to let guest know
> about host's response and that it could now safely consume and then deallocate
> the bulk data blocks subsequently.
> 
> ---------- [recap-end] ----------
> 
> So the "queue size" actually defines the ringbuffer size. It does not define
> the maximum amount of descriptors. The "queue size" rather defines how many
> pending messages can be pushed into either one ringbuffer before the other
> side would need to wait until the counter side would step up (i.e. ring buffer
> full).
> 
> The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE actually is) OTOH
> defines the max. bulk data size that could be transmitted with each virtio
> round trip message.
> 
> And in fact, 9p currently handles the virtio "queue size" as directly
> associative with its maximum amount of active 9p requests the server could
> handle simultaniously:
> 
>   hw/9pfs/9p.h:#define MAX_REQ         128
>   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
>   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev, MAX_REQ,
>                                  handle_9p_output);
> 
> So if I would change it like this, just for the purpose to increase the max.
> virtio transmission size:
> 
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -218,7 +218,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
>      v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
>      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
>                  VIRTQUEUE_MAX_SIZE);
> -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
>  }
>  
> Then it would require additional synchronization code on both ends and
> therefore unnecessary complexity, because it would now be possible that more
> requests are pushed into the ringbuffer than server could handle.
> 
> There is one potential issue though that probably did justify the "don't
> exceed the queue size" rule:
> 
> ATM the descriptor table is allocated (just in time) as *one* continuous
> buffer via kmalloc_array():
> https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7d33a4/drivers/virtio/virtio_ring.c#L440
> 
> So assuming transmission size of 2 * 128 MB that kmalloc_array() call would
> yield in kmalloc(1M) and the latter might fail if guest had highly fragmented
> physical memory. For such kind of error case there is currently a fallback
> path in virtqueue_add_split() that would then use the required amount of
> pre-allocated descriptors instead:
> https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7d33a4/drivers/virtio/virtio_ring.c#L525
> 
> That fallback recovery path would no longer be viable if the queue size was
> exceeded. There would be alternatives though, e.g. by allowing to chain
> indirect descriptor tables (currently prohibited by the virtio specs).

Making the maximum number of descriptors independent of the queue size
requires a change to the VIRTIO spec since the two values are currently
explicitly tied together by the spec.

Before doing that, are there benchmark results showing that 1 MB vs 128
MB produces a performance improvement? I'm asking because if performance
with 1 MB is good then you can probably do that without having to change
VIRTIO and also because it's counter-intuitive that 9p needs 128 MB for
good performance when it's ultimately implemented on top of disk and
network I/O that have lower size limits.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-25 10:30                 ` [Virtio-fs] " Stefan Hajnoczi
@ 2021-10-25 15:03                   ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-25 15:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Kevin Wolf, Laurent Vivier, qemu-block,
	Michael S. Tsirkin, Jason Wang, Amit Shah, David Hildenbrand,
	Greg Kurz, virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > 
> > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > Schoenebeck
> > > 
> > > wrote:
> > > > > > > > > At the moment the maximum transfer size with virtio is
> > > > > > > > > limited
> > > > > > > > > to
> > > > > > > > > 4M
> > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its
> > > > > > > > > maximum
> > > > > > > > > theoretical possible transfer size of 128M (32k pages)
> > > > > > > > > according
> > > > > > > > > to
> > > > > > > > > the
> > > > > > > > > virtio specs:
> > > > > > > > > 
> > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v
> > > > > > > > > 1.1-
> > > > > > > > > cs
> > > > > > > > > 01
> > > > > > > > > .html#
> > > > > > > > > x1-240006
> > > > > > > > 
> > > > > > > > Hi Christian,
> > > > > 
> > > > > > > > I took a quick look at the code:
> > > > > Hi,
> > > > > 
> > > > > Thanks Stefan for sharing virtio expertise and helping Christian !
> > > > > 
> > > > > > > > - The Linux 9p driver restricts descriptor chains to 128
> > > > > > > > elements
> > > > > > > > 
> > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > 
> > > > > > > Yes, that's the limitation that I am about to remove (WIP);
> > > > > > > current
> > > > > > > kernel
> > > > > > > patches:
> > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@cr
> > > > > > > udeb
> > > > > > > yt
> > > > > > > e.
> > > > > > > com/>
> > > > > > 
> > > > > > I haven't read the patches yet but I'm concerned that today the
> > > > > > driver
> > > > > > is pretty well-behaved and this new patch series introduces a spec
> > > > > > violation. Not fixing existing spec violations is okay, but adding
> > > > > > new
> > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > solution.
> > > > 
> > > > Nobody has reviewed the kernel patches yet. My main concern therefore
> > > > actually is that the kernel patches are already too complex, because
> > > > the
> > > > current situation is that only Dominique is handling 9p patches on
> > > > kernel
> > > > side, and he barely has time for 9p anymore.
> > > > 
> > > > Another reason for me to catch up on reading current kernel code and
> > > > stepping in as reviewer of 9p on kernel side ASAP, independent of this
> > > > issue.
> > > > 
> > > > As for current kernel patches' complexity: I can certainly drop patch
> > > > 7
> > > > entirely as it is probably just overkill. Patch 4 is then the biggest
> > > > chunk, I have to see if I can simplify it, and whether it would make
> > > > sense to squash with patch 3.
> > > > 
> > > > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and
> > > > > > > > will
> > > > > > > > fail
> > > > > > > > 
> > > > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > 
> > > > > > > Hmm, which makes me wonder why I never encountered this error
> > > > > > > during
> > > > > > > testing.
> > > > > > > 
> > > > > > > Most people will use the 9p qemu 'local' fs driver backend in
> > > > > > > practice,
> > > > > > > so
> > > > > > > that v9fs_read() call would translate for most people to this
> > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > 
> > > > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState
> > > > > > > *fs,
> > > > > > > 
> > > > > > >                             const struct iovec *iov,
> > > > > > >                             int iovcnt, off_t offset)
> > > > > > > 
> > > > > > > {
> > > > > > > #ifdef CONFIG_PREADV
> > > > > > > 
> > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > 
> > > > > > > #else
> > > > > > > 
> > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > >     if (err == -1) {
> > > > > > >     
> > > > > > >         return err;
> > > > > > >     
> > > > > > >     } else {
> > > > > > >     
> > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > >     
> > > > > > >     }
> > > > > > > 
> > > > > > > #endif
> > > > > > > }
> > > > > > > 
> > > > > > > > Unless I misunderstood the code, neither side can take
> > > > > > > > advantage
> > > > > > > > of
> > > > > > > > the
> > > > > > > > new 32k descriptor chain limit?
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > Stefan
> > > > > > > 
> > > > > > > I need to check that when I have some more time. One possible
> > > > > > > explanation
> > > > > > > might be that preadv() already has this wrapped into a loop in
> > > > > > > its
> > > > > > > implementation to circumvent a limit like IOV_MAX. It might be
> > > > > > > another
> > > > > > > "it
> > > > > > > works, but not portable" issue, but not sure.
> > > > > > > 
> > > > > > > There are still a bunch of other issues I have to resolve. If
> > > > > > > you
> > > > > > > look
> > > > > > > at
> > > > > > > net/9p/client.c on kernel side, you'll notice that it basically
> > > > > > > does
> > > > > > > this ATM> >
> > > > > > > 
> > > > > > >     kmalloc(msize);
> > > > > 
> > > > > Note that this is done twice : once for the T message (client
> > > > > request)
> > > > > and
> > > > > once for the R message (server answer). The 9p driver could adjust
> > > > > the
> > > > > size
> > > > > of the T message to what's really needed instead of allocating the
> > > > > full
> > > > > msize. R message size is not known though.
> > > > 
> > > > Would it make sense adding a second virtio ring, dedicated to server
> > > > responses to solve this? IIRC 9p server already calculates appropriate
> > > > exact sizes for each response type. So server could just push space
> > > > that's
> > > > really needed for its responses.
> > > > 
> > > > > > > for every 9p request. So not only does it allocate much more
> > > > > > > memory
> > > > > > > for
> > > > > > > every request than actually required (i.e. say 9pfs was mounted
> > > > > > > with
> > > > > > > msize=8M, then a 9p request that actually would just need 1k
> > > > > > > would
> > > > > > > nevertheless allocate 8M), but also it allocates > PAGE_SIZE,
> > > > > > > which
> > > > > > > obviously may fail at any time.>
> > > > > > 
> > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc()
> > > > > > situation.
> > > > 
> > > > Hu, I didn't even consider vmalloc(). I just tried the kvmalloc()
> > > > wrapper
> > > > as a quick & dirty test, but it crashed in the same way as kmalloc()
> > > > with
> > > > large msize values immediately on mounting:
> > > > 
> > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > --- a/net/9p/client.c
> > > > +++ b/net/9p/client.c
> > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct
> > > > p9_client
> > > > *clnt)
> > > > 
> > > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
> > > >  
> > > >                          int alloc_msize)
> > > >  
> > > >  {
> > > > 
> > > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > +       if (false) {
> > > > 
> > > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > > >                 GFP_NOFS);
> > > >                 fc->cache = c->fcall_cache;
> > > >         
> > > >         } else {
> > > > 
> > > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > > 
> > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > 
> > > Now I get:
> > >    virtio: bogus descriptor or out of resources
> > > 
> > > So, still some work ahead on both ends.
> > 
> > Few hacks later (only changes on 9p client side) I got this running stable
> > now. The reason for the virtio error above was that kvmalloc() returns a
> > non-logical kernel address for any kvmalloc(>4M), i.e. an address that is
> > inaccessible from host side, hence that "bogus descriptor" message by
> > QEMU.
> > So I had to split those linear 9p client buffers into sparse ones (set of
> > individual pages).
> > 
> > I tested this for some days with various virtio transmission sizes and it
> > works as expected up to 128 MB (more precisely: 128 MB read space + 128 MB
> > write space per virtio round trip message).
> > 
> > I did not encounter a show stopper for large virtio transmission sizes
> > (4 MB ... 128 MB) on virtio level, neither as a result of testing, nor
> > after reviewing the existing code.
> > 
> > About IOV_MAX: that's apparently not an issue on virtio level. Most of the
> > iovec code, both on Linux kernel side and on QEMU side do not have this
> > limitation. It is apparently however indeed a limitation for userland apps
> > calling the Linux kernel's syscalls yet.
> > 
> > Stefan, as it stands now, I am even more convinced that the upper virtio
> > transmission size limit should not be squeezed into the queue size
> > argument of virtio_add_queue(). Not because of the previous argument that
> > it would waste space (~1MB), but rather because they are two different
> > things. To outline this, just a quick recap of what happens exactly when
> > a bulk message is pushed over the virtio wire (assuming virtio "split"
> > layout here):
> > 
> > ---------- [recap-start] ----------
> > 
> > For each bulk message sent guest <-> host, exactly *one* of the
> > pre-allocated descriptors is taken and placed (subsequently) into exactly
> > *one* position of the two available/used ring buffers. The actual
> > descriptor table though, containing all the DMA addresses of the message
> > bulk data, is allocated just in time for each round trip message. Say, it
> > is the first message sent, it yields in the following structure:
> > 
> > Ring Buffer   Descriptor Table      Bulk Data Pages
> > 
> >    +-+              +-+           +-----------------+
> >    
> >    |D|------------->|d|---------->| Bulk data block |
> >    
> >    +-+              |d|--------+  +-----------------+
> >    
> >    | |              |d|------+ |
> >    
> >    +-+               .       | |  +-----------------+
> >    
> >    | |               .       | +->| Bulk data block |
> >     
> >     .                .       |    +-----------------+
> >     .               |d|-+    |
> >     .               +-+ |    |    +-----------------+
> >     
> >    | |                  |    +--->| Bulk data block |
> >    
> >    +-+                  |         +-----------------+
> >    
> >    | |                  |                 .
> >    
> >    +-+                  |                 .
> >    
> >                         |                 .
> >                         |         
> >                         |         +-----------------+
> >                         
> >                         +-------->| Bulk data block |
> >                         
> >                                   +-----------------+
> > 
> > Legend:
> > D: pre-allocated descriptor
> > d: just in time allocated descriptor
> > -->: memory pointer (DMA)
> > 
> > The bulk data blocks are allocated by the respective device driver above
> > virtio subsystem level (guest side).
> > 
> > There are exactly as many descriptors pre-allocated (D) as the size of a
> > ring buffer.
> > 
> > A "descriptor" is more or less just a chainable DMA memory pointer;
> > defined
> > as:
> > 
> > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > "next". */ struct vring_desc {
> > 
> > 	/* Address (guest-physical). */
> > 	__virtio64 addr;
> > 	/* Length. */
> > 	__virtio32 len;
> > 	/* The flags as indicated above. */
> > 	__virtio16 flags;
> > 	/* We chain unused descriptors via this, too */
> > 	__virtio16 next;
> > 
> > };
> > 
> > There are 2 ring buffers; the "available" ring buffer is for sending a
> > message guest->host (which will transmit DMA addresses of guest allocated
> > bulk data blocks that are used for data sent to device, and separate
> > guest allocated bulk data blocks that will be used by host side to place
> > its response bulk data), and the "used" ring buffer is for sending
> > host->guest to let guest know about host's response and that it could now
> > safely consume and then deallocate the bulk data blocks subsequently.
> > 
> > ---------- [recap-end] ----------
> > 
> > So the "queue size" actually defines the ringbuffer size. It does not
> > define the maximum amount of descriptors. The "queue size" rather defines
> > how many pending messages can be pushed into either one ringbuffer before
> > the other side would need to wait until the counter side would step up
> > (i.e. ring buffer full).
> > 
> > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE actually is)
> > OTOH defines the max. bulk data size that could be transmitted with each
> > virtio round trip message.
> > 
> > And in fact, 9p currently handles the virtio "queue size" as directly
> > associative with its maximum amount of active 9p requests the server could
> > 
> > handle simultaniously:
> >   hw/9pfs/9p.h:#define MAX_REQ         128
> >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> >   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev, MAX_REQ,
> >   
> >                                  handle_9p_output);
> > 
> > So if I would change it like this, just for the purpose to increase the
> > max. virtio transmission size:
> > 
> > --- a/hw/9pfs/virtio-9p-device.c
> > +++ b/hw/9pfs/virtio-9p-device.c
> > @@ -218,7 +218,7 @@ static void virtio_9p_device_realize(DeviceState *dev,
> > Error **errp)> 
> >      v->config_size = sizeof(struct virtio_9p_config) +
> >      strlen(s->fsconf.tag);
> >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> >      
> >                  VIRTQUEUE_MAX_SIZE);
> > 
> > -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
> > 
> >  }
> > 
> > Then it would require additional synchronization code on both ends and
> > therefore unnecessary complexity, because it would now be possible that
> > more requests are pushed into the ringbuffer than server could handle.
> > 
> > There is one potential issue though that probably did justify the "don't
> > exceed the queue size" rule:
> > 
> > ATM the descriptor table is allocated (just in time) as *one* continuous
> > buffer via kmalloc_array():
> > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7
> > d33a4/drivers/virtio/virtio_ring.c#L440
> > 
> > So assuming transmission size of 2 * 128 MB that kmalloc_array() call
> > would
> > yield in kmalloc(1M) and the latter might fail if guest had highly
> > fragmented physical memory. For such kind of error case there is
> > currently a fallback path in virtqueue_add_split() that would then use
> > the required amount of pre-allocated descriptors instead:
> > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7
> > d33a4/drivers/virtio/virtio_ring.c#L525
> > 
> > That fallback recovery path would no longer be viable if the queue size
> > was
> > exceeded. There would be alternatives though, e.g. by allowing to chain
> > indirect descriptor tables (currently prohibited by the virtio specs).
> 
> Making the maximum number of descriptors independent of the queue size
> requires a change to the VIRTIO spec since the two values are currently
> explicitly tied together by the spec.

Yes, that's what the virtio specs say. But they don't say why, nor did I hear
a reason in this dicussion.

That's why I invested time reviewing current virtio implementation and specs,
as well as actually testing exceeding that limit. And as I outlined in detail
in my previous email, I only found one theoretical issue that could be
addressed though.

> Before doing that, are there benchmark results showing that 1 MB vs 128
> MB produces a performance improvement? I'm asking because if performance
> with 1 MB is good then you can probably do that without having to change
> VIRTIO and also because it's counter-intuitive that 9p needs 128 MB for
> good performance when it's ultimately implemented on top of disk and
> network I/O that have lower size limits.

First some numbers, linear reading a 12 GB file:

msize    average      notes

8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
1 MB     2551 MB/s    this msize would already violate virtio specs
2 MB     2521 MB/s    this msize would already violate virtio specs
4 MB     2628 MB/s    planned max. msize of my current kernel patches [1]

Note that current 9p Linux client implementation used here, has a bunch of
known code simplifications that cost a lot the more you increase msize, which
also explains the bumb at 1 MB vs 2 MB here. I will address these issues with
my kernel patches soon. Current numbers already suggest though that you will
see growing performance above 4 MB msize as well.

I have not even bothered benchmarking my current, heavily hacked kernel for
the case 4 MB .. 128 MB, because I'm using ridiculous expensive hacks that
copy huge buffers between 9p client level (linear buffers, non-logical address
space) and virtio level (sparse buffers, logical address space for DMA)
several times back and forth.

The point of my current hacks were just to find out whether it was feasible
and sane to exceed current virtio limit, and I think it is.

But again, this is not just about performance. My conclusion as described in
my previous email is that virtio currently squeezes

	"max. simultanious amount of bulk messages"

vs.

	"max. bulk data transmission size per bulk messaage"

into the same configuration parameter, which is IMO inappropriate and hence
splitting them into 2 separate parameters when creating a queue makes sense,
independent of the performance benchmarks.

[1] https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-25 15:03                   ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-10-25 15:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > 
> > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > Schoenebeck
> > > 
> > > wrote:
> > > > > > > > > At the moment the maximum transfer size with virtio is
> > > > > > > > > limited
> > > > > > > > > to
> > > > > > > > > 4M
> > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its
> > > > > > > > > maximum
> > > > > > > > > theoretical possible transfer size of 128M (32k pages)
> > > > > > > > > according
> > > > > > > > > to
> > > > > > > > > the
> > > > > > > > > virtio specs:
> > > > > > > > > 
> > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v
> > > > > > > > > 1.1-
> > > > > > > > > cs
> > > > > > > > > 01
> > > > > > > > > .html#
> > > > > > > > > x1-240006
> > > > > > > > 
> > > > > > > > Hi Christian,
> > > > > 
> > > > > > > > I took a quick look at the code:
> > > > > Hi,
> > > > > 
> > > > > Thanks Stefan for sharing virtio expertise and helping Christian !
> > > > > 
> > > > > > > > - The Linux 9p driver restricts descriptor chains to 128
> > > > > > > > elements
> > > > > > > > 
> > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > 
> > > > > > > Yes, that's the limitation that I am about to remove (WIP);
> > > > > > > current
> > > > > > > kernel
> > > > > > > patches:
> > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@cr
> > > > > > > udeb
> > > > > > > yt
> > > > > > > e.
> > > > > > > com/>
> > > > > > 
> > > > > > I haven't read the patches yet but I'm concerned that today the
> > > > > > driver
> > > > > > is pretty well-behaved and this new patch series introduces a spec
> > > > > > violation. Not fixing existing spec violations is okay, but adding
> > > > > > new
> > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > solution.
> > > > 
> > > > Nobody has reviewed the kernel patches yet. My main concern therefore
> > > > actually is that the kernel patches are already too complex, because
> > > > the
> > > > current situation is that only Dominique is handling 9p patches on
> > > > kernel
> > > > side, and he barely has time for 9p anymore.
> > > > 
> > > > Another reason for me to catch up on reading current kernel code and
> > > > stepping in as reviewer of 9p on kernel side ASAP, independent of this
> > > > issue.
> > > > 
> > > > As for current kernel patches' complexity: I can certainly drop patch
> > > > 7
> > > > entirely as it is probably just overkill. Patch 4 is then the biggest
> > > > chunk, I have to see if I can simplify it, and whether it would make
> > > > sense to squash with patch 3.
> > > > 
> > > > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and
> > > > > > > > will
> > > > > > > > fail
> > > > > > > > 
> > > > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > 
> > > > > > > Hmm, which makes me wonder why I never encountered this error
> > > > > > > during
> > > > > > > testing.
> > > > > > > 
> > > > > > > Most people will use the 9p qemu 'local' fs driver backend in
> > > > > > > practice,
> > > > > > > so
> > > > > > > that v9fs_read() call would translate for most people to this
> > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > 
> > > > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState
> > > > > > > *fs,
> > > > > > > 
> > > > > > >                             const struct iovec *iov,
> > > > > > >                             int iovcnt, off_t offset)
> > > > > > > 
> > > > > > > {
> > > > > > > #ifdef CONFIG_PREADV
> > > > > > > 
> > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > 
> > > > > > > #else
> > > > > > > 
> > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > >     if (err == -1) {
> > > > > > >     
> > > > > > >         return err;
> > > > > > >     
> > > > > > >     } else {
> > > > > > >     
> > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > >     
> > > > > > >     }
> > > > > > > 
> > > > > > > #endif
> > > > > > > }
> > > > > > > 
> > > > > > > > Unless I misunderstood the code, neither side can take
> > > > > > > > advantage
> > > > > > > > of
> > > > > > > > the
> > > > > > > > new 32k descriptor chain limit?
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > Stefan
> > > > > > > 
> > > > > > > I need to check that when I have some more time. One possible
> > > > > > > explanation
> > > > > > > might be that preadv() already has this wrapped into a loop in
> > > > > > > its
> > > > > > > implementation to circumvent a limit like IOV_MAX. It might be
> > > > > > > another
> > > > > > > "it
> > > > > > > works, but not portable" issue, but not sure.
> > > > > > > 
> > > > > > > There are still a bunch of other issues I have to resolve. If
> > > > > > > you
> > > > > > > look
> > > > > > > at
> > > > > > > net/9p/client.c on kernel side, you'll notice that it basically
> > > > > > > does
> > > > > > > this ATM> >
> > > > > > > 
> > > > > > >     kmalloc(msize);
> > > > > 
> > > > > Note that this is done twice : once for the T message (client
> > > > > request)
> > > > > and
> > > > > once for the R message (server answer). The 9p driver could adjust
> > > > > the
> > > > > size
> > > > > of the T message to what's really needed instead of allocating the
> > > > > full
> > > > > msize. R message size is not known though.
> > > > 
> > > > Would it make sense adding a second virtio ring, dedicated to server
> > > > responses to solve this? IIRC 9p server already calculates appropriate
> > > > exact sizes for each response type. So server could just push space
> > > > that's
> > > > really needed for its responses.
> > > > 
> > > > > > > for every 9p request. So not only does it allocate much more
> > > > > > > memory
> > > > > > > for
> > > > > > > every request than actually required (i.e. say 9pfs was mounted
> > > > > > > with
> > > > > > > msize=8M, then a 9p request that actually would just need 1k
> > > > > > > would
> > > > > > > nevertheless allocate 8M), but also it allocates > PAGE_SIZE,
> > > > > > > which
> > > > > > > obviously may fail at any time.>
> > > > > > 
> > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc()
> > > > > > situation.
> > > > 
> > > > Hu, I didn't even consider vmalloc(). I just tried the kvmalloc()
> > > > wrapper
> > > > as a quick & dirty test, but it crashed in the same way as kmalloc()
> > > > with
> > > > large msize values immediately on mounting:
> > > > 
> > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > --- a/net/9p/client.c
> > > > +++ b/net/9p/client.c
> > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct
> > > > p9_client
> > > > *clnt)
> > > > 
> > > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
> > > >  
> > > >                          int alloc_msize)
> > > >  
> > > >  {
> > > > 
> > > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > +       if (false) {
> > > > 
> > > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > > >                 GFP_NOFS);
> > > >                 fc->cache = c->fcall_cache;
> > > >         
> > > >         } else {
> > > > 
> > > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > > 
> > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > 
> > > Now I get:
> > >    virtio: bogus descriptor or out of resources
> > > 
> > > So, still some work ahead on both ends.
> > 
> > Few hacks later (only changes on 9p client side) I got this running stable
> > now. The reason for the virtio error above was that kvmalloc() returns a
> > non-logical kernel address for any kvmalloc(>4M), i.e. an address that is
> > inaccessible from host side, hence that "bogus descriptor" message by
> > QEMU.
> > So I had to split those linear 9p client buffers into sparse ones (set of
> > individual pages).
> > 
> > I tested this for some days with various virtio transmission sizes and it
> > works as expected up to 128 MB (more precisely: 128 MB read space + 128 MB
> > write space per virtio round trip message).
> > 
> > I did not encounter a show stopper for large virtio transmission sizes
> > (4 MB ... 128 MB) on virtio level, neither as a result of testing, nor
> > after reviewing the existing code.
> > 
> > About IOV_MAX: that's apparently not an issue on virtio level. Most of the
> > iovec code, both on Linux kernel side and on QEMU side do not have this
> > limitation. It is apparently however indeed a limitation for userland apps
> > calling the Linux kernel's syscalls yet.
> > 
> > Stefan, as it stands now, I am even more convinced that the upper virtio
> > transmission size limit should not be squeezed into the queue size
> > argument of virtio_add_queue(). Not because of the previous argument that
> > it would waste space (~1MB), but rather because they are two different
> > things. To outline this, just a quick recap of what happens exactly when
> > a bulk message is pushed over the virtio wire (assuming virtio "split"
> > layout here):
> > 
> > ---------- [recap-start] ----------
> > 
> > For each bulk message sent guest <-> host, exactly *one* of the
> > pre-allocated descriptors is taken and placed (subsequently) into exactly
> > *one* position of the two available/used ring buffers. The actual
> > descriptor table though, containing all the DMA addresses of the message
> > bulk data, is allocated just in time for each round trip message. Say, it
> > is the first message sent, it yields in the following structure:
> > 
> > Ring Buffer   Descriptor Table      Bulk Data Pages
> > 
> >    +-+              +-+           +-----------------+
> >    
> >    |D|------------->|d|---------->| Bulk data block |
> >    
> >    +-+              |d|--------+  +-----------------+
> >    
> >    | |              |d|------+ |
> >    
> >    +-+               .       | |  +-----------------+
> >    
> >    | |               .       | +->| Bulk data block |
> >     
> >     .                .       |    +-----------------+
> >     .               |d|-+    |
> >     .               +-+ |    |    +-----------------+
> >     
> >    | |                  |    +--->| Bulk data block |
> >    
> >    +-+                  |         +-----------------+
> >    
> >    | |                  |                 .
> >    
> >    +-+                  |                 .
> >    
> >                         |                 .
> >                         |         
> >                         |         +-----------------+
> >                         
> >                         +-------->| Bulk data block |
> >                         
> >                                   +-----------------+
> > 
> > Legend:
> > D: pre-allocated descriptor
> > d: just in time allocated descriptor
> > -->: memory pointer (DMA)
> > 
> > The bulk data blocks are allocated by the respective device driver above
> > virtio subsystem level (guest side).
> > 
> > There are exactly as many descriptors pre-allocated (D) as the size of a
> > ring buffer.
> > 
> > A "descriptor" is more or less just a chainable DMA memory pointer;
> > defined
> > as:
> > 
> > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > "next". */ struct vring_desc {
> > 
> > 	/* Address (guest-physical). */
> > 	__virtio64 addr;
> > 	/* Length. */
> > 	__virtio32 len;
> > 	/* The flags as indicated above. */
> > 	__virtio16 flags;
> > 	/* We chain unused descriptors via this, too */
> > 	__virtio16 next;
> > 
> > };
> > 
> > There are 2 ring buffers; the "available" ring buffer is for sending a
> > message guest->host (which will transmit DMA addresses of guest allocated
> > bulk data blocks that are used for data sent to device, and separate
> > guest allocated bulk data blocks that will be used by host side to place
> > its response bulk data), and the "used" ring buffer is for sending
> > host->guest to let guest know about host's response and that it could now
> > safely consume and then deallocate the bulk data blocks subsequently.
> > 
> > ---------- [recap-end] ----------
> > 
> > So the "queue size" actually defines the ringbuffer size. It does not
> > define the maximum amount of descriptors. The "queue size" rather defines
> > how many pending messages can be pushed into either one ringbuffer before
> > the other side would need to wait until the counter side would step up
> > (i.e. ring buffer full).
> > 
> > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE actually is)
> > OTOH defines the max. bulk data size that could be transmitted with each
> > virtio round trip message.
> > 
> > And in fact, 9p currently handles the virtio "queue size" as directly
> > associative with its maximum amount of active 9p requests the server could
> > 
> > handle simultaniously:
> >   hw/9pfs/9p.h:#define MAX_REQ         128
> >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> >   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev, MAX_REQ,
> >   
> >                                  handle_9p_output);
> > 
> > So if I would change it like this, just for the purpose to increase the
> > max. virtio transmission size:
> > 
> > --- a/hw/9pfs/virtio-9p-device.c
> > +++ b/hw/9pfs/virtio-9p-device.c
> > @@ -218,7 +218,7 @@ static void virtio_9p_device_realize(DeviceState *dev,
> > Error **errp)> 
> >      v->config_size = sizeof(struct virtio_9p_config) +
> >      strlen(s->fsconf.tag);
> >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> >      
> >                  VIRTQUEUE_MAX_SIZE);
> > 
> > -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
> > 
> >  }
> > 
> > Then it would require additional synchronization code on both ends and
> > therefore unnecessary complexity, because it would now be possible that
> > more requests are pushed into the ringbuffer than server could handle.
> > 
> > There is one potential issue though that probably did justify the "don't
> > exceed the queue size" rule:
> > 
> > ATM the descriptor table is allocated (just in time) as *one* continuous
> > buffer via kmalloc_array():
> > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7
> > d33a4/drivers/virtio/virtio_ring.c#L440
> > 
> > So assuming transmission size of 2 * 128 MB that kmalloc_array() call
> > would
> > yield in kmalloc(1M) and the latter might fail if guest had highly
> > fragmented physical memory. For such kind of error case there is
> > currently a fallback path in virtqueue_add_split() that would then use
> > the required amount of pre-allocated descriptors instead:
> > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7
> > d33a4/drivers/virtio/virtio_ring.c#L525
> > 
> > That fallback recovery path would no longer be viable if the queue size
> > was
> > exceeded. There would be alternatives though, e.g. by allowing to chain
> > indirect descriptor tables (currently prohibited by the virtio specs).
> 
> Making the maximum number of descriptors independent of the queue size
> requires a change to the VIRTIO spec since the two values are currently
> explicitly tied together by the spec.

Yes, that's what the virtio specs say. But they don't say why, nor did I hear
a reason in this dicussion.

That's why I invested time reviewing current virtio implementation and specs,
as well as actually testing exceeding that limit. And as I outlined in detail
in my previous email, I only found one theoretical issue that could be
addressed though.

> Before doing that, are there benchmark results showing that 1 MB vs 128
> MB produces a performance improvement? I'm asking because if performance
> with 1 MB is good then you can probably do that without having to change
> VIRTIO and also because it's counter-intuitive that 9p needs 128 MB for
> good performance when it's ultimately implemented on top of disk and
> network I/O that have lower size limits.

First some numbers, linear reading a 12 GB file:

msize    average      notes

8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
1 MB     2551 MB/s    this msize would already violate virtio specs
2 MB     2521 MB/s    this msize would already violate virtio specs
4 MB     2628 MB/s    planned max. msize of my current kernel patches [1]

Note that current 9p Linux client implementation used here, has a bunch of
known code simplifications that cost a lot the more you increase msize, which
also explains the bumb at 1 MB vs 2 MB here. I will address these issues with
my kernel patches soon. Current numbers already suggest though that you will
see growing performance above 4 MB msize as well.

I have not even bothered benchmarking my current, heavily hacked kernel for
the case 4 MB .. 128 MB, because I'm using ridiculous expensive hacks that
copy huge buffers between 9p client level (linear buffers, non-logical address
space) and virtio level (sparse buffers, logical address space for DMA)
several times back and forth.

The point of my current hacks were just to find out whether it was feasible
and sane to exceed current virtio limit, and I think it is.

But again, this is not just about performance. My conclusion as described in
my previous email is that virtio currently squeezes

	"max. simultanious amount of bulk messages"

vs.

	"max. bulk data transmission size per bulk messaage"

into the same configuration parameter, which is IMO inappropriate and hence
splitting them into 2 separate parameters when creating a queue makes sense,
independent of the performance benchmarks.

[1] https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-25 15:03                   ` [Virtio-fs] " Christian Schoenebeck
@ 2021-10-28  9:00                     ` Stefan Hajnoczi
  -1 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-10-28  9:00 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, Greg Kurz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 22053 bytes --]

On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck wrote:
> On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > 
> > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > > Schoenebeck
> > > > 
> > > > wrote:
> > > > > > > > > > At the moment the maximum transfer size with virtio is
> > > > > > > > > > limited
> > > > > > > > > > to
> > > > > > > > > > 4M
> > > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its
> > > > > > > > > > maximum
> > > > > > > > > > theoretical possible transfer size of 128M (32k pages)
> > > > > > > > > > according
> > > > > > > > > > to
> > > > > > > > > > the
> > > > > > > > > > virtio specs:
> > > > > > > > > > 
> > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v
> > > > > > > > > > 1.1-
> > > > > > > > > > cs
> > > > > > > > > > 01
> > > > > > > > > > .html#
> > > > > > > > > > x1-240006
> > > > > > > > > 
> > > > > > > > > Hi Christian,
> > > > > > 
> > > > > > > > > I took a quick look at the code:
> > > > > > Hi,
> > > > > > 
> > > > > > Thanks Stefan for sharing virtio expertise and helping Christian !
> > > > > > 
> > > > > > > > > - The Linux 9p driver restricts descriptor chains to 128
> > > > > > > > > elements
> > > > > > > > > 
> > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > 
> > > > > > > > Yes, that's the limitation that I am about to remove (WIP);
> > > > > > > > current
> > > > > > > > kernel
> > > > > > > > patches:
> > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@cr
> > > > > > > > udeb
> > > > > > > > yt
> > > > > > > > e.
> > > > > > > > com/>
> > > > > > > 
> > > > > > > I haven't read the patches yet but I'm concerned that today the
> > > > > > > driver
> > > > > > > is pretty well-behaved and this new patch series introduces a spec
> > > > > > > violation. Not fixing existing spec violations is okay, but adding
> > > > > > > new
> > > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > > solution.
> > > > > 
> > > > > Nobody has reviewed the kernel patches yet. My main concern therefore
> > > > > actually is that the kernel patches are already too complex, because
> > > > > the
> > > > > current situation is that only Dominique is handling 9p patches on
> > > > > kernel
> > > > > side, and he barely has time for 9p anymore.
> > > > > 
> > > > > Another reason for me to catch up on reading current kernel code and
> > > > > stepping in as reviewer of 9p on kernel side ASAP, independent of this
> > > > > issue.
> > > > > 
> > > > > As for current kernel patches' complexity: I can certainly drop patch
> > > > > 7
> > > > > entirely as it is probably just overkill. Patch 4 is then the biggest
> > > > > chunk, I have to see if I can simplify it, and whether it would make
> > > > > sense to squash with patch 3.
> > > > > 
> > > > > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and
> > > > > > > > > will
> > > > > > > > > fail
> > > > > > > > > 
> > > > > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > 
> > > > > > > > Hmm, which makes me wonder why I never encountered this error
> > > > > > > > during
> > > > > > > > testing.
> > > > > > > > 
> > > > > > > > Most people will use the 9p qemu 'local' fs driver backend in
> > > > > > > > practice,
> > > > > > > > so
> > > > > > > > that v9fs_read() call would translate for most people to this
> > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > 
> > > > > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState
> > > > > > > > *fs,
> > > > > > > > 
> > > > > > > >                             const struct iovec *iov,
> > > > > > > >                             int iovcnt, off_t offset)
> > > > > > > > 
> > > > > > > > {
> > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > 
> > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > 
> > > > > > > > #else
> > > > > > > > 
> > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > >     if (err == -1) {
> > > > > > > >     
> > > > > > > >         return err;
> > > > > > > >     
> > > > > > > >     } else {
> > > > > > > >     
> > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > >     
> > > > > > > >     }
> > > > > > > > 
> > > > > > > > #endif
> > > > > > > > }
> > > > > > > > 
> > > > > > > > > Unless I misunderstood the code, neither side can take
> > > > > > > > > advantage
> > > > > > > > > of
> > > > > > > > > the
> > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > Stefan
> > > > > > > > 
> > > > > > > > I need to check that when I have some more time. One possible
> > > > > > > > explanation
> > > > > > > > might be that preadv() already has this wrapped into a loop in
> > > > > > > > its
> > > > > > > > implementation to circumvent a limit like IOV_MAX. It might be
> > > > > > > > another
> > > > > > > > "it
> > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > 
> > > > > > > > There are still a bunch of other issues I have to resolve. If
> > > > > > > > you
> > > > > > > > look
> > > > > > > > at
> > > > > > > > net/9p/client.c on kernel side, you'll notice that it basically
> > > > > > > > does
> > > > > > > > this ATM> >
> > > > > > > > 
> > > > > > > >     kmalloc(msize);
> > > > > > 
> > > > > > Note that this is done twice : once for the T message (client
> > > > > > request)
> > > > > > and
> > > > > > once for the R message (server answer). The 9p driver could adjust
> > > > > > the
> > > > > > size
> > > > > > of the T message to what's really needed instead of allocating the
> > > > > > full
> > > > > > msize. R message size is not known though.
> > > > > 
> > > > > Would it make sense adding a second virtio ring, dedicated to server
> > > > > responses to solve this? IIRC 9p server already calculates appropriate
> > > > > exact sizes for each response type. So server could just push space
> > > > > that's
> > > > > really needed for its responses.
> > > > > 
> > > > > > > > for every 9p request. So not only does it allocate much more
> > > > > > > > memory
> > > > > > > > for
> > > > > > > > every request than actually required (i.e. say 9pfs was mounted
> > > > > > > > with
> > > > > > > > msize=8M, then a 9p request that actually would just need 1k
> > > > > > > > would
> > > > > > > > nevertheless allocate 8M), but also it allocates > PAGE_SIZE,
> > > > > > > > which
> > > > > > > > obviously may fail at any time.>
> > > > > > > 
> > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc()
> > > > > > > situation.
> > > > > 
> > > > > Hu, I didn't even consider vmalloc(). I just tried the kvmalloc()
> > > > > wrapper
> > > > > as a quick & dirty test, but it crashed in the same way as kmalloc()
> > > > > with
> > > > > large msize values immediately on mounting:
> > > > > 
> > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > --- a/net/9p/client.c
> > > > > +++ b/net/9p/client.c
> > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct
> > > > > p9_client
> > > > > *clnt)
> > > > > 
> > > > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
> > > > >  
> > > > >                          int alloc_msize)
> > > > >  
> > > > >  {
> > > > > 
> > > > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > > +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > > +       if (false) {
> > > > > 
> > > > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > > > >                 GFP_NOFS);
> > > > >                 fc->cache = c->fcall_cache;
> > > > >         
> > > > >         } else {
> > > > > 
> > > > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > > > 
> > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > 
> > > > Now I get:
> > > >    virtio: bogus descriptor or out of resources
> > > > 
> > > > So, still some work ahead on both ends.
> > > 
> > > Few hacks later (only changes on 9p client side) I got this running stable
> > > now. The reason for the virtio error above was that kvmalloc() returns a
> > > non-logical kernel address for any kvmalloc(>4M), i.e. an address that is
> > > inaccessible from host side, hence that "bogus descriptor" message by
> > > QEMU.
> > > So I had to split those linear 9p client buffers into sparse ones (set of
> > > individual pages).
> > > 
> > > I tested this for some days with various virtio transmission sizes and it
> > > works as expected up to 128 MB (more precisely: 128 MB read space + 128 MB
> > > write space per virtio round trip message).
> > > 
> > > I did not encounter a show stopper for large virtio transmission sizes
> > > (4 MB ... 128 MB) on virtio level, neither as a result of testing, nor
> > > after reviewing the existing code.
> > > 
> > > About IOV_MAX: that's apparently not an issue on virtio level. Most of the
> > > iovec code, both on Linux kernel side and on QEMU side do not have this
> > > limitation. It is apparently however indeed a limitation for userland apps
> > > calling the Linux kernel's syscalls yet.
> > > 
> > > Stefan, as it stands now, I am even more convinced that the upper virtio
> > > transmission size limit should not be squeezed into the queue size
> > > argument of virtio_add_queue(). Not because of the previous argument that
> > > it would waste space (~1MB), but rather because they are two different
> > > things. To outline this, just a quick recap of what happens exactly when
> > > a bulk message is pushed over the virtio wire (assuming virtio "split"
> > > layout here):
> > > 
> > > ---------- [recap-start] ----------
> > > 
> > > For each bulk message sent guest <-> host, exactly *one* of the
> > > pre-allocated descriptors is taken and placed (subsequently) into exactly
> > > *one* position of the two available/used ring buffers. The actual
> > > descriptor table though, containing all the DMA addresses of the message
> > > bulk data, is allocated just in time for each round trip message. Say, it
> > > is the first message sent, it yields in the following structure:
> > > 
> > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > 
> > >    +-+              +-+           +-----------------+
> > >    
> > >    |D|------------->|d|---------->| Bulk data block |
> > >    
> > >    +-+              |d|--------+  +-----------------+
> > >    
> > >    | |              |d|------+ |
> > >    
> > >    +-+               .       | |  +-----------------+
> > >    
> > >    | |               .       | +->| Bulk data block |
> > >     
> > >     .                .       |    +-----------------+
> > >     .               |d|-+    |
> > >     .               +-+ |    |    +-----------------+
> > >     
> > >    | |                  |    +--->| Bulk data block |
> > >    
> > >    +-+                  |         +-----------------+
> > >    
> > >    | |                  |                 .
> > >    
> > >    +-+                  |                 .
> > >    
> > >                         |                 .
> > >                         |         
> > >                         |         +-----------------+
> > >                         
> > >                         +-------->| Bulk data block |
> > >                         
> > >                                   +-----------------+
> > > 
> > > Legend:
> > > D: pre-allocated descriptor
> > > d: just in time allocated descriptor
> > > -->: memory pointer (DMA)
> > > 
> > > The bulk data blocks are allocated by the respective device driver above
> > > virtio subsystem level (guest side).
> > > 
> > > There are exactly as many descriptors pre-allocated (D) as the size of a
> > > ring buffer.
> > > 
> > > A "descriptor" is more or less just a chainable DMA memory pointer;
> > > defined
> > > as:
> > > 
> > > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > > "next". */ struct vring_desc {
> > > 
> > > 	/* Address (guest-physical). */
> > > 	__virtio64 addr;
> > > 	/* Length. */
> > > 	__virtio32 len;
> > > 	/* The flags as indicated above. */
> > > 	__virtio16 flags;
> > > 	/* We chain unused descriptors via this, too */
> > > 	__virtio16 next;
> > > 
> > > };
> > > 
> > > There are 2 ring buffers; the "available" ring buffer is for sending a
> > > message guest->host (which will transmit DMA addresses of guest allocated
> > > bulk data blocks that are used for data sent to device, and separate
> > > guest allocated bulk data blocks that will be used by host side to place
> > > its response bulk data), and the "used" ring buffer is for sending
> > > host->guest to let guest know about host's response and that it could now
> > > safely consume and then deallocate the bulk data blocks subsequently.
> > > 
> > > ---------- [recap-end] ----------
> > > 
> > > So the "queue size" actually defines the ringbuffer size. It does not
> > > define the maximum amount of descriptors. The "queue size" rather defines
> > > how many pending messages can be pushed into either one ringbuffer before
> > > the other side would need to wait until the counter side would step up
> > > (i.e. ring buffer full).
> > > 
> > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE actually is)
> > > OTOH defines the max. bulk data size that could be transmitted with each
> > > virtio round trip message.
> > > 
> > > And in fact, 9p currently handles the virtio "queue size" as directly
> > > associative with its maximum amount of active 9p requests the server could
> > > 
> > > handle simultaniously:
> > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > >   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev, MAX_REQ,
> > >   
> > >                                  handle_9p_output);
> > > 
> > > So if I would change it like this, just for the purpose to increase the
> > > max. virtio transmission size:
> > > 
> > > --- a/hw/9pfs/virtio-9p-device.c
> > > +++ b/hw/9pfs/virtio-9p-device.c
> > > @@ -218,7 +218,7 @@ static void virtio_9p_device_realize(DeviceState *dev,
> > > Error **errp)> 
> > >      v->config_size = sizeof(struct virtio_9p_config) +
> > >      strlen(s->fsconf.tag);
> > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > >      
> > >                  VIRTQUEUE_MAX_SIZE);
> > > 
> > > -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > > +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
> > > 
> > >  }
> > > 
> > > Then it would require additional synchronization code on both ends and
> > > therefore unnecessary complexity, because it would now be possible that
> > > more requests are pushed into the ringbuffer than server could handle.
> > > 
> > > There is one potential issue though that probably did justify the "don't
> > > exceed the queue size" rule:
> > > 
> > > ATM the descriptor table is allocated (just in time) as *one* continuous
> > > buffer via kmalloc_array():
> > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7
> > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > 
> > > So assuming transmission size of 2 * 128 MB that kmalloc_array() call
> > > would
> > > yield in kmalloc(1M) and the latter might fail if guest had highly
> > > fragmented physical memory. For such kind of error case there is
> > > currently a fallback path in virtqueue_add_split() that would then use
> > > the required amount of pre-allocated descriptors instead:
> > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7
> > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > 
> > > That fallback recovery path would no longer be viable if the queue size
> > > was
> > > exceeded. There would be alternatives though, e.g. by allowing to chain
> > > indirect descriptor tables (currently prohibited by the virtio specs).
> > 
> > Making the maximum number of descriptors independent of the queue size
> > requires a change to the VIRTIO spec since the two values are currently
> > explicitly tied together by the spec.
> 
> Yes, that's what the virtio specs say. But they don't say why, nor did I hear
> a reason in this dicussion.
> 
> That's why I invested time reviewing current virtio implementation and specs,
> as well as actually testing exceeding that limit. And as I outlined in detail
> in my previous email, I only found one theoretical issue that could be
> addressed though.

I agree that there is a limitation in the VIRTIO spec, but violating the
spec isn't an acceptable solution:

1. QEMU and Linux aren't the only components that implement VIRTIO. You
   cannot make assumptions about their implementations because it may
   break spec-compliant implementations that you haven't looked at.

   Your patches weren't able to increase Queue Size because some device
   implementations break when descriptor chains are too long. This shows
   there is a practical issue even in QEMU.

2. The specific spec violation that we discussed creates the problem
   that drivers can no longer determine the maximum description chain
   length. This in turn will lead to more implementation-specific
   assumptions being baked into drivers and cause problems with
   interoperability and future changes.

The spec needs to be extended instead. I included an idea for how to do
that below.

> > Before doing that, are there benchmark results showing that 1 MB vs 128
> > MB produces a performance improvement? I'm asking because if performance
> > with 1 MB is good then you can probably do that without having to change
> > VIRTIO and also because it's counter-intuitive that 9p needs 128 MB for
> > good performance when it's ultimately implemented on top of disk and
> > network I/O that have lower size limits.
> 
> First some numbers, linear reading a 12 GB file:
> 
> msize    average      notes
> 
> 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> 512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
> 1 MB     2551 MB/s    this msize would already violate virtio specs
> 2 MB     2521 MB/s    this msize would already violate virtio specs
> 4 MB     2628 MB/s    planned max. msize of my current kernel patches [1]

How many descriptors are used? 4 MB can be covered by a single
descriptor if the data is physically contiguous in memory, so this data
doesn't demonstrate a need for more descriptors.

> But again, this is not just about performance. My conclusion as described in
> my previous email is that virtio currently squeezes
> 
> 	"max. simultanious amount of bulk messages"
> 
> vs.
> 
> 	"max. bulk data transmission size per bulk messaage"
> 
> into the same configuration parameter, which is IMO inappropriate and hence
> splitting them into 2 separate parameters when creating a queue makes sense,
> independent of the performance benchmarks.
> 
> [1] https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/

Some devices effectively already have this because the device advertises
a maximum number of descriptors via device-specific mechanisms like the
struct virtio_blk_config seg_max field. But today these fields can only
reduce the maximum descriptor chain length because the spec still limits
the length to Queue Size.

We can build on this approach to raise the length above Queue Size. This
approach has the advantage that the maximum number of segments isn't per
device or per virtqueue, it's fine-grained. If the device supports two
requests types then different max descriptor chain limits could be given
for them by introducing two separate configuration space fields.

Here are the corresponding spec changes:

1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is added
   to indicate that indirect descriptor table size and maximum
   descriptor chain length are not limited by Queue Size value. (Maybe
   there still needs to be a limit like 2^15?)

2. "2.6.5.3.1 Driver Requirements: Indirect Descriptors" is updated to
   say that VIRTIO_RING_F_LARGE_INDIRECT_DESC overrides the maximum
   descriptor chain length.

2. A new configuration space field is added for 9p indicating the
   maximum descriptor chain length.

One thing that's messy is that we've been discussing the maximum
descriptor chain length but 9p has the "msize" concept, which isn't
aware of contiguous memory. It may be necessary to extend the 9p driver
code to size requests not just according to their length in bytes but
also according to the descriptor chain length. That's how the Linux
block layer deals with queue limits (struct queue_limits max_segments vs
max_hw_sectors).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-10-28  9:00                     ` Stefan Hajnoczi
  0 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-10-28  9:00 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Raphael Norwitz

[-- Attachment #1: Type: text/plain, Size: 22053 bytes --]

On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck wrote:
> On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > 
> > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > > Schoenebeck
> > > > 
> > > > wrote:
> > > > > > > > > > At the moment the maximum transfer size with virtio is
> > > > > > > > > > limited
> > > > > > > > > > to
> > > > > > > > > > 4M
> > > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its
> > > > > > > > > > maximum
> > > > > > > > > > theoretical possible transfer size of 128M (32k pages)
> > > > > > > > > > according
> > > > > > > > > > to
> > > > > > > > > > the
> > > > > > > > > > virtio specs:
> > > > > > > > > > 
> > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v
> > > > > > > > > > 1.1-
> > > > > > > > > > cs
> > > > > > > > > > 01
> > > > > > > > > > .html#
> > > > > > > > > > x1-240006
> > > > > > > > > 
> > > > > > > > > Hi Christian,
> > > > > > 
> > > > > > > > > I took a quick look at the code:
> > > > > > Hi,
> > > > > > 
> > > > > > Thanks Stefan for sharing virtio expertise and helping Christian !
> > > > > > 
> > > > > > > > > - The Linux 9p driver restricts descriptor chains to 128
> > > > > > > > > elements
> > > > > > > > > 
> > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > 
> > > > > > > > Yes, that's the limitation that I am about to remove (WIP);
> > > > > > > > current
> > > > > > > > kernel
> > > > > > > > patches:
> > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@cr
> > > > > > > > udeb
> > > > > > > > yt
> > > > > > > > e.
> > > > > > > > com/>
> > > > > > > 
> > > > > > > I haven't read the patches yet but I'm concerned that today the
> > > > > > > driver
> > > > > > > is pretty well-behaved and this new patch series introduces a spec
> > > > > > > violation. Not fixing existing spec violations is okay, but adding
> > > > > > > new
> > > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > > solution.
> > > > > 
> > > > > Nobody has reviewed the kernel patches yet. My main concern therefore
> > > > > actually is that the kernel patches are already too complex, because
> > > > > the
> > > > > current situation is that only Dominique is handling 9p patches on
> > > > > kernel
> > > > > side, and he barely has time for 9p anymore.
> > > > > 
> > > > > Another reason for me to catch up on reading current kernel code and
> > > > > stepping in as reviewer of 9p on kernel side ASAP, independent of this
> > > > > issue.
> > > > > 
> > > > > As for current kernel patches' complexity: I can certainly drop patch
> > > > > 7
> > > > > entirely as it is probably just overkill. Patch 4 is then the biggest
> > > > > chunk, I have to see if I can simplify it, and whether it would make
> > > > > sense to squash with patch 3.
> > > > > 
> > > > > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and
> > > > > > > > > will
> > > > > > > > > fail
> > > > > > > > > 
> > > > > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > 
> > > > > > > > Hmm, which makes me wonder why I never encountered this error
> > > > > > > > during
> > > > > > > > testing.
> > > > > > > > 
> > > > > > > > Most people will use the 9p qemu 'local' fs driver backend in
> > > > > > > > practice,
> > > > > > > > so
> > > > > > > > that v9fs_read() call would translate for most people to this
> > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > 
> > > > > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState
> > > > > > > > *fs,
> > > > > > > > 
> > > > > > > >                             const struct iovec *iov,
> > > > > > > >                             int iovcnt, off_t offset)
> > > > > > > > 
> > > > > > > > {
> > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > 
> > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > 
> > > > > > > > #else
> > > > > > > > 
> > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > >     if (err == -1) {
> > > > > > > >     
> > > > > > > >         return err;
> > > > > > > >     
> > > > > > > >     } else {
> > > > > > > >     
> > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > >     
> > > > > > > >     }
> > > > > > > > 
> > > > > > > > #endif
> > > > > > > > }
> > > > > > > > 
> > > > > > > > > Unless I misunderstood the code, neither side can take
> > > > > > > > > advantage
> > > > > > > > > of
> > > > > > > > > the
> > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > Stefan
> > > > > > > > 
> > > > > > > > I need to check that when I have some more time. One possible
> > > > > > > > explanation
> > > > > > > > might be that preadv() already has this wrapped into a loop in
> > > > > > > > its
> > > > > > > > implementation to circumvent a limit like IOV_MAX. It might be
> > > > > > > > another
> > > > > > > > "it
> > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > 
> > > > > > > > There are still a bunch of other issues I have to resolve. If
> > > > > > > > you
> > > > > > > > look
> > > > > > > > at
> > > > > > > > net/9p/client.c on kernel side, you'll notice that it basically
> > > > > > > > does
> > > > > > > > this ATM> >
> > > > > > > > 
> > > > > > > >     kmalloc(msize);
> > > > > > 
> > > > > > Note that this is done twice : once for the T message (client
> > > > > > request)
> > > > > > and
> > > > > > once for the R message (server answer). The 9p driver could adjust
> > > > > > the
> > > > > > size
> > > > > > of the T message to what's really needed instead of allocating the
> > > > > > full
> > > > > > msize. R message size is not known though.
> > > > > 
> > > > > Would it make sense adding a second virtio ring, dedicated to server
> > > > > responses to solve this? IIRC 9p server already calculates appropriate
> > > > > exact sizes for each response type. So server could just push space
> > > > > that's
> > > > > really needed for its responses.
> > > > > 
> > > > > > > > for every 9p request. So not only does it allocate much more
> > > > > > > > memory
> > > > > > > > for
> > > > > > > > every request than actually required (i.e. say 9pfs was mounted
> > > > > > > > with
> > > > > > > > msize=8M, then a 9p request that actually would just need 1k
> > > > > > > > would
> > > > > > > > nevertheless allocate 8M), but also it allocates > PAGE_SIZE,
> > > > > > > > which
> > > > > > > > obviously may fail at any time.>
> > > > > > > 
> > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc()
> > > > > > > situation.
> > > > > 
> > > > > Hu, I didn't even consider vmalloc(). I just tried the kvmalloc()
> > > > > wrapper
> > > > > as a quick & dirty test, but it crashed in the same way as kmalloc()
> > > > > with
> > > > > large msize values immediately on mounting:
> > > > > 
> > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > --- a/net/9p/client.c
> > > > > +++ b/net/9p/client.c
> > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct
> > > > > p9_client
> > > > > *clnt)
> > > > > 
> > > > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
> > > > >  
> > > > >                          int alloc_msize)
> > > > >  
> > > > >  {
> > > > > 
> > > > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > > +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > > +       if (false) {
> > > > > 
> > > > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > > > >                 GFP_NOFS);
> > > > >                 fc->cache = c->fcall_cache;
> > > > >         
> > > > >         } else {
> > > > > 
> > > > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > > > 
> > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > 
> > > > Now I get:
> > > >    virtio: bogus descriptor or out of resources
> > > > 
> > > > So, still some work ahead on both ends.
> > > 
> > > Few hacks later (only changes on 9p client side) I got this running stable
> > > now. The reason for the virtio error above was that kvmalloc() returns a
> > > non-logical kernel address for any kvmalloc(>4M), i.e. an address that is
> > > inaccessible from host side, hence that "bogus descriptor" message by
> > > QEMU.
> > > So I had to split those linear 9p client buffers into sparse ones (set of
> > > individual pages).
> > > 
> > > I tested this for some days with various virtio transmission sizes and it
> > > works as expected up to 128 MB (more precisely: 128 MB read space + 128 MB
> > > write space per virtio round trip message).
> > > 
> > > I did not encounter a show stopper for large virtio transmission sizes
> > > (4 MB ... 128 MB) on virtio level, neither as a result of testing, nor
> > > after reviewing the existing code.
> > > 
> > > About IOV_MAX: that's apparently not an issue on virtio level. Most of the
> > > iovec code, both on Linux kernel side and on QEMU side do not have this
> > > limitation. It is apparently however indeed a limitation for userland apps
> > > calling the Linux kernel's syscalls yet.
> > > 
> > > Stefan, as it stands now, I am even more convinced that the upper virtio
> > > transmission size limit should not be squeezed into the queue size
> > > argument of virtio_add_queue(). Not because of the previous argument that
> > > it would waste space (~1MB), but rather because they are two different
> > > things. To outline this, just a quick recap of what happens exactly when
> > > a bulk message is pushed over the virtio wire (assuming virtio "split"
> > > layout here):
> > > 
> > > ---------- [recap-start] ----------
> > > 
> > > For each bulk message sent guest <-> host, exactly *one* of the
> > > pre-allocated descriptors is taken and placed (subsequently) into exactly
> > > *one* position of the two available/used ring buffers. The actual
> > > descriptor table though, containing all the DMA addresses of the message
> > > bulk data, is allocated just in time for each round trip message. Say, it
> > > is the first message sent, it yields in the following structure:
> > > 
> > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > 
> > >    +-+              +-+           +-----------------+
> > >    
> > >    |D|------------->|d|---------->| Bulk data block |
> > >    
> > >    +-+              |d|--------+  +-----------------+
> > >    
> > >    | |              |d|------+ |
> > >    
> > >    +-+               .       | |  +-----------------+
> > >    
> > >    | |               .       | +->| Bulk data block |
> > >     
> > >     .                .       |    +-----------------+
> > >     .               |d|-+    |
> > >     .               +-+ |    |    +-----------------+
> > >     
> > >    | |                  |    +--->| Bulk data block |
> > >    
> > >    +-+                  |         +-----------------+
> > >    
> > >    | |                  |                 .
> > >    
> > >    +-+                  |                 .
> > >    
> > >                         |                 .
> > >                         |         
> > >                         |         +-----------------+
> > >                         
> > >                         +-------->| Bulk data block |
> > >                         
> > >                                   +-----------------+
> > > 
> > > Legend:
> > > D: pre-allocated descriptor
> > > d: just in time allocated descriptor
> > > -->: memory pointer (DMA)
> > > 
> > > The bulk data blocks are allocated by the respective device driver above
> > > virtio subsystem level (guest side).
> > > 
> > > There are exactly as many descriptors pre-allocated (D) as the size of a
> > > ring buffer.
> > > 
> > > A "descriptor" is more or less just a chainable DMA memory pointer;
> > > defined
> > > as:
> > > 
> > > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > > "next". */ struct vring_desc {
> > > 
> > > 	/* Address (guest-physical). */
> > > 	__virtio64 addr;
> > > 	/* Length. */
> > > 	__virtio32 len;
> > > 	/* The flags as indicated above. */
> > > 	__virtio16 flags;
> > > 	/* We chain unused descriptors via this, too */
> > > 	__virtio16 next;
> > > 
> > > };
> > > 
> > > There are 2 ring buffers; the "available" ring buffer is for sending a
> > > message guest->host (which will transmit DMA addresses of guest allocated
> > > bulk data blocks that are used for data sent to device, and separate
> > > guest allocated bulk data blocks that will be used by host side to place
> > > its response bulk data), and the "used" ring buffer is for sending
> > > host->guest to let guest know about host's response and that it could now
> > > safely consume and then deallocate the bulk data blocks subsequently.
> > > 
> > > ---------- [recap-end] ----------
> > > 
> > > So the "queue size" actually defines the ringbuffer size. It does not
> > > define the maximum amount of descriptors. The "queue size" rather defines
> > > how many pending messages can be pushed into either one ringbuffer before
> > > the other side would need to wait until the counter side would step up
> > > (i.e. ring buffer full).
> > > 
> > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE actually is)
> > > OTOH defines the max. bulk data size that could be transmitted with each
> > > virtio round trip message.
> > > 
> > > And in fact, 9p currently handles the virtio "queue size" as directly
> > > associative with its maximum amount of active 9p requests the server could
> > > 
> > > handle simultaniously:
> > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > >   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev, MAX_REQ,
> > >   
> > >                                  handle_9p_output);
> > > 
> > > So if I would change it like this, just for the purpose to increase the
> > > max. virtio transmission size:
> > > 
> > > --- a/hw/9pfs/virtio-9p-device.c
> > > +++ b/hw/9pfs/virtio-9p-device.c
> > > @@ -218,7 +218,7 @@ static void virtio_9p_device_realize(DeviceState *dev,
> > > Error **errp)> 
> > >      v->config_size = sizeof(struct virtio_9p_config) +
> > >      strlen(s->fsconf.tag);
> > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > >      
> > >                  VIRTQUEUE_MAX_SIZE);
> > > 
> > > -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > > +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
> > > 
> > >  }
> > > 
> > > Then it would require additional synchronization code on both ends and
> > > therefore unnecessary complexity, because it would now be possible that
> > > more requests are pushed into the ringbuffer than server could handle.
> > > 
> > > There is one potential issue though that probably did justify the "don't
> > > exceed the queue size" rule:
> > > 
> > > ATM the descriptor table is allocated (just in time) as *one* continuous
> > > buffer via kmalloc_array():
> > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7
> > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > 
> > > So assuming transmission size of 2 * 128 MB that kmalloc_array() call
> > > would
> > > yield in kmalloc(1M) and the latter might fail if guest had highly
> > > fragmented physical memory. For such kind of error case there is
> > > currently a fallback path in virtqueue_add_split() that would then use
> > > the required amount of pre-allocated descriptors instead:
> > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7
> > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > 
> > > That fallback recovery path would no longer be viable if the queue size
> > > was
> > > exceeded. There would be alternatives though, e.g. by allowing to chain
> > > indirect descriptor tables (currently prohibited by the virtio specs).
> > 
> > Making the maximum number of descriptors independent of the queue size
> > requires a change to the VIRTIO spec since the two values are currently
> > explicitly tied together by the spec.
> 
> Yes, that's what the virtio specs say. But they don't say why, nor did I hear
> a reason in this dicussion.
> 
> That's why I invested time reviewing current virtio implementation and specs,
> as well as actually testing exceeding that limit. And as I outlined in detail
> in my previous email, I only found one theoretical issue that could be
> addressed though.

I agree that there is a limitation in the VIRTIO spec, but violating the
spec isn't an acceptable solution:

1. QEMU and Linux aren't the only components that implement VIRTIO. You
   cannot make assumptions about their implementations because it may
   break spec-compliant implementations that you haven't looked at.

   Your patches weren't able to increase Queue Size because some device
   implementations break when descriptor chains are too long. This shows
   there is a practical issue even in QEMU.

2. The specific spec violation that we discussed creates the problem
   that drivers can no longer determine the maximum description chain
   length. This in turn will lead to more implementation-specific
   assumptions being baked into drivers and cause problems with
   interoperability and future changes.

The spec needs to be extended instead. I included an idea for how to do
that below.

> > Before doing that, are there benchmark results showing that 1 MB vs 128
> > MB produces a performance improvement? I'm asking because if performance
> > with 1 MB is good then you can probably do that without having to change
> > VIRTIO and also because it's counter-intuitive that 9p needs 128 MB for
> > good performance when it's ultimately implemented on top of disk and
> > network I/O that have lower size limits.
> 
> First some numbers, linear reading a 12 GB file:
> 
> msize    average      notes
> 
> 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> 512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
> 1 MB     2551 MB/s    this msize would already violate virtio specs
> 2 MB     2521 MB/s    this msize would already violate virtio specs
> 4 MB     2628 MB/s    planned max. msize of my current kernel patches [1]

How many descriptors are used? 4 MB can be covered by a single
descriptor if the data is physically contiguous in memory, so this data
doesn't demonstrate a need for more descriptors.

> But again, this is not just about performance. My conclusion as described in
> my previous email is that virtio currently squeezes
> 
> 	"max. simultanious amount of bulk messages"
> 
> vs.
> 
> 	"max. bulk data transmission size per bulk messaage"
> 
> into the same configuration parameter, which is IMO inappropriate and hence
> splitting them into 2 separate parameters when creating a queue makes sense,
> independent of the performance benchmarks.
> 
> [1] https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/

Some devices effectively already have this because the device advertises
a maximum number of descriptors via device-specific mechanisms like the
struct virtio_blk_config seg_max field. But today these fields can only
reduce the maximum descriptor chain length because the spec still limits
the length to Queue Size.

We can build on this approach to raise the length above Queue Size. This
approach has the advantage that the maximum number of segments isn't per
device or per virtqueue, it's fine-grained. If the device supports two
requests types then different max descriptor chain limits could be given
for them by introducing two separate configuration space fields.

Here are the corresponding spec changes:

1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is added
   to indicate that indirect descriptor table size and maximum
   descriptor chain length are not limited by Queue Size value. (Maybe
   there still needs to be a limit like 2^15?)

2. "2.6.5.3.1 Driver Requirements: Indirect Descriptors" is updated to
   say that VIRTIO_RING_F_LARGE_INDIRECT_DESC overrides the maximum
   descriptor chain length.

2. A new configuration space field is added for 9p indicating the
   maximum descriptor chain length.

One thing that's messy is that we've been discussing the maximum
descriptor chain length but 9p has the "msize" concept, which isn't
aware of contiguous memory. It may be necessary to extend the 9p driver
code to size requests not just according to their length in bytes but
also according to the descriptor chain length. That's how the Linux
block layer deals with queue limits (struct queue_limits max_segments vs
max_hw_sectors).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-10-28  9:00                     ` [Virtio-fs] " Stefan Hajnoczi
@ 2021-11-01 20:29                       ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-11-01 20:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Kevin Wolf, Laurent Vivier, qemu-block,
	Michael S. Tsirkin, Jason Wang, Amit Shah, David Hildenbrand,
	Greg Kurz, virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck wrote:
> > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > 
> > > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > > > Schoenebeck
> > > > > 
> > > > > wrote:
> > > > > > > > > > > At the moment the maximum transfer size with virtio is
> > > > > > > > > > > limited
> > > > > > > > > > > to
> > > > > > > > > > > 4M
> > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its
> > > > > > > > > > > maximum
> > > > > > > > > > > theoretical possible transfer size of 128M (32k pages)
> > > > > > > > > > > according
> > > > > > > > > > > to
> > > > > > > > > > > the
> > > > > > > > > > > virtio specs:
> > > > > > > > > > > 
> > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virt
> > > > > > > > > > > io-v
> > > > > > > > > > > 1.1-
> > > > > > > > > > > cs
> > > > > > > > > > > 01
> > > > > > > > > > > .html#
> > > > > > > > > > > x1-240006
> > > > > > > > > > 
> > > > > > > > > > Hi Christian,
> > > > > > > 
> > > > > > > > > > I took a quick look at the code:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > Thanks Stefan for sharing virtio expertise and helping Christian
> > > > > > > !
> > > > > > > 
> > > > > > > > > > - The Linux 9p driver restricts descriptor chains to 128
> > > > > > > > > > elements
> > > > > > > > > > 
> > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > 
> > > > > > > > > Yes, that's the limitation that I am about to remove (WIP);
> > > > > > > > > current
> > > > > > > > > kernel
> > > > > > > > > patches:
> > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_os
> > > > > > > > > s@cr
> > > > > > > > > udeb
> > > > > > > > > yt
> > > > > > > > > e.
> > > > > > > > > com/>
> > > > > > > > 
> > > > > > > > I haven't read the patches yet but I'm concerned that today
> > > > > > > > the
> > > > > > > > driver
> > > > > > > > is pretty well-behaved and this new patch series introduces a
> > > > > > > > spec
> > > > > > > > violation. Not fixing existing spec violations is okay, but
> > > > > > > > adding
> > > > > > > > new
> > > > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > > > solution.
> > > > > > 
> > > > > > Nobody has reviewed the kernel patches yet. My main concern
> > > > > > therefore
> > > > > > actually is that the kernel patches are already too complex,
> > > > > > because
> > > > > > the
> > > > > > current situation is that only Dominique is handling 9p patches on
> > > > > > kernel
> > > > > > side, and he barely has time for 9p anymore.
> > > > > > 
> > > > > > Another reason for me to catch up on reading current kernel code
> > > > > > and
> > > > > > stepping in as reviewer of 9p on kernel side ASAP, independent of
> > > > > > this
> > > > > > issue.
> > > > > > 
> > > > > > As for current kernel patches' complexity: I can certainly drop
> > > > > > patch
> > > > > > 7
> > > > > > entirely as it is probably just overkill. Patch 4 is then the
> > > > > > biggest
> > > > > > chunk, I have to see if I can simplify it, and whether it would
> > > > > > make
> > > > > > sense to squash with patch 3.
> > > > > > 
> > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2)
> > > > > > > > > > and
> > > > > > > > > > will
> > > > > > > > > > fail
> > > > > > > > > > 
> > > > > > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > 
> > > > > > > > > Hmm, which makes me wonder why I never encountered this
> > > > > > > > > error
> > > > > > > > > during
> > > > > > > > > testing.
> > > > > > > > > 
> > > > > > > > > Most people will use the 9p qemu 'local' fs driver backend
> > > > > > > > > in
> > > > > > > > > practice,
> > > > > > > > > so
> > > > > > > > > that v9fs_read() call would translate for most people to
> > > > > > > > > this
> > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > 
> > > > > > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState
> > > > > > > > > *fs,
> > > > > > > > > 
> > > > > > > > >                             const struct iovec *iov,
> > > > > > > > >                             int iovcnt, off_t offset)
> > > > > > > > > 
> > > > > > > > > {
> > > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > > 
> > > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > > 
> > > > > > > > > #else
> > > > > > > > > 
> > > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > > >     if (err == -1) {
> > > > > > > > >     
> > > > > > > > >         return err;
> > > > > > > > >     
> > > > > > > > >     } else {
> > > > > > > > >     
> > > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > > >     
> > > > > > > > >     }
> > > > > > > > > 
> > > > > > > > > #endif
> > > > > > > > > }
> > > > > > > > > 
> > > > > > > > > > Unless I misunderstood the code, neither side can take
> > > > > > > > > > advantage
> > > > > > > > > > of
> > > > > > > > > > the
> > > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > > 
> > > > > > > > > > Thanks,
> > > > > > > > > > Stefan
> > > > > > > > > 
> > > > > > > > > I need to check that when I have some more time. One
> > > > > > > > > possible
> > > > > > > > > explanation
> > > > > > > > > might be that preadv() already has this wrapped into a loop
> > > > > > > > > in
> > > > > > > > > its
> > > > > > > > > implementation to circumvent a limit like IOV_MAX. It might
> > > > > > > > > be
> > > > > > > > > another
> > > > > > > > > "it
> > > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > > 
> > > > > > > > > There are still a bunch of other issues I have to resolve.
> > > > > > > > > If
> > > > > > > > > you
> > > > > > > > > look
> > > > > > > > > at
> > > > > > > > > net/9p/client.c on kernel side, you'll notice that it
> > > > > > > > > basically
> > > > > > > > > does
> > > > > > > > > this ATM> >
> > > > > > > > > 
> > > > > > > > >     kmalloc(msize);
> > > > > > > 
> > > > > > > Note that this is done twice : once for the T message (client
> > > > > > > request)
> > > > > > > and
> > > > > > > once for the R message (server answer). The 9p driver could
> > > > > > > adjust
> > > > > > > the
> > > > > > > size
> > > > > > > of the T message to what's really needed instead of allocating
> > > > > > > the
> > > > > > > full
> > > > > > > msize. R message size is not known though.
> > > > > > 
> > > > > > Would it make sense adding a second virtio ring, dedicated to
> > > > > > server
> > > > > > responses to solve this? IIRC 9p server already calculates
> > > > > > appropriate
> > > > > > exact sizes for each response type. So server could just push
> > > > > > space
> > > > > > that's
> > > > > > really needed for its responses.
> > > > > > 
> > > > > > > > > for every 9p request. So not only does it allocate much more
> > > > > > > > > memory
> > > > > > > > > for
> > > > > > > > > every request than actually required (i.e. say 9pfs was
> > > > > > > > > mounted
> > > > > > > > > with
> > > > > > > > > msize=8M, then a 9p request that actually would just need 1k
> > > > > > > > > would
> > > > > > > > > nevertheless allocate 8M), but also it allocates >
> > > > > > > > > PAGE_SIZE,
> > > > > > > > > which
> > > > > > > > > obviously may fail at any time.>
> > > > > > > > 
> > > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc()
> > > > > > > > situation.
> > > > > > 
> > > > > > Hu, I didn't even consider vmalloc(). I just tried the kvmalloc()
> > > > > > wrapper
> > > > > > as a quick & dirty test, but it crashed in the same way as
> > > > > > kmalloc()
> > > > > > with
> > > > > > large msize values immediately on mounting:
> > > > > > 
> > > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > > --- a/net/9p/client.c
> > > > > > +++ b/net/9p/client.c
> > > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct
> > > > > > p9_client
> > > > > > *clnt)
> > > > > > 
> > > > > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall
> > > > > >  *fc,
> > > > > >  
> > > > > >                          int alloc_msize)
> > > > > >  
> > > > > >  {
> > > > > > 
> > > > > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > > > +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > > > +       if (false) {
> > > > > > 
> > > > > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > > > > >                 GFP_NOFS);
> > > > > >                 fc->cache = c->fcall_cache;
> > > > > >         
> > > > > >         } else {
> > > > > > 
> > > > > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > > > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > > > > 
> > > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > > 
> > > > > Now I get:
> > > > >    virtio: bogus descriptor or out of resources
> > > > > 
> > > > > So, still some work ahead on both ends.
> > > > 
> > > > Few hacks later (only changes on 9p client side) I got this running
> > > > stable
> > > > now. The reason for the virtio error above was that kvmalloc() returns
> > > > a
> > > > non-logical kernel address for any kvmalloc(>4M), i.e. an address that
> > > > is
> > > > inaccessible from host side, hence that "bogus descriptor" message by
> > > > QEMU.
> > > > So I had to split those linear 9p client buffers into sparse ones (set
> > > > of
> > > > individual pages).
> > > > 
> > > > I tested this for some days with various virtio transmission sizes and
> > > > it
> > > > works as expected up to 128 MB (more precisely: 128 MB read space +
> > > > 128 MB
> > > > write space per virtio round trip message).
> > > > 
> > > > I did not encounter a show stopper for large virtio transmission sizes
> > > > (4 MB ... 128 MB) on virtio level, neither as a result of testing, nor
> > > > after reviewing the existing code.
> > > > 
> > > > About IOV_MAX: that's apparently not an issue on virtio level. Most of
> > > > the
> > > > iovec code, both on Linux kernel side and on QEMU side do not have
> > > > this
> > > > limitation. It is apparently however indeed a limitation for userland
> > > > apps
> > > > calling the Linux kernel's syscalls yet.
> > > > 
> > > > Stefan, as it stands now, I am even more convinced that the upper
> > > > virtio
> > > > transmission size limit should not be squeezed into the queue size
> > > > argument of virtio_add_queue(). Not because of the previous argument
> > > > that
> > > > it would waste space (~1MB), but rather because they are two different
> > > > things. To outline this, just a quick recap of what happens exactly
> > > > when
> > > > a bulk message is pushed over the virtio wire (assuming virtio "split"
> > > > layout here):
> > > > 
> > > > ---------- [recap-start] ----------
> > > > 
> > > > For each bulk message sent guest <-> host, exactly *one* of the
> > > > pre-allocated descriptors is taken and placed (subsequently) into
> > > > exactly
> > > > *one* position of the two available/used ring buffers. The actual
> > > > descriptor table though, containing all the DMA addresses of the
> > > > message
> > > > bulk data, is allocated just in time for each round trip message. Say,
> > > > it
> > > > is the first message sent, it yields in the following structure:
> > > > 
> > > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > > 
> > > >    +-+              +-+           +-----------------+
> > > >    
> > > >    |D|------------->|d|---------->| Bulk data block |
> > > >    
> > > >    +-+              |d|--------+  +-----------------+
> > > >    
> > > >    | |              |d|------+ |
> > > >    
> > > >    +-+               .       | |  +-----------------+
> > > >    
> > > >    | |               .       | +->| Bulk data block |
> > > >     
> > > >     .                .       |    +-----------------+
> > > >     .               |d|-+    |
> > > >     .               +-+ |    |    +-----------------+
> > > >     
> > > >    | |                  |    +--->| Bulk data block |
> > > >    
> > > >    +-+                  |         +-----------------+
> > > >    
> > > >    | |                  |                 .
> > > >    
> > > >    +-+                  |                 .
> > > >    
> > > >                         |                 .
> > > >                         |         
> > > >                         |         +-----------------+
> > > >                         
> > > >                         +-------->| Bulk data block |
> > > >                         
> > > >                                   +-----------------+
> > > > 
> > > > Legend:
> > > > D: pre-allocated descriptor
> > > > d: just in time allocated descriptor
> > > > -->: memory pointer (DMA)
> > > > 
> > > > The bulk data blocks are allocated by the respective device driver
> > > > above
> > > > virtio subsystem level (guest side).
> > > > 
> > > > There are exactly as many descriptors pre-allocated (D) as the size of
> > > > a
> > > > ring buffer.
> > > > 
> > > > A "descriptor" is more or less just a chainable DMA memory pointer;
> > > > defined
> > > > as:
> > > > 
> > > > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > > > "next". */ struct vring_desc {
> > > > 
> > > > 	/* Address (guest-physical). */
> > > > 	__virtio64 addr;
> > > > 	/* Length. */
> > > > 	__virtio32 len;
> > > > 	/* The flags as indicated above. */
> > > > 	__virtio16 flags;
> > > > 	/* We chain unused descriptors via this, too */
> > > > 	__virtio16 next;
> > > > 
> > > > };
> > > > 
> > > > There are 2 ring buffers; the "available" ring buffer is for sending a
> > > > message guest->host (which will transmit DMA addresses of guest
> > > > allocated
> > > > bulk data blocks that are used for data sent to device, and separate
> > > > guest allocated bulk data blocks that will be used by host side to
> > > > place
> > > > its response bulk data), and the "used" ring buffer is for sending
> > > > host->guest to let guest know about host's response and that it could
> > > > now
> > > > safely consume and then deallocate the bulk data blocks subsequently.
> > > > 
> > > > ---------- [recap-end] ----------
> > > > 
> > > > So the "queue size" actually defines the ringbuffer size. It does not
> > > > define the maximum amount of descriptors. The "queue size" rather
> > > > defines
> > > > how many pending messages can be pushed into either one ringbuffer
> > > > before
> > > > the other side would need to wait until the counter side would step up
> > > > (i.e. ring buffer full).
> > > > 
> > > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE actually
> > > > is)
> > > > OTOH defines the max. bulk data size that could be transmitted with
> > > > each
> > > > virtio round trip message.
> > > > 
> > > > And in fact, 9p currently handles the virtio "queue size" as directly
> > > > associative with its maximum amount of active 9p requests the server
> > > > could
> > > > 
> > > > handle simultaniously:
> > > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > > >   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev,
> > > >   MAX_REQ,
> > > >   
> > > >                                  handle_9p_output);
> > > > 
> > > > So if I would change it like this, just for the purpose to increase
> > > > the
> > > > max. virtio transmission size:
> > > > 
> > > > --- a/hw/9pfs/virtio-9p-device.c
> > > > +++ b/hw/9pfs/virtio-9p-device.c
> > > > @@ -218,7 +218,7 @@ static void virtio_9p_device_realize(DeviceState
> > > > *dev,
> > > > Error **errp)>
> > > > 
> > > >      v->config_size = sizeof(struct virtio_9p_config) +
> > > >      strlen(s->fsconf.tag);
> > > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > > >      
> > > >                  VIRTQUEUE_MAX_SIZE);
> > > > 
> > > > -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > > > +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
> > > > 
> > > >  }
> > > > 
> > > > Then it would require additional synchronization code on both ends and
> > > > therefore unnecessary complexity, because it would now be possible
> > > > that
> > > > more requests are pushed into the ringbuffer than server could handle.
> > > > 
> > > > There is one potential issue though that probably did justify the
> > > > "don't
> > > > exceed the queue size" rule:
> > > > 
> > > > ATM the descriptor table is allocated (just in time) as *one*
> > > > continuous
> > > > buffer via kmalloc_array():
> > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086
> > > > f7c7
> > > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > > 
> > > > So assuming transmission size of 2 * 128 MB that kmalloc_array() call
> > > > would
> > > > yield in kmalloc(1M) and the latter might fail if guest had highly
> > > > fragmented physical memory. For such kind of error case there is
> > > > currently a fallback path in virtqueue_add_split() that would then use
> > > > the required amount of pre-allocated descriptors instead:
> > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086
> > > > f7c7
> > > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > > 
> > > > That fallback recovery path would no longer be viable if the queue
> > > > size
> > > > was
> > > > exceeded. There would be alternatives though, e.g. by allowing to
> > > > chain
> > > > indirect descriptor tables (currently prohibited by the virtio specs).
> > > 
> > > Making the maximum number of descriptors independent of the queue size
> > > requires a change to the VIRTIO spec since the two values are currently
> > > explicitly tied together by the spec.
> > 
> > Yes, that's what the virtio specs say. But they don't say why, nor did I
> > hear a reason in this dicussion.
> > 
> > That's why I invested time reviewing current virtio implementation and
> > specs, as well as actually testing exceeding that limit. And as I
> > outlined in detail in my previous email, I only found one theoretical
> > issue that could be addressed though.
> 
> I agree that there is a limitation in the VIRTIO spec, but violating the
> spec isn't an acceptable solution:
> 
> 1. QEMU and Linux aren't the only components that implement VIRTIO. You
>    cannot make assumptions about their implementations because it may
>    break spec-compliant implementations that you haven't looked at.
> 
>    Your patches weren't able to increase Queue Size because some device
>    implementations break when descriptor chains are too long. This shows
>    there is a practical issue even in QEMU.
> 
> 2. The specific spec violation that we discussed creates the problem
>    that drivers can no longer determine the maximum description chain
>    length. This in turn will lead to more implementation-specific
>    assumptions being baked into drivers and cause problems with
>    interoperability and future changes.
> 
> The spec needs to be extended instead. I included an idea for how to do
> that below.

Sure, I just wanted to see if there was a non-neglectable "hard" show stopper
per se that I probably haven't seen yet. I have not questioned aiming a clean
solution.

Thanks for the clarification!

> > > Before doing that, are there benchmark results showing that 1 MB vs 128
> > > MB produces a performance improvement? I'm asking because if performance
> > > with 1 MB is good then you can probably do that without having to change
> > > VIRTIO and also because it's counter-intuitive that 9p needs 128 MB for
> > > good performance when it's ultimately implemented on top of disk and
> > > network I/O that have lower size limits.
> > 
> > First some numbers, linear reading a 12 GB file:
> > 
> > msize    average      notes
> > 
> > 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> > 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> > 512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
> > 1 MB     2551 MB/s    this msize would already violate virtio specs
> > 2 MB     2521 MB/s    this msize would already violate virtio specs
> > 4 MB     2628 MB/s    planned max. msize of my current kernel patches [1]
> 
> How many descriptors are used? 4 MB can be covered by a single
> descriptor if the data is physically contiguous in memory, so this data
> doesn't demonstrate a need for more descriptors.

No, in the last couple years there was apparently no kernel version that used
just one descriptor, nor did my benchmarked version. Even though the Linux 9p
client uses (yet) simple linear buffers (contiguous physical memory) on 9p
client level, these are however split into PAGE_SIZE chunks by function
pack_sg_list() [1] before being fed to virtio level:

static unsigned int rest_of_page(void *data)
{
	return PAGE_SIZE - offset_in_page(data);
}
...
static int pack_sg_list(struct scatterlist *sg, int start,
			int limit, char *data, int count)
{
	int s;
	int index = start;

	while (count) {
		s = rest_of_page(data);
		...
		sg_set_buf(&sg[index++], data, s);
		count -= s;
		data += s;
	}
	...
}

[1] https://github.com/torvalds/linux/blob/19901165d90fdca1e57c9baa0d5b4c63d15c476a/net/9p/trans_virtio.c#L171

So when sending 4MB over virtio wire, it would yield in 1k descriptors ATM.

I have wondered about this before, but did not question it, because due to the
cross-platform nature I couldn't say for certain whether that's probably
needed somewhere. I mean for the case virtio-PCI I know for sure that one
descriptor (i.e. >PAGE_SIZE) would be fine, but I don't know if that applies
to all buses and architectures.

> > But again, this is not just about performance. My conclusion as described
> > in my previous email is that virtio currently squeezes
> > 
> > 	"max. simultanious amount of bulk messages"
> > 
> > vs.
> > 
> > 	"max. bulk data transmission size per bulk messaage"
> > 
> > into the same configuration parameter, which is IMO inappropriate and
> > hence
> > splitting them into 2 separate parameters when creating a queue makes
> > sense, independent of the performance benchmarks.
> > 
> > [1]
> > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.c
> > om/
> Some devices effectively already have this because the device advertises
> a maximum number of descriptors via device-specific mechanisms like the
> struct virtio_blk_config seg_max field. But today these fields can only
> reduce the maximum descriptor chain length because the spec still limits
> the length to Queue Size.
> 
> We can build on this approach to raise the length above Queue Size. This
> approach has the advantage that the maximum number of segments isn't per
> device or per virtqueue, it's fine-grained. If the device supports two
> requests types then different max descriptor chain limits could be given
> for them by introducing two separate configuration space fields.
> 
> Here are the corresponding spec changes:
> 
> 1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is added
>    to indicate that indirect descriptor table size and maximum
>    descriptor chain length are not limited by Queue Size value. (Maybe
>    there still needs to be a limit like 2^15?)

Sounds good to me!

AFAIK it is effectively limited to 2^16 because of vring_desc->next:

/* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
struct vring_desc {
        /* Address (guest-physical). */
        __virtio64 addr;
        /* Length. */
        __virtio32 len;
        /* The flags as indicated above. */
        __virtio16 flags;
        /* We chain unused descriptors via this, too */
        __virtio16 next;
};

At least unless either chained indirect descriptor tables or nested indirect
descriptor tables would be allowed as well, which are both prohibited by the
specs ATM. I'm not saying that this would be needed anytime soon. :)

> 2. "2.6.5.3.1 Driver Requirements: Indirect Descriptors" is updated to
>    say that VIRTIO_RING_F_LARGE_INDIRECT_DESC overrides the maximum
>    descriptor chain length.

OK

> 2. A new configuration space field is added for 9p indicating the
>    maximum descriptor chain length.

So additionally to the VIRTIO_RING_F_LARGE_INDIRECT_DESC capability bit field
also a numeric field that would exactly specify the max. chain length. Sure,
why not.

> One thing that's messy is that we've been discussing the maximum
> descriptor chain length but 9p has the "msize" concept, which isn't
> aware of contiguous memory. It may be necessary to extend the 9p driver
> code to size requests not just according to their length in bytes but
> also according to the descriptor chain length. That's how the Linux
> block layer deals with queue limits (struct queue_limits max_segments vs
> max_hw_sectors).

Hmm, can't follow on that one. For what should that be needed in case of 9p?
My plan was to limit msize by 9p client simply at session start to whatever is
the max. amount virtio descriptors supported by host and using PAGE_SIZE as
size per descriptor, because that's what 9p client actually does ATM (see
above). So you think that should be changed to e.g. just one descriptor for
4MB, right?

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-11-01 20:29                       ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-11-01 20:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng

On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck wrote:
> > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > 
> > > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > > > Schoenebeck
> > > > > 
> > > > > wrote:
> > > > > > > > > > > At the moment the maximum transfer size with virtio is
> > > > > > > > > > > limited
> > > > > > > > > > > to
> > > > > > > > > > > 4M
> > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its
> > > > > > > > > > > maximum
> > > > > > > > > > > theoretical possible transfer size of 128M (32k pages)
> > > > > > > > > > > according
> > > > > > > > > > > to
> > > > > > > > > > > the
> > > > > > > > > > > virtio specs:
> > > > > > > > > > > 
> > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virt
> > > > > > > > > > > io-v
> > > > > > > > > > > 1.1-
> > > > > > > > > > > cs
> > > > > > > > > > > 01
> > > > > > > > > > > .html#
> > > > > > > > > > > x1-240006
> > > > > > > > > > 
> > > > > > > > > > Hi Christian,
> > > > > > > 
> > > > > > > > > > I took a quick look at the code:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > Thanks Stefan for sharing virtio expertise and helping Christian
> > > > > > > !
> > > > > > > 
> > > > > > > > > > - The Linux 9p driver restricts descriptor chains to 128
> > > > > > > > > > elements
> > > > > > > > > > 
> > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > 
> > > > > > > > > Yes, that's the limitation that I am about to remove (WIP);
> > > > > > > > > current
> > > > > > > > > kernel
> > > > > > > > > patches:
> > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_os
> > > > > > > > > s@cr
> > > > > > > > > udeb
> > > > > > > > > yt
> > > > > > > > > e.
> > > > > > > > > com/>
> > > > > > > > 
> > > > > > > > I haven't read the patches yet but I'm concerned that today
> > > > > > > > the
> > > > > > > > driver
> > > > > > > > is pretty well-behaved and this new patch series introduces a
> > > > > > > > spec
> > > > > > > > violation. Not fixing existing spec violations is okay, but
> > > > > > > > adding
> > > > > > > > new
> > > > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > > > solution.
> > > > > > 
> > > > > > Nobody has reviewed the kernel patches yet. My main concern
> > > > > > therefore
> > > > > > actually is that the kernel patches are already too complex,
> > > > > > because
> > > > > > the
> > > > > > current situation is that only Dominique is handling 9p patches on
> > > > > > kernel
> > > > > > side, and he barely has time for 9p anymore.
> > > > > > 
> > > > > > Another reason for me to catch up on reading current kernel code
> > > > > > and
> > > > > > stepping in as reviewer of 9p on kernel side ASAP, independent of
> > > > > > this
> > > > > > issue.
> > > > > > 
> > > > > > As for current kernel patches' complexity: I can certainly drop
> > > > > > patch
> > > > > > 7
> > > > > > entirely as it is probably just overkill. Patch 4 is then the
> > > > > > biggest
> > > > > > chunk, I have to see if I can simplify it, and whether it would
> > > > > > make
> > > > > > sense to squash with patch 3.
> > > > > > 
> > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2)
> > > > > > > > > > and
> > > > > > > > > > will
> > > > > > > > > > fail
> > > > > > > > > > 
> > > > > > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > 
> > > > > > > > > Hmm, which makes me wonder why I never encountered this
> > > > > > > > > error
> > > > > > > > > during
> > > > > > > > > testing.
> > > > > > > > > 
> > > > > > > > > Most people will use the 9p qemu 'local' fs driver backend
> > > > > > > > > in
> > > > > > > > > practice,
> > > > > > > > > so
> > > > > > > > > that v9fs_read() call would translate for most people to
> > > > > > > > > this
> > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > 
> > > > > > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState
> > > > > > > > > *fs,
> > > > > > > > > 
> > > > > > > > >                             const struct iovec *iov,
> > > > > > > > >                             int iovcnt, off_t offset)
> > > > > > > > > 
> > > > > > > > > {
> > > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > > 
> > > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > > 
> > > > > > > > > #else
> > > > > > > > > 
> > > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > > >     if (err == -1) {
> > > > > > > > >     
> > > > > > > > >         return err;
> > > > > > > > >     
> > > > > > > > >     } else {
> > > > > > > > >     
> > > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > > >     
> > > > > > > > >     }
> > > > > > > > > 
> > > > > > > > > #endif
> > > > > > > > > }
> > > > > > > > > 
> > > > > > > > > > Unless I misunderstood the code, neither side can take
> > > > > > > > > > advantage
> > > > > > > > > > of
> > > > > > > > > > the
> > > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > > 
> > > > > > > > > > Thanks,
> > > > > > > > > > Stefan
> > > > > > > > > 
> > > > > > > > > I need to check that when I have some more time. One
> > > > > > > > > possible
> > > > > > > > > explanation
> > > > > > > > > might be that preadv() already has this wrapped into a loop
> > > > > > > > > in
> > > > > > > > > its
> > > > > > > > > implementation to circumvent a limit like IOV_MAX. It might
> > > > > > > > > be
> > > > > > > > > another
> > > > > > > > > "it
> > > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > > 
> > > > > > > > > There are still a bunch of other issues I have to resolve.
> > > > > > > > > If
> > > > > > > > > you
> > > > > > > > > look
> > > > > > > > > at
> > > > > > > > > net/9p/client.c on kernel side, you'll notice that it
> > > > > > > > > basically
> > > > > > > > > does
> > > > > > > > > this ATM> >
> > > > > > > > > 
> > > > > > > > >     kmalloc(msize);
> > > > > > > 
> > > > > > > Note that this is done twice : once for the T message (client
> > > > > > > request)
> > > > > > > and
> > > > > > > once for the R message (server answer). The 9p driver could
> > > > > > > adjust
> > > > > > > the
> > > > > > > size
> > > > > > > of the T message to what's really needed instead of allocating
> > > > > > > the
> > > > > > > full
> > > > > > > msize. R message size is not known though.
> > > > > > 
> > > > > > Would it make sense adding a second virtio ring, dedicated to
> > > > > > server
> > > > > > responses to solve this? IIRC 9p server already calculates
> > > > > > appropriate
> > > > > > exact sizes for each response type. So server could just push
> > > > > > space
> > > > > > that's
> > > > > > really needed for its responses.
> > > > > > 
> > > > > > > > > for every 9p request. So not only does it allocate much more
> > > > > > > > > memory
> > > > > > > > > for
> > > > > > > > > every request than actually required (i.e. say 9pfs was
> > > > > > > > > mounted
> > > > > > > > > with
> > > > > > > > > msize=8M, then a 9p request that actually would just need 1k
> > > > > > > > > would
> > > > > > > > > nevertheless allocate 8M), but also it allocates >
> > > > > > > > > PAGE_SIZE,
> > > > > > > > > which
> > > > > > > > > obviously may fail at any time.>
> > > > > > > > 
> > > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc()
> > > > > > > > situation.
> > > > > > 
> > > > > > Hu, I didn't even consider vmalloc(). I just tried the kvmalloc()
> > > > > > wrapper
> > > > > > as a quick & dirty test, but it crashed in the same way as
> > > > > > kmalloc()
> > > > > > with
> > > > > > large msize values immediately on mounting:
> > > > > > 
> > > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > > --- a/net/9p/client.c
> > > > > > +++ b/net/9p/client.c
> > > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct
> > > > > > p9_client
> > > > > > *clnt)
> > > > > > 
> > > > > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall
> > > > > >  *fc,
> > > > > >  
> > > > > >                          int alloc_msize)
> > > > > >  
> > > > > >  {
> > > > > > 
> > > > > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > > > +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > > > +       if (false) {
> > > > > > 
> > > > > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > > > > >                 GFP_NOFS);
> > > > > >                 fc->cache = c->fcall_cache;
> > > > > >         
> > > > > >         } else {
> > > > > > 
> > > > > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > > > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > > > > 
> > > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > > 
> > > > > Now I get:
> > > > >    virtio: bogus descriptor or out of resources
> > > > > 
> > > > > So, still some work ahead on both ends.
> > > > 
> > > > Few hacks later (only changes on 9p client side) I got this running
> > > > stable
> > > > now. The reason for the virtio error above was that kvmalloc() returns
> > > > a
> > > > non-logical kernel address for any kvmalloc(>4M), i.e. an address that
> > > > is
> > > > inaccessible from host side, hence that "bogus descriptor" message by
> > > > QEMU.
> > > > So I had to split those linear 9p client buffers into sparse ones (set
> > > > of
> > > > individual pages).
> > > > 
> > > > I tested this for some days with various virtio transmission sizes and
> > > > it
> > > > works as expected up to 128 MB (more precisely: 128 MB read space +
> > > > 128 MB
> > > > write space per virtio round trip message).
> > > > 
> > > > I did not encounter a show stopper for large virtio transmission sizes
> > > > (4 MB ... 128 MB) on virtio level, neither as a result of testing, nor
> > > > after reviewing the existing code.
> > > > 
> > > > About IOV_MAX: that's apparently not an issue on virtio level. Most of
> > > > the
> > > > iovec code, both on Linux kernel side and on QEMU side do not have
> > > > this
> > > > limitation. It is apparently however indeed a limitation for userland
> > > > apps
> > > > calling the Linux kernel's syscalls yet.
> > > > 
> > > > Stefan, as it stands now, I am even more convinced that the upper
> > > > virtio
> > > > transmission size limit should not be squeezed into the queue size
> > > > argument of virtio_add_queue(). Not because of the previous argument
> > > > that
> > > > it would waste space (~1MB), but rather because they are two different
> > > > things. To outline this, just a quick recap of what happens exactly
> > > > when
> > > > a bulk message is pushed over the virtio wire (assuming virtio "split"
> > > > layout here):
> > > > 
> > > > ---------- [recap-start] ----------
> > > > 
> > > > For each bulk message sent guest <-> host, exactly *one* of the
> > > > pre-allocated descriptors is taken and placed (subsequently) into
> > > > exactly
> > > > *one* position of the two available/used ring buffers. The actual
> > > > descriptor table though, containing all the DMA addresses of the
> > > > message
> > > > bulk data, is allocated just in time for each round trip message. Say,
> > > > it
> > > > is the first message sent, it yields in the following structure:
> > > > 
> > > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > > 
> > > >    +-+              +-+           +-----------------+
> > > >    
> > > >    |D|------------->|d|---------->| Bulk data block |
> > > >    
> > > >    +-+              |d|--------+  +-----------------+
> > > >    
> > > >    | |              |d|------+ |
> > > >    
> > > >    +-+               .       | |  +-----------------+
> > > >    
> > > >    | |               .       | +->| Bulk data block |
> > > >     
> > > >     .                .       |    +-----------------+
> > > >     .               |d|-+    |
> > > >     .               +-+ |    |    +-----------------+
> > > >     
> > > >    | |                  |    +--->| Bulk data block |
> > > >    
> > > >    +-+                  |         +-----------------+
> > > >    
> > > >    | |                  |                 .
> > > >    
> > > >    +-+                  |                 .
> > > >    
> > > >                         |                 .
> > > >                         |         
> > > >                         |         +-----------------+
> > > >                         
> > > >                         +-------->| Bulk data block |
> > > >                         
> > > >                                   +-----------------+
> > > > 
> > > > Legend:
> > > > D: pre-allocated descriptor
> > > > d: just in time allocated descriptor
> > > > -->: memory pointer (DMA)
> > > > 
> > > > The bulk data blocks are allocated by the respective device driver
> > > > above
> > > > virtio subsystem level (guest side).
> > > > 
> > > > There are exactly as many descriptors pre-allocated (D) as the size of
> > > > a
> > > > ring buffer.
> > > > 
> > > > A "descriptor" is more or less just a chainable DMA memory pointer;
> > > > defined
> > > > as:
> > > > 
> > > > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > > > "next". */ struct vring_desc {
> > > > 
> > > > 	/* Address (guest-physical). */
> > > > 	__virtio64 addr;
> > > > 	/* Length. */
> > > > 	__virtio32 len;
> > > > 	/* The flags as indicated above. */
> > > > 	__virtio16 flags;
> > > > 	/* We chain unused descriptors via this, too */
> > > > 	__virtio16 next;
> > > > 
> > > > };
> > > > 
> > > > There are 2 ring buffers; the "available" ring buffer is for sending a
> > > > message guest->host (which will transmit DMA addresses of guest
> > > > allocated
> > > > bulk data blocks that are used for data sent to device, and separate
> > > > guest allocated bulk data blocks that will be used by host side to
> > > > place
> > > > its response bulk data), and the "used" ring buffer is for sending
> > > > host->guest to let guest know about host's response and that it could
> > > > now
> > > > safely consume and then deallocate the bulk data blocks subsequently.
> > > > 
> > > > ---------- [recap-end] ----------
> > > > 
> > > > So the "queue size" actually defines the ringbuffer size. It does not
> > > > define the maximum amount of descriptors. The "queue size" rather
> > > > defines
> > > > how many pending messages can be pushed into either one ringbuffer
> > > > before
> > > > the other side would need to wait until the counter side would step up
> > > > (i.e. ring buffer full).
> > > > 
> > > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE actually
> > > > is)
> > > > OTOH defines the max. bulk data size that could be transmitted with
> > > > each
> > > > virtio round trip message.
> > > > 
> > > > And in fact, 9p currently handles the virtio "queue size" as directly
> > > > associative with its maximum amount of active 9p requests the server
> > > > could
> > > > 
> > > > handle simultaniously:
> > > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > > >   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev,
> > > >   MAX_REQ,
> > > >   
> > > >                                  handle_9p_output);
> > > > 
> > > > So if I would change it like this, just for the purpose to increase
> > > > the
> > > > max. virtio transmission size:
> > > > 
> > > > --- a/hw/9pfs/virtio-9p-device.c
> > > > +++ b/hw/9pfs/virtio-9p-device.c
> > > > @@ -218,7 +218,7 @@ static void virtio_9p_device_realize(DeviceState
> > > > *dev,
> > > > Error **errp)>
> > > > 
> > > >      v->config_size = sizeof(struct virtio_9p_config) +
> > > >      strlen(s->fsconf.tag);
> > > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > > >      
> > > >                  VIRTQUEUE_MAX_SIZE);
> > > > 
> > > > -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > > > +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
> > > > 
> > > >  }
> > > > 
> > > > Then it would require additional synchronization code on both ends and
> > > > therefore unnecessary complexity, because it would now be possible
> > > > that
> > > > more requests are pushed into the ringbuffer than server could handle.
> > > > 
> > > > There is one potential issue though that probably did justify the
> > > > "don't
> > > > exceed the queue size" rule:
> > > > 
> > > > ATM the descriptor table is allocated (just in time) as *one*
> > > > continuous
> > > > buffer via kmalloc_array():
> > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086
> > > > f7c7
> > > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > > 
> > > > So assuming transmission size of 2 * 128 MB that kmalloc_array() call
> > > > would
> > > > yield in kmalloc(1M) and the latter might fail if guest had highly
> > > > fragmented physical memory. For such kind of error case there is
> > > > currently a fallback path in virtqueue_add_split() that would then use
> > > > the required amount of pre-allocated descriptors instead:
> > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086
> > > > f7c7
> > > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > > 
> > > > That fallback recovery path would no longer be viable if the queue
> > > > size
> > > > was
> > > > exceeded. There would be alternatives though, e.g. by allowing to
> > > > chain
> > > > indirect descriptor tables (currently prohibited by the virtio specs).
> > > 
> > > Making the maximum number of descriptors independent of the queue size
> > > requires a change to the VIRTIO spec since the two values are currently
> > > explicitly tied together by the spec.
> > 
> > Yes, that's what the virtio specs say. But they don't say why, nor did I
> > hear a reason in this dicussion.
> > 
> > That's why I invested time reviewing current virtio implementation and
> > specs, as well as actually testing exceeding that limit. And as I
> > outlined in detail in my previous email, I only found one theoretical
> > issue that could be addressed though.
> 
> I agree that there is a limitation in the VIRTIO spec, but violating the
> spec isn't an acceptable solution:
> 
> 1. QEMU and Linux aren't the only components that implement VIRTIO. You
>    cannot make assumptions about their implementations because it may
>    break spec-compliant implementations that you haven't looked at.
> 
>    Your patches weren't able to increase Queue Size because some device
>    implementations break when descriptor chains are too long. This shows
>    there is a practical issue even in QEMU.
> 
> 2. The specific spec violation that we discussed creates the problem
>    that drivers can no longer determine the maximum description chain
>    length. This in turn will lead to more implementation-specific
>    assumptions being baked into drivers and cause problems with
>    interoperability and future changes.
> 
> The spec needs to be extended instead. I included an idea for how to do
> that below.

Sure, I just wanted to see if there was a non-neglectable "hard" show stopper
per se that I probably haven't seen yet. I have not questioned aiming a clean
solution.

Thanks for the clarification!

> > > Before doing that, are there benchmark results showing that 1 MB vs 128
> > > MB produces a performance improvement? I'm asking because if performance
> > > with 1 MB is good then you can probably do that without having to change
> > > VIRTIO and also because it's counter-intuitive that 9p needs 128 MB for
> > > good performance when it's ultimately implemented on top of disk and
> > > network I/O that have lower size limits.
> > 
> > First some numbers, linear reading a 12 GB file:
> > 
> > msize    average      notes
> > 
> > 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> > 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> > 512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
> > 1 MB     2551 MB/s    this msize would already violate virtio specs
> > 2 MB     2521 MB/s    this msize would already violate virtio specs
> > 4 MB     2628 MB/s    planned max. msize of my current kernel patches [1]
> 
> How many descriptors are used? 4 MB can be covered by a single
> descriptor if the data is physically contiguous in memory, so this data
> doesn't demonstrate a need for more descriptors.

No, in the last couple years there was apparently no kernel version that used
just one descriptor, nor did my benchmarked version. Even though the Linux 9p
client uses (yet) simple linear buffers (contiguous physical memory) on 9p
client level, these are however split into PAGE_SIZE chunks by function
pack_sg_list() [1] before being fed to virtio level:

static unsigned int rest_of_page(void *data)
{
	return PAGE_SIZE - offset_in_page(data);
}
...
static int pack_sg_list(struct scatterlist *sg, int start,
			int limit, char *data, int count)
{
	int s;
	int index = start;

	while (count) {
		s = rest_of_page(data);
		...
		sg_set_buf(&sg[index++], data, s);
		count -= s;
		data += s;
	}
	...
}

[1] https://github.com/torvalds/linux/blob/19901165d90fdca1e57c9baa0d5b4c63d15c476a/net/9p/trans_virtio.c#L171

So when sending 4MB over virtio wire, it would yield in 1k descriptors ATM.

I have wondered about this before, but did not question it, because due to the
cross-platform nature I couldn't say for certain whether that's probably
needed somewhere. I mean for the case virtio-PCI I know for sure that one
descriptor (i.e. >PAGE_SIZE) would be fine, but I don't know if that applies
to all buses and architectures.

> > But again, this is not just about performance. My conclusion as described
> > in my previous email is that virtio currently squeezes
> > 
> > 	"max. simultanious amount of bulk messages"
> > 
> > vs.
> > 
> > 	"max. bulk data transmission size per bulk messaage"
> > 
> > into the same configuration parameter, which is IMO inappropriate and
> > hence
> > splitting them into 2 separate parameters when creating a queue makes
> > sense, independent of the performance benchmarks.
> > 
> > [1]
> > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.c
> > om/
> Some devices effectively already have this because the device advertises
> a maximum number of descriptors via device-specific mechanisms like the
> struct virtio_blk_config seg_max field. But today these fields can only
> reduce the maximum descriptor chain length because the spec still limits
> the length to Queue Size.
> 
> We can build on this approach to raise the length above Queue Size. This
> approach has the advantage that the maximum number of segments isn't per
> device or per virtqueue, it's fine-grained. If the device supports two
> requests types then different max descriptor chain limits could be given
> for them by introducing two separate configuration space fields.
> 
> Here are the corresponding spec changes:
> 
> 1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is added
>    to indicate that indirect descriptor table size and maximum
>    descriptor chain length are not limited by Queue Size value. (Maybe
>    there still needs to be a limit like 2^15?)

Sounds good to me!

AFAIK it is effectively limited to 2^16 because of vring_desc->next:

/* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
struct vring_desc {
        /* Address (guest-physical). */
        __virtio64 addr;
        /* Length. */
        __virtio32 len;
        /* The flags as indicated above. */
        __virtio16 flags;
        /* We chain unused descriptors via this, too */
        __virtio16 next;
};

At least unless either chained indirect descriptor tables or nested indirect
descriptor tables would be allowed as well, which are both prohibited by the
specs ATM. I'm not saying that this would be needed anytime soon. :)

> 2. "2.6.5.3.1 Driver Requirements: Indirect Descriptors" is updated to
>    say that VIRTIO_RING_F_LARGE_INDIRECT_DESC overrides the maximum
>    descriptor chain length.

OK

> 2. A new configuration space field is added for 9p indicating the
>    maximum descriptor chain length.

So additionally to the VIRTIO_RING_F_LARGE_INDIRECT_DESC capability bit field
also a numeric field that would exactly specify the max. chain length. Sure,
why not.

> One thing that's messy is that we've been discussing the maximum
> descriptor chain length but 9p has the "msize" concept, which isn't
> aware of contiguous memory. It may be necessary to extend the 9p driver
> code to size requests not just according to their length in bytes but
> also according to the descriptor chain length. That's how the Linux
> block layer deals with queue limits (struct queue_limits max_segments vs
> max_hw_sectors).

Hmm, can't follow on that one. For what should that be needed in case of 9p?
My plan was to limit msize by 9p client simply at session start to whatever is
the max. amount virtio descriptors supported by host and using PAGE_SIZE as
size per descriptor, because that's what 9p client actually does ATM (see
above). So you think that should be changed to e.g. just one descriptor for
4MB, right?

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-11-01 20:29                       ` [Virtio-fs] " Christian Schoenebeck
@ 2021-11-03 11:33                         ` Stefan Hajnoczi
  -1 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-11-03 11:33 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, Greg Kurz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 28306 bytes --]

On Mon, Nov 01, 2021 at 09:29:26PM +0100, Christian Schoenebeck wrote:
> On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> > On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck wrote:
> > > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > > 
> > > > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > > > > Schoenebeck
> > > > > > 
> > > > > > wrote:
> > > > > > > > > > > > At the moment the maximum transfer size with virtio is
> > > > > > > > > > > > limited
> > > > > > > > > > > > to
> > > > > > > > > > > > 4M
> > > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its
> > > > > > > > > > > > maximum
> > > > > > > > > > > > theoretical possible transfer size of 128M (32k pages)
> > > > > > > > > > > > according
> > > > > > > > > > > > to
> > > > > > > > > > > > the
> > > > > > > > > > > > virtio specs:
> > > > > > > > > > > > 
> > > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virt
> > > > > > > > > > > > io-v
> > > > > > > > > > > > 1.1-
> > > > > > > > > > > > cs
> > > > > > > > > > > > 01
> > > > > > > > > > > > .html#
> > > > > > > > > > > > x1-240006
> > > > > > > > > > > 
> > > > > > > > > > > Hi Christian,
> > > > > > > > 
> > > > > > > > > > > I took a quick look at the code:
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > Thanks Stefan for sharing virtio expertise and helping Christian
> > > > > > > > !
> > > > > > > > 
> > > > > > > > > > > - The Linux 9p driver restricts descriptor chains to 128
> > > > > > > > > > > elements
> > > > > > > > > > > 
> > > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > > 
> > > > > > > > > > Yes, that's the limitation that I am about to remove (WIP);
> > > > > > > > > > current
> > > > > > > > > > kernel
> > > > > > > > > > patches:
> > > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_os
> > > > > > > > > > s@cr
> > > > > > > > > > udeb
> > > > > > > > > > yt
> > > > > > > > > > e.
> > > > > > > > > > com/>
> > > > > > > > > 
> > > > > > > > > I haven't read the patches yet but I'm concerned that today
> > > > > > > > > the
> > > > > > > > > driver
> > > > > > > > > is pretty well-behaved and this new patch series introduces a
> > > > > > > > > spec
> > > > > > > > > violation. Not fixing existing spec violations is okay, but
> > > > > > > > > adding
> > > > > > > > > new
> > > > > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > > > > solution.
> > > > > > > 
> > > > > > > Nobody has reviewed the kernel patches yet. My main concern
> > > > > > > therefore
> > > > > > > actually is that the kernel patches are already too complex,
> > > > > > > because
> > > > > > > the
> > > > > > > current situation is that only Dominique is handling 9p patches on
> > > > > > > kernel
> > > > > > > side, and he barely has time for 9p anymore.
> > > > > > > 
> > > > > > > Another reason for me to catch up on reading current kernel code
> > > > > > > and
> > > > > > > stepping in as reviewer of 9p on kernel side ASAP, independent of
> > > > > > > this
> > > > > > > issue.
> > > > > > > 
> > > > > > > As for current kernel patches' complexity: I can certainly drop
> > > > > > > patch
> > > > > > > 7
> > > > > > > entirely as it is probably just overkill. Patch 4 is then the
> > > > > > > biggest
> > > > > > > chunk, I have to see if I can simplify it, and whether it would
> > > > > > > make
> > > > > > > sense to squash with patch 3.
> > > > > > > 
> > > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2)
> > > > > > > > > > > and
> > > > > > > > > > > will
> > > > > > > > > > > fail
> > > > > > > > > > > 
> > > > > > > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > > 
> > > > > > > > > > Hmm, which makes me wonder why I never encountered this
> > > > > > > > > > error
> > > > > > > > > > during
> > > > > > > > > > testing.
> > > > > > > > > > 
> > > > > > > > > > Most people will use the 9p qemu 'local' fs driver backend
> > > > > > > > > > in
> > > > > > > > > > practice,
> > > > > > > > > > so
> > > > > > > > > > that v9fs_read() call would translate for most people to
> > > > > > > > > > this
> > > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > > 
> > > > > > > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState
> > > > > > > > > > *fs,
> > > > > > > > > > 
> > > > > > > > > >                             const struct iovec *iov,
> > > > > > > > > >                             int iovcnt, off_t offset)
> > > > > > > > > > 
> > > > > > > > > > {
> > > > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > > > 
> > > > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > > > 
> > > > > > > > > > #else
> > > > > > > > > > 
> > > > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > > > >     if (err == -1) {
> > > > > > > > > >     
> > > > > > > > > >         return err;
> > > > > > > > > >     
> > > > > > > > > >     } else {
> > > > > > > > > >     
> > > > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > > > >     
> > > > > > > > > >     }
> > > > > > > > > > 
> > > > > > > > > > #endif
> > > > > > > > > > }
> > > > > > > > > > 
> > > > > > > > > > > Unless I misunderstood the code, neither side can take
> > > > > > > > > > > advantage
> > > > > > > > > > > of
> > > > > > > > > > > the
> > > > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > > > 
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Stefan
> > > > > > > > > > 
> > > > > > > > > > I need to check that when I have some more time. One
> > > > > > > > > > possible
> > > > > > > > > > explanation
> > > > > > > > > > might be that preadv() already has this wrapped into a loop
> > > > > > > > > > in
> > > > > > > > > > its
> > > > > > > > > > implementation to circumvent a limit like IOV_MAX. It might
> > > > > > > > > > be
> > > > > > > > > > another
> > > > > > > > > > "it
> > > > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > > > 
> > > > > > > > > > There are still a bunch of other issues I have to resolve.
> > > > > > > > > > If
> > > > > > > > > > you
> > > > > > > > > > look
> > > > > > > > > > at
> > > > > > > > > > net/9p/client.c on kernel side, you'll notice that it
> > > > > > > > > > basically
> > > > > > > > > > does
> > > > > > > > > > this ATM> >
> > > > > > > > > > 
> > > > > > > > > >     kmalloc(msize);
> > > > > > > > 
> > > > > > > > Note that this is done twice : once for the T message (client
> > > > > > > > request)
> > > > > > > > and
> > > > > > > > once for the R message (server answer). The 9p driver could
> > > > > > > > adjust
> > > > > > > > the
> > > > > > > > size
> > > > > > > > of the T message to what's really needed instead of allocating
> > > > > > > > the
> > > > > > > > full
> > > > > > > > msize. R message size is not known though.
> > > > > > > 
> > > > > > > Would it make sense adding a second virtio ring, dedicated to
> > > > > > > server
> > > > > > > responses to solve this? IIRC 9p server already calculates
> > > > > > > appropriate
> > > > > > > exact sizes for each response type. So server could just push
> > > > > > > space
> > > > > > > that's
> > > > > > > really needed for its responses.
> > > > > > > 
> > > > > > > > > > for every 9p request. So not only does it allocate much more
> > > > > > > > > > memory
> > > > > > > > > > for
> > > > > > > > > > every request than actually required (i.e. say 9pfs was
> > > > > > > > > > mounted
> > > > > > > > > > with
> > > > > > > > > > msize=8M, then a 9p request that actually would just need 1k
> > > > > > > > > > would
> > > > > > > > > > nevertheless allocate 8M), but also it allocates >
> > > > > > > > > > PAGE_SIZE,
> > > > > > > > > > which
> > > > > > > > > > obviously may fail at any time.>
> > > > > > > > > 
> > > > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc()
> > > > > > > > > situation.
> > > > > > > 
> > > > > > > Hu, I didn't even consider vmalloc(). I just tried the kvmalloc()
> > > > > > > wrapper
> > > > > > > as a quick & dirty test, but it crashed in the same way as
> > > > > > > kmalloc()
> > > > > > > with
> > > > > > > large msize values immediately on mounting:
> > > > > > > 
> > > > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > > > --- a/net/9p/client.c
> > > > > > > +++ b/net/9p/client.c
> > > > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct
> > > > > > > p9_client
> > > > > > > *clnt)
> > > > > > > 
> > > > > > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall
> > > > > > >  *fc,
> > > > > > >  
> > > > > > >                          int alloc_msize)
> > > > > > >  
> > > > > > >  {
> > > > > > > 
> > > > > > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > > > > +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > > > > +       if (false) {
> > > > > > > 
> > > > > > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > > > > > >                 GFP_NOFS);
> > > > > > >                 fc->cache = c->fcall_cache;
> > > > > > >         
> > > > > > >         } else {
> > > > > > > 
> > > > > > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > > > > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > > > > > 
> > > > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > > > 
> > > > > > Now I get:
> > > > > >    virtio: bogus descriptor or out of resources
> > > > > > 
> > > > > > So, still some work ahead on both ends.
> > > > > 
> > > > > Few hacks later (only changes on 9p client side) I got this running
> > > > > stable
> > > > > now. The reason for the virtio error above was that kvmalloc() returns
> > > > > a
> > > > > non-logical kernel address for any kvmalloc(>4M), i.e. an address that
> > > > > is
> > > > > inaccessible from host side, hence that "bogus descriptor" message by
> > > > > QEMU.
> > > > > So I had to split those linear 9p client buffers into sparse ones (set
> > > > > of
> > > > > individual pages).
> > > > > 
> > > > > I tested this for some days with various virtio transmission sizes and
> > > > > it
> > > > > works as expected up to 128 MB (more precisely: 128 MB read space +
> > > > > 128 MB
> > > > > write space per virtio round trip message).
> > > > > 
> > > > > I did not encounter a show stopper for large virtio transmission sizes
> > > > > (4 MB ... 128 MB) on virtio level, neither as a result of testing, nor
> > > > > after reviewing the existing code.
> > > > > 
> > > > > About IOV_MAX: that's apparently not an issue on virtio level. Most of
> > > > > the
> > > > > iovec code, both on Linux kernel side and on QEMU side do not have
> > > > > this
> > > > > limitation. It is apparently however indeed a limitation for userland
> > > > > apps
> > > > > calling the Linux kernel's syscalls yet.
> > > > > 
> > > > > Stefan, as it stands now, I am even more convinced that the upper
> > > > > virtio
> > > > > transmission size limit should not be squeezed into the queue size
> > > > > argument of virtio_add_queue(). Not because of the previous argument
> > > > > that
> > > > > it would waste space (~1MB), but rather because they are two different
> > > > > things. To outline this, just a quick recap of what happens exactly
> > > > > when
> > > > > a bulk message is pushed over the virtio wire (assuming virtio "split"
> > > > > layout here):
> > > > > 
> > > > > ---------- [recap-start] ----------
> > > > > 
> > > > > For each bulk message sent guest <-> host, exactly *one* of the
> > > > > pre-allocated descriptors is taken and placed (subsequently) into
> > > > > exactly
> > > > > *one* position of the two available/used ring buffers. The actual
> > > > > descriptor table though, containing all the DMA addresses of the
> > > > > message
> > > > > bulk data, is allocated just in time for each round trip message. Say,
> > > > > it
> > > > > is the first message sent, it yields in the following structure:
> > > > > 
> > > > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > > > 
> > > > >    +-+              +-+           +-----------------+
> > > > >    
> > > > >    |D|------------->|d|---------->| Bulk data block |
> > > > >    
> > > > >    +-+              |d|--------+  +-----------------+
> > > > >    
> > > > >    | |              |d|------+ |
> > > > >    
> > > > >    +-+               .       | |  +-----------------+
> > > > >    
> > > > >    | |               .       | +->| Bulk data block |
> > > > >     
> > > > >     .                .       |    +-----------------+
> > > > >     .               |d|-+    |
> > > > >     .               +-+ |    |    +-----------------+
> > > > >     
> > > > >    | |                  |    +--->| Bulk data block |
> > > > >    
> > > > >    +-+                  |         +-----------------+
> > > > >    
> > > > >    | |                  |                 .
> > > > >    
> > > > >    +-+                  |                 .
> > > > >    
> > > > >                         |                 .
> > > > >                         |         
> > > > >                         |         +-----------------+
> > > > >                         
> > > > >                         +-------->| Bulk data block |
> > > > >                         
> > > > >                                   +-----------------+
> > > > > 
> > > > > Legend:
> > > > > D: pre-allocated descriptor
> > > > > d: just in time allocated descriptor
> > > > > -->: memory pointer (DMA)
> > > > > 
> > > > > The bulk data blocks are allocated by the respective device driver
> > > > > above
> > > > > virtio subsystem level (guest side).
> > > > > 
> > > > > There are exactly as many descriptors pre-allocated (D) as the size of
> > > > > a
> > > > > ring buffer.
> > > > > 
> > > > > A "descriptor" is more or less just a chainable DMA memory pointer;
> > > > > defined
> > > > > as:
> > > > > 
> > > > > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > > > > "next". */ struct vring_desc {
> > > > > 
> > > > > 	/* Address (guest-physical). */
> > > > > 	__virtio64 addr;
> > > > > 	/* Length. */
> > > > > 	__virtio32 len;
> > > > > 	/* The flags as indicated above. */
> > > > > 	__virtio16 flags;
> > > > > 	/* We chain unused descriptors via this, too */
> > > > > 	__virtio16 next;
> > > > > 
> > > > > };
> > > > > 
> > > > > There are 2 ring buffers; the "available" ring buffer is for sending a
> > > > > message guest->host (which will transmit DMA addresses of guest
> > > > > allocated
> > > > > bulk data blocks that are used for data sent to device, and separate
> > > > > guest allocated bulk data blocks that will be used by host side to
> > > > > place
> > > > > its response bulk data), and the "used" ring buffer is for sending
> > > > > host->guest to let guest know about host's response and that it could
> > > > > now
> > > > > safely consume and then deallocate the bulk data blocks subsequently.
> > > > > 
> > > > > ---------- [recap-end] ----------
> > > > > 
> > > > > So the "queue size" actually defines the ringbuffer size. It does not
> > > > > define the maximum amount of descriptors. The "queue size" rather
> > > > > defines
> > > > > how many pending messages can be pushed into either one ringbuffer
> > > > > before
> > > > > the other side would need to wait until the counter side would step up
> > > > > (i.e. ring buffer full).
> > > > > 
> > > > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE actually
> > > > > is)
> > > > > OTOH defines the max. bulk data size that could be transmitted with
> > > > > each
> > > > > virtio round trip message.
> > > > > 
> > > > > And in fact, 9p currently handles the virtio "queue size" as directly
> > > > > associative with its maximum amount of active 9p requests the server
> > > > > could
> > > > > 
> > > > > handle simultaniously:
> > > > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > > > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > > > >   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev,
> > > > >   MAX_REQ,
> > > > >   
> > > > >                                  handle_9p_output);
> > > > > 
> > > > > So if I would change it like this, just for the purpose to increase
> > > > > the
> > > > > max. virtio transmission size:
> > > > > 
> > > > > --- a/hw/9pfs/virtio-9p-device.c
> > > > > +++ b/hw/9pfs/virtio-9p-device.c
> > > > > @@ -218,7 +218,7 @@ static void virtio_9p_device_realize(DeviceState
> > > > > *dev,
> > > > > Error **errp)>
> > > > > 
> > > > >      v->config_size = sizeof(struct virtio_9p_config) +
> > > > >      strlen(s->fsconf.tag);
> > > > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > > > >      
> > > > >                  VIRTQUEUE_MAX_SIZE);
> > > > > 
> > > > > -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > > > > +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
> > > > > 
> > > > >  }
> > > > > 
> > > > > Then it would require additional synchronization code on both ends and
> > > > > therefore unnecessary complexity, because it would now be possible
> > > > > that
> > > > > more requests are pushed into the ringbuffer than server could handle.
> > > > > 
> > > > > There is one potential issue though that probably did justify the
> > > > > "don't
> > > > > exceed the queue size" rule:
> > > > > 
> > > > > ATM the descriptor table is allocated (just in time) as *one*
> > > > > continuous
> > > > > buffer via kmalloc_array():
> > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086
> > > > > f7c7
> > > > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > > > 
> > > > > So assuming transmission size of 2 * 128 MB that kmalloc_array() call
> > > > > would
> > > > > yield in kmalloc(1M) and the latter might fail if guest had highly
> > > > > fragmented physical memory. For such kind of error case there is
> > > > > currently a fallback path in virtqueue_add_split() that would then use
> > > > > the required amount of pre-allocated descriptors instead:
> > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086
> > > > > f7c7
> > > > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > > > 
> > > > > That fallback recovery path would no longer be viable if the queue
> > > > > size
> > > > > was
> > > > > exceeded. There would be alternatives though, e.g. by allowing to
> > > > > chain
> > > > > indirect descriptor tables (currently prohibited by the virtio specs).
> > > > 
> > > > Making the maximum number of descriptors independent of the queue size
> > > > requires a change to the VIRTIO spec since the two values are currently
> > > > explicitly tied together by the spec.
> > > 
> > > Yes, that's what the virtio specs say. But they don't say why, nor did I
> > > hear a reason in this dicussion.
> > > 
> > > That's why I invested time reviewing current virtio implementation and
> > > specs, as well as actually testing exceeding that limit. And as I
> > > outlined in detail in my previous email, I only found one theoretical
> > > issue that could be addressed though.
> > 
> > I agree that there is a limitation in the VIRTIO spec, but violating the
> > spec isn't an acceptable solution:
> > 
> > 1. QEMU and Linux aren't the only components that implement VIRTIO. You
> >    cannot make assumptions about their implementations because it may
> >    break spec-compliant implementations that you haven't looked at.
> > 
> >    Your patches weren't able to increase Queue Size because some device
> >    implementations break when descriptor chains are too long. This shows
> >    there is a practical issue even in QEMU.
> > 
> > 2. The specific spec violation that we discussed creates the problem
> >    that drivers can no longer determine the maximum description chain
> >    length. This in turn will lead to more implementation-specific
> >    assumptions being baked into drivers and cause problems with
> >    interoperability and future changes.
> > 
> > The spec needs to be extended instead. I included an idea for how to do
> > that below.
> 
> Sure, I just wanted to see if there was a non-neglectable "hard" show stopper
> per se that I probably haven't seen yet. I have not questioned aiming a clean
> solution.
> 
> Thanks for the clarification!
> 
> > > > Before doing that, are there benchmark results showing that 1 MB vs 128
> > > > MB produces a performance improvement? I'm asking because if performance
> > > > with 1 MB is good then you can probably do that without having to change
> > > > VIRTIO and also because it's counter-intuitive that 9p needs 128 MB for
> > > > good performance when it's ultimately implemented on top of disk and
> > > > network I/O that have lower size limits.
> > > 
> > > First some numbers, linear reading a 12 GB file:
> > > 
> > > msize    average      notes
> > > 
> > > 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> > > 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> > > 512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
> > > 1 MB     2551 MB/s    this msize would already violate virtio specs
> > > 2 MB     2521 MB/s    this msize would already violate virtio specs
> > > 4 MB     2628 MB/s    planned max. msize of my current kernel patches [1]
> > 
> > How many descriptors are used? 4 MB can be covered by a single
> > descriptor if the data is physically contiguous in memory, so this data
> > doesn't demonstrate a need for more descriptors.
> 
> No, in the last couple years there was apparently no kernel version that used
> just one descriptor, nor did my benchmarked version. Even though the Linux 9p
> client uses (yet) simple linear buffers (contiguous physical memory) on 9p
> client level, these are however split into PAGE_SIZE chunks by function
> pack_sg_list() [1] before being fed to virtio level:
> 
> static unsigned int rest_of_page(void *data)
> {
> 	return PAGE_SIZE - offset_in_page(data);
> }
> ...
> static int pack_sg_list(struct scatterlist *sg, int start,
> 			int limit, char *data, int count)
> {
> 	int s;
> 	int index = start;
> 
> 	while (count) {
> 		s = rest_of_page(data);
> 		...
> 		sg_set_buf(&sg[index++], data, s);
> 		count -= s;
> 		data += s;
> 	}
> 	...
> }
> 
> [1] https://github.com/torvalds/linux/blob/19901165d90fdca1e57c9baa0d5b4c63d15c476a/net/9p/trans_virtio.c#L171
> 
> So when sending 4MB over virtio wire, it would yield in 1k descriptors ATM.
> 
> I have wondered about this before, but did not question it, because due to the
> cross-platform nature I couldn't say for certain whether that's probably
> needed somewhere. I mean for the case virtio-PCI I know for sure that one
> descriptor (i.e. >PAGE_SIZE) would be fine, but I don't know if that applies
> to all buses and architectures.

VIRTIO does not limit descriptor the descriptor len field to PAGE_SIZE,
so I don't think there is a limit at the VIRTIO level.

If this function coalesces adjacent pages then the descriptor chain
length issues could be reduced.

> > > But again, this is not just about performance. My conclusion as described
> > > in my previous email is that virtio currently squeezes
> > > 
> > > 	"max. simultanious amount of bulk messages"
> > > 
> > > vs.
> > > 
> > > 	"max. bulk data transmission size per bulk messaage"
> > > 
> > > into the same configuration parameter, which is IMO inappropriate and
> > > hence
> > > splitting them into 2 separate parameters when creating a queue makes
> > > sense, independent of the performance benchmarks.
> > > 
> > > [1]
> > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.c
> > > om/
> > Some devices effectively already have this because the device advertises
> > a maximum number of descriptors via device-specific mechanisms like the
> > struct virtio_blk_config seg_max field. But today these fields can only
> > reduce the maximum descriptor chain length because the spec still limits
> > the length to Queue Size.
> > 
> > We can build on this approach to raise the length above Queue Size. This
> > approach has the advantage that the maximum number of segments isn't per
> > device or per virtqueue, it's fine-grained. If the device supports two
> > requests types then different max descriptor chain limits could be given
> > for them by introducing two separate configuration space fields.
> > 
> > Here are the corresponding spec changes:
> > 
> > 1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is added
> >    to indicate that indirect descriptor table size and maximum
> >    descriptor chain length are not limited by Queue Size value. (Maybe
> >    there still needs to be a limit like 2^15?)
> 
> Sounds good to me!
> 
> AFAIK it is effectively limited to 2^16 because of vring_desc->next:
> 
> /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
> struct vring_desc {
>         /* Address (guest-physical). */
>         __virtio64 addr;
>         /* Length. */
>         __virtio32 len;
>         /* The flags as indicated above. */
>         __virtio16 flags;
>         /* We chain unused descriptors via this, too */
>         __virtio16 next;
> };

Yes, Split Virtqueues have a fundamental limit on indirect table size
due to the "next" field. Packed Virtqueue descriptors don't have a
"next" field so descriptor chains could be longer in theory (currently
forbidden by the spec).

> > One thing that's messy is that we've been discussing the maximum
> > descriptor chain length but 9p has the "msize" concept, which isn't
> > aware of contiguous memory. It may be necessary to extend the 9p driver
> > code to size requests not just according to their length in bytes but
> > also according to the descriptor chain length. That's how the Linux
> > block layer deals with queue limits (struct queue_limits max_segments vs
> > max_hw_sectors).
> 
> Hmm, can't follow on that one. For what should that be needed in case of 9p?
> My plan was to limit msize by 9p client simply at session start to whatever is
> the max. amount virtio descriptors supported by host and using PAGE_SIZE as
> size per descriptor, because that's what 9p client actually does ATM (see
> above). So you think that should be changed to e.g. just one descriptor for
> 4MB, right?

Limiting msize to the 9p transport device's maximum number of
descriptors is conservative (i.e. 128 descriptors = 512 KB msize)
because it doesn't take advantage of contiguous memory. I suggest
leaving msize alone, adding a separate limit at which requests are split
according to the maximum descriptor chain length, and tweaking
pack_sg_list() to coalesce adjacent pages.

That way msize can be large without necessarily using lots of
descriptors (depending on the memory layout).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-11-03 11:33                         ` Stefan Hajnoczi
  0 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-11-03 11:33 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz

[-- Attachment #1: Type: text/plain, Size: 28306 bytes --]

On Mon, Nov 01, 2021 at 09:29:26PM +0100, Christian Schoenebeck wrote:
> On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> > On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck wrote:
> > > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > > 
> > > > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > > > > Schoenebeck
> > > > > > 
> > > > > > wrote:
> > > > > > > > > > > > At the moment the maximum transfer size with virtio is
> > > > > > > > > > > > limited
> > > > > > > > > > > > to
> > > > > > > > > > > > 4M
> > > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its
> > > > > > > > > > > > maximum
> > > > > > > > > > > > theoretical possible transfer size of 128M (32k pages)
> > > > > > > > > > > > according
> > > > > > > > > > > > to
> > > > > > > > > > > > the
> > > > > > > > > > > > virtio specs:
> > > > > > > > > > > > 
> > > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virt
> > > > > > > > > > > > io-v
> > > > > > > > > > > > 1.1-
> > > > > > > > > > > > cs
> > > > > > > > > > > > 01
> > > > > > > > > > > > .html#
> > > > > > > > > > > > x1-240006
> > > > > > > > > > > 
> > > > > > > > > > > Hi Christian,
> > > > > > > > 
> > > > > > > > > > > I took a quick look at the code:
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > Thanks Stefan for sharing virtio expertise and helping Christian
> > > > > > > > !
> > > > > > > > 
> > > > > > > > > > > - The Linux 9p driver restricts descriptor chains to 128
> > > > > > > > > > > elements
> > > > > > > > > > > 
> > > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > > 
> > > > > > > > > > Yes, that's the limitation that I am about to remove (WIP);
> > > > > > > > > > current
> > > > > > > > > > kernel
> > > > > > > > > > patches:
> > > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_os
> > > > > > > > > > s@cr
> > > > > > > > > > udeb
> > > > > > > > > > yt
> > > > > > > > > > e.
> > > > > > > > > > com/>
> > > > > > > > > 
> > > > > > > > > I haven't read the patches yet but I'm concerned that today
> > > > > > > > > the
> > > > > > > > > driver
> > > > > > > > > is pretty well-behaved and this new patch series introduces a
> > > > > > > > > spec
> > > > > > > > > violation. Not fixing existing spec violations is okay, but
> > > > > > > > > adding
> > > > > > > > > new
> > > > > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > > > > solution.
> > > > > > > 
> > > > > > > Nobody has reviewed the kernel patches yet. My main concern
> > > > > > > therefore
> > > > > > > actually is that the kernel patches are already too complex,
> > > > > > > because
> > > > > > > the
> > > > > > > current situation is that only Dominique is handling 9p patches on
> > > > > > > kernel
> > > > > > > side, and he barely has time for 9p anymore.
> > > > > > > 
> > > > > > > Another reason for me to catch up on reading current kernel code
> > > > > > > and
> > > > > > > stepping in as reviewer of 9p on kernel side ASAP, independent of
> > > > > > > this
> > > > > > > issue.
> > > > > > > 
> > > > > > > As for current kernel patches' complexity: I can certainly drop
> > > > > > > patch
> > > > > > > 7
> > > > > > > entirely as it is probably just overkill. Patch 4 is then the
> > > > > > > biggest
> > > > > > > chunk, I have to see if I can simplify it, and whether it would
> > > > > > > make
> > > > > > > sense to squash with patch 3.
> > > > > > > 
> > > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2)
> > > > > > > > > > > and
> > > > > > > > > > > will
> > > > > > > > > > > fail
> > > > > > > > > > > 
> > > > > > > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > > 
> > > > > > > > > > Hmm, which makes me wonder why I never encountered this
> > > > > > > > > > error
> > > > > > > > > > during
> > > > > > > > > > testing.
> > > > > > > > > > 
> > > > > > > > > > Most people will use the 9p qemu 'local' fs driver backend
> > > > > > > > > > in
> > > > > > > > > > practice,
> > > > > > > > > > so
> > > > > > > > > > that v9fs_read() call would translate for most people to
> > > > > > > > > > this
> > > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > > 
> > > > > > > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState
> > > > > > > > > > *fs,
> > > > > > > > > > 
> > > > > > > > > >                             const struct iovec *iov,
> > > > > > > > > >                             int iovcnt, off_t offset)
> > > > > > > > > > 
> > > > > > > > > > {
> > > > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > > > 
> > > > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > > > 
> > > > > > > > > > #else
> > > > > > > > > > 
> > > > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > > > >     if (err == -1) {
> > > > > > > > > >     
> > > > > > > > > >         return err;
> > > > > > > > > >     
> > > > > > > > > >     } else {
> > > > > > > > > >     
> > > > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > > > >     
> > > > > > > > > >     }
> > > > > > > > > > 
> > > > > > > > > > #endif
> > > > > > > > > > }
> > > > > > > > > > 
> > > > > > > > > > > Unless I misunderstood the code, neither side can take
> > > > > > > > > > > advantage
> > > > > > > > > > > of
> > > > > > > > > > > the
> > > > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > > > 
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Stefan
> > > > > > > > > > 
> > > > > > > > > > I need to check that when I have some more time. One
> > > > > > > > > > possible
> > > > > > > > > > explanation
> > > > > > > > > > might be that preadv() already has this wrapped into a loop
> > > > > > > > > > in
> > > > > > > > > > its
> > > > > > > > > > implementation to circumvent a limit like IOV_MAX. It might
> > > > > > > > > > be
> > > > > > > > > > another
> > > > > > > > > > "it
> > > > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > > > 
> > > > > > > > > > There are still a bunch of other issues I have to resolve.
> > > > > > > > > > If
> > > > > > > > > > you
> > > > > > > > > > look
> > > > > > > > > > at
> > > > > > > > > > net/9p/client.c on kernel side, you'll notice that it
> > > > > > > > > > basically
> > > > > > > > > > does
> > > > > > > > > > this ATM> >
> > > > > > > > > > 
> > > > > > > > > >     kmalloc(msize);
> > > > > > > > 
> > > > > > > > Note that this is done twice : once for the T message (client
> > > > > > > > request)
> > > > > > > > and
> > > > > > > > once for the R message (server answer). The 9p driver could
> > > > > > > > adjust
> > > > > > > > the
> > > > > > > > size
> > > > > > > > of the T message to what's really needed instead of allocating
> > > > > > > > the
> > > > > > > > full
> > > > > > > > msize. R message size is not known though.
> > > > > > > 
> > > > > > > Would it make sense adding a second virtio ring, dedicated to
> > > > > > > server
> > > > > > > responses to solve this? IIRC 9p server already calculates
> > > > > > > appropriate
> > > > > > > exact sizes for each response type. So server could just push
> > > > > > > space
> > > > > > > that's
> > > > > > > really needed for its responses.
> > > > > > > 
> > > > > > > > > > for every 9p request. So not only does it allocate much more
> > > > > > > > > > memory
> > > > > > > > > > for
> > > > > > > > > > every request than actually required (i.e. say 9pfs was
> > > > > > > > > > mounted
> > > > > > > > > > with
> > > > > > > > > > msize=8M, then a 9p request that actually would just need 1k
> > > > > > > > > > would
> > > > > > > > > > nevertheless allocate 8M), but also it allocates >
> > > > > > > > > > PAGE_SIZE,
> > > > > > > > > > which
> > > > > > > > > > obviously may fail at any time.>
> > > > > > > > > 
> > > > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc()
> > > > > > > > > situation.
> > > > > > > 
> > > > > > > Hu, I didn't even consider vmalloc(). I just tried the kvmalloc()
> > > > > > > wrapper
> > > > > > > as a quick & dirty test, but it crashed in the same way as
> > > > > > > kmalloc()
> > > > > > > with
> > > > > > > large msize values immediately on mounting:
> > > > > > > 
> > > > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > > > --- a/net/9p/client.c
> > > > > > > +++ b/net/9p/client.c
> > > > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct
> > > > > > > p9_client
> > > > > > > *clnt)
> > > > > > > 
> > > > > > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall
> > > > > > >  *fc,
> > > > > > >  
> > > > > > >                          int alloc_msize)
> > > > > > >  
> > > > > > >  {
> > > > > > > 
> > > > > > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > > > > +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > > > > +       if (false) {
> > > > > > > 
> > > > > > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > > > > > >                 GFP_NOFS);
> > > > > > >                 fc->cache = c->fcall_cache;
> > > > > > >         
> > > > > > >         } else {
> > > > > > > 
> > > > > > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > > > > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > > > > > 
> > > > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > > > 
> > > > > > Now I get:
> > > > > >    virtio: bogus descriptor or out of resources
> > > > > > 
> > > > > > So, still some work ahead on both ends.
> > > > > 
> > > > > Few hacks later (only changes on 9p client side) I got this running
> > > > > stable
> > > > > now. The reason for the virtio error above was that kvmalloc() returns
> > > > > a
> > > > > non-logical kernel address for any kvmalloc(>4M), i.e. an address that
> > > > > is
> > > > > inaccessible from host side, hence that "bogus descriptor" message by
> > > > > QEMU.
> > > > > So I had to split those linear 9p client buffers into sparse ones (set
> > > > > of
> > > > > individual pages).
> > > > > 
> > > > > I tested this for some days with various virtio transmission sizes and
> > > > > it
> > > > > works as expected up to 128 MB (more precisely: 128 MB read space +
> > > > > 128 MB
> > > > > write space per virtio round trip message).
> > > > > 
> > > > > I did not encounter a show stopper for large virtio transmission sizes
> > > > > (4 MB ... 128 MB) on virtio level, neither as a result of testing, nor
> > > > > after reviewing the existing code.
> > > > > 
> > > > > About IOV_MAX: that's apparently not an issue on virtio level. Most of
> > > > > the
> > > > > iovec code, both on Linux kernel side and on QEMU side do not have
> > > > > this
> > > > > limitation. It is apparently however indeed a limitation for userland
> > > > > apps
> > > > > calling the Linux kernel's syscalls yet.
> > > > > 
> > > > > Stefan, as it stands now, I am even more convinced that the upper
> > > > > virtio
> > > > > transmission size limit should not be squeezed into the queue size
> > > > > argument of virtio_add_queue(). Not because of the previous argument
> > > > > that
> > > > > it would waste space (~1MB), but rather because they are two different
> > > > > things. To outline this, just a quick recap of what happens exactly
> > > > > when
> > > > > a bulk message is pushed over the virtio wire (assuming virtio "split"
> > > > > layout here):
> > > > > 
> > > > > ---------- [recap-start] ----------
> > > > > 
> > > > > For each bulk message sent guest <-> host, exactly *one* of the
> > > > > pre-allocated descriptors is taken and placed (subsequently) into
> > > > > exactly
> > > > > *one* position of the two available/used ring buffers. The actual
> > > > > descriptor table though, containing all the DMA addresses of the
> > > > > message
> > > > > bulk data, is allocated just in time for each round trip message. Say,
> > > > > it
> > > > > is the first message sent, it yields in the following structure:
> > > > > 
> > > > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > > > 
> > > > >    +-+              +-+           +-----------------+
> > > > >    
> > > > >    |D|------------->|d|---------->| Bulk data block |
> > > > >    
> > > > >    +-+              |d|--------+  +-----------------+
> > > > >    
> > > > >    | |              |d|------+ |
> > > > >    
> > > > >    +-+               .       | |  +-----------------+
> > > > >    
> > > > >    | |               .       | +->| Bulk data block |
> > > > >     
> > > > >     .                .       |    +-----------------+
> > > > >     .               |d|-+    |
> > > > >     .               +-+ |    |    +-----------------+
> > > > >     
> > > > >    | |                  |    +--->| Bulk data block |
> > > > >    
> > > > >    +-+                  |         +-----------------+
> > > > >    
> > > > >    | |                  |                 .
> > > > >    
> > > > >    +-+                  |                 .
> > > > >    
> > > > >                         |                 .
> > > > >                         |         
> > > > >                         |         +-----------------+
> > > > >                         
> > > > >                         +-------->| Bulk data block |
> > > > >                         
> > > > >                                   +-----------------+
> > > > > 
> > > > > Legend:
> > > > > D: pre-allocated descriptor
> > > > > d: just in time allocated descriptor
> > > > > -->: memory pointer (DMA)
> > > > > 
> > > > > The bulk data blocks are allocated by the respective device driver
> > > > > above
> > > > > virtio subsystem level (guest side).
> > > > > 
> > > > > There are exactly as many descriptors pre-allocated (D) as the size of
> > > > > a
> > > > > ring buffer.
> > > > > 
> > > > > A "descriptor" is more or less just a chainable DMA memory pointer;
> > > > > defined
> > > > > as:
> > > > > 
> > > > > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > > > > "next". */ struct vring_desc {
> > > > > 
> > > > > 	/* Address (guest-physical). */
> > > > > 	__virtio64 addr;
> > > > > 	/* Length. */
> > > > > 	__virtio32 len;
> > > > > 	/* The flags as indicated above. */
> > > > > 	__virtio16 flags;
> > > > > 	/* We chain unused descriptors via this, too */
> > > > > 	__virtio16 next;
> > > > > 
> > > > > };
> > > > > 
> > > > > There are 2 ring buffers; the "available" ring buffer is for sending a
> > > > > message guest->host (which will transmit DMA addresses of guest
> > > > > allocated
> > > > > bulk data blocks that are used for data sent to device, and separate
> > > > > guest allocated bulk data blocks that will be used by host side to
> > > > > place
> > > > > its response bulk data), and the "used" ring buffer is for sending
> > > > > host->guest to let guest know about host's response and that it could
> > > > > now
> > > > > safely consume and then deallocate the bulk data blocks subsequently.
> > > > > 
> > > > > ---------- [recap-end] ----------
> > > > > 
> > > > > So the "queue size" actually defines the ringbuffer size. It does not
> > > > > define the maximum amount of descriptors. The "queue size" rather
> > > > > defines
> > > > > how many pending messages can be pushed into either one ringbuffer
> > > > > before
> > > > > the other side would need to wait until the counter side would step up
> > > > > (i.e. ring buffer full).
> > > > > 
> > > > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE actually
> > > > > is)
> > > > > OTOH defines the max. bulk data size that could be transmitted with
> > > > > each
> > > > > virtio round trip message.
> > > > > 
> > > > > And in fact, 9p currently handles the virtio "queue size" as directly
> > > > > associative with its maximum amount of active 9p requests the server
> > > > > could
> > > > > 
> > > > > handle simultaniously:
> > > > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > > > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > > > >   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev,
> > > > >   MAX_REQ,
> > > > >   
> > > > >                                  handle_9p_output);
> > > > > 
> > > > > So if I would change it like this, just for the purpose to increase
> > > > > the
> > > > > max. virtio transmission size:
> > > > > 
> > > > > --- a/hw/9pfs/virtio-9p-device.c
> > > > > +++ b/hw/9pfs/virtio-9p-device.c
> > > > > @@ -218,7 +218,7 @@ static void virtio_9p_device_realize(DeviceState
> > > > > *dev,
> > > > > Error **errp)>
> > > > > 
> > > > >      v->config_size = sizeof(struct virtio_9p_config) +
> > > > >      strlen(s->fsconf.tag);
> > > > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > > > >      
> > > > >                  VIRTQUEUE_MAX_SIZE);
> > > > > 
> > > > > -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > > > > +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
> > > > > 
> > > > >  }
> > > > > 
> > > > > Then it would require additional synchronization code on both ends and
> > > > > therefore unnecessary complexity, because it would now be possible
> > > > > that
> > > > > more requests are pushed into the ringbuffer than server could handle.
> > > > > 
> > > > > There is one potential issue though that probably did justify the
> > > > > "don't
> > > > > exceed the queue size" rule:
> > > > > 
> > > > > ATM the descriptor table is allocated (just in time) as *one*
> > > > > continuous
> > > > > buffer via kmalloc_array():
> > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086
> > > > > f7c7
> > > > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > > > 
> > > > > So assuming transmission size of 2 * 128 MB that kmalloc_array() call
> > > > > would
> > > > > yield in kmalloc(1M) and the latter might fail if guest had highly
> > > > > fragmented physical memory. For such kind of error case there is
> > > > > currently a fallback path in virtqueue_add_split() that would then use
> > > > > the required amount of pre-allocated descriptors instead:
> > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086
> > > > > f7c7
> > > > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > > > 
> > > > > That fallback recovery path would no longer be viable if the queue
> > > > > size
> > > > > was
> > > > > exceeded. There would be alternatives though, e.g. by allowing to
> > > > > chain
> > > > > indirect descriptor tables (currently prohibited by the virtio specs).
> > > > 
> > > > Making the maximum number of descriptors independent of the queue size
> > > > requires a change to the VIRTIO spec since the two values are currently
> > > > explicitly tied together by the spec.
> > > 
> > > Yes, that's what the virtio specs say. But they don't say why, nor did I
> > > hear a reason in this dicussion.
> > > 
> > > That's why I invested time reviewing current virtio implementation and
> > > specs, as well as actually testing exceeding that limit. And as I
> > > outlined in detail in my previous email, I only found one theoretical
> > > issue that could be addressed though.
> > 
> > I agree that there is a limitation in the VIRTIO spec, but violating the
> > spec isn't an acceptable solution:
> > 
> > 1. QEMU and Linux aren't the only components that implement VIRTIO. You
> >    cannot make assumptions about their implementations because it may
> >    break spec-compliant implementations that you haven't looked at.
> > 
> >    Your patches weren't able to increase Queue Size because some device
> >    implementations break when descriptor chains are too long. This shows
> >    there is a practical issue even in QEMU.
> > 
> > 2. The specific spec violation that we discussed creates the problem
> >    that drivers can no longer determine the maximum description chain
> >    length. This in turn will lead to more implementation-specific
> >    assumptions being baked into drivers and cause problems with
> >    interoperability and future changes.
> > 
> > The spec needs to be extended instead. I included an idea for how to do
> > that below.
> 
> Sure, I just wanted to see if there was a non-neglectable "hard" show stopper
> per se that I probably haven't seen yet. I have not questioned aiming a clean
> solution.
> 
> Thanks for the clarification!
> 
> > > > Before doing that, are there benchmark results showing that 1 MB vs 128
> > > > MB produces a performance improvement? I'm asking because if performance
> > > > with 1 MB is good then you can probably do that without having to change
> > > > VIRTIO and also because it's counter-intuitive that 9p needs 128 MB for
> > > > good performance when it's ultimately implemented on top of disk and
> > > > network I/O that have lower size limits.
> > > 
> > > First some numbers, linear reading a 12 GB file:
> > > 
> > > msize    average      notes
> > > 
> > > 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> > > 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> > > 512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
> > > 1 MB     2551 MB/s    this msize would already violate virtio specs
> > > 2 MB     2521 MB/s    this msize would already violate virtio specs
> > > 4 MB     2628 MB/s    planned max. msize of my current kernel patches [1]
> > 
> > How many descriptors are used? 4 MB can be covered by a single
> > descriptor if the data is physically contiguous in memory, so this data
> > doesn't demonstrate a need for more descriptors.
> 
> No, in the last couple years there was apparently no kernel version that used
> just one descriptor, nor did my benchmarked version. Even though the Linux 9p
> client uses (yet) simple linear buffers (contiguous physical memory) on 9p
> client level, these are however split into PAGE_SIZE chunks by function
> pack_sg_list() [1] before being fed to virtio level:
> 
> static unsigned int rest_of_page(void *data)
> {
> 	return PAGE_SIZE - offset_in_page(data);
> }
> ...
> static int pack_sg_list(struct scatterlist *sg, int start,
> 			int limit, char *data, int count)
> {
> 	int s;
> 	int index = start;
> 
> 	while (count) {
> 		s = rest_of_page(data);
> 		...
> 		sg_set_buf(&sg[index++], data, s);
> 		count -= s;
> 		data += s;
> 	}
> 	...
> }
> 
> [1] https://github.com/torvalds/linux/blob/19901165d90fdca1e57c9baa0d5b4c63d15c476a/net/9p/trans_virtio.c#L171
> 
> So when sending 4MB over virtio wire, it would yield in 1k descriptors ATM.
> 
> I have wondered about this before, but did not question it, because due to the
> cross-platform nature I couldn't say for certain whether that's probably
> needed somewhere. I mean for the case virtio-PCI I know for sure that one
> descriptor (i.e. >PAGE_SIZE) would be fine, but I don't know if that applies
> to all buses and architectures.

VIRTIO does not limit descriptor the descriptor len field to PAGE_SIZE,
so I don't think there is a limit at the VIRTIO level.

If this function coalesces adjacent pages then the descriptor chain
length issues could be reduced.

> > > But again, this is not just about performance. My conclusion as described
> > > in my previous email is that virtio currently squeezes
> > > 
> > > 	"max. simultanious amount of bulk messages"
> > > 
> > > vs.
> > > 
> > > 	"max. bulk data transmission size per bulk messaage"
> > > 
> > > into the same configuration parameter, which is IMO inappropriate and
> > > hence
> > > splitting them into 2 separate parameters when creating a queue makes
> > > sense, independent of the performance benchmarks.
> > > 
> > > [1]
> > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.c
> > > om/
> > Some devices effectively already have this because the device advertises
> > a maximum number of descriptors via device-specific mechanisms like the
> > struct virtio_blk_config seg_max field. But today these fields can only
> > reduce the maximum descriptor chain length because the spec still limits
> > the length to Queue Size.
> > 
> > We can build on this approach to raise the length above Queue Size. This
> > approach has the advantage that the maximum number of segments isn't per
> > device or per virtqueue, it's fine-grained. If the device supports two
> > requests types then different max descriptor chain limits could be given
> > for them by introducing two separate configuration space fields.
> > 
> > Here are the corresponding spec changes:
> > 
> > 1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is added
> >    to indicate that indirect descriptor table size and maximum
> >    descriptor chain length are not limited by Queue Size value. (Maybe
> >    there still needs to be a limit like 2^15?)
> 
> Sounds good to me!
> 
> AFAIK it is effectively limited to 2^16 because of vring_desc->next:
> 
> /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
> struct vring_desc {
>         /* Address (guest-physical). */
>         __virtio64 addr;
>         /* Length. */
>         __virtio32 len;
>         /* The flags as indicated above. */
>         __virtio16 flags;
>         /* We chain unused descriptors via this, too */
>         __virtio16 next;
> };

Yes, Split Virtqueues have a fundamental limit on indirect table size
due to the "next" field. Packed Virtqueue descriptors don't have a
"next" field so descriptor chains could be longer in theory (currently
forbidden by the spec).

> > One thing that's messy is that we've been discussing the maximum
> > descriptor chain length but 9p has the "msize" concept, which isn't
> > aware of contiguous memory. It may be necessary to extend the 9p driver
> > code to size requests not just according to their length in bytes but
> > also according to the descriptor chain length. That's how the Linux
> > block layer deals with queue limits (struct queue_limits max_segments vs
> > max_hw_sectors).
> 
> Hmm, can't follow on that one. For what should that be needed in case of 9p?
> My plan was to limit msize by 9p client simply at session start to whatever is
> the max. amount virtio descriptors supported by host and using PAGE_SIZE as
> size per descriptor, because that's what 9p client actually does ATM (see
> above). So you think that should be changed to e.g. just one descriptor for
> 4MB, right?

Limiting msize to the 9p transport device's maximum number of
descriptors is conservative (i.e. 128 descriptors = 512 KB msize)
because it doesn't take advantage of contiguous memory. I suggest
leaving msize alone, adding a separate limit at which requests are split
according to the maximum descriptor chain length, and tweaking
pack_sg_list() to coalesce adjacent pages.

That way msize can be large without necessarily using lots of
descriptors (depending on the memory layout).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-11-03 11:33                         ` [Virtio-fs] " Stefan Hajnoczi
@ 2021-11-04 14:41                           ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-11-04 14:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Kevin Wolf, Laurent Vivier, qemu-block,
	Michael S. Tsirkin, Jason Wang, Amit Shah, David Hildenbrand,
	Greg Kurz, virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

On Mittwoch, 3. November 2021 12:33:33 CET Stefan Hajnoczi wrote:
> On Mon, Nov 01, 2021 at 09:29:26PM +0100, Christian Schoenebeck wrote:
> > On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> > > On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck wrote:
> > > > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > > > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > > > 
> > > > > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > > > > > Schoenebeck
> > > > > > > 
> > > > > > > wrote:
> > > > > > > > > > > > > At the moment the maximum transfer size with virtio
> > > > > > > > > > > > > is
> > > > > > > > > > > > > limited
> > > > > > > > > > > > > to
> > > > > > > > > > > > > 4M
> > > > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to
> > > > > > > > > > > > > its
> > > > > > > > > > > > > maximum
> > > > > > > > > > > > > theoretical possible transfer size of 128M (32k
> > > > > > > > > > > > > pages)
> > > > > > > > > > > > > according
> > > > > > > > > > > > > to
> > > > > > > > > > > > > the
> > > > > > > > > > > > > virtio specs:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/
> > > > > > > > > > > > > virt
> > > > > > > > > > > > > io-v
> > > > > > > > > > > > > 1.1-
> > > > > > > > > > > > > cs
> > > > > > > > > > > > > 01
> > > > > > > > > > > > > .html#
> > > > > > > > > > > > > x1-240006
> > > > > > > > > > > > 
> > > > > > > > > > > > Hi Christian,
> > > > > > > > > 
> > > > > > > > > > > > I took a quick look at the code:
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > Thanks Stefan for sharing virtio expertise and helping
> > > > > > > > > Christian
> > > > > > > > > !
> > > > > > > > > 
> > > > > > > > > > > > - The Linux 9p driver restricts descriptor chains to
> > > > > > > > > > > > 128
> > > > > > > > > > > > elements
> > > > > > > > > > > > 
> > > > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > > > 
> > > > > > > > > > > Yes, that's the limitation that I am about to remove
> > > > > > > > > > > (WIP);
> > > > > > > > > > > current
> > > > > > > > > > > kernel
> > > > > > > > > > > patches:
> > > > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linu
> > > > > > > > > > > x_os
> > > > > > > > > > > s@cr
> > > > > > > > > > > udeb
> > > > > > > > > > > yt
> > > > > > > > > > > e.
> > > > > > > > > > > com/>
> > > > > > > > > > 
> > > > > > > > > > I haven't read the patches yet but I'm concerned that
> > > > > > > > > > today
> > > > > > > > > > the
> > > > > > > > > > driver
> > > > > > > > > > is pretty well-behaved and this new patch series
> > > > > > > > > > introduces a
> > > > > > > > > > spec
> > > > > > > > > > violation. Not fixing existing spec violations is okay,
> > > > > > > > > > but
> > > > > > > > > > adding
> > > > > > > > > > new
> > > > > > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > > > > > solution.
> > > > > > > > 
> > > > > > > > Nobody has reviewed the kernel patches yet. My main concern
> > > > > > > > therefore
> > > > > > > > actually is that the kernel patches are already too complex,
> > > > > > > > because
> > > > > > > > the
> > > > > > > > current situation is that only Dominique is handling 9p
> > > > > > > > patches on
> > > > > > > > kernel
> > > > > > > > side, and he barely has time for 9p anymore.
> > > > > > > > 
> > > > > > > > Another reason for me to catch up on reading current kernel
> > > > > > > > code
> > > > > > > > and
> > > > > > > > stepping in as reviewer of 9p on kernel side ASAP, independent
> > > > > > > > of
> > > > > > > > this
> > > > > > > > issue.
> > > > > > > > 
> > > > > > > > As for current kernel patches' complexity: I can certainly
> > > > > > > > drop
> > > > > > > > patch
> > > > > > > > 7
> > > > > > > > entirely as it is probably just overkill. Patch 4 is then the
> > > > > > > > biggest
> > > > > > > > chunk, I have to see if I can simplify it, and whether it
> > > > > > > > would
> > > > > > > > make
> > > > > > > > sense to squash with patch 3.
> > > > > > > > 
> > > > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to
> > > > > > > > > > > > preadv(2)
> > > > > > > > > > > > and
> > > > > > > > > > > > will
> > > > > > > > > > > > fail
> > > > > > > > > > > > 
> > > > > > > > > > > >   with EINVAL when called with more than IOV_MAX
> > > > > > > > > > > >   iovecs
> > > > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > > > 
> > > > > > > > > > > Hmm, which makes me wonder why I never encountered this
> > > > > > > > > > > error
> > > > > > > > > > > during
> > > > > > > > > > > testing.
> > > > > > > > > > > 
> > > > > > > > > > > Most people will use the 9p qemu 'local' fs driver
> > > > > > > > > > > backend
> > > > > > > > > > > in
> > > > > > > > > > > practice,
> > > > > > > > > > > so
> > > > > > > > > > > that v9fs_read() call would translate for most people to
> > > > > > > > > > > this
> > > > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > > > 
> > > > > > > > > > > static ssize_t local_preadv(FsContext *ctx,
> > > > > > > > > > > V9fsFidOpenState
> > > > > > > > > > > *fs,
> > > > > > > > > > > 
> > > > > > > > > > >                             const struct iovec *iov,
> > > > > > > > > > >                             int iovcnt, off_t offset)
> > > > > > > > > > > 
> > > > > > > > > > > {
> > > > > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > > > > 
> > > > > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > > > > 
> > > > > > > > > > > #else
> > > > > > > > > > > 
> > > > > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > > > > >     if (err == -1) {
> > > > > > > > > > >     
> > > > > > > > > > >         return err;
> > > > > > > > > > >     
> > > > > > > > > > >     } else {
> > > > > > > > > > >     
> > > > > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > > > > >     
> > > > > > > > > > >     }
> > > > > > > > > > > 
> > > > > > > > > > > #endif
> > > > > > > > > > > }
> > > > > > > > > > > 
> > > > > > > > > > > > Unless I misunderstood the code, neither side can take
> > > > > > > > > > > > advantage
> > > > > > > > > > > > of
> > > > > > > > > > > > the
> > > > > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > > > > 
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Stefan
> > > > > > > > > > > 
> > > > > > > > > > > I need to check that when I have some more time. One
> > > > > > > > > > > possible
> > > > > > > > > > > explanation
> > > > > > > > > > > might be that preadv() already has this wrapped into a
> > > > > > > > > > > loop
> > > > > > > > > > > in
> > > > > > > > > > > its
> > > > > > > > > > > implementation to circumvent a limit like IOV_MAX. It
> > > > > > > > > > > might
> > > > > > > > > > > be
> > > > > > > > > > > another
> > > > > > > > > > > "it
> > > > > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > > > > 
> > > > > > > > > > > There are still a bunch of other issues I have to
> > > > > > > > > > > resolve.
> > > > > > > > > > > If
> > > > > > > > > > > you
> > > > > > > > > > > look
> > > > > > > > > > > at
> > > > > > > > > > > net/9p/client.c on kernel side, you'll notice that it
> > > > > > > > > > > basically
> > > > > > > > > > > does
> > > > > > > > > > > this ATM> >
> > > > > > > > > > > 
> > > > > > > > > > >     kmalloc(msize);
> > > > > > > > > 
> > > > > > > > > Note that this is done twice : once for the T message
> > > > > > > > > (client
> > > > > > > > > request)
> > > > > > > > > and
> > > > > > > > > once for the R message (server answer). The 9p driver could
> > > > > > > > > adjust
> > > > > > > > > the
> > > > > > > > > size
> > > > > > > > > of the T message to what's really needed instead of
> > > > > > > > > allocating
> > > > > > > > > the
> > > > > > > > > full
> > > > > > > > > msize. R message size is not known though.
> > > > > > > > 
> > > > > > > > Would it make sense adding a second virtio ring, dedicated to
> > > > > > > > server
> > > > > > > > responses to solve this? IIRC 9p server already calculates
> > > > > > > > appropriate
> > > > > > > > exact sizes for each response type. So server could just push
> > > > > > > > space
> > > > > > > > that's
> > > > > > > > really needed for its responses.
> > > > > > > > 
> > > > > > > > > > > for every 9p request. So not only does it allocate much
> > > > > > > > > > > more
> > > > > > > > > > > memory
> > > > > > > > > > > for
> > > > > > > > > > > every request than actually required (i.e. say 9pfs was
> > > > > > > > > > > mounted
> > > > > > > > > > > with
> > > > > > > > > > > msize=8M, then a 9p request that actually would just
> > > > > > > > > > > need 1k
> > > > > > > > > > > would
> > > > > > > > > > > nevertheless allocate 8M), but also it allocates >
> > > > > > > > > > > PAGE_SIZE,
> > > > > > > > > > > which
> > > > > > > > > > > obviously may fail at any time.>
> > > > > > > > > > 
> > > > > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs
> > > > > > > > > > vmalloc()
> > > > > > > > > > situation.
> > > > > > > > 
> > > > > > > > Hu, I didn't even consider vmalloc(). I just tried the
> > > > > > > > kvmalloc()
> > > > > > > > wrapper
> > > > > > > > as a quick & dirty test, but it crashed in the same way as
> > > > > > > > kmalloc()
> > > > > > > > with
> > > > > > > > large msize values immediately on mounting:
> > > > > > > > 
> > > > > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > > > > --- a/net/9p/client.c
> > > > > > > > +++ b/net/9p/client.c
> > > > > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct
> > > > > > > > p9_client
> > > > > > > > *clnt)
> > > > > > > > 
> > > > > > > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall
> > > > > > > >  *fc,
> > > > > > > >  
> > > > > > > >                          int alloc_msize)
> > > > > > > >  
> > > > > > > >  {
> > > > > > > > 
> > > > > > > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize)
> > > > > > > > {
> > > > > > > > +       //if (likely(c->fcall_cache) && alloc_msize ==
> > > > > > > > c->msize) {
> > > > > > > > +       if (false) {
> > > > > > > > 
> > > > > > > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > > > > > > >                 GFP_NOFS);
> > > > > > > >                 fc->cache = c->fcall_cache;
> > > > > > > >         
> > > > > > > >         } else {
> > > > > > > > 
> > > > > > > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > > > > > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > > > > > > 
> > > > > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > > > > 
> > > > > > > Now I get:
> > > > > > >    virtio: bogus descriptor or out of resources
> > > > > > > 
> > > > > > > So, still some work ahead on both ends.
> > > > > > 
> > > > > > Few hacks later (only changes on 9p client side) I got this
> > > > > > running
> > > > > > stable
> > > > > > now. The reason for the virtio error above was that kvmalloc()
> > > > > > returns
> > > > > > a
> > > > > > non-logical kernel address for any kvmalloc(>4M), i.e. an address
> > > > > > that
> > > > > > is
> > > > > > inaccessible from host side, hence that "bogus descriptor" message
> > > > > > by
> > > > > > QEMU.
> > > > > > So I had to split those linear 9p client buffers into sparse ones
> > > > > > (set
> > > > > > of
> > > > > > individual pages).
> > > > > > 
> > > > > > I tested this for some days with various virtio transmission sizes
> > > > > > and
> > > > > > it
> > > > > > works as expected up to 128 MB (more precisely: 128 MB read space
> > > > > > +
> > > > > > 128 MB
> > > > > > write space per virtio round trip message).
> > > > > > 
> > > > > > I did not encounter a show stopper for large virtio transmission
> > > > > > sizes
> > > > > > (4 MB ... 128 MB) on virtio level, neither as a result of testing,
> > > > > > nor
> > > > > > after reviewing the existing code.
> > > > > > 
> > > > > > About IOV_MAX: that's apparently not an issue on virtio level.
> > > > > > Most of
> > > > > > the
> > > > > > iovec code, both on Linux kernel side and on QEMU side do not have
> > > > > > this
> > > > > > limitation. It is apparently however indeed a limitation for
> > > > > > userland
> > > > > > apps
> > > > > > calling the Linux kernel's syscalls yet.
> > > > > > 
> > > > > > Stefan, as it stands now, I am even more convinced that the upper
> > > > > > virtio
> > > > > > transmission size limit should not be squeezed into the queue size
> > > > > > argument of virtio_add_queue(). Not because of the previous
> > > > > > argument
> > > > > > that
> > > > > > it would waste space (~1MB), but rather because they are two
> > > > > > different
> > > > > > things. To outline this, just a quick recap of what happens
> > > > > > exactly
> > > > > > when
> > > > > > a bulk message is pushed over the virtio wire (assuming virtio
> > > > > > "split"
> > > > > > layout here):
> > > > > > 
> > > > > > ---------- [recap-start] ----------
> > > > > > 
> > > > > > For each bulk message sent guest <-> host, exactly *one* of the
> > > > > > pre-allocated descriptors is taken and placed (subsequently) into
> > > > > > exactly
> > > > > > *one* position of the two available/used ring buffers. The actual
> > > > > > descriptor table though, containing all the DMA addresses of the
> > > > > > message
> > > > > > bulk data, is allocated just in time for each round trip message.
> > > > > > Say,
> > > > > > it
> > > > > > is the first message sent, it yields in the following structure:
> > > > > > 
> > > > > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > > > > 
> > > > > >    +-+              +-+           +-----------------+
> > > > > >    
> > > > > >    |D|------------->|d|---------->| Bulk data block |
> > > > > >    
> > > > > >    +-+              |d|--------+  +-----------------+
> > > > > >    
> > > > > >    | |              |d|------+ |
> > > > > >    
> > > > > >    +-+               .       | |  +-----------------+
> > > > > >    
> > > > > >    | |               .       | +->| Bulk data block |
> > > > > >     
> > > > > >     .                .       |    +-----------------+
> > > > > >     .               |d|-+    |
> > > > > >     .               +-+ |    |    +-----------------+
> > > > > >     
> > > > > >    | |                  |    +--->| Bulk data block |
> > > > > >    
> > > > > >    +-+                  |         +-----------------+
> > > > > >    
> > > > > >    | |                  |                 .
> > > > > >    
> > > > > >    +-+                  |                 .
> > > > > >    
> > > > > >                         |                 .
> > > > > >                         |         
> > > > > >                         |         +-----------------+
> > > > > >                         
> > > > > >                         +-------->| Bulk data block |
> > > > > >                         
> > > > > >                                   +-----------------+
> > > > > > 
> > > > > > Legend:
> > > > > > D: pre-allocated descriptor
> > > > > > d: just in time allocated descriptor
> > > > > > -->: memory pointer (DMA)
> > > > > > 
> > > > > > The bulk data blocks are allocated by the respective device driver
> > > > > > above
> > > > > > virtio subsystem level (guest side).
> > > > > > 
> > > > > > There are exactly as many descriptors pre-allocated (D) as the
> > > > > > size of
> > > > > > a
> > > > > > ring buffer.
> > > > > > 
> > > > > > A "descriptor" is more or less just a chainable DMA memory
> > > > > > pointer;
> > > > > > defined
> > > > > > as:
> > > > > > 
> > > > > > /* Virtio ring descriptors: 16 bytes.  These can chain together
> > > > > > via
> > > > > > "next". */ struct vring_desc {
> > > > > > 
> > > > > > 	/* Address (guest-physical). */
> > > > > > 	__virtio64 addr;
> > > > > > 	/* Length. */
> > > > > > 	__virtio32 len;
> > > > > > 	/* The flags as indicated above. */
> > > > > > 	__virtio16 flags;
> > > > > > 	/* We chain unused descriptors via this, too */
> > > > > > 	__virtio16 next;
> > > > > > 
> > > > > > };
> > > > > > 
> > > > > > There are 2 ring buffers; the "available" ring buffer is for
> > > > > > sending a
> > > > > > message guest->host (which will transmit DMA addresses of guest
> > > > > > allocated
> > > > > > bulk data blocks that are used for data sent to device, and
> > > > > > separate
> > > > > > guest allocated bulk data blocks that will be used by host side to
> > > > > > place
> > > > > > its response bulk data), and the "used" ring buffer is for sending
> > > > > > host->guest to let guest know about host's response and that it
> > > > > > could
> > > > > > now
> > > > > > safely consume and then deallocate the bulk data blocks
> > > > > > subsequently.
> > > > > > 
> > > > > > ---------- [recap-end] ----------
> > > > > > 
> > > > > > So the "queue size" actually defines the ringbuffer size. It does
> > > > > > not
> > > > > > define the maximum amount of descriptors. The "queue size" rather
> > > > > > defines
> > > > > > how many pending messages can be pushed into either one ringbuffer
> > > > > > before
> > > > > > the other side would need to wait until the counter side would
> > > > > > step up
> > > > > > (i.e. ring buffer full).
> > > > > > 
> > > > > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE
> > > > > > actually
> > > > > > is)
> > > > > > OTOH defines the max. bulk data size that could be transmitted
> > > > > > with
> > > > > > each
> > > > > > virtio round trip message.
> > > > > > 
> > > > > > And in fact, 9p currently handles the virtio "queue size" as
> > > > > > directly
> > > > > > associative with its maximum amount of active 9p requests the
> > > > > > server
> > > > > > could
> > > > > > 
> > > > > > handle simultaniously:
> > > > > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > > > > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > > > > >   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev,
> > > > > >   MAX_REQ,
> > > > > >   
> > > > > >                                  handle_9p_output);
> > > > > > 
> > > > > > So if I would change it like this, just for the purpose to
> > > > > > increase
> > > > > > the
> > > > > > max. virtio transmission size:
> > > > > > 
> > > > > > --- a/hw/9pfs/virtio-9p-device.c
> > > > > > +++ b/hw/9pfs/virtio-9p-device.c
> > > > > > @@ -218,7 +218,7 @@ static void
> > > > > > virtio_9p_device_realize(DeviceState
> > > > > > *dev,
> > > > > > Error **errp)>
> > > > > > 
> > > > > >      v->config_size = sizeof(struct virtio_9p_config) +
> > > > > >      strlen(s->fsconf.tag);
> > > > > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > > > > >      
> > > > > >                  VIRTQUEUE_MAX_SIZE);
> > > > > > 
> > > > > > -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > > > > > +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
> > > > > > 
> > > > > >  }
> > > > > > 
> > > > > > Then it would require additional synchronization code on both ends
> > > > > > and
> > > > > > therefore unnecessary complexity, because it would now be possible
> > > > > > that
> > > > > > more requests are pushed into the ringbuffer than server could
> > > > > > handle.
> > > > > > 
> > > > > > There is one potential issue though that probably did justify the
> > > > > > "don't
> > > > > > exceed the queue size" rule:
> > > > > > 
> > > > > > ATM the descriptor table is allocated (just in time) as *one*
> > > > > > continuous
> > > > > > buffer via kmalloc_array():
> > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798c
> > > > > > a086
> > > > > > f7c7
> > > > > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > > > > 
> > > > > > So assuming transmission size of 2 * 128 MB that kmalloc_array()
> > > > > > call
> > > > > > would
> > > > > > yield in kmalloc(1M) and the latter might fail if guest had highly
> > > > > > fragmented physical memory. For such kind of error case there is
> > > > > > currently a fallback path in virtqueue_add_split() that would then
> > > > > > use
> > > > > > the required amount of pre-allocated descriptors instead:
> > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798c
> > > > > > a086
> > > > > > f7c7
> > > > > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > > > > 
> > > > > > That fallback recovery path would no longer be viable if the queue
> > > > > > size
> > > > > > was
> > > > > > exceeded. There would be alternatives though, e.g. by allowing to
> > > > > > chain
> > > > > > indirect descriptor tables (currently prohibited by the virtio
> > > > > > specs).
> > > > > 
> > > > > Making the maximum number of descriptors independent of the queue
> > > > > size
> > > > > requires a change to the VIRTIO spec since the two values are
> > > > > currently
> > > > > explicitly tied together by the spec.
> > > > 
> > > > Yes, that's what the virtio specs say. But they don't say why, nor did
> > > > I
> > > > hear a reason in this dicussion.
> > > > 
> > > > That's why I invested time reviewing current virtio implementation and
> > > > specs, as well as actually testing exceeding that limit. And as I
> > > > outlined in detail in my previous email, I only found one theoretical
> > > > issue that could be addressed though.
> > > 
> > > I agree that there is a limitation in the VIRTIO spec, but violating the
> > > spec isn't an acceptable solution:
> > > 
> > > 1. QEMU and Linux aren't the only components that implement VIRTIO. You
> > > 
> > >    cannot make assumptions about their implementations because it may
> > >    break spec-compliant implementations that you haven't looked at.
> > >    
> > >    Your patches weren't able to increase Queue Size because some device
> > >    implementations break when descriptor chains are too long. This shows
> > >    there is a practical issue even in QEMU.
> > > 
> > > 2. The specific spec violation that we discussed creates the problem
> > > 
> > >    that drivers can no longer determine the maximum description chain
> > >    length. This in turn will lead to more implementation-specific
> > >    assumptions being baked into drivers and cause problems with
> > >    interoperability and future changes.
> > > 
> > > The spec needs to be extended instead. I included an idea for how to do
> > > that below.
> > 
> > Sure, I just wanted to see if there was a non-neglectable "hard" show
> > stopper per se that I probably haven't seen yet. I have not questioned
> > aiming a clean solution.
> > 
> > Thanks for the clarification!
> > 
> > > > > Before doing that, are there benchmark results showing that 1 MB vs
> > > > > 128
> > > > > MB produces a performance improvement? I'm asking because if
> > > > > performance
> > > > > with 1 MB is good then you can probably do that without having to
> > > > > change
> > > > > VIRTIO and also because it's counter-intuitive that 9p needs 128 MB
> > > > > for
> > > > > good performance when it's ultimately implemented on top of disk and
> > > > > network I/O that have lower size limits.
> > > > 
> > > > First some numbers, linear reading a 12 GB file:
> > > > 
> > > > msize    average      notes
> > > > 
> > > > 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> > > > 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> > > > 512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
> > > > 1 MB     2551 MB/s    this msize would already violate virtio specs
> > > > 2 MB     2521 MB/s    this msize would already violate virtio specs
> > > > 4 MB     2628 MB/s    planned max. msize of my current kernel patches
> > > > [1]
> > > 
> > > How many descriptors are used? 4 MB can be covered by a single
> > > descriptor if the data is physically contiguous in memory, so this data
> > > doesn't demonstrate a need for more descriptors.
> > 
> > No, in the last couple years there was apparently no kernel version that
> > used just one descriptor, nor did my benchmarked version. Even though the
> > Linux 9p client uses (yet) simple linear buffers (contiguous physical
> > memory) on 9p client level, these are however split into PAGE_SIZE chunks
> > by function pack_sg_list() [1] before being fed to virtio level:
> > 
> > static unsigned int rest_of_page(void *data)
> > {
> > 
> > 	return PAGE_SIZE - offset_in_page(data);
> > 
> > }
> > ...
> > static int pack_sg_list(struct scatterlist *sg, int start,
> > 
> > 			int limit, char *data, int count)
> > 
> > {
> > 
> > 	int s;
> > 	int index = start;
> > 	
> > 	while (count) {
> > 	
> > 		s = rest_of_page(data);
> > 		...
> > 		sg_set_buf(&sg[index++], data, s);
> > 		count -= s;
> > 		data += s;
> > 	
> > 	}
> > 	...
> > 
> > }
> > 
> > [1]
> > https://github.com/torvalds/linux/blob/19901165d90fdca1e57c9baa0d5b4c63d1
> > 5c476a/net/9p/trans_virtio.c#L171
> > 
> > So when sending 4MB over virtio wire, it would yield in 1k descriptors
> > ATM.
> > 
> > I have wondered about this before, but did not question it, because due to
> > the cross-platform nature I couldn't say for certain whether that's
> > probably needed somewhere. I mean for the case virtio-PCI I know for sure
> > that one descriptor (i.e. >PAGE_SIZE) would be fine, but I don't know if
> > that applies to all buses and architectures.
> 
> VIRTIO does not limit descriptor the descriptor len field to PAGE_SIZE,
> so I don't think there is a limit at the VIRTIO level.

So you are viewing this purely from virtio specs PoV: in the sense, if it is
not prohibited by the virtio specs, then it should work. Maybe.

> If this function coalesces adjacent pages then the descriptor chain
> length issues could be reduced.
> 
> > > > But again, this is not just about performance. My conclusion as
> > > > described
> > > > in my previous email is that virtio currently squeezes
> > > > 
> > > > 	"max. simultanious amount of bulk messages"
> > > > 
> > > > vs.
> > > > 
> > > > 	"max. bulk data transmission size per bulk messaage"
> > > > 
> > > > into the same configuration parameter, which is IMO inappropriate and
> > > > hence
> > > > splitting them into 2 separate parameters when creating a queue makes
> > > > sense, independent of the performance benchmarks.
> > > > 
> > > > [1]
> > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyt
> > > > e.c
> > > > om/
> > > 
> > > Some devices effectively already have this because the device advertises
> > > a maximum number of descriptors via device-specific mechanisms like the
> > > struct virtio_blk_config seg_max field. But today these fields can only
> > > reduce the maximum descriptor chain length because the spec still limits
> > > the length to Queue Size.
> > > 
> > > We can build on this approach to raise the length above Queue Size. This
> > > approach has the advantage that the maximum number of segments isn't per
> > > device or per virtqueue, it's fine-grained. If the device supports two
> > > requests types then different max descriptor chain limits could be given
> > > for them by introducing two separate configuration space fields.
> > > 
> > > Here are the corresponding spec changes:
> > > 
> > > 1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is added
> > > 
> > >    to indicate that indirect descriptor table size and maximum
> > >    descriptor chain length are not limited by Queue Size value. (Maybe
> > >    there still needs to be a limit like 2^15?)
> > 
> > Sounds good to me!
> > 
> > AFAIK it is effectively limited to 2^16 because of vring_desc->next:
> > 
> > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > "next". */ struct vring_desc {
> > 
> >         /* Address (guest-physical). */
> >         __virtio64 addr;
> >         /* Length. */
> >         __virtio32 len;
> >         /* The flags as indicated above. */
> >         __virtio16 flags;
> >         /* We chain unused descriptors via this, too */
> >         __virtio16 next;
> > 
> > };
> 
> Yes, Split Virtqueues have a fundamental limit on indirect table size
> due to the "next" field. Packed Virtqueue descriptors don't have a
> "next" field so descriptor chains could be longer in theory (currently
> forbidden by the spec).
> 
> > > One thing that's messy is that we've been discussing the maximum
> > > descriptor chain length but 9p has the "msize" concept, which isn't
> > > aware of contiguous memory. It may be necessary to extend the 9p driver
> > > code to size requests not just according to their length in bytes but
> > > also according to the descriptor chain length. That's how the Linux
> > > block layer deals with queue limits (struct queue_limits max_segments vs
> > > max_hw_sectors).
> > 
> > Hmm, can't follow on that one. For what should that be needed in case of
> > 9p? My plan was to limit msize by 9p client simply at session start to
> > whatever is the max. amount virtio descriptors supported by host and
> > using PAGE_SIZE as size per descriptor, because that's what 9p client
> > actually does ATM (see above). So you think that should be changed to
> > e.g. just one descriptor for 4MB, right?
> 
> Limiting msize to the 9p transport device's maximum number of
> descriptors is conservative (i.e. 128 descriptors = 512 KB msize)
> because it doesn't take advantage of contiguous memory. I suggest
> leaving msize alone, adding a separate limit at which requests are split
> according to the maximum descriptor chain length, and tweaking
> pack_sg_list() to coalesce adjacent pages.
> 
> That way msize can be large without necessarily using lots of
> descriptors (depending on the memory layout).

That was actually a tempting solution. Because it would neither require
changes to the virtio specs (at least for a while) and it would also work with
older QEMU versions. And for that pack_sg_list() portion of the code it would
work well and easy as the buffer passed to pack_sg_list() is contiguous
already.

However I just realized for the zero-copy version of the code that would be
more tricky. The ZC version already uses individual pages (struct page, hence
PAGE_SIZE each) which are pinned, i.e. it uses pack_sg_list_p() [1] in
combination with p9_get_mapped_pages() [2]

[1] https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9ab8389/net/9p/trans_virtio.c#L218
[2] https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9ab8389/net/9p/trans_virtio.c#L309

So that would require much more work and code trying to sort and coalesce
individual pages to contiguous physical memory for the sake of reducing virtio
descriptors. And there is no guarantee that this is even possible. The kernel
may simply return a non-contiguous set of pages which would eventually end up
exceeding the virtio descriptor limit again.

So looks like it was probably still easier and realistic to just add virtio
capabilities for now for allowing to exceed current descriptor limit.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-11-04 14:41                           ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-11-04 14:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Mittwoch, 3. November 2021 12:33:33 CET Stefan Hajnoczi wrote:
> On Mon, Nov 01, 2021 at 09:29:26PM +0100, Christian Schoenebeck wrote:
> > On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> > > On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck wrote:
> > > > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > > > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > > > 
> > > > > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > > > > > Schoenebeck
> > > > > > > 
> > > > > > > wrote:
> > > > > > > > > > > > > At the moment the maximum transfer size with virtio
> > > > > > > > > > > > > is
> > > > > > > > > > > > > limited
> > > > > > > > > > > > > to
> > > > > > > > > > > > > 4M
> > > > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to
> > > > > > > > > > > > > its
> > > > > > > > > > > > > maximum
> > > > > > > > > > > > > theoretical possible transfer size of 128M (32k
> > > > > > > > > > > > > pages)
> > > > > > > > > > > > > according
> > > > > > > > > > > > > to
> > > > > > > > > > > > > the
> > > > > > > > > > > > > virtio specs:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/
> > > > > > > > > > > > > virt
> > > > > > > > > > > > > io-v
> > > > > > > > > > > > > 1.1-
> > > > > > > > > > > > > cs
> > > > > > > > > > > > > 01
> > > > > > > > > > > > > .html#
> > > > > > > > > > > > > x1-240006
> > > > > > > > > > > > 
> > > > > > > > > > > > Hi Christian,
> > > > > > > > > 
> > > > > > > > > > > > I took a quick look at the code:
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > Thanks Stefan for sharing virtio expertise and helping
> > > > > > > > > Christian
> > > > > > > > > !
> > > > > > > > > 
> > > > > > > > > > > > - The Linux 9p driver restricts descriptor chains to
> > > > > > > > > > > > 128
> > > > > > > > > > > > elements
> > > > > > > > > > > > 
> > > > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > > > 
> > > > > > > > > > > Yes, that's the limitation that I am about to remove
> > > > > > > > > > > (WIP);
> > > > > > > > > > > current
> > > > > > > > > > > kernel
> > > > > > > > > > > patches:
> > > > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linu
> > > > > > > > > > > x_os
> > > > > > > > > > > s@cr
> > > > > > > > > > > udeb
> > > > > > > > > > > yt
> > > > > > > > > > > e.
> > > > > > > > > > > com/>
> > > > > > > > > > 
> > > > > > > > > > I haven't read the patches yet but I'm concerned that
> > > > > > > > > > today
> > > > > > > > > > the
> > > > > > > > > > driver
> > > > > > > > > > is pretty well-behaved and this new patch series
> > > > > > > > > > introduces a
> > > > > > > > > > spec
> > > > > > > > > > violation. Not fixing existing spec violations is okay,
> > > > > > > > > > but
> > > > > > > > > > adding
> > > > > > > > > > new
> > > > > > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > > > > > solution.
> > > > > > > > 
> > > > > > > > Nobody has reviewed the kernel patches yet. My main concern
> > > > > > > > therefore
> > > > > > > > actually is that the kernel patches are already too complex,
> > > > > > > > because
> > > > > > > > the
> > > > > > > > current situation is that only Dominique is handling 9p
> > > > > > > > patches on
> > > > > > > > kernel
> > > > > > > > side, and he barely has time for 9p anymore.
> > > > > > > > 
> > > > > > > > Another reason for me to catch up on reading current kernel
> > > > > > > > code
> > > > > > > > and
> > > > > > > > stepping in as reviewer of 9p on kernel side ASAP, independent
> > > > > > > > of
> > > > > > > > this
> > > > > > > > issue.
> > > > > > > > 
> > > > > > > > As for current kernel patches' complexity: I can certainly
> > > > > > > > drop
> > > > > > > > patch
> > > > > > > > 7
> > > > > > > > entirely as it is probably just overkill. Patch 4 is then the
> > > > > > > > biggest
> > > > > > > > chunk, I have to see if I can simplify it, and whether it
> > > > > > > > would
> > > > > > > > make
> > > > > > > > sense to squash with patch 3.
> > > > > > > > 
> > > > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to
> > > > > > > > > > > > preadv(2)
> > > > > > > > > > > > and
> > > > > > > > > > > > will
> > > > > > > > > > > > fail
> > > > > > > > > > > > 
> > > > > > > > > > > >   with EINVAL when called with more than IOV_MAX
> > > > > > > > > > > >   iovecs
> > > > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > > > 
> > > > > > > > > > > Hmm, which makes me wonder why I never encountered this
> > > > > > > > > > > error
> > > > > > > > > > > during
> > > > > > > > > > > testing.
> > > > > > > > > > > 
> > > > > > > > > > > Most people will use the 9p qemu 'local' fs driver
> > > > > > > > > > > backend
> > > > > > > > > > > in
> > > > > > > > > > > practice,
> > > > > > > > > > > so
> > > > > > > > > > > that v9fs_read() call would translate for most people to
> > > > > > > > > > > this
> > > > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > > > 
> > > > > > > > > > > static ssize_t local_preadv(FsContext *ctx,
> > > > > > > > > > > V9fsFidOpenState
> > > > > > > > > > > *fs,
> > > > > > > > > > > 
> > > > > > > > > > >                             const struct iovec *iov,
> > > > > > > > > > >                             int iovcnt, off_t offset)
> > > > > > > > > > > 
> > > > > > > > > > > {
> > > > > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > > > > 
> > > > > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > > > > 
> > > > > > > > > > > #else
> > > > > > > > > > > 
> > > > > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > > > > >     if (err == -1) {
> > > > > > > > > > >     
> > > > > > > > > > >         return err;
> > > > > > > > > > >     
> > > > > > > > > > >     } else {
> > > > > > > > > > >     
> > > > > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > > > > >     
> > > > > > > > > > >     }
> > > > > > > > > > > 
> > > > > > > > > > > #endif
> > > > > > > > > > > }
> > > > > > > > > > > 
> > > > > > > > > > > > Unless I misunderstood the code, neither side can take
> > > > > > > > > > > > advantage
> > > > > > > > > > > > of
> > > > > > > > > > > > the
> > > > > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > > > > 
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Stefan
> > > > > > > > > > > 
> > > > > > > > > > > I need to check that when I have some more time. One
> > > > > > > > > > > possible
> > > > > > > > > > > explanation
> > > > > > > > > > > might be that preadv() already has this wrapped into a
> > > > > > > > > > > loop
> > > > > > > > > > > in
> > > > > > > > > > > its
> > > > > > > > > > > implementation to circumvent a limit like IOV_MAX. It
> > > > > > > > > > > might
> > > > > > > > > > > be
> > > > > > > > > > > another
> > > > > > > > > > > "it
> > > > > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > > > > 
> > > > > > > > > > > There are still a bunch of other issues I have to
> > > > > > > > > > > resolve.
> > > > > > > > > > > If
> > > > > > > > > > > you
> > > > > > > > > > > look
> > > > > > > > > > > at
> > > > > > > > > > > net/9p/client.c on kernel side, you'll notice that it
> > > > > > > > > > > basically
> > > > > > > > > > > does
> > > > > > > > > > > this ATM> >
> > > > > > > > > > > 
> > > > > > > > > > >     kmalloc(msize);
> > > > > > > > > 
> > > > > > > > > Note that this is done twice : once for the T message
> > > > > > > > > (client
> > > > > > > > > request)
> > > > > > > > > and
> > > > > > > > > once for the R message (server answer). The 9p driver could
> > > > > > > > > adjust
> > > > > > > > > the
> > > > > > > > > size
> > > > > > > > > of the T message to what's really needed instead of
> > > > > > > > > allocating
> > > > > > > > > the
> > > > > > > > > full
> > > > > > > > > msize. R message size is not known though.
> > > > > > > > 
> > > > > > > > Would it make sense adding a second virtio ring, dedicated to
> > > > > > > > server
> > > > > > > > responses to solve this? IIRC 9p server already calculates
> > > > > > > > appropriate
> > > > > > > > exact sizes for each response type. So server could just push
> > > > > > > > space
> > > > > > > > that's
> > > > > > > > really needed for its responses.
> > > > > > > > 
> > > > > > > > > > > for every 9p request. So not only does it allocate much
> > > > > > > > > > > more
> > > > > > > > > > > memory
> > > > > > > > > > > for
> > > > > > > > > > > every request than actually required (i.e. say 9pfs was
> > > > > > > > > > > mounted
> > > > > > > > > > > with
> > > > > > > > > > > msize=8M, then a 9p request that actually would just
> > > > > > > > > > > need 1k
> > > > > > > > > > > would
> > > > > > > > > > > nevertheless allocate 8M), but also it allocates >
> > > > > > > > > > > PAGE_SIZE,
> > > > > > > > > > > which
> > > > > > > > > > > obviously may fail at any time.>
> > > > > > > > > > 
> > > > > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs
> > > > > > > > > > vmalloc()
> > > > > > > > > > situation.
> > > > > > > > 
> > > > > > > > Hu, I didn't even consider vmalloc(). I just tried the
> > > > > > > > kvmalloc()
> > > > > > > > wrapper
> > > > > > > > as a quick & dirty test, but it crashed in the same way as
> > > > > > > > kmalloc()
> > > > > > > > with
> > > > > > > > large msize values immediately on mounting:
> > > > > > > > 
> > > > > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > > > > --- a/net/9p/client.c
> > > > > > > > +++ b/net/9p/client.c
> > > > > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct
> > > > > > > > p9_client
> > > > > > > > *clnt)
> > > > > > > > 
> > > > > > > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall
> > > > > > > >  *fc,
> > > > > > > >  
> > > > > > > >                          int alloc_msize)
> > > > > > > >  
> > > > > > > >  {
> > > > > > > > 
> > > > > > > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize)
> > > > > > > > {
> > > > > > > > +       //if (likely(c->fcall_cache) && alloc_msize ==
> > > > > > > > c->msize) {
> > > > > > > > +       if (false) {
> > > > > > > > 
> > > > > > > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > > > > > > >                 GFP_NOFS);
> > > > > > > >                 fc->cache = c->fcall_cache;
> > > > > > > >         
> > > > > > > >         } else {
> > > > > > > > 
> > > > > > > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > > > > > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > > > > > > 
> > > > > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > > > > 
> > > > > > > Now I get:
> > > > > > >    virtio: bogus descriptor or out of resources
> > > > > > > 
> > > > > > > So, still some work ahead on both ends.
> > > > > > 
> > > > > > Few hacks later (only changes on 9p client side) I got this
> > > > > > running
> > > > > > stable
> > > > > > now. The reason for the virtio error above was that kvmalloc()
> > > > > > returns
> > > > > > a
> > > > > > non-logical kernel address for any kvmalloc(>4M), i.e. an address
> > > > > > that
> > > > > > is
> > > > > > inaccessible from host side, hence that "bogus descriptor" message
> > > > > > by
> > > > > > QEMU.
> > > > > > So I had to split those linear 9p client buffers into sparse ones
> > > > > > (set
> > > > > > of
> > > > > > individual pages).
> > > > > > 
> > > > > > I tested this for some days with various virtio transmission sizes
> > > > > > and
> > > > > > it
> > > > > > works as expected up to 128 MB (more precisely: 128 MB read space
> > > > > > +
> > > > > > 128 MB
> > > > > > write space per virtio round trip message).
> > > > > > 
> > > > > > I did not encounter a show stopper for large virtio transmission
> > > > > > sizes
> > > > > > (4 MB ... 128 MB) on virtio level, neither as a result of testing,
> > > > > > nor
> > > > > > after reviewing the existing code.
> > > > > > 
> > > > > > About IOV_MAX: that's apparently not an issue on virtio level.
> > > > > > Most of
> > > > > > the
> > > > > > iovec code, both on Linux kernel side and on QEMU side do not have
> > > > > > this
> > > > > > limitation. It is apparently however indeed a limitation for
> > > > > > userland
> > > > > > apps
> > > > > > calling the Linux kernel's syscalls yet.
> > > > > > 
> > > > > > Stefan, as it stands now, I am even more convinced that the upper
> > > > > > virtio
> > > > > > transmission size limit should not be squeezed into the queue size
> > > > > > argument of virtio_add_queue(). Not because of the previous
> > > > > > argument
> > > > > > that
> > > > > > it would waste space (~1MB), but rather because they are two
> > > > > > different
> > > > > > things. To outline this, just a quick recap of what happens
> > > > > > exactly
> > > > > > when
> > > > > > a bulk message is pushed over the virtio wire (assuming virtio
> > > > > > "split"
> > > > > > layout here):
> > > > > > 
> > > > > > ---------- [recap-start] ----------
> > > > > > 
> > > > > > For each bulk message sent guest <-> host, exactly *one* of the
> > > > > > pre-allocated descriptors is taken and placed (subsequently) into
> > > > > > exactly
> > > > > > *one* position of the two available/used ring buffers. The actual
> > > > > > descriptor table though, containing all the DMA addresses of the
> > > > > > message
> > > > > > bulk data, is allocated just in time for each round trip message.
> > > > > > Say,
> > > > > > it
> > > > > > is the first message sent, it yields in the following structure:
> > > > > > 
> > > > > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > > > > 
> > > > > >    +-+              +-+           +-----------------+
> > > > > >    
> > > > > >    |D|------------->|d|---------->| Bulk data block |
> > > > > >    
> > > > > >    +-+              |d|--------+  +-----------------+
> > > > > >    
> > > > > >    | |              |d|------+ |
> > > > > >    
> > > > > >    +-+               .       | |  +-----------------+
> > > > > >    
> > > > > >    | |               .       | +->| Bulk data block |
> > > > > >     
> > > > > >     .                .       |    +-----------------+
> > > > > >     .               |d|-+    |
> > > > > >     .               +-+ |    |    +-----------------+
> > > > > >     
> > > > > >    | |                  |    +--->| Bulk data block |
> > > > > >    
> > > > > >    +-+                  |         +-----------------+
> > > > > >    
> > > > > >    | |                  |                 .
> > > > > >    
> > > > > >    +-+                  |                 .
> > > > > >    
> > > > > >                         |                 .
> > > > > >                         |         
> > > > > >                         |         +-----------------+
> > > > > >                         
> > > > > >                         +-------->| Bulk data block |
> > > > > >                         
> > > > > >                                   +-----------------+
> > > > > > 
> > > > > > Legend:
> > > > > > D: pre-allocated descriptor
> > > > > > d: just in time allocated descriptor
> > > > > > -->: memory pointer (DMA)
> > > > > > 
> > > > > > The bulk data blocks are allocated by the respective device driver
> > > > > > above
> > > > > > virtio subsystem level (guest side).
> > > > > > 
> > > > > > There are exactly as many descriptors pre-allocated (D) as the
> > > > > > size of
> > > > > > a
> > > > > > ring buffer.
> > > > > > 
> > > > > > A "descriptor" is more or less just a chainable DMA memory
> > > > > > pointer;
> > > > > > defined
> > > > > > as:
> > > > > > 
> > > > > > /* Virtio ring descriptors: 16 bytes.  These can chain together
> > > > > > via
> > > > > > "next". */ struct vring_desc {
> > > > > > 
> > > > > > 	/* Address (guest-physical). */
> > > > > > 	__virtio64 addr;
> > > > > > 	/* Length. */
> > > > > > 	__virtio32 len;
> > > > > > 	/* The flags as indicated above. */
> > > > > > 	__virtio16 flags;
> > > > > > 	/* We chain unused descriptors via this, too */
> > > > > > 	__virtio16 next;
> > > > > > 
> > > > > > };
> > > > > > 
> > > > > > There are 2 ring buffers; the "available" ring buffer is for
> > > > > > sending a
> > > > > > message guest->host (which will transmit DMA addresses of guest
> > > > > > allocated
> > > > > > bulk data blocks that are used for data sent to device, and
> > > > > > separate
> > > > > > guest allocated bulk data blocks that will be used by host side to
> > > > > > place
> > > > > > its response bulk data), and the "used" ring buffer is for sending
> > > > > > host->guest to let guest know about host's response and that it
> > > > > > could
> > > > > > now
> > > > > > safely consume and then deallocate the bulk data blocks
> > > > > > subsequently.
> > > > > > 
> > > > > > ---------- [recap-end] ----------
> > > > > > 
> > > > > > So the "queue size" actually defines the ringbuffer size. It does
> > > > > > not
> > > > > > define the maximum amount of descriptors. The "queue size" rather
> > > > > > defines
> > > > > > how many pending messages can be pushed into either one ringbuffer
> > > > > > before
> > > > > > the other side would need to wait until the counter side would
> > > > > > step up
> > > > > > (i.e. ring buffer full).
> > > > > > 
> > > > > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE
> > > > > > actually
> > > > > > is)
> > > > > > OTOH defines the max. bulk data size that could be transmitted
> > > > > > with
> > > > > > each
> > > > > > virtio round trip message.
> > > > > > 
> > > > > > And in fact, 9p currently handles the virtio "queue size" as
> > > > > > directly
> > > > > > associative with its maximum amount of active 9p requests the
> > > > > > server
> > > > > > could
> > > > > > 
> > > > > > handle simultaniously:
> > > > > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > > > > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > > > > >   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev,
> > > > > >   MAX_REQ,
> > > > > >   
> > > > > >                                  handle_9p_output);
> > > > > > 
> > > > > > So if I would change it like this, just for the purpose to
> > > > > > increase
> > > > > > the
> > > > > > max. virtio transmission size:
> > > > > > 
> > > > > > --- a/hw/9pfs/virtio-9p-device.c
> > > > > > +++ b/hw/9pfs/virtio-9p-device.c
> > > > > > @@ -218,7 +218,7 @@ static void
> > > > > > virtio_9p_device_realize(DeviceState
> > > > > > *dev,
> > > > > > Error **errp)>
> > > > > > 
> > > > > >      v->config_size = sizeof(struct virtio_9p_config) +
> > > > > >      strlen(s->fsconf.tag);
> > > > > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > > > > >      
> > > > > >                  VIRTQUEUE_MAX_SIZE);
> > > > > > 
> > > > > > -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > > > > > +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
> > > > > > 
> > > > > >  }
> > > > > > 
> > > > > > Then it would require additional synchronization code on both ends
> > > > > > and
> > > > > > therefore unnecessary complexity, because it would now be possible
> > > > > > that
> > > > > > more requests are pushed into the ringbuffer than server could
> > > > > > handle.
> > > > > > 
> > > > > > There is one potential issue though that probably did justify the
> > > > > > "don't
> > > > > > exceed the queue size" rule:
> > > > > > 
> > > > > > ATM the descriptor table is allocated (just in time) as *one*
> > > > > > continuous
> > > > > > buffer via kmalloc_array():
> > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798c
> > > > > > a086
> > > > > > f7c7
> > > > > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > > > > 
> > > > > > So assuming transmission size of 2 * 128 MB that kmalloc_array()
> > > > > > call
> > > > > > would
> > > > > > yield in kmalloc(1M) and the latter might fail if guest had highly
> > > > > > fragmented physical memory. For such kind of error case there is
> > > > > > currently a fallback path in virtqueue_add_split() that would then
> > > > > > use
> > > > > > the required amount of pre-allocated descriptors instead:
> > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798c
> > > > > > a086
> > > > > > f7c7
> > > > > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > > > > 
> > > > > > That fallback recovery path would no longer be viable if the queue
> > > > > > size
> > > > > > was
> > > > > > exceeded. There would be alternatives though, e.g. by allowing to
> > > > > > chain
> > > > > > indirect descriptor tables (currently prohibited by the virtio
> > > > > > specs).
> > > > > 
> > > > > Making the maximum number of descriptors independent of the queue
> > > > > size
> > > > > requires a change to the VIRTIO spec since the two values are
> > > > > currently
> > > > > explicitly tied together by the spec.
> > > > 
> > > > Yes, that's what the virtio specs say. But they don't say why, nor did
> > > > I
> > > > hear a reason in this dicussion.
> > > > 
> > > > That's why I invested time reviewing current virtio implementation and
> > > > specs, as well as actually testing exceeding that limit. And as I
> > > > outlined in detail in my previous email, I only found one theoretical
> > > > issue that could be addressed though.
> > > 
> > > I agree that there is a limitation in the VIRTIO spec, but violating the
> > > spec isn't an acceptable solution:
> > > 
> > > 1. QEMU and Linux aren't the only components that implement VIRTIO. You
> > > 
> > >    cannot make assumptions about their implementations because it may
> > >    break spec-compliant implementations that you haven't looked at.
> > >    
> > >    Your patches weren't able to increase Queue Size because some device
> > >    implementations break when descriptor chains are too long. This shows
> > >    there is a practical issue even in QEMU.
> > > 
> > > 2. The specific spec violation that we discussed creates the problem
> > > 
> > >    that drivers can no longer determine the maximum description chain
> > >    length. This in turn will lead to more implementation-specific
> > >    assumptions being baked into drivers and cause problems with
> > >    interoperability and future changes.
> > > 
> > > The spec needs to be extended instead. I included an idea for how to do
> > > that below.
> > 
> > Sure, I just wanted to see if there was a non-neglectable "hard" show
> > stopper per se that I probably haven't seen yet. I have not questioned
> > aiming a clean solution.
> > 
> > Thanks for the clarification!
> > 
> > > > > Before doing that, are there benchmark results showing that 1 MB vs
> > > > > 128
> > > > > MB produces a performance improvement? I'm asking because if
> > > > > performance
> > > > > with 1 MB is good then you can probably do that without having to
> > > > > change
> > > > > VIRTIO and also because it's counter-intuitive that 9p needs 128 MB
> > > > > for
> > > > > good performance when it's ultimately implemented on top of disk and
> > > > > network I/O that have lower size limits.
> > > > 
> > > > First some numbers, linear reading a 12 GB file:
> > > > 
> > > > msize    average      notes
> > > > 
> > > > 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> > > > 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> > > > 512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
> > > > 1 MB     2551 MB/s    this msize would already violate virtio specs
> > > > 2 MB     2521 MB/s    this msize would already violate virtio specs
> > > > 4 MB     2628 MB/s    planned max. msize of my current kernel patches
> > > > [1]
> > > 
> > > How many descriptors are used? 4 MB can be covered by a single
> > > descriptor if the data is physically contiguous in memory, so this data
> > > doesn't demonstrate a need for more descriptors.
> > 
> > No, in the last couple years there was apparently no kernel version that
> > used just one descriptor, nor did my benchmarked version. Even though the
> > Linux 9p client uses (yet) simple linear buffers (contiguous physical
> > memory) on 9p client level, these are however split into PAGE_SIZE chunks
> > by function pack_sg_list() [1] before being fed to virtio level:
> > 
> > static unsigned int rest_of_page(void *data)
> > {
> > 
> > 	return PAGE_SIZE - offset_in_page(data);
> > 
> > }
> > ...
> > static int pack_sg_list(struct scatterlist *sg, int start,
> > 
> > 			int limit, char *data, int count)
> > 
> > {
> > 
> > 	int s;
> > 	int index = start;
> > 	
> > 	while (count) {
> > 	
> > 		s = rest_of_page(data);
> > 		...
> > 		sg_set_buf(&sg[index++], data, s);
> > 		count -= s;
> > 		data += s;
> > 	
> > 	}
> > 	...
> > 
> > }
> > 
> > [1]
> > https://github.com/torvalds/linux/blob/19901165d90fdca1e57c9baa0d5b4c63d1
> > 5c476a/net/9p/trans_virtio.c#L171
> > 
> > So when sending 4MB over virtio wire, it would yield in 1k descriptors
> > ATM.
> > 
> > I have wondered about this before, but did not question it, because due to
> > the cross-platform nature I couldn't say for certain whether that's
> > probably needed somewhere. I mean for the case virtio-PCI I know for sure
> > that one descriptor (i.e. >PAGE_SIZE) would be fine, but I don't know if
> > that applies to all buses and architectures.
> 
> VIRTIO does not limit descriptor the descriptor len field to PAGE_SIZE,
> so I don't think there is a limit at the VIRTIO level.

So you are viewing this purely from virtio specs PoV: in the sense, if it is
not prohibited by the virtio specs, then it should work. Maybe.

> If this function coalesces adjacent pages then the descriptor chain
> length issues could be reduced.
> 
> > > > But again, this is not just about performance. My conclusion as
> > > > described
> > > > in my previous email is that virtio currently squeezes
> > > > 
> > > > 	"max. simultanious amount of bulk messages"
> > > > 
> > > > vs.
> > > > 
> > > > 	"max. bulk data transmission size per bulk messaage"
> > > > 
> > > > into the same configuration parameter, which is IMO inappropriate and
> > > > hence
> > > > splitting them into 2 separate parameters when creating a queue makes
> > > > sense, independent of the performance benchmarks.
> > > > 
> > > > [1]
> > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyt
> > > > e.c
> > > > om/
> > > 
> > > Some devices effectively already have this because the device advertises
> > > a maximum number of descriptors via device-specific mechanisms like the
> > > struct virtio_blk_config seg_max field. But today these fields can only
> > > reduce the maximum descriptor chain length because the spec still limits
> > > the length to Queue Size.
> > > 
> > > We can build on this approach to raise the length above Queue Size. This
> > > approach has the advantage that the maximum number of segments isn't per
> > > device or per virtqueue, it's fine-grained. If the device supports two
> > > requests types then different max descriptor chain limits could be given
> > > for them by introducing two separate configuration space fields.
> > > 
> > > Here are the corresponding spec changes:
> > > 
> > > 1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is added
> > > 
> > >    to indicate that indirect descriptor table size and maximum
> > >    descriptor chain length are not limited by Queue Size value. (Maybe
> > >    there still needs to be a limit like 2^15?)
> > 
> > Sounds good to me!
> > 
> > AFAIK it is effectively limited to 2^16 because of vring_desc->next:
> > 
> > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > "next". */ struct vring_desc {
> > 
> >         /* Address (guest-physical). */
> >         __virtio64 addr;
> >         /* Length. */
> >         __virtio32 len;
> >         /* The flags as indicated above. */
> >         __virtio16 flags;
> >         /* We chain unused descriptors via this, too */
> >         __virtio16 next;
> > 
> > };
> 
> Yes, Split Virtqueues have a fundamental limit on indirect table size
> due to the "next" field. Packed Virtqueue descriptors don't have a
> "next" field so descriptor chains could be longer in theory (currently
> forbidden by the spec).
> 
> > > One thing that's messy is that we've been discussing the maximum
> > > descriptor chain length but 9p has the "msize" concept, which isn't
> > > aware of contiguous memory. It may be necessary to extend the 9p driver
> > > code to size requests not just according to their length in bytes but
> > > also according to the descriptor chain length. That's how the Linux
> > > block layer deals with queue limits (struct queue_limits max_segments vs
> > > max_hw_sectors).
> > 
> > Hmm, can't follow on that one. For what should that be needed in case of
> > 9p? My plan was to limit msize by 9p client simply at session start to
> > whatever is the max. amount virtio descriptors supported by host and
> > using PAGE_SIZE as size per descriptor, because that's what 9p client
> > actually does ATM (see above). So you think that should be changed to
> > e.g. just one descriptor for 4MB, right?
> 
> Limiting msize to the 9p transport device's maximum number of
> descriptors is conservative (i.e. 128 descriptors = 512 KB msize)
> because it doesn't take advantage of contiguous memory. I suggest
> leaving msize alone, adding a separate limit at which requests are split
> according to the maximum descriptor chain length, and tweaking
> pack_sg_list() to coalesce adjacent pages.
> 
> That way msize can be large without necessarily using lots of
> descriptors (depending on the memory layout).

That was actually a tempting solution. Because it would neither require
changes to the virtio specs (at least for a while) and it would also work with
older QEMU versions. And for that pack_sg_list() portion of the code it would
work well and easy as the buffer passed to pack_sg_list() is contiguous
already.

However I just realized for the zero-copy version of the code that would be
more tricky. The ZC version already uses individual pages (struct page, hence
PAGE_SIZE each) which are pinned, i.e. it uses pack_sg_list_p() [1] in
combination with p9_get_mapped_pages() [2]

[1] https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9ab8389/net/9p/trans_virtio.c#L218
[2] https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9ab8389/net/9p/trans_virtio.c#L309

So that would require much more work and code trying to sort and coalesce
individual pages to contiguous physical memory for the sake of reducing virtio
descriptors. And there is no guarantee that this is even possible. The kernel
may simply return a non-contiguous set of pages which would eventually end up
exceeding the virtio descriptor limit again.

So looks like it was probably still easier and realistic to just add virtio
capabilities for now for allowing to exceed current descriptor limit.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-11-04 14:41                           ` [Virtio-fs] " Christian Schoenebeck
@ 2021-11-09 10:56                             ` Stefan Hajnoczi
  -1 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-11-09 10:56 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, Greg Kurz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 34829 bytes --]

On Thu, Nov 04, 2021 at 03:41:23PM +0100, Christian Schoenebeck wrote:
> On Mittwoch, 3. November 2021 12:33:33 CET Stefan Hajnoczi wrote:
> > On Mon, Nov 01, 2021 at 09:29:26PM +0100, Christian Schoenebeck wrote:
> > > On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> > > > On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck wrote:
> > > > > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > > > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > > > > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > > > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > > > > 
> > > > > > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > > > > > > Schoenebeck
> > > > > > > > 
> > > > > > > > wrote:
> > > > > > > > > > > > > > At the moment the maximum transfer size with virtio
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > limited
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > 4M
> > > > > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to
> > > > > > > > > > > > > > its
> > > > > > > > > > > > > > maximum
> > > > > > > > > > > > > > theoretical possible transfer size of 128M (32k
> > > > > > > > > > > > > > pages)
> > > > > > > > > > > > > > according
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > virtio specs:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/
> > > > > > > > > > > > > > virt
> > > > > > > > > > > > > > io-v
> > > > > > > > > > > > > > 1.1-
> > > > > > > > > > > > > > cs
> > > > > > > > > > > > > > 01
> > > > > > > > > > > > > > .html#
> > > > > > > > > > > > > > x1-240006
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Hi Christian,
> > > > > > > > > > 
> > > > > > > > > > > > > I took a quick look at the code:
> > > > > > > > > > Hi,
> > > > > > > > > > 
> > > > > > > > > > Thanks Stefan for sharing virtio expertise and helping
> > > > > > > > > > Christian
> > > > > > > > > > !
> > > > > > > > > > 
> > > > > > > > > > > > > - The Linux 9p driver restricts descriptor chains to
> > > > > > > > > > > > > 128
> > > > > > > > > > > > > elements
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > > > > 
> > > > > > > > > > > > Yes, that's the limitation that I am about to remove
> > > > > > > > > > > > (WIP);
> > > > > > > > > > > > current
> > > > > > > > > > > > kernel
> > > > > > > > > > > > patches:
> > > > > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linu
> > > > > > > > > > > > x_os
> > > > > > > > > > > > s@cr
> > > > > > > > > > > > udeb
> > > > > > > > > > > > yt
> > > > > > > > > > > > e.
> > > > > > > > > > > > com/>
> > > > > > > > > > > 
> > > > > > > > > > > I haven't read the patches yet but I'm concerned that
> > > > > > > > > > > today
> > > > > > > > > > > the
> > > > > > > > > > > driver
> > > > > > > > > > > is pretty well-behaved and this new patch series
> > > > > > > > > > > introduces a
> > > > > > > > > > > spec
> > > > > > > > > > > violation. Not fixing existing spec violations is okay,
> > > > > > > > > > > but
> > > > > > > > > > > adding
> > > > > > > > > > > new
> > > > > > > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > > > > > > solution.
> > > > > > > > > 
> > > > > > > > > Nobody has reviewed the kernel patches yet. My main concern
> > > > > > > > > therefore
> > > > > > > > > actually is that the kernel patches are already too complex,
> > > > > > > > > because
> > > > > > > > > the
> > > > > > > > > current situation is that only Dominique is handling 9p
> > > > > > > > > patches on
> > > > > > > > > kernel
> > > > > > > > > side, and he barely has time for 9p anymore.
> > > > > > > > > 
> > > > > > > > > Another reason for me to catch up on reading current kernel
> > > > > > > > > code
> > > > > > > > > and
> > > > > > > > > stepping in as reviewer of 9p on kernel side ASAP, independent
> > > > > > > > > of
> > > > > > > > > this
> > > > > > > > > issue.
> > > > > > > > > 
> > > > > > > > > As for current kernel patches' complexity: I can certainly
> > > > > > > > > drop
> > > > > > > > > patch
> > > > > > > > > 7
> > > > > > > > > entirely as it is probably just overkill. Patch 4 is then the
> > > > > > > > > biggest
> > > > > > > > > chunk, I have to see if I can simplify it, and whether it
> > > > > > > > > would
> > > > > > > > > make
> > > > > > > > > sense to squash with patch 3.
> > > > > > > > > 
> > > > > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to
> > > > > > > > > > > > > preadv(2)
> > > > > > > > > > > > > and
> > > > > > > > > > > > > will
> > > > > > > > > > > > > fail
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   with EINVAL when called with more than IOV_MAX
> > > > > > > > > > > > >   iovecs
> > > > > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > > > > 
> > > > > > > > > > > > Hmm, which makes me wonder why I never encountered this
> > > > > > > > > > > > error
> > > > > > > > > > > > during
> > > > > > > > > > > > testing.
> > > > > > > > > > > > 
> > > > > > > > > > > > Most people will use the 9p qemu 'local' fs driver
> > > > > > > > > > > > backend
> > > > > > > > > > > > in
> > > > > > > > > > > > practice,
> > > > > > > > > > > > so
> > > > > > > > > > > > that v9fs_read() call would translate for most people to
> > > > > > > > > > > > this
> > > > > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > > > > 
> > > > > > > > > > > > static ssize_t local_preadv(FsContext *ctx,
> > > > > > > > > > > > V9fsFidOpenState
> > > > > > > > > > > > *fs,
> > > > > > > > > > > > 
> > > > > > > > > > > >                             const struct iovec *iov,
> > > > > > > > > > > >                             int iovcnt, off_t offset)
> > > > > > > > > > > > 
> > > > > > > > > > > > {
> > > > > > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > > > > > 
> > > > > > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > > > > > 
> > > > > > > > > > > > #else
> > > > > > > > > > > > 
> > > > > > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > > > > > >     if (err == -1) {
> > > > > > > > > > > >     
> > > > > > > > > > > >         return err;
> > > > > > > > > > > >     
> > > > > > > > > > > >     } else {
> > > > > > > > > > > >     
> > > > > > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > > > > > >     
> > > > > > > > > > > >     }
> > > > > > > > > > > > 
> > > > > > > > > > > > #endif
> > > > > > > > > > > > }
> > > > > > > > > > > > 
> > > > > > > > > > > > > Unless I misunderstood the code, neither side can take
> > > > > > > > > > > > > advantage
> > > > > > > > > > > > > of
> > > > > > > > > > > > > the
> > > > > > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Stefan
> > > > > > > > > > > > 
> > > > > > > > > > > > I need to check that when I have some more time. One
> > > > > > > > > > > > possible
> > > > > > > > > > > > explanation
> > > > > > > > > > > > might be that preadv() already has this wrapped into a
> > > > > > > > > > > > loop
> > > > > > > > > > > > in
> > > > > > > > > > > > its
> > > > > > > > > > > > implementation to circumvent a limit like IOV_MAX. It
> > > > > > > > > > > > might
> > > > > > > > > > > > be
> > > > > > > > > > > > another
> > > > > > > > > > > > "it
> > > > > > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > > > > > 
> > > > > > > > > > > > There are still a bunch of other issues I have to
> > > > > > > > > > > > resolve.
> > > > > > > > > > > > If
> > > > > > > > > > > > you
> > > > > > > > > > > > look
> > > > > > > > > > > > at
> > > > > > > > > > > > net/9p/client.c on kernel side, you'll notice that it
> > > > > > > > > > > > basically
> > > > > > > > > > > > does
> > > > > > > > > > > > this ATM> >
> > > > > > > > > > > > 
> > > > > > > > > > > >     kmalloc(msize);
> > > > > > > > > > 
> > > > > > > > > > Note that this is done twice : once for the T message
> > > > > > > > > > (client
> > > > > > > > > > request)
> > > > > > > > > > and
> > > > > > > > > > once for the R message (server answer). The 9p driver could
> > > > > > > > > > adjust
> > > > > > > > > > the
> > > > > > > > > > size
> > > > > > > > > > of the T message to what's really needed instead of
> > > > > > > > > > allocating
> > > > > > > > > > the
> > > > > > > > > > full
> > > > > > > > > > msize. R message size is not known though.
> > > > > > > > > 
> > > > > > > > > Would it make sense adding a second virtio ring, dedicated to
> > > > > > > > > server
> > > > > > > > > responses to solve this? IIRC 9p server already calculates
> > > > > > > > > appropriate
> > > > > > > > > exact sizes for each response type. So server could just push
> > > > > > > > > space
> > > > > > > > > that's
> > > > > > > > > really needed for its responses.
> > > > > > > > > 
> > > > > > > > > > > > for every 9p request. So not only does it allocate much
> > > > > > > > > > > > more
> > > > > > > > > > > > memory
> > > > > > > > > > > > for
> > > > > > > > > > > > every request than actually required (i.e. say 9pfs was
> > > > > > > > > > > > mounted
> > > > > > > > > > > > with
> > > > > > > > > > > > msize=8M, then a 9p request that actually would just
> > > > > > > > > > > > need 1k
> > > > > > > > > > > > would
> > > > > > > > > > > > nevertheless allocate 8M), but also it allocates >
> > > > > > > > > > > > PAGE_SIZE,
> > > > > > > > > > > > which
> > > > > > > > > > > > obviously may fail at any time.>
> > > > > > > > > > > 
> > > > > > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs
> > > > > > > > > > > vmalloc()
> > > > > > > > > > > situation.
> > > > > > > > > 
> > > > > > > > > Hu, I didn't even consider vmalloc(). I just tried the
> > > > > > > > > kvmalloc()
> > > > > > > > > wrapper
> > > > > > > > > as a quick & dirty test, but it crashed in the same way as
> > > > > > > > > kmalloc()
> > > > > > > > > with
> > > > > > > > > large msize values immediately on mounting:
> > > > > > > > > 
> > > > > > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > > > > > --- a/net/9p/client.c
> > > > > > > > > +++ b/net/9p/client.c
> > > > > > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct
> > > > > > > > > p9_client
> > > > > > > > > *clnt)
> > > > > > > > > 
> > > > > > > > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall
> > > > > > > > >  *fc,
> > > > > > > > >  
> > > > > > > > >                          int alloc_msize)
> > > > > > > > >  
> > > > > > > > >  {
> > > > > > > > > 
> > > > > > > > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize)
> > > > > > > > > {
> > > > > > > > > +       //if (likely(c->fcall_cache) && alloc_msize ==
> > > > > > > > > c->msize) {
> > > > > > > > > +       if (false) {
> > > > > > > > > 
> > > > > > > > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > > > > > > > >                 GFP_NOFS);
> > > > > > > > >                 fc->cache = c->fcall_cache;
> > > > > > > > >         
> > > > > > > > >         } else {
> > > > > > > > > 
> > > > > > > > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > > > > > > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > > > > > > > 
> > > > > > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > > > > > 
> > > > > > > > Now I get:
> > > > > > > >    virtio: bogus descriptor or out of resources
> > > > > > > > 
> > > > > > > > So, still some work ahead on both ends.
> > > > > > > 
> > > > > > > Few hacks later (only changes on 9p client side) I got this
> > > > > > > running
> > > > > > > stable
> > > > > > > now. The reason for the virtio error above was that kvmalloc()
> > > > > > > returns
> > > > > > > a
> > > > > > > non-logical kernel address for any kvmalloc(>4M), i.e. an address
> > > > > > > that
> > > > > > > is
> > > > > > > inaccessible from host side, hence that "bogus descriptor" message
> > > > > > > by
> > > > > > > QEMU.
> > > > > > > So I had to split those linear 9p client buffers into sparse ones
> > > > > > > (set
> > > > > > > of
> > > > > > > individual pages).
> > > > > > > 
> > > > > > > I tested this for some days with various virtio transmission sizes
> > > > > > > and
> > > > > > > it
> > > > > > > works as expected up to 128 MB (more precisely: 128 MB read space
> > > > > > > +
> > > > > > > 128 MB
> > > > > > > write space per virtio round trip message).
> > > > > > > 
> > > > > > > I did not encounter a show stopper for large virtio transmission
> > > > > > > sizes
> > > > > > > (4 MB ... 128 MB) on virtio level, neither as a result of testing,
> > > > > > > nor
> > > > > > > after reviewing the existing code.
> > > > > > > 
> > > > > > > About IOV_MAX: that's apparently not an issue on virtio level.
> > > > > > > Most of
> > > > > > > the
> > > > > > > iovec code, both on Linux kernel side and on QEMU side do not have
> > > > > > > this
> > > > > > > limitation. It is apparently however indeed a limitation for
> > > > > > > userland
> > > > > > > apps
> > > > > > > calling the Linux kernel's syscalls yet.
> > > > > > > 
> > > > > > > Stefan, as it stands now, I am even more convinced that the upper
> > > > > > > virtio
> > > > > > > transmission size limit should not be squeezed into the queue size
> > > > > > > argument of virtio_add_queue(). Not because of the previous
> > > > > > > argument
> > > > > > > that
> > > > > > > it would waste space (~1MB), but rather because they are two
> > > > > > > different
> > > > > > > things. To outline this, just a quick recap of what happens
> > > > > > > exactly
> > > > > > > when
> > > > > > > a bulk message is pushed over the virtio wire (assuming virtio
> > > > > > > "split"
> > > > > > > layout here):
> > > > > > > 
> > > > > > > ---------- [recap-start] ----------
> > > > > > > 
> > > > > > > For each bulk message sent guest <-> host, exactly *one* of the
> > > > > > > pre-allocated descriptors is taken and placed (subsequently) into
> > > > > > > exactly
> > > > > > > *one* position of the two available/used ring buffers. The actual
> > > > > > > descriptor table though, containing all the DMA addresses of the
> > > > > > > message
> > > > > > > bulk data, is allocated just in time for each round trip message.
> > > > > > > Say,
> > > > > > > it
> > > > > > > is the first message sent, it yields in the following structure:
> > > > > > > 
> > > > > > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > > > > > 
> > > > > > >    +-+              +-+           +-----------------+
> > > > > > >    
> > > > > > >    |D|------------->|d|---------->| Bulk data block |
> > > > > > >    
> > > > > > >    +-+              |d|--------+  +-----------------+
> > > > > > >    
> > > > > > >    | |              |d|------+ |
> > > > > > >    
> > > > > > >    +-+               .       | |  +-----------------+
> > > > > > >    
> > > > > > >    | |               .       | +->| Bulk data block |
> > > > > > >     
> > > > > > >     .                .       |    +-----------------+
> > > > > > >     .               |d|-+    |
> > > > > > >     .               +-+ |    |    +-----------------+
> > > > > > >     
> > > > > > >    | |                  |    +--->| Bulk data block |
> > > > > > >    
> > > > > > >    +-+                  |         +-----------------+
> > > > > > >    
> > > > > > >    | |                  |                 .
> > > > > > >    
> > > > > > >    +-+                  |                 .
> > > > > > >    
> > > > > > >                         |                 .
> > > > > > >                         |         
> > > > > > >                         |         +-----------------+
> > > > > > >                         
> > > > > > >                         +-------->| Bulk data block |
> > > > > > >                         
> > > > > > >                                   +-----------------+
> > > > > > > 
> > > > > > > Legend:
> > > > > > > D: pre-allocated descriptor
> > > > > > > d: just in time allocated descriptor
> > > > > > > -->: memory pointer (DMA)
> > > > > > > 
> > > > > > > The bulk data blocks are allocated by the respective device driver
> > > > > > > above
> > > > > > > virtio subsystem level (guest side).
> > > > > > > 
> > > > > > > There are exactly as many descriptors pre-allocated (D) as the
> > > > > > > size of
> > > > > > > a
> > > > > > > ring buffer.
> > > > > > > 
> > > > > > > A "descriptor" is more or less just a chainable DMA memory
> > > > > > > pointer;
> > > > > > > defined
> > > > > > > as:
> > > > > > > 
> > > > > > > /* Virtio ring descriptors: 16 bytes.  These can chain together
> > > > > > > via
> > > > > > > "next". */ struct vring_desc {
> > > > > > > 
> > > > > > > 	/* Address (guest-physical). */
> > > > > > > 	__virtio64 addr;
> > > > > > > 	/* Length. */
> > > > > > > 	__virtio32 len;
> > > > > > > 	/* The flags as indicated above. */
> > > > > > > 	__virtio16 flags;
> > > > > > > 	/* We chain unused descriptors via this, too */
> > > > > > > 	__virtio16 next;
> > > > > > > 
> > > > > > > };
> > > > > > > 
> > > > > > > There are 2 ring buffers; the "available" ring buffer is for
> > > > > > > sending a
> > > > > > > message guest->host (which will transmit DMA addresses of guest
> > > > > > > allocated
> > > > > > > bulk data blocks that are used for data sent to device, and
> > > > > > > separate
> > > > > > > guest allocated bulk data blocks that will be used by host side to
> > > > > > > place
> > > > > > > its response bulk data), and the "used" ring buffer is for sending
> > > > > > > host->guest to let guest know about host's response and that it
> > > > > > > could
> > > > > > > now
> > > > > > > safely consume and then deallocate the bulk data blocks
> > > > > > > subsequently.
> > > > > > > 
> > > > > > > ---------- [recap-end] ----------
> > > > > > > 
> > > > > > > So the "queue size" actually defines the ringbuffer size. It does
> > > > > > > not
> > > > > > > define the maximum amount of descriptors. The "queue size" rather
> > > > > > > defines
> > > > > > > how many pending messages can be pushed into either one ringbuffer
> > > > > > > before
> > > > > > > the other side would need to wait until the counter side would
> > > > > > > step up
> > > > > > > (i.e. ring buffer full).
> > > > > > > 
> > > > > > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE
> > > > > > > actually
> > > > > > > is)
> > > > > > > OTOH defines the max. bulk data size that could be transmitted
> > > > > > > with
> > > > > > > each
> > > > > > > virtio round trip message.
> > > > > > > 
> > > > > > > And in fact, 9p currently handles the virtio "queue size" as
> > > > > > > directly
> > > > > > > associative with its maximum amount of active 9p requests the
> > > > > > > server
> > > > > > > could
> > > > > > > 
> > > > > > > handle simultaniously:
> > > > > > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > > > > > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > > > > > >   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev,
> > > > > > >   MAX_REQ,
> > > > > > >   
> > > > > > >                                  handle_9p_output);
> > > > > > > 
> > > > > > > So if I would change it like this, just for the purpose to
> > > > > > > increase
> > > > > > > the
> > > > > > > max. virtio transmission size:
> > > > > > > 
> > > > > > > --- a/hw/9pfs/virtio-9p-device.c
> > > > > > > +++ b/hw/9pfs/virtio-9p-device.c
> > > > > > > @@ -218,7 +218,7 @@ static void
> > > > > > > virtio_9p_device_realize(DeviceState
> > > > > > > *dev,
> > > > > > > Error **errp)>
> > > > > > > 
> > > > > > >      v->config_size = sizeof(struct virtio_9p_config) +
> > > > > > >      strlen(s->fsconf.tag);
> > > > > > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > > > > > >      
> > > > > > >                  VIRTQUEUE_MAX_SIZE);
> > > > > > > 
> > > > > > > -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > > > > > > +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
> > > > > > > 
> > > > > > >  }
> > > > > > > 
> > > > > > > Then it would require additional synchronization code on both ends
> > > > > > > and
> > > > > > > therefore unnecessary complexity, because it would now be possible
> > > > > > > that
> > > > > > > more requests are pushed into the ringbuffer than server could
> > > > > > > handle.
> > > > > > > 
> > > > > > > There is one potential issue though that probably did justify the
> > > > > > > "don't
> > > > > > > exceed the queue size" rule:
> > > > > > > 
> > > > > > > ATM the descriptor table is allocated (just in time) as *one*
> > > > > > > continuous
> > > > > > > buffer via kmalloc_array():
> > > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798c
> > > > > > > a086
> > > > > > > f7c7
> > > > > > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > > > > > 
> > > > > > > So assuming transmission size of 2 * 128 MB that kmalloc_array()
> > > > > > > call
> > > > > > > would
> > > > > > > yield in kmalloc(1M) and the latter might fail if guest had highly
> > > > > > > fragmented physical memory. For such kind of error case there is
> > > > > > > currently a fallback path in virtqueue_add_split() that would then
> > > > > > > use
> > > > > > > the required amount of pre-allocated descriptors instead:
> > > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798c
> > > > > > > a086
> > > > > > > f7c7
> > > > > > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > > > > > 
> > > > > > > That fallback recovery path would no longer be viable if the queue
> > > > > > > size
> > > > > > > was
> > > > > > > exceeded. There would be alternatives though, e.g. by allowing to
> > > > > > > chain
> > > > > > > indirect descriptor tables (currently prohibited by the virtio
> > > > > > > specs).
> > > > > > 
> > > > > > Making the maximum number of descriptors independent of the queue
> > > > > > size
> > > > > > requires a change to the VIRTIO spec since the two values are
> > > > > > currently
> > > > > > explicitly tied together by the spec.
> > > > > 
> > > > > Yes, that's what the virtio specs say. But they don't say why, nor did
> > > > > I
> > > > > hear a reason in this dicussion.
> > > > > 
> > > > > That's why I invested time reviewing current virtio implementation and
> > > > > specs, as well as actually testing exceeding that limit. And as I
> > > > > outlined in detail in my previous email, I only found one theoretical
> > > > > issue that could be addressed though.
> > > > 
> > > > I agree that there is a limitation in the VIRTIO spec, but violating the
> > > > spec isn't an acceptable solution:
> > > > 
> > > > 1. QEMU and Linux aren't the only components that implement VIRTIO. You
> > > > 
> > > >    cannot make assumptions about their implementations because it may
> > > >    break spec-compliant implementations that you haven't looked at.
> > > >    
> > > >    Your patches weren't able to increase Queue Size because some device
> > > >    implementations break when descriptor chains are too long. This shows
> > > >    there is a practical issue even in QEMU.
> > > > 
> > > > 2. The specific spec violation that we discussed creates the problem
> > > > 
> > > >    that drivers can no longer determine the maximum description chain
> > > >    length. This in turn will lead to more implementation-specific
> > > >    assumptions being baked into drivers and cause problems with
> > > >    interoperability and future changes.
> > > > 
> > > > The spec needs to be extended instead. I included an idea for how to do
> > > > that below.
> > > 
> > > Sure, I just wanted to see if there was a non-neglectable "hard" show
> > > stopper per se that I probably haven't seen yet. I have not questioned
> > > aiming a clean solution.
> > > 
> > > Thanks for the clarification!
> > > 
> > > > > > Before doing that, are there benchmark results showing that 1 MB vs
> > > > > > 128
> > > > > > MB produces a performance improvement? I'm asking because if
> > > > > > performance
> > > > > > with 1 MB is good then you can probably do that without having to
> > > > > > change
> > > > > > VIRTIO and also because it's counter-intuitive that 9p needs 128 MB
> > > > > > for
> > > > > > good performance when it's ultimately implemented on top of disk and
> > > > > > network I/O that have lower size limits.
> > > > > 
> > > > > First some numbers, linear reading a 12 GB file:
> > > > > 
> > > > > msize    average      notes
> > > > > 
> > > > > 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> > > > > 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> > > > > 512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
> > > > > 1 MB     2551 MB/s    this msize would already violate virtio specs
> > > > > 2 MB     2521 MB/s    this msize would already violate virtio specs
> > > > > 4 MB     2628 MB/s    planned max. msize of my current kernel patches
> > > > > [1]
> > > > 
> > > > How many descriptors are used? 4 MB can be covered by a single
> > > > descriptor if the data is physically contiguous in memory, so this data
> > > > doesn't demonstrate a need for more descriptors.
> > > 
> > > No, in the last couple years there was apparently no kernel version that
> > > used just one descriptor, nor did my benchmarked version. Even though the
> > > Linux 9p client uses (yet) simple linear buffers (contiguous physical
> > > memory) on 9p client level, these are however split into PAGE_SIZE chunks
> > > by function pack_sg_list() [1] before being fed to virtio level:
> > > 
> > > static unsigned int rest_of_page(void *data)
> > > {
> > > 
> > > 	return PAGE_SIZE - offset_in_page(data);
> > > 
> > > }
> > > ...
> > > static int pack_sg_list(struct scatterlist *sg, int start,
> > > 
> > > 			int limit, char *data, int count)
> > > 
> > > {
> > > 
> > > 	int s;
> > > 	int index = start;
> > > 	
> > > 	while (count) {
> > > 	
> > > 		s = rest_of_page(data);
> > > 		...
> > > 		sg_set_buf(&sg[index++], data, s);
> > > 		count -= s;
> > > 		data += s;
> > > 	
> > > 	}
> > > 	...
> > > 
> > > }
> > > 
> > > [1]
> > > https://github.com/torvalds/linux/blob/19901165d90fdca1e57c9baa0d5b4c63d1
> > > 5c476a/net/9p/trans_virtio.c#L171
> > > 
> > > So when sending 4MB over virtio wire, it would yield in 1k descriptors
> > > ATM.
> > > 
> > > I have wondered about this before, but did not question it, because due to
> > > the cross-platform nature I couldn't say for certain whether that's
> > > probably needed somewhere. I mean for the case virtio-PCI I know for sure
> > > that one descriptor (i.e. >PAGE_SIZE) would be fine, but I don't know if
> > > that applies to all buses and architectures.
> > 
> > VIRTIO does not limit descriptor the descriptor len field to PAGE_SIZE,
> > so I don't think there is a limit at the VIRTIO level.
> 
> So you are viewing this purely from virtio specs PoV: in the sense, if it is
> not prohibited by the virtio specs, then it should work. Maybe.

Limitations must be specified either in the 9P protocol or the VIRTIO
specification. Drivers and devices will not be able to operate correctly
if there are limitations that aren't covered by the specs.

Do you have something in mind that isn't covered by the specs?

> > If this function coalesces adjacent pages then the descriptor chain
> > length issues could be reduced.
> > 
> > > > > But again, this is not just about performance. My conclusion as
> > > > > described
> > > > > in my previous email is that virtio currently squeezes
> > > > > 
> > > > > 	"max. simultanious amount of bulk messages"
> > > > > 
> > > > > vs.
> > > > > 
> > > > > 	"max. bulk data transmission size per bulk messaage"
> > > > > 
> > > > > into the same configuration parameter, which is IMO inappropriate and
> > > > > hence
> > > > > splitting them into 2 separate parameters when creating a queue makes
> > > > > sense, independent of the performance benchmarks.
> > > > > 
> > > > > [1]
> > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyt
> > > > > e.c
> > > > > om/
> > > > 
> > > > Some devices effectively already have this because the device advertises
> > > > a maximum number of descriptors via device-specific mechanisms like the
> > > > struct virtio_blk_config seg_max field. But today these fields can only
> > > > reduce the maximum descriptor chain length because the spec still limits
> > > > the length to Queue Size.
> > > > 
> > > > We can build on this approach to raise the length above Queue Size. This
> > > > approach has the advantage that the maximum number of segments isn't per
> > > > device or per virtqueue, it's fine-grained. If the device supports two
> > > > requests types then different max descriptor chain limits could be given
> > > > for them by introducing two separate configuration space fields.
> > > > 
> > > > Here are the corresponding spec changes:
> > > > 
> > > > 1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is added
> > > > 
> > > >    to indicate that indirect descriptor table size and maximum
> > > >    descriptor chain length are not limited by Queue Size value. (Maybe
> > > >    there still needs to be a limit like 2^15?)
> > > 
> > > Sounds good to me!
> > > 
> > > AFAIK it is effectively limited to 2^16 because of vring_desc->next:
> > > 
> > > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > > "next". */ struct vring_desc {
> > > 
> > >         /* Address (guest-physical). */
> > >         __virtio64 addr;
> > >         /* Length. */
> > >         __virtio32 len;
> > >         /* The flags as indicated above. */
> > >         __virtio16 flags;
> > >         /* We chain unused descriptors via this, too */
> > >         __virtio16 next;
> > > 
> > > };
> > 
> > Yes, Split Virtqueues have a fundamental limit on indirect table size
> > due to the "next" field. Packed Virtqueue descriptors don't have a
> > "next" field so descriptor chains could be longer in theory (currently
> > forbidden by the spec).
> > 
> > > > One thing that's messy is that we've been discussing the maximum
> > > > descriptor chain length but 9p has the "msize" concept, which isn't
> > > > aware of contiguous memory. It may be necessary to extend the 9p driver
> > > > code to size requests not just according to their length in bytes but
> > > > also according to the descriptor chain length. That's how the Linux
> > > > block layer deals with queue limits (struct queue_limits max_segments vs
> > > > max_hw_sectors).
> > > 
> > > Hmm, can't follow on that one. For what should that be needed in case of
> > > 9p? My plan was to limit msize by 9p client simply at session start to
> > > whatever is the max. amount virtio descriptors supported by host and
> > > using PAGE_SIZE as size per descriptor, because that's what 9p client
> > > actually does ATM (see above). So you think that should be changed to
> > > e.g. just one descriptor for 4MB, right?
> > 
> > Limiting msize to the 9p transport device's maximum number of
> > descriptors is conservative (i.e. 128 descriptors = 512 KB msize)
> > because it doesn't take advantage of contiguous memory. I suggest
> > leaving msize alone, adding a separate limit at which requests are split
> > according to the maximum descriptor chain length, and tweaking
> > pack_sg_list() to coalesce adjacent pages.
> > 
> > That way msize can be large without necessarily using lots of
> > descriptors (depending on the memory layout).
> 
> That was actually a tempting solution. Because it would neither require
> changes to the virtio specs (at least for a while) and it would also work with
> older QEMU versions. And for that pack_sg_list() portion of the code it would
> work well and easy as the buffer passed to pack_sg_list() is contiguous
> already.
> 
> However I just realized for the zero-copy version of the code that would be
> more tricky. The ZC version already uses individual pages (struct page, hence
> PAGE_SIZE each) which are pinned, i.e. it uses pack_sg_list_p() [1] in
> combination with p9_get_mapped_pages() [2]
> 
> [1] https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9ab8389/net/9p/trans_virtio.c#L218
> [2] https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9ab8389/net/9p/trans_virtio.c#L309
> 
> So that would require much more work and code trying to sort and coalesce
> individual pages to contiguous physical memory for the sake of reducing virtio
> descriptors. And there is no guarantee that this is even possible. The kernel
> may simply return a non-contiguous set of pages which would eventually end up
> exceeding the virtio descriptor limit again.

Order must be preserved so pages cannot be sorted by physical address.
How about simply coalescing when pages are adjacent?

> So looks like it was probably still easier and realistic to just add virtio
> capabilities for now for allowing to exceed current descriptor limit.

I'm still not sure why virtio-net, virtio-blk, virtio-fs, etc perform
fine under today's limits while virtio-9p needs a much higher limit to
achieve good performance. Maybe there is an issue in a layer above the
vring that's causing the virtio-9p performance you've observed?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-11-09 10:56                             ` Stefan Hajnoczi
  0 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-11-09 10:56 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Raphael Norwitz

[-- Attachment #1: Type: text/plain, Size: 34829 bytes --]

On Thu, Nov 04, 2021 at 03:41:23PM +0100, Christian Schoenebeck wrote:
> On Mittwoch, 3. November 2021 12:33:33 CET Stefan Hajnoczi wrote:
> > On Mon, Nov 01, 2021 at 09:29:26PM +0100, Christian Schoenebeck wrote:
> > > On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> > > > On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck wrote:
> > > > > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > > > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > > > > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > > > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > > > > 
> > > > > > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > > > > > > Schoenebeck
> > > > > > > > 
> > > > > > > > wrote:
> > > > > > > > > > > > > > At the moment the maximum transfer size with virtio
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > limited
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > 4M
> > > > > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to
> > > > > > > > > > > > > > its
> > > > > > > > > > > > > > maximum
> > > > > > > > > > > > > > theoretical possible transfer size of 128M (32k
> > > > > > > > > > > > > > pages)
> > > > > > > > > > > > > > according
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > virtio specs:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/
> > > > > > > > > > > > > > virt
> > > > > > > > > > > > > > io-v
> > > > > > > > > > > > > > 1.1-
> > > > > > > > > > > > > > cs
> > > > > > > > > > > > > > 01
> > > > > > > > > > > > > > .html#
> > > > > > > > > > > > > > x1-240006
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Hi Christian,
> > > > > > > > > > 
> > > > > > > > > > > > > I took a quick look at the code:
> > > > > > > > > > Hi,
> > > > > > > > > > 
> > > > > > > > > > Thanks Stefan for sharing virtio expertise and helping
> > > > > > > > > > Christian
> > > > > > > > > > !
> > > > > > > > > > 
> > > > > > > > > > > > > - The Linux 9p driver restricts descriptor chains to
> > > > > > > > > > > > > 128
> > > > > > > > > > > > > elements
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > > > > 
> > > > > > > > > > > > Yes, that's the limitation that I am about to remove
> > > > > > > > > > > > (WIP);
> > > > > > > > > > > > current
> > > > > > > > > > > > kernel
> > > > > > > > > > > > patches:
> > > > > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linu
> > > > > > > > > > > > x_os
> > > > > > > > > > > > s@cr
> > > > > > > > > > > > udeb
> > > > > > > > > > > > yt
> > > > > > > > > > > > e.
> > > > > > > > > > > > com/>
> > > > > > > > > > > 
> > > > > > > > > > > I haven't read the patches yet but I'm concerned that
> > > > > > > > > > > today
> > > > > > > > > > > the
> > > > > > > > > > > driver
> > > > > > > > > > > is pretty well-behaved and this new patch series
> > > > > > > > > > > introduces a
> > > > > > > > > > > spec
> > > > > > > > > > > violation. Not fixing existing spec violations is okay,
> > > > > > > > > > > but
> > > > > > > > > > > adding
> > > > > > > > > > > new
> > > > > > > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > > > > > > solution.
> > > > > > > > > 
> > > > > > > > > Nobody has reviewed the kernel patches yet. My main concern
> > > > > > > > > therefore
> > > > > > > > > actually is that the kernel patches are already too complex,
> > > > > > > > > because
> > > > > > > > > the
> > > > > > > > > current situation is that only Dominique is handling 9p
> > > > > > > > > patches on
> > > > > > > > > kernel
> > > > > > > > > side, and he barely has time for 9p anymore.
> > > > > > > > > 
> > > > > > > > > Another reason for me to catch up on reading current kernel
> > > > > > > > > code
> > > > > > > > > and
> > > > > > > > > stepping in as reviewer of 9p on kernel side ASAP, independent
> > > > > > > > > of
> > > > > > > > > this
> > > > > > > > > issue.
> > > > > > > > > 
> > > > > > > > > As for current kernel patches' complexity: I can certainly
> > > > > > > > > drop
> > > > > > > > > patch
> > > > > > > > > 7
> > > > > > > > > entirely as it is probably just overkill. Patch 4 is then the
> > > > > > > > > biggest
> > > > > > > > > chunk, I have to see if I can simplify it, and whether it
> > > > > > > > > would
> > > > > > > > > make
> > > > > > > > > sense to squash with patch 3.
> > > > > > > > > 
> > > > > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to
> > > > > > > > > > > > > preadv(2)
> > > > > > > > > > > > > and
> > > > > > > > > > > > > will
> > > > > > > > > > > > > fail
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   with EINVAL when called with more than IOV_MAX
> > > > > > > > > > > > >   iovecs
> > > > > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > > > > 
> > > > > > > > > > > > Hmm, which makes me wonder why I never encountered this
> > > > > > > > > > > > error
> > > > > > > > > > > > during
> > > > > > > > > > > > testing.
> > > > > > > > > > > > 
> > > > > > > > > > > > Most people will use the 9p qemu 'local' fs driver
> > > > > > > > > > > > backend
> > > > > > > > > > > > in
> > > > > > > > > > > > practice,
> > > > > > > > > > > > so
> > > > > > > > > > > > that v9fs_read() call would translate for most people to
> > > > > > > > > > > > this
> > > > > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > > > > 
> > > > > > > > > > > > static ssize_t local_preadv(FsContext *ctx,
> > > > > > > > > > > > V9fsFidOpenState
> > > > > > > > > > > > *fs,
> > > > > > > > > > > > 
> > > > > > > > > > > >                             const struct iovec *iov,
> > > > > > > > > > > >                             int iovcnt, off_t offset)
> > > > > > > > > > > > 
> > > > > > > > > > > > {
> > > > > > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > > > > > 
> > > > > > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > > > > > 
> > > > > > > > > > > > #else
> > > > > > > > > > > > 
> > > > > > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > > > > > >     if (err == -1) {
> > > > > > > > > > > >     
> > > > > > > > > > > >         return err;
> > > > > > > > > > > >     
> > > > > > > > > > > >     } else {
> > > > > > > > > > > >     
> > > > > > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > > > > > >     
> > > > > > > > > > > >     }
> > > > > > > > > > > > 
> > > > > > > > > > > > #endif
> > > > > > > > > > > > }
> > > > > > > > > > > > 
> > > > > > > > > > > > > Unless I misunderstood the code, neither side can take
> > > > > > > > > > > > > advantage
> > > > > > > > > > > > > of
> > > > > > > > > > > > > the
> > > > > > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Stefan
> > > > > > > > > > > > 
> > > > > > > > > > > > I need to check that when I have some more time. One
> > > > > > > > > > > > possible
> > > > > > > > > > > > explanation
> > > > > > > > > > > > might be that preadv() already has this wrapped into a
> > > > > > > > > > > > loop
> > > > > > > > > > > > in
> > > > > > > > > > > > its
> > > > > > > > > > > > implementation to circumvent a limit like IOV_MAX. It
> > > > > > > > > > > > might
> > > > > > > > > > > > be
> > > > > > > > > > > > another
> > > > > > > > > > > > "it
> > > > > > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > > > > > 
> > > > > > > > > > > > There are still a bunch of other issues I have to
> > > > > > > > > > > > resolve.
> > > > > > > > > > > > If
> > > > > > > > > > > > you
> > > > > > > > > > > > look
> > > > > > > > > > > > at
> > > > > > > > > > > > net/9p/client.c on kernel side, you'll notice that it
> > > > > > > > > > > > basically
> > > > > > > > > > > > does
> > > > > > > > > > > > this ATM> >
> > > > > > > > > > > > 
> > > > > > > > > > > >     kmalloc(msize);
> > > > > > > > > > 
> > > > > > > > > > Note that this is done twice : once for the T message
> > > > > > > > > > (client
> > > > > > > > > > request)
> > > > > > > > > > and
> > > > > > > > > > once for the R message (server answer). The 9p driver could
> > > > > > > > > > adjust
> > > > > > > > > > the
> > > > > > > > > > size
> > > > > > > > > > of the T message to what's really needed instead of
> > > > > > > > > > allocating
> > > > > > > > > > the
> > > > > > > > > > full
> > > > > > > > > > msize. R message size is not known though.
> > > > > > > > > 
> > > > > > > > > Would it make sense adding a second virtio ring, dedicated to
> > > > > > > > > server
> > > > > > > > > responses to solve this? IIRC 9p server already calculates
> > > > > > > > > appropriate
> > > > > > > > > exact sizes for each response type. So server could just push
> > > > > > > > > space
> > > > > > > > > that's
> > > > > > > > > really needed for its responses.
> > > > > > > > > 
> > > > > > > > > > > > for every 9p request. So not only does it allocate much
> > > > > > > > > > > > more
> > > > > > > > > > > > memory
> > > > > > > > > > > > for
> > > > > > > > > > > > every request than actually required (i.e. say 9pfs was
> > > > > > > > > > > > mounted
> > > > > > > > > > > > with
> > > > > > > > > > > > msize=8M, then a 9p request that actually would just
> > > > > > > > > > > > need 1k
> > > > > > > > > > > > would
> > > > > > > > > > > > nevertheless allocate 8M), but also it allocates >
> > > > > > > > > > > > PAGE_SIZE,
> > > > > > > > > > > > which
> > > > > > > > > > > > obviously may fail at any time.>
> > > > > > > > > > > 
> > > > > > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs
> > > > > > > > > > > vmalloc()
> > > > > > > > > > > situation.
> > > > > > > > > 
> > > > > > > > > Hu, I didn't even consider vmalloc(). I just tried the
> > > > > > > > > kvmalloc()
> > > > > > > > > wrapper
> > > > > > > > > as a quick & dirty test, but it crashed in the same way as
> > > > > > > > > kmalloc()
> > > > > > > > > with
> > > > > > > > > large msize values immediately on mounting:
> > > > > > > > > 
> > > > > > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > > > > > --- a/net/9p/client.c
> > > > > > > > > +++ b/net/9p/client.c
> > > > > > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct
> > > > > > > > > p9_client
> > > > > > > > > *clnt)
> > > > > > > > > 
> > > > > > > > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall
> > > > > > > > >  *fc,
> > > > > > > > >  
> > > > > > > > >                          int alloc_msize)
> > > > > > > > >  
> > > > > > > > >  {
> > > > > > > > > 
> > > > > > > > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize)
> > > > > > > > > {
> > > > > > > > > +       //if (likely(c->fcall_cache) && alloc_msize ==
> > > > > > > > > c->msize) {
> > > > > > > > > +       if (false) {
> > > > > > > > > 
> > > > > > > > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > > > > > > > >                 GFP_NOFS);
> > > > > > > > >                 fc->cache = c->fcall_cache;
> > > > > > > > >         
> > > > > > > > >         } else {
> > > > > > > > > 
> > > > > > > > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > > > > > > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > > > > > > > 
> > > > > > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > > > > > 
> > > > > > > > Now I get:
> > > > > > > >    virtio: bogus descriptor or out of resources
> > > > > > > > 
> > > > > > > > So, still some work ahead on both ends.
> > > > > > > 
> > > > > > > Few hacks later (only changes on 9p client side) I got this
> > > > > > > running
> > > > > > > stable
> > > > > > > now. The reason for the virtio error above was that kvmalloc()
> > > > > > > returns
> > > > > > > a
> > > > > > > non-logical kernel address for any kvmalloc(>4M), i.e. an address
> > > > > > > that
> > > > > > > is
> > > > > > > inaccessible from host side, hence that "bogus descriptor" message
> > > > > > > by
> > > > > > > QEMU.
> > > > > > > So I had to split those linear 9p client buffers into sparse ones
> > > > > > > (set
> > > > > > > of
> > > > > > > individual pages).
> > > > > > > 
> > > > > > > I tested this for some days with various virtio transmission sizes
> > > > > > > and
> > > > > > > it
> > > > > > > works as expected up to 128 MB (more precisely: 128 MB read space
> > > > > > > +
> > > > > > > 128 MB
> > > > > > > write space per virtio round trip message).
> > > > > > > 
> > > > > > > I did not encounter a show stopper for large virtio transmission
> > > > > > > sizes
> > > > > > > (4 MB ... 128 MB) on virtio level, neither as a result of testing,
> > > > > > > nor
> > > > > > > after reviewing the existing code.
> > > > > > > 
> > > > > > > About IOV_MAX: that's apparently not an issue on virtio level.
> > > > > > > Most of
> > > > > > > the
> > > > > > > iovec code, both on Linux kernel side and on QEMU side do not have
> > > > > > > this
> > > > > > > limitation. It is apparently however indeed a limitation for
> > > > > > > userland
> > > > > > > apps
> > > > > > > calling the Linux kernel's syscalls yet.
> > > > > > > 
> > > > > > > Stefan, as it stands now, I am even more convinced that the upper
> > > > > > > virtio
> > > > > > > transmission size limit should not be squeezed into the queue size
> > > > > > > argument of virtio_add_queue(). Not because of the previous
> > > > > > > argument
> > > > > > > that
> > > > > > > it would waste space (~1MB), but rather because they are two
> > > > > > > different
> > > > > > > things. To outline this, just a quick recap of what happens
> > > > > > > exactly
> > > > > > > when
> > > > > > > a bulk message is pushed over the virtio wire (assuming virtio
> > > > > > > "split"
> > > > > > > layout here):
> > > > > > > 
> > > > > > > ---------- [recap-start] ----------
> > > > > > > 
> > > > > > > For each bulk message sent guest <-> host, exactly *one* of the
> > > > > > > pre-allocated descriptors is taken and placed (subsequently) into
> > > > > > > exactly
> > > > > > > *one* position of the two available/used ring buffers. The actual
> > > > > > > descriptor table though, containing all the DMA addresses of the
> > > > > > > message
> > > > > > > bulk data, is allocated just in time for each round trip message.
> > > > > > > Say,
> > > > > > > it
> > > > > > > is the first message sent, it yields in the following structure:
> > > > > > > 
> > > > > > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > > > > > 
> > > > > > >    +-+              +-+           +-----------------+
> > > > > > >    
> > > > > > >    |D|------------->|d|---------->| Bulk data block |
> > > > > > >    
> > > > > > >    +-+              |d|--------+  +-----------------+
> > > > > > >    
> > > > > > >    | |              |d|------+ |
> > > > > > >    
> > > > > > >    +-+               .       | |  +-----------------+
> > > > > > >    
> > > > > > >    | |               .       | +->| Bulk data block |
> > > > > > >     
> > > > > > >     .                .       |    +-----------------+
> > > > > > >     .               |d|-+    |
> > > > > > >     .               +-+ |    |    +-----------------+
> > > > > > >     
> > > > > > >    | |                  |    +--->| Bulk data block |
> > > > > > >    
> > > > > > >    +-+                  |         +-----------------+
> > > > > > >    
> > > > > > >    | |                  |                 .
> > > > > > >    
> > > > > > >    +-+                  |                 .
> > > > > > >    
> > > > > > >                         |                 .
> > > > > > >                         |         
> > > > > > >                         |         +-----------------+
> > > > > > >                         
> > > > > > >                         +-------->| Bulk data block |
> > > > > > >                         
> > > > > > >                                   +-----------------+
> > > > > > > 
> > > > > > > Legend:
> > > > > > > D: pre-allocated descriptor
> > > > > > > d: just in time allocated descriptor
> > > > > > > -->: memory pointer (DMA)
> > > > > > > 
> > > > > > > The bulk data blocks are allocated by the respective device driver
> > > > > > > above
> > > > > > > virtio subsystem level (guest side).
> > > > > > > 
> > > > > > > There are exactly as many descriptors pre-allocated (D) as the
> > > > > > > size of
> > > > > > > a
> > > > > > > ring buffer.
> > > > > > > 
> > > > > > > A "descriptor" is more or less just a chainable DMA memory
> > > > > > > pointer;
> > > > > > > defined
> > > > > > > as:
> > > > > > > 
> > > > > > > /* Virtio ring descriptors: 16 bytes.  These can chain together
> > > > > > > via
> > > > > > > "next". */ struct vring_desc {
> > > > > > > 
> > > > > > > 	/* Address (guest-physical). */
> > > > > > > 	__virtio64 addr;
> > > > > > > 	/* Length. */
> > > > > > > 	__virtio32 len;
> > > > > > > 	/* The flags as indicated above. */
> > > > > > > 	__virtio16 flags;
> > > > > > > 	/* We chain unused descriptors via this, too */
> > > > > > > 	__virtio16 next;
> > > > > > > 
> > > > > > > };
> > > > > > > 
> > > > > > > There are 2 ring buffers; the "available" ring buffer is for
> > > > > > > sending a
> > > > > > > message guest->host (which will transmit DMA addresses of guest
> > > > > > > allocated
> > > > > > > bulk data blocks that are used for data sent to device, and
> > > > > > > separate
> > > > > > > guest allocated bulk data blocks that will be used by host side to
> > > > > > > place
> > > > > > > its response bulk data), and the "used" ring buffer is for sending
> > > > > > > host->guest to let guest know about host's response and that it
> > > > > > > could
> > > > > > > now
> > > > > > > safely consume and then deallocate the bulk data blocks
> > > > > > > subsequently.
> > > > > > > 
> > > > > > > ---------- [recap-end] ----------
> > > > > > > 
> > > > > > > So the "queue size" actually defines the ringbuffer size. It does
> > > > > > > not
> > > > > > > define the maximum amount of descriptors. The "queue size" rather
> > > > > > > defines
> > > > > > > how many pending messages can be pushed into either one ringbuffer
> > > > > > > before
> > > > > > > the other side would need to wait until the counter side would
> > > > > > > step up
> > > > > > > (i.e. ring buffer full).
> > > > > > > 
> > > > > > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE
> > > > > > > actually
> > > > > > > is)
> > > > > > > OTOH defines the max. bulk data size that could be transmitted
> > > > > > > with
> > > > > > > each
> > > > > > > virtio round trip message.
> > > > > > > 
> > > > > > > And in fact, 9p currently handles the virtio "queue size" as
> > > > > > > directly
> > > > > > > associative with its maximum amount of active 9p requests the
> > > > > > > server
> > > > > > > could
> > > > > > > 
> > > > > > > handle simultaniously:
> > > > > > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > > > > > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > > > > > >   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev,
> > > > > > >   MAX_REQ,
> > > > > > >   
> > > > > > >                                  handle_9p_output);
> > > > > > > 
> > > > > > > So if I would change it like this, just for the purpose to
> > > > > > > increase
> > > > > > > the
> > > > > > > max. virtio transmission size:
> > > > > > > 
> > > > > > > --- a/hw/9pfs/virtio-9p-device.c
> > > > > > > +++ b/hw/9pfs/virtio-9p-device.c
> > > > > > > @@ -218,7 +218,7 @@ static void
> > > > > > > virtio_9p_device_realize(DeviceState
> > > > > > > *dev,
> > > > > > > Error **errp)>
> > > > > > > 
> > > > > > >      v->config_size = sizeof(struct virtio_9p_config) +
> > > > > > >      strlen(s->fsconf.tag);
> > > > > > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > > > > > >      
> > > > > > >                  VIRTQUEUE_MAX_SIZE);
> > > > > > > 
> > > > > > > -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > > > > > > +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
> > > > > > > 
> > > > > > >  }
> > > > > > > 
> > > > > > > Then it would require additional synchronization code on both ends
> > > > > > > and
> > > > > > > therefore unnecessary complexity, because it would now be possible
> > > > > > > that
> > > > > > > more requests are pushed into the ringbuffer than server could
> > > > > > > handle.
> > > > > > > 
> > > > > > > There is one potential issue though that probably did justify the
> > > > > > > "don't
> > > > > > > exceed the queue size" rule:
> > > > > > > 
> > > > > > > ATM the descriptor table is allocated (just in time) as *one*
> > > > > > > continuous
> > > > > > > buffer via kmalloc_array():
> > > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798c
> > > > > > > a086
> > > > > > > f7c7
> > > > > > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > > > > > 
> > > > > > > So assuming transmission size of 2 * 128 MB that kmalloc_array()
> > > > > > > call
> > > > > > > would
> > > > > > > yield in kmalloc(1M) and the latter might fail if guest had highly
> > > > > > > fragmented physical memory. For such kind of error case there is
> > > > > > > currently a fallback path in virtqueue_add_split() that would then
> > > > > > > use
> > > > > > > the required amount of pre-allocated descriptors instead:
> > > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798c
> > > > > > > a086
> > > > > > > f7c7
> > > > > > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > > > > > 
> > > > > > > That fallback recovery path would no longer be viable if the queue
> > > > > > > size
> > > > > > > was
> > > > > > > exceeded. There would be alternatives though, e.g. by allowing to
> > > > > > > chain
> > > > > > > indirect descriptor tables (currently prohibited by the virtio
> > > > > > > specs).
> > > > > > 
> > > > > > Making the maximum number of descriptors independent of the queue
> > > > > > size
> > > > > > requires a change to the VIRTIO spec since the two values are
> > > > > > currently
> > > > > > explicitly tied together by the spec.
> > > > > 
> > > > > Yes, that's what the virtio specs say. But they don't say why, nor did
> > > > > I
> > > > > hear a reason in this dicussion.
> > > > > 
> > > > > That's why I invested time reviewing current virtio implementation and
> > > > > specs, as well as actually testing exceeding that limit. And as I
> > > > > outlined in detail in my previous email, I only found one theoretical
> > > > > issue that could be addressed though.
> > > > 
> > > > I agree that there is a limitation in the VIRTIO spec, but violating the
> > > > spec isn't an acceptable solution:
> > > > 
> > > > 1. QEMU and Linux aren't the only components that implement VIRTIO. You
> > > > 
> > > >    cannot make assumptions about their implementations because it may
> > > >    break spec-compliant implementations that you haven't looked at.
> > > >    
> > > >    Your patches weren't able to increase Queue Size because some device
> > > >    implementations break when descriptor chains are too long. This shows
> > > >    there is a practical issue even in QEMU.
> > > > 
> > > > 2. The specific spec violation that we discussed creates the problem
> > > > 
> > > >    that drivers can no longer determine the maximum description chain
> > > >    length. This in turn will lead to more implementation-specific
> > > >    assumptions being baked into drivers and cause problems with
> > > >    interoperability and future changes.
> > > > 
> > > > The spec needs to be extended instead. I included an idea for how to do
> > > > that below.
> > > 
> > > Sure, I just wanted to see if there was a non-neglectable "hard" show
> > > stopper per se that I probably haven't seen yet. I have not questioned
> > > aiming a clean solution.
> > > 
> > > Thanks for the clarification!
> > > 
> > > > > > Before doing that, are there benchmark results showing that 1 MB vs
> > > > > > 128
> > > > > > MB produces a performance improvement? I'm asking because if
> > > > > > performance
> > > > > > with 1 MB is good then you can probably do that without having to
> > > > > > change
> > > > > > VIRTIO and also because it's counter-intuitive that 9p needs 128 MB
> > > > > > for
> > > > > > good performance when it's ultimately implemented on top of disk and
> > > > > > network I/O that have lower size limits.
> > > > > 
> > > > > First some numbers, linear reading a 12 GB file:
> > > > > 
> > > > > msize    average      notes
> > > > > 
> > > > > 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> > > > > 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> > > > > 512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
> > > > > 1 MB     2551 MB/s    this msize would already violate virtio specs
> > > > > 2 MB     2521 MB/s    this msize would already violate virtio specs
> > > > > 4 MB     2628 MB/s    planned max. msize of my current kernel patches
> > > > > [1]
> > > > 
> > > > How many descriptors are used? 4 MB can be covered by a single
> > > > descriptor if the data is physically contiguous in memory, so this data
> > > > doesn't demonstrate a need for more descriptors.
> > > 
> > > No, in the last couple years there was apparently no kernel version that
> > > used just one descriptor, nor did my benchmarked version. Even though the
> > > Linux 9p client uses (yet) simple linear buffers (contiguous physical
> > > memory) on 9p client level, these are however split into PAGE_SIZE chunks
> > > by function pack_sg_list() [1] before being fed to virtio level:
> > > 
> > > static unsigned int rest_of_page(void *data)
> > > {
> > > 
> > > 	return PAGE_SIZE - offset_in_page(data);
> > > 
> > > }
> > > ...
> > > static int pack_sg_list(struct scatterlist *sg, int start,
> > > 
> > > 			int limit, char *data, int count)
> > > 
> > > {
> > > 
> > > 	int s;
> > > 	int index = start;
> > > 	
> > > 	while (count) {
> > > 	
> > > 		s = rest_of_page(data);
> > > 		...
> > > 		sg_set_buf(&sg[index++], data, s);
> > > 		count -= s;
> > > 		data += s;
> > > 	
> > > 	}
> > > 	...
> > > 
> > > }
> > > 
> > > [1]
> > > https://github.com/torvalds/linux/blob/19901165d90fdca1e57c9baa0d5b4c63d1
> > > 5c476a/net/9p/trans_virtio.c#L171
> > > 
> > > So when sending 4MB over virtio wire, it would yield in 1k descriptors
> > > ATM.
> > > 
> > > I have wondered about this before, but did not question it, because due to
> > > the cross-platform nature I couldn't say for certain whether that's
> > > probably needed somewhere. I mean for the case virtio-PCI I know for sure
> > > that one descriptor (i.e. >PAGE_SIZE) would be fine, but I don't know if
> > > that applies to all buses and architectures.
> > 
> > VIRTIO does not limit descriptor the descriptor len field to PAGE_SIZE,
> > so I don't think there is a limit at the VIRTIO level.
> 
> So you are viewing this purely from virtio specs PoV: in the sense, if it is
> not prohibited by the virtio specs, then it should work. Maybe.

Limitations must be specified either in the 9P protocol or the VIRTIO
specification. Drivers and devices will not be able to operate correctly
if there are limitations that aren't covered by the specs.

Do you have something in mind that isn't covered by the specs?

> > If this function coalesces adjacent pages then the descriptor chain
> > length issues could be reduced.
> > 
> > > > > But again, this is not just about performance. My conclusion as
> > > > > described
> > > > > in my previous email is that virtio currently squeezes
> > > > > 
> > > > > 	"max. simultanious amount of bulk messages"
> > > > > 
> > > > > vs.
> > > > > 
> > > > > 	"max. bulk data transmission size per bulk messaage"
> > > > > 
> > > > > into the same configuration parameter, which is IMO inappropriate and
> > > > > hence
> > > > > splitting them into 2 separate parameters when creating a queue makes
> > > > > sense, independent of the performance benchmarks.
> > > > > 
> > > > > [1]
> > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyt
> > > > > e.c
> > > > > om/
> > > > 
> > > > Some devices effectively already have this because the device advertises
> > > > a maximum number of descriptors via device-specific mechanisms like the
> > > > struct virtio_blk_config seg_max field. But today these fields can only
> > > > reduce the maximum descriptor chain length because the spec still limits
> > > > the length to Queue Size.
> > > > 
> > > > We can build on this approach to raise the length above Queue Size. This
> > > > approach has the advantage that the maximum number of segments isn't per
> > > > device or per virtqueue, it's fine-grained. If the device supports two
> > > > requests types then different max descriptor chain limits could be given
> > > > for them by introducing two separate configuration space fields.
> > > > 
> > > > Here are the corresponding spec changes:
> > > > 
> > > > 1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is added
> > > > 
> > > >    to indicate that indirect descriptor table size and maximum
> > > >    descriptor chain length are not limited by Queue Size value. (Maybe
> > > >    there still needs to be a limit like 2^15?)
> > > 
> > > Sounds good to me!
> > > 
> > > AFAIK it is effectively limited to 2^16 because of vring_desc->next:
> > > 
> > > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > > "next". */ struct vring_desc {
> > > 
> > >         /* Address (guest-physical). */
> > >         __virtio64 addr;
> > >         /* Length. */
> > >         __virtio32 len;
> > >         /* The flags as indicated above. */
> > >         __virtio16 flags;
> > >         /* We chain unused descriptors via this, too */
> > >         __virtio16 next;
> > > 
> > > };
> > 
> > Yes, Split Virtqueues have a fundamental limit on indirect table size
> > due to the "next" field. Packed Virtqueue descriptors don't have a
> > "next" field so descriptor chains could be longer in theory (currently
> > forbidden by the spec).
> > 
> > > > One thing that's messy is that we've been discussing the maximum
> > > > descriptor chain length but 9p has the "msize" concept, which isn't
> > > > aware of contiguous memory. It may be necessary to extend the 9p driver
> > > > code to size requests not just according to their length in bytes but
> > > > also according to the descriptor chain length. That's how the Linux
> > > > block layer deals with queue limits (struct queue_limits max_segments vs
> > > > max_hw_sectors).
> > > 
> > > Hmm, can't follow on that one. For what should that be needed in case of
> > > 9p? My plan was to limit msize by 9p client simply at session start to
> > > whatever is the max. amount virtio descriptors supported by host and
> > > using PAGE_SIZE as size per descriptor, because that's what 9p client
> > > actually does ATM (see above). So you think that should be changed to
> > > e.g. just one descriptor for 4MB, right?
> > 
> > Limiting msize to the 9p transport device's maximum number of
> > descriptors is conservative (i.e. 128 descriptors = 512 KB msize)
> > because it doesn't take advantage of contiguous memory. I suggest
> > leaving msize alone, adding a separate limit at which requests are split
> > according to the maximum descriptor chain length, and tweaking
> > pack_sg_list() to coalesce adjacent pages.
> > 
> > That way msize can be large without necessarily using lots of
> > descriptors (depending on the memory layout).
> 
> That was actually a tempting solution. Because it would neither require
> changes to the virtio specs (at least for a while) and it would also work with
> older QEMU versions. And for that pack_sg_list() portion of the code it would
> work well and easy as the buffer passed to pack_sg_list() is contiguous
> already.
> 
> However I just realized for the zero-copy version of the code that would be
> more tricky. The ZC version already uses individual pages (struct page, hence
> PAGE_SIZE each) which are pinned, i.e. it uses pack_sg_list_p() [1] in
> combination with p9_get_mapped_pages() [2]
> 
> [1] https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9ab8389/net/9p/trans_virtio.c#L218
> [2] https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9ab8389/net/9p/trans_virtio.c#L309
> 
> So that would require much more work and code trying to sort and coalesce
> individual pages to contiguous physical memory for the sake of reducing virtio
> descriptors. And there is no guarantee that this is even possible. The kernel
> may simply return a non-contiguous set of pages which would eventually end up
> exceeding the virtio descriptor limit again.

Order must be preserved so pages cannot be sorted by physical address.
How about simply coalescing when pages are adjacent?

> So looks like it was probably still easier and realistic to just add virtio
> capabilities for now for allowing to exceed current descriptor limit.

I'm still not sure why virtio-net, virtio-blk, virtio-fs, etc perform
fine under today's limits while virtio-9p needs a much higher limit to
achieve good performance. Maybe there is an issue in a layer above the
vring that's causing the virtio-9p performance you've observed?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-11-09 10:56                             ` [Virtio-fs] " Stefan Hajnoczi
@ 2021-11-09 13:09                               ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-11-09 13:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Kevin Wolf, Laurent Vivier, qemu-block,
	Michael S. Tsirkin, Jason Wang, Amit Shah, David Hildenbrand,
	Greg Kurz, virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

On Dienstag, 9. November 2021 11:56:35 CET Stefan Hajnoczi wrote:
> On Thu, Nov 04, 2021 at 03:41:23PM +0100, Christian Schoenebeck wrote:
> > On Mittwoch, 3. November 2021 12:33:33 CET Stefan Hajnoczi wrote:
> > > On Mon, Nov 01, 2021 at 09:29:26PM +0100, Christian Schoenebeck wrote:
> > > > On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> > > > > On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck 
wrote:
> > > > > > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > > > > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck 
wrote:
> > > > > > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian 
Schoenebeck wrote:
> > > > > > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian 
Schoenebeck wrote:
> > > > > > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > > > > > 
> > > > > > > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian 
Schoenebeck wrote:
> > > > > > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan 
Hajnoczi wrote:
> > > > > > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200,
> > > > > > > > > > > > > > Christian
> > > > > > > > > > > > > > Schoenebeck
> > > > > > > > > 
> > > > > > > > > wrote:
> > > > > > > > > > > > > > > At the moment the maximum transfer size with
> > > > > > > > > > > > > > > virtio
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > limited
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > 4M
> > > > > > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this
> > > > > > > > > > > > > > > limit to
> > > > > > > > > > > > > > > its
> > > > > > > > > > > > > > > maximum
> > > > > > > > > > > > > > > theoretical possible transfer size of 128M (32k
> > > > > > > > > > > > > > > pages)
> > > > > > > > > > > > > > > according
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > virtio specs:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/c
> > > > > > > > > > > > > > > s01/
> > > > > > > > > > > > > > > virt
> > > > > > > > > > > > > > > io-v
> > > > > > > > > > > > > > > 1.1-
> > > > > > > > > > > > > > > cs
> > > > > > > > > > > > > > > 01
> > > > > > > > > > > > > > > .html#
> > > > > > > > > > > > > > > x1-240006
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Hi Christian,
> > > > > > > > > > > 
> > > > > > > > > > > > > > I took a quick look at the code:
> > > > > > > > > > > Hi,
> > > > > > > > > > > 
> > > > > > > > > > > Thanks Stefan for sharing virtio expertise and helping
> > > > > > > > > > > Christian
> > > > > > > > > > > !
> > > > > > > > > > > 
> > > > > > > > > > > > > > - The Linux 9p driver restricts descriptor chains
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > 128
> > > > > > > > > > > > > > elements
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Yes, that's the limitation that I am about to remove
> > > > > > > > > > > > > (WIP);
> > > > > > > > > > > > > current
> > > > > > > > > > > > > kernel
> > > > > > > > > > > > > patches:
> > > > > > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.
> > > > > > > > > > > > > linu
> > > > > > > > > > > > > x_os
> > > > > > > > > > > > > s@cr
> > > > > > > > > > > > > udeb
> > > > > > > > > > > > > yt
> > > > > > > > > > > > > e.
> > > > > > > > > > > > > com/>
> > > > > > > > > > > > 
> > > > > > > > > > > > I haven't read the patches yet but I'm concerned that
> > > > > > > > > > > > today
> > > > > > > > > > > > the
> > > > > > > > > > > > driver
> > > > > > > > > > > > is pretty well-behaved and this new patch series
> > > > > > > > > > > > introduces a
> > > > > > > > > > > > spec
> > > > > > > > > > > > violation. Not fixing existing spec violations is
> > > > > > > > > > > > okay,
> > > > > > > > > > > > but
> > > > > > > > > > > > adding
> > > > > > > > > > > > new
> > > > > > > > > > > > ones is a red flag. I think we need to figure out a
> > > > > > > > > > > > clean
> > > > > > > > > > > > solution.
> > > > > > > > > > 
> > > > > > > > > > Nobody has reviewed the kernel patches yet. My main
> > > > > > > > > > concern
> > > > > > > > > > therefore
> > > > > > > > > > actually is that the kernel patches are already too
> > > > > > > > > > complex,
> > > > > > > > > > because
> > > > > > > > > > the
> > > > > > > > > > current situation is that only Dominique is handling 9p
> > > > > > > > > > patches on
> > > > > > > > > > kernel
> > > > > > > > > > side, and he barely has time for 9p anymore.
> > > > > > > > > > 
> > > > > > > > > > Another reason for me to catch up on reading current
> > > > > > > > > > kernel
> > > > > > > > > > code
> > > > > > > > > > and
> > > > > > > > > > stepping in as reviewer of 9p on kernel side ASAP,
> > > > > > > > > > independent
> > > > > > > > > > of
> > > > > > > > > > this
> > > > > > > > > > issue.
> > > > > > > > > > 
> > > > > > > > > > As for current kernel patches' complexity: I can certainly
> > > > > > > > > > drop
> > > > > > > > > > patch
> > > > > > > > > > 7
> > > > > > > > > > entirely as it is probably just overkill. Patch 4 is then
> > > > > > > > > > the
> > > > > > > > > > biggest
> > > > > > > > > > chunk, I have to see if I can simplify it, and whether it
> > > > > > > > > > would
> > > > > > > > > > make
> > > > > > > > > > sense to squash with patch 3.
> > > > > > > > > > 
> > > > > > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to
> > > > > > > > > > > > > > preadv(2)
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > will
> > > > > > > > > > > > > > fail
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >   with EINVAL when called with more than IOV_MAX
> > > > > > > > > > > > > >   iovecs
> > > > > > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Hmm, which makes me wonder why I never encountered
> > > > > > > > > > > > > this
> > > > > > > > > > > > > error
> > > > > > > > > > > > > during
> > > > > > > > > > > > > testing.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Most people will use the 9p qemu 'local' fs driver
> > > > > > > > > > > > > backend
> > > > > > > > > > > > > in
> > > > > > > > > > > > > practice,
> > > > > > > > > > > > > so
> > > > > > > > > > > > > that v9fs_read() call would translate for most
> > > > > > > > > > > > > people to
> > > > > > > > > > > > > this
> > > > > > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > > > > > 
> > > > > > > > > > > > > static ssize_t local_preadv(FsContext *ctx,
> > > > > > > > > > > > > V9fsFidOpenState
> > > > > > > > > > > > > *fs,
> > > > > > > > > > > > > 
> > > > > > > > > > > > >                             const struct iovec *iov,
> > > > > > > > > > > > >                             int iovcnt, off_t
> > > > > > > > > > > > >                             offset)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > {
> > > > > > > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > > > > > > 
> > > > > > > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > > > > > > 
> > > > > > > > > > > > > #else
> > > > > > > > > > > > > 
> > > > > > > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > > > > > > >     if (err == -1) {
> > > > > > > > > > > > >     
> > > > > > > > > > > > >         return err;
> > > > > > > > > > > > >     
> > > > > > > > > > > > >     } else {
> > > > > > > > > > > > >     
> > > > > > > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > > > > > > >     
> > > > > > > > > > > > >     }
> > > > > > > > > > > > > 
> > > > > > > > > > > > > #endif
> > > > > > > > > > > > > }
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > Unless I misunderstood the code, neither side can
> > > > > > > > > > > > > > take
> > > > > > > > > > > > > > advantage
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Stefan
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I need to check that when I have some more time. One
> > > > > > > > > > > > > possible
> > > > > > > > > > > > > explanation
> > > > > > > > > > > > > might be that preadv() already has this wrapped into
> > > > > > > > > > > > > a
> > > > > > > > > > > > > loop
> > > > > > > > > > > > > in
> > > > > > > > > > > > > its
> > > > > > > > > > > > > implementation to circumvent a limit like IOV_MAX.
> > > > > > > > > > > > > It
> > > > > > > > > > > > > might
> > > > > > > > > > > > > be
> > > > > > > > > > > > > another
> > > > > > > > > > > > > "it
> > > > > > > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > There are still a bunch of other issues I have to
> > > > > > > > > > > > > resolve.
> > > > > > > > > > > > > If
> > > > > > > > > > > > > you
> > > > > > > > > > > > > look
> > > > > > > > > > > > > at
> > > > > > > > > > > > > net/9p/client.c on kernel side, you'll notice that
> > > > > > > > > > > > > it
> > > > > > > > > > > > > basically
> > > > > > > > > > > > > does
> > > > > > > > > > > > > this ATM> >
> > > > > > > > > > > > > 
> > > > > > > > > > > > >     kmalloc(msize);
> > > > > > > > > > > 
> > > > > > > > > > > Note that this is done twice : once for the T message
> > > > > > > > > > > (client
> > > > > > > > > > > request)
> > > > > > > > > > > and
> > > > > > > > > > > once for the R message (server answer). The 9p driver
> > > > > > > > > > > could
> > > > > > > > > > > adjust
> > > > > > > > > > > the
> > > > > > > > > > > size
> > > > > > > > > > > of the T message to what's really needed instead of
> > > > > > > > > > > allocating
> > > > > > > > > > > the
> > > > > > > > > > > full
> > > > > > > > > > > msize. R message size is not known though.
> > > > > > > > > > 
> > > > > > > > > > Would it make sense adding a second virtio ring, dedicated
> > > > > > > > > > to
> > > > > > > > > > server
> > > > > > > > > > responses to solve this? IIRC 9p server already calculates
> > > > > > > > > > appropriate
> > > > > > > > > > exact sizes for each response type. So server could just
> > > > > > > > > > push
> > > > > > > > > > space
> > > > > > > > > > that's
> > > > > > > > > > really needed for its responses.
> > > > > > > > > > 
> > > > > > > > > > > > > for every 9p request. So not only does it allocate
> > > > > > > > > > > > > much
> > > > > > > > > > > > > more
> > > > > > > > > > > > > memory
> > > > > > > > > > > > > for
> > > > > > > > > > > > > every request than actually required (i.e. say 9pfs
> > > > > > > > > > > > > was
> > > > > > > > > > > > > mounted
> > > > > > > > > > > > > with
> > > > > > > > > > > > > msize=8M, then a 9p request that actually would just
> > > > > > > > > > > > > need 1k
> > > > > > > > > > > > > would
> > > > > > > > > > > > > nevertheless allocate 8M), but also it allocates >
> > > > > > > > > > > > > PAGE_SIZE,
> > > > > > > > > > > > > which
> > > > > > > > > > > > > obviously may fail at any time.>
> > > > > > > > > > > > 
> > > > > > > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs
> > > > > > > > > > > > vmalloc()
> > > > > > > > > > > > situation.
> > > > > > > > > > 
> > > > > > > > > > Hu, I didn't even consider vmalloc(). I just tried the
> > > > > > > > > > kvmalloc()
> > > > > > > > > > wrapper
> > > > > > > > > > as a quick & dirty test, but it crashed in the same way as
> > > > > > > > > > kmalloc()
> > > > > > > > > > with
> > > > > > > > > > large msize values immediately on mounting:
> > > > > > > > > > 
> > > > > > > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > > > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > > > > > > --- a/net/9p/client.c
> > > > > > > > > > +++ b/net/9p/client.c
> > > > > > > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts,
> > > > > > > > > > struct
> > > > > > > > > > p9_client
> > > > > > > > > > *clnt)
> > > > > > > > > > 
> > > > > > > > > >  static int p9_fcall_init(struct p9_client *c, struct
> > > > > > > > > >  p9_fcall
> > > > > > > > > >  *fc,
> > > > > > > > > >  
> > > > > > > > > >                          int alloc_msize)
> > > > > > > > > >  
> > > > > > > > > >  {
> > > > > > > > > > 
> > > > > > > > > > -       if (likely(c->fcall_cache) && alloc_msize ==
> > > > > > > > > > c->msize)
> > > > > > > > > > {
> > > > > > > > > > +       //if (likely(c->fcall_cache) && alloc_msize ==
> > > > > > > > > > c->msize) {
> > > > > > > > > > +       if (false) {
> > > > > > > > > > 
> > > > > > > > > >                 fc->sdata =
> > > > > > > > > >                 kmem_cache_alloc(c->fcall_cache,
> > > > > > > > > >                 GFP_NOFS);
> > > > > > > > > >                 fc->cache = c->fcall_cache;
> > > > > > > > > >         
> > > > > > > > > >         } else {
> > > > > > > > > > 
> > > > > > > > > > -               fc->sdata = kmalloc(alloc_msize,
> > > > > > > > > > GFP_NOFS);
> > > > > > > > > > +               fc->sdata = kvmalloc(alloc_msize,
> > > > > > > > > > GFP_NOFS);
> > > > > > > > > 
> > > > > > > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > > > > > > 
> > > > > > > > > Now I get:
> > > > > > > > >    virtio: bogus descriptor or out of resources
> > > > > > > > > 
> > > > > > > > > So, still some work ahead on both ends.
> > > > > > > > 
> > > > > > > > Few hacks later (only changes on 9p client side) I got this
> > > > > > > > running
> > > > > > > > stable
> > > > > > > > now. The reason for the virtio error above was that kvmalloc()
> > > > > > > > returns
> > > > > > > > a
> > > > > > > > non-logical kernel address for any kvmalloc(>4M), i.e. an
> > > > > > > > address
> > > > > > > > that
> > > > > > > > is
> > > > > > > > inaccessible from host side, hence that "bogus descriptor"
> > > > > > > > message
> > > > > > > > by
> > > > > > > > QEMU.
> > > > > > > > So I had to split those linear 9p client buffers into sparse
> > > > > > > > ones
> > > > > > > > (set
> > > > > > > > of
> > > > > > > > individual pages).
> > > > > > > > 
> > > > > > > > I tested this for some days with various virtio transmission
> > > > > > > > sizes
> > > > > > > > and
> > > > > > > > it
> > > > > > > > works as expected up to 128 MB (more precisely: 128 MB read
> > > > > > > > space
> > > > > > > > +
> > > > > > > > 128 MB
> > > > > > > > write space per virtio round trip message).
> > > > > > > > 
> > > > > > > > I did not encounter a show stopper for large virtio
> > > > > > > > transmission
> > > > > > > > sizes
> > > > > > > > (4 MB ... 128 MB) on virtio level, neither as a result of
> > > > > > > > testing,
> > > > > > > > nor
> > > > > > > > after reviewing the existing code.
> > > > > > > > 
> > > > > > > > About IOV_MAX: that's apparently not an issue on virtio level.
> > > > > > > > Most of
> > > > > > > > the
> > > > > > > > iovec code, both on Linux kernel side and on QEMU side do not
> > > > > > > > have
> > > > > > > > this
> > > > > > > > limitation. It is apparently however indeed a limitation for
> > > > > > > > userland
> > > > > > > > apps
> > > > > > > > calling the Linux kernel's syscalls yet.
> > > > > > > > 
> > > > > > > > Stefan, as it stands now, I am even more convinced that the
> > > > > > > > upper
> > > > > > > > virtio
> > > > > > > > transmission size limit should not be squeezed into the queue
> > > > > > > > size
> > > > > > > > argument of virtio_add_queue(). Not because of the previous
> > > > > > > > argument
> > > > > > > > that
> > > > > > > > it would waste space (~1MB), but rather because they are two
> > > > > > > > different
> > > > > > > > things. To outline this, just a quick recap of what happens
> > > > > > > > exactly
> > > > > > > > when
> > > > > > > > a bulk message is pushed over the virtio wire (assuming virtio
> > > > > > > > "split"
> > > > > > > > layout here):
> > > > > > > > 
> > > > > > > > ---------- [recap-start] ----------
> > > > > > > > 
> > > > > > > > For each bulk message sent guest <-> host, exactly *one* of
> > > > > > > > the
> > > > > > > > pre-allocated descriptors is taken and placed (subsequently)
> > > > > > > > into
> > > > > > > > exactly
> > > > > > > > *one* position of the two available/used ring buffers. The
> > > > > > > > actual
> > > > > > > > descriptor table though, containing all the DMA addresses of
> > > > > > > > the
> > > > > > > > message
> > > > > > > > bulk data, is allocated just in time for each round trip
> > > > > > > > message.
> > > > > > > > Say,
> > > > > > > > it
> > > > > > > > is the first message sent, it yields in the following
> > > > > > > > structure:
> > > > > > > > 
> > > > > > > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > > > > > > 
> > > > > > > >    +-+              +-+           +-----------------+
> > > > > > > >    
> > > > > > > >    |D|------------->|d|---------->| Bulk data block |
> > > > > > > >    
> > > > > > > >    +-+              |d|--------+  +-----------------+
> > > > > > > >    
> > > > > > > >    | |              |d|------+ |
> > > > > > > >    
> > > > > > > >    +-+               .       | |  +-----------------+
> > > > > > > >    
> > > > > > > >    | |               .       | +->| Bulk data block |
> > > > > > > >     
> > > > > > > >     .                .       |    +-----------------+
> > > > > > > >     .               |d|-+    |
> > > > > > > >     .               +-+ |    |    +-----------------+
> > > > > > > >     
> > > > > > > >    | |                  |    +--->| Bulk data block |
> > > > > > > >    
> > > > > > > >    +-+                  |         +-----------------+
> > > > > > > >    
> > > > > > > >    | |                  |                 .
> > > > > > > >    
> > > > > > > >    +-+                  |                 .
> > > > > > > >    
> > > > > > > >                         |                 .
> > > > > > > >                         |         
> > > > > > > >                         |         +-----------------+
> > > > > > > >                         
> > > > > > > >                         +-------->| Bulk data block |
> > > > > > > >                         
> > > > > > > >                                   +-----------------+
> > > > > > > > 
> > > > > > > > Legend:
> > > > > > > > D: pre-allocated descriptor
> > > > > > > > d: just in time allocated descriptor
> > > > > > > > -->: memory pointer (DMA)
> > > > > > > > 
> > > > > > > > The bulk data blocks are allocated by the respective device
> > > > > > > > driver
> > > > > > > > above
> > > > > > > > virtio subsystem level (guest side).
> > > > > > > > 
> > > > > > > > There are exactly as many descriptors pre-allocated (D) as the
> > > > > > > > size of
> > > > > > > > a
> > > > > > > > ring buffer.
> > > > > > > > 
> > > > > > > > A "descriptor" is more or less just a chainable DMA memory
> > > > > > > > pointer;
> > > > > > > > defined
> > > > > > > > as:
> > > > > > > > 
> > > > > > > > /* Virtio ring descriptors: 16 bytes.  These can chain
> > > > > > > > together
> > > > > > > > via
> > > > > > > > "next". */ struct vring_desc {
> > > > > > > > 
> > > > > > > > 	/* Address (guest-physical). */
> > > > > > > > 	__virtio64 addr;
> > > > > > > > 	/* Length. */
> > > > > > > > 	__virtio32 len;
> > > > > > > > 	/* The flags as indicated above. */
> > > > > > > > 	__virtio16 flags;
> > > > > > > > 	/* We chain unused descriptors via this, too */
> > > > > > > > 	__virtio16 next;
> > > > > > > > 
> > > > > > > > };
> > > > > > > > 
> > > > > > > > There are 2 ring buffers; the "available" ring buffer is for
> > > > > > > > sending a
> > > > > > > > message guest->host (which will transmit DMA addresses of
> > > > > > > > guest
> > > > > > > > allocated
> > > > > > > > bulk data blocks that are used for data sent to device, and
> > > > > > > > separate
> > > > > > > > guest allocated bulk data blocks that will be used by host
> > > > > > > > side to
> > > > > > > > place
> > > > > > > > its response bulk data), and the "used" ring buffer is for
> > > > > > > > sending
> > > > > > > > host->guest to let guest know about host's response and that
> > > > > > > > it
> > > > > > > > could
> > > > > > > > now
> > > > > > > > safely consume and then deallocate the bulk data blocks
> > > > > > > > subsequently.
> > > > > > > > 
> > > > > > > > ---------- [recap-end] ----------
> > > > > > > > 
> > > > > > > > So the "queue size" actually defines the ringbuffer size. It
> > > > > > > > does
> > > > > > > > not
> > > > > > > > define the maximum amount of descriptors. The "queue size"
> > > > > > > > rather
> > > > > > > > defines
> > > > > > > > how many pending messages can be pushed into either one
> > > > > > > > ringbuffer
> > > > > > > > before
> > > > > > > > the other side would need to wait until the counter side would
> > > > > > > > step up
> > > > > > > > (i.e. ring buffer full).
> > > > > > > > 
> > > > > > > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE
> > > > > > > > actually
> > > > > > > > is)
> > > > > > > > OTOH defines the max. bulk data size that could be transmitted
> > > > > > > > with
> > > > > > > > each
> > > > > > > > virtio round trip message.
> > > > > > > > 
> > > > > > > > And in fact, 9p currently handles the virtio "queue size" as
> > > > > > > > directly
> > > > > > > > associative with its maximum amount of active 9p requests the
> > > > > > > > server
> > > > > > > > could
> > > > > > > > 
> > > > > > > > handle simultaniously:
> > > > > > > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > > > > > > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > > > > > > >   hw/9pfs/virtio-9p-device.c:    v->vq =
> > > > > > > >   virtio_add_queue(vdev,
> > > > > > > >   MAX_REQ,
> > > > > > > >   
> > > > > > > >                                  handle_9p_output);
> > > > > > > > 
> > > > > > > > So if I would change it like this, just for the purpose to
> > > > > > > > increase
> > > > > > > > the
> > > > > > > > max. virtio transmission size:
> > > > > > > > 
> > > > > > > > --- a/hw/9pfs/virtio-9p-device.c
> > > > > > > > +++ b/hw/9pfs/virtio-9p-device.c
> > > > > > > > @@ -218,7 +218,7 @@ static void
> > > > > > > > virtio_9p_device_realize(DeviceState
> > > > > > > > *dev,
> > > > > > > > Error **errp)>
> > > > > > > > 
> > > > > > > >      v->config_size = sizeof(struct virtio_9p_config) +
> > > > > > > >      strlen(s->fsconf.tag);
> > > > > > > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P,
> > > > > > > >      v->config_size,
> > > > > > > >      
> > > > > > > >                  VIRTQUEUE_MAX_SIZE);
> > > > > > > > 
> > > > > > > > -    v->vq = virtio_add_queue(vdev, MAX_REQ,
> > > > > > > > handle_9p_output);
> > > > > > > > +    v->vq = virtio_add_queue(vdev, 32*1024,
> > > > > > > > handle_9p_output);
> > > > > > > > 
> > > > > > > >  }
> > > > > > > > 
> > > > > > > > Then it would require additional synchronization code on both
> > > > > > > > ends
> > > > > > > > and
> > > > > > > > therefore unnecessary complexity, because it would now be
> > > > > > > > possible
> > > > > > > > that
> > > > > > > > more requests are pushed into the ringbuffer than server could
> > > > > > > > handle.
> > > > > > > > 
> > > > > > > > There is one potential issue though that probably did justify
> > > > > > > > the
> > > > > > > > "don't
> > > > > > > > exceed the queue size" rule:
> > > > > > > > 
> > > > > > > > ATM the descriptor table is allocated (just in time) as *one*
> > > > > > > > continuous
> > > > > > > > buffer via kmalloc_array():
> > > > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53
> > > > > > > > 798c
> > > > > > > > a086
> > > > > > > > f7c7
> > > > > > > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > > > > > > 
> > > > > > > > So assuming transmission size of 2 * 128 MB that
> > > > > > > > kmalloc_array()
> > > > > > > > call
> > > > > > > > would
> > > > > > > > yield in kmalloc(1M) and the latter might fail if guest had
> > > > > > > > highly
> > > > > > > > fragmented physical memory. For such kind of error case there
> > > > > > > > is
> > > > > > > > currently a fallback path in virtqueue_add_split() that would
> > > > > > > > then
> > > > > > > > use
> > > > > > > > the required amount of pre-allocated descriptors instead:
> > > > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53
> > > > > > > > 798c
> > > > > > > > a086
> > > > > > > > f7c7
> > > > > > > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > > > > > > 
> > > > > > > > That fallback recovery path would no longer be viable if the
> > > > > > > > queue
> > > > > > > > size
> > > > > > > > was
> > > > > > > > exceeded. There would be alternatives though, e.g. by allowing
> > > > > > > > to
> > > > > > > > chain
> > > > > > > > indirect descriptor tables (currently prohibited by the virtio
> > > > > > > > specs).
> > > > > > > 
> > > > > > > Making the maximum number of descriptors independent of the
> > > > > > > queue
> > > > > > > size
> > > > > > > requires a change to the VIRTIO spec since the two values are
> > > > > > > currently
> > > > > > > explicitly tied together by the spec.
> > > > > > 
> > > > > > Yes, that's what the virtio specs say. But they don't say why, nor
> > > > > > did
> > > > > > I
> > > > > > hear a reason in this dicussion.
> > > > > > 
> > > > > > That's why I invested time reviewing current virtio implementation
> > > > > > and
> > > > > > specs, as well as actually testing exceeding that limit. And as I
> > > > > > outlined in detail in my previous email, I only found one
> > > > > > theoretical
> > > > > > issue that could be addressed though.
> > > > > 
> > > > > I agree that there is a limitation in the VIRTIO spec, but violating
> > > > > the
> > > > > spec isn't an acceptable solution:
> > > > > 
> > > > > 1. QEMU and Linux aren't the only components that implement VIRTIO.
> > > > > You
> > > > > 
> > > > >    cannot make assumptions about their implementations because it
> > > > >    may
> > > > >    break spec-compliant implementations that you haven't looked at.
> > > > >    
> > > > >    Your patches weren't able to increase Queue Size because some
> > > > >    device
> > > > >    implementations break when descriptor chains are too long. This
> > > > >    shows
> > > > >    there is a practical issue even in QEMU.
> > > > > 
> > > > > 2. The specific spec violation that we discussed creates the problem
> > > > > 
> > > > >    that drivers can no longer determine the maximum description
> > > > >    chain
> > > > >    length. This in turn will lead to more implementation-specific
> > > > >    assumptions being baked into drivers and cause problems with
> > > > >    interoperability and future changes.
> > > > > 
> > > > > The spec needs to be extended instead. I included an idea for how to
> > > > > do
> > > > > that below.
> > > > 
> > > > Sure, I just wanted to see if there was a non-neglectable "hard" show
> > > > stopper per se that I probably haven't seen yet. I have not questioned
> > > > aiming a clean solution.
> > > > 
> > > > Thanks for the clarification!
> > > > 
> > > > > > > Before doing that, are there benchmark results showing that 1 MB
> > > > > > > vs
> > > > > > > 128
> > > > > > > MB produces a performance improvement? I'm asking because if
> > > > > > > performance
> > > > > > > with 1 MB is good then you can probably do that without having
> > > > > > > to
> > > > > > > change
> > > > > > > VIRTIO and also because it's counter-intuitive that 9p needs 128
> > > > > > > MB
> > > > > > > for
> > > > > > > good performance when it's ultimately implemented on top of disk
> > > > > > > and
> > > > > > > network I/O that have lower size limits.
> > > > > > 
> > > > > > First some numbers, linear reading a 12 GB file:
> > > > > > 
> > > > > > msize    average      notes
> > > > > > 
> > > > > > 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> > > > > > 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> > > > > > 512 kB   1961 MB/s    current max. msize with any Linux kernel
> > > > > > <=v5.15
> > > > > > 1 MB     2551 MB/s    this msize would already violate virtio
> > > > > > specs
> > > > > > 2 MB     2521 MB/s    this msize would already violate virtio
> > > > > > specs
> > > > > > 4 MB     2628 MB/s    planned max. msize of my current kernel
> > > > > > patches
> > > > > > [1]
> > > > > 
> > > > > How many descriptors are used? 4 MB can be covered by a single
> > > > > descriptor if the data is physically contiguous in memory, so this
> > > > > data
> > > > > doesn't demonstrate a need for more descriptors.
> > > > 
> > > > No, in the last couple years there was apparently no kernel version
> > > > that
> > > > used just one descriptor, nor did my benchmarked version. Even though
> > > > the
> > > > Linux 9p client uses (yet) simple linear buffers (contiguous physical
> > > > memory) on 9p client level, these are however split into PAGE_SIZE
> > > > chunks
> > > > by function pack_sg_list() [1] before being fed to virtio level:
> > > > 
> > > > static unsigned int rest_of_page(void *data)
> > > > {
> > > > 
> > > > 	return PAGE_SIZE - offset_in_page(data);
> > > > 
> > > > }
> > > > ...
> > > > static int pack_sg_list(struct scatterlist *sg, int start,
> > > > 
> > > > 			int limit, char *data, int count)
> > > > 
> > > > {
> > > > 
> > > > 	int s;
> > > > 	int index = start;
> > > > 	
> > > > 	while (count) {
> > > > 	
> > > > 		s = rest_of_page(data);
> > > > 		...
> > > > 		sg_set_buf(&sg[index++], data, s);
> > > > 		count -= s;
> > > > 		data += s;
> > > > 	
> > > > 	}
> > > > 	...
> > > > 
> > > > }
> > > > 
> > > > [1]
> > > > https://github.com/torvalds/linux/blob/19901165d90fdca1e57c9baa0d5b4c6
> > > > 3d1
> > > > 5c476a/net/9p/trans_virtio.c#L171
> > > > 
> > > > So when sending 4MB over virtio wire, it would yield in 1k descriptors
> > > > ATM.
> > > > 
> > > > I have wondered about this before, but did not question it, because
> > > > due to
> > > > the cross-platform nature I couldn't say for certain whether that's
> > > > probably needed somewhere. I mean for the case virtio-PCI I know for
> > > > sure
> > > > that one descriptor (i.e. >PAGE_SIZE) would be fine, but I don't know
> > > > if
> > > > that applies to all buses and architectures.
> > > 
> > > VIRTIO does not limit descriptor the descriptor len field to PAGE_SIZE,
> > > so I don't think there is a limit at the VIRTIO level.
> > 
> > So you are viewing this purely from virtio specs PoV: in the sense, if it
> > is not prohibited by the virtio specs, then it should work. Maybe.
> 
> Limitations must be specified either in the 9P protocol or the VIRTIO
> specification. Drivers and devices will not be able to operate correctly
> if there are limitations that aren't covered by the specs.
> 
> Do you have something in mind that isn't covered by the specs?

Not sure whether that's something that should be specified by the virtio 
specs, probably not. I simply do not know if there was any bus or architecture 
that would have a limitation for max. size for a memory block passed per one 
DMA address.

> > > If this function coalesces adjacent pages then the descriptor chain
> > > length issues could be reduced.
> > > 
> > > > > > But again, this is not just about performance. My conclusion as
> > > > > > described
> > > > > > in my previous email is that virtio currently squeezes
> > > > > > 
> > > > > > 	"max. simultanious amount of bulk messages"
> > > > > > 
> > > > > > vs.
> > > > > > 
> > > > > > 	"max. bulk data transmission size per bulk messaage"
> > > > > > 
> > > > > > into the same configuration parameter, which is IMO inappropriate
> > > > > > and
> > > > > > hence
> > > > > > splitting them into 2 separate parameters when creating a queue
> > > > > > makes
> > > > > > sense, independent of the performance benchmarks.
> > > > > > 
> > > > > > [1]
> > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crud
> > > > > > ebyt
> > > > > > e.c
> > > > > > om/
> > > > > 
> > > > > Some devices effectively already have this because the device
> > > > > advertises
> > > > > a maximum number of descriptors via device-specific mechanisms like
> > > > > the
> > > > > struct virtio_blk_config seg_max field. But today these fields can
> > > > > only
> > > > > reduce the maximum descriptor chain length because the spec still
> > > > > limits
> > > > > the length to Queue Size.
> > > > > 
> > > > > We can build on this approach to raise the length above Queue Size.
> > > > > This
> > > > > approach has the advantage that the maximum number of segments isn't
> > > > > per
> > > > > device or per virtqueue, it's fine-grained. If the device supports
> > > > > two
> > > > > requests types then different max descriptor chain limits could be
> > > > > given
> > > > > for them by introducing two separate configuration space fields.
> > > > > 
> > > > > Here are the corresponding spec changes:
> > > > > 
> > > > > 1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is
> > > > > added
> > > > > 
> > > > >    to indicate that indirect descriptor table size and maximum
> > > > >    descriptor chain length are not limited by Queue Size value.
> > > > >    (Maybe
> > > > >    there still needs to be a limit like 2^15?)
> > > > 
> > > > Sounds good to me!
> > > > 
> > > > AFAIK it is effectively limited to 2^16 because of vring_desc->next:
> > > > 
> > > > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > > > "next". */ struct vring_desc {
> > > > 
> > > >         /* Address (guest-physical). */
> > > >         __virtio64 addr;
> > > >         /* Length. */
> > > >         __virtio32 len;
> > > >         /* The flags as indicated above. */
> > > >         __virtio16 flags;
> > > >         /* We chain unused descriptors via this, too */
> > > >         __virtio16 next;
> > > > 
> > > > };
> > > 
> > > Yes, Split Virtqueues have a fundamental limit on indirect table size
> > > due to the "next" field. Packed Virtqueue descriptors don't have a
> > > "next" field so descriptor chains could be longer in theory (currently
> > > forbidden by the spec).
> > > 
> > > > > One thing that's messy is that we've been discussing the maximum
> > > > > descriptor chain length but 9p has the "msize" concept, which isn't
> > > > > aware of contiguous memory. It may be necessary to extend the 9p
> > > > > driver
> > > > > code to size requests not just according to their length in bytes
> > > > > but
> > > > > also according to the descriptor chain length. That's how the Linux
> > > > > block layer deals with queue limits (struct queue_limits
> > > > > max_segments vs
> > > > > max_hw_sectors).
> > > > 
> > > > Hmm, can't follow on that one. For what should that be needed in case
> > > > of
> > > > 9p? My plan was to limit msize by 9p client simply at session start to
> > > > whatever is the max. amount virtio descriptors supported by host and
> > > > using PAGE_SIZE as size per descriptor, because that's what 9p client
> > > > actually does ATM (see above). So you think that should be changed to
> > > > e.g. just one descriptor for 4MB, right?
> > > 
> > > Limiting msize to the 9p transport device's maximum number of
> > > descriptors is conservative (i.e. 128 descriptors = 512 KB msize)
> > > because it doesn't take advantage of contiguous memory. I suggest
> > > leaving msize alone, adding a separate limit at which requests are split
> > > according to the maximum descriptor chain length, and tweaking
> > > pack_sg_list() to coalesce adjacent pages.
> > > 
> > > That way msize can be large without necessarily using lots of
> > > descriptors (depending on the memory layout).
> > 
> > That was actually a tempting solution. Because it would neither require
> > changes to the virtio specs (at least for a while) and it would also work
> > with older QEMU versions. And for that pack_sg_list() portion of the code
> > it would work well and easy as the buffer passed to pack_sg_list() is
> > contiguous already.
> > 
> > However I just realized for the zero-copy version of the code that would
> > be
> > more tricky. The ZC version already uses individual pages (struct page,
> > hence PAGE_SIZE each) which are pinned, i.e. it uses pack_sg_list_p() [1]
> > in combination with p9_get_mapped_pages() [2]
> > 
> > [1]
> > https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9
> > ab8389/net/9p/trans_virtio.c#L218 [2]
> > https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9
> > ab8389/net/9p/trans_virtio.c#L309
> > 
> > So that would require much more work and code trying to sort and coalesce
> > individual pages to contiguous physical memory for the sake of reducing
> > virtio descriptors. And there is no guarantee that this is even possible.
> > The kernel may simply return a non-contiguous set of pages which would
> > eventually end up exceeding the virtio descriptor limit again.
> 
> Order must be preserved so pages cannot be sorted by physical address.
> How about simply coalescing when pages are adjacent?

It would help, but not solve the issue we are talking about here: if 99% of 
the cases could successfully merge descriptors to stay below the descriptor 
count limit, but in 1% of the cases it could not, then this still construes a 
severe runtime issue that could trigger at any time.

> > So looks like it was probably still easier and realistic to just add
> > virtio
> > capabilities for now for allowing to exceed current descriptor limit.
> 
> I'm still not sure why virtio-net, virtio-blk, virtio-fs, etc perform
> fine under today's limits while virtio-9p needs a much higher limit to
> achieve good performance. Maybe there is an issue in a layer above the
> vring that's causing the virtio-9p performance you've observed?

Are you referring to (somewhat) recent benchmarks when saying those would all 
still perform fine today?

Vivek was running detailed benchmarks for virtiofs vs. 9p:
https://lists.gnu.org/archive/html/qemu-devel/2020-12/msg02704.html

For the virtio aspect discussed here, only the benchmark configurations 
without cache are relevant (9p-none, vtfs-none) and under this aspect the 
situation seems to be quite similar between 9p and virtio-fs. You'll also note 
once DAX is enabled (vtfs-none-dax) that apparently boosts virtio-fs 
performance significantly, which however seems to corelate to numbers when I 
am running 9p with msize > 300k. Note: Vivek was presumably running 9p 
effecively with msize=300k, as this was the kernel limitation at that time.

To bring things into relation: there are known performance aspects in 9p that 
can be improved, yes, both on Linux kernel side and on 9p server side in QEMU. 
For instance 9p server uses coroutines [1] and currently dispatches between 
worker thread(s) and main thread too often per request (partly addressed 
already [2], but still WIP), which accumulates to overall latency. But Vivek 
was actually using a 9p patch here which disabled coroutines entirely, which 
suggests that the virtio transmission size limit still represents a 
bottleneck.

[1] https://wiki.qemu.org/Documentation/9p#Coroutines
[2] https://wiki.qemu.org/Documentation/9p#Implementation_Plans

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-11-09 13:09                               ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-11-09 13:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng

On Dienstag, 9. November 2021 11:56:35 CET Stefan Hajnoczi wrote:
> On Thu, Nov 04, 2021 at 03:41:23PM +0100, Christian Schoenebeck wrote:
> > On Mittwoch, 3. November 2021 12:33:33 CET Stefan Hajnoczi wrote:
> > > On Mon, Nov 01, 2021 at 09:29:26PM +0100, Christian Schoenebeck wrote:
> > > > On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> > > > > On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck 
wrote:
> > > > > > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > > > > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck 
wrote:
> > > > > > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian 
Schoenebeck wrote:
> > > > > > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian 
Schoenebeck wrote:
> > > > > > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > > > > > 
> > > > > > > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian 
Schoenebeck wrote:
> > > > > > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan 
Hajnoczi wrote:
> > > > > > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200,
> > > > > > > > > > > > > > Christian
> > > > > > > > > > > > > > Schoenebeck
> > > > > > > > > 
> > > > > > > > > wrote:
> > > > > > > > > > > > > > > At the moment the maximum transfer size with
> > > > > > > > > > > > > > > virtio
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > limited
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > 4M
> > > > > > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this
> > > > > > > > > > > > > > > limit to
> > > > > > > > > > > > > > > its
> > > > > > > > > > > > > > > maximum
> > > > > > > > > > > > > > > theoretical possible transfer size of 128M (32k
> > > > > > > > > > > > > > > pages)
> > > > > > > > > > > > > > > according
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > virtio specs:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/c
> > > > > > > > > > > > > > > s01/
> > > > > > > > > > > > > > > virt
> > > > > > > > > > > > > > > io-v
> > > > > > > > > > > > > > > 1.1-
> > > > > > > > > > > > > > > cs
> > > > > > > > > > > > > > > 01
> > > > > > > > > > > > > > > .html#
> > > > > > > > > > > > > > > x1-240006
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Hi Christian,
> > > > > > > > > > > 
> > > > > > > > > > > > > > I took a quick look at the code:
> > > > > > > > > > > Hi,
> > > > > > > > > > > 
> > > > > > > > > > > Thanks Stefan for sharing virtio expertise and helping
> > > > > > > > > > > Christian
> > > > > > > > > > > !
> > > > > > > > > > > 
> > > > > > > > > > > > > > - The Linux 9p driver restricts descriptor chains
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > 128
> > > > > > > > > > > > > > elements
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Yes, that's the limitation that I am about to remove
> > > > > > > > > > > > > (WIP);
> > > > > > > > > > > > > current
> > > > > > > > > > > > > kernel
> > > > > > > > > > > > > patches:
> > > > > > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.
> > > > > > > > > > > > > linu
> > > > > > > > > > > > > x_os
> > > > > > > > > > > > > s@cr
> > > > > > > > > > > > > udeb
> > > > > > > > > > > > > yt
> > > > > > > > > > > > > e.
> > > > > > > > > > > > > com/>
> > > > > > > > > > > > 
> > > > > > > > > > > > I haven't read the patches yet but I'm concerned that
> > > > > > > > > > > > today
> > > > > > > > > > > > the
> > > > > > > > > > > > driver
> > > > > > > > > > > > is pretty well-behaved and this new patch series
> > > > > > > > > > > > introduces a
> > > > > > > > > > > > spec
> > > > > > > > > > > > violation. Not fixing existing spec violations is
> > > > > > > > > > > > okay,
> > > > > > > > > > > > but
> > > > > > > > > > > > adding
> > > > > > > > > > > > new
> > > > > > > > > > > > ones is a red flag. I think we need to figure out a
> > > > > > > > > > > > clean
> > > > > > > > > > > > solution.
> > > > > > > > > > 
> > > > > > > > > > Nobody has reviewed the kernel patches yet. My main
> > > > > > > > > > concern
> > > > > > > > > > therefore
> > > > > > > > > > actually is that the kernel patches are already too
> > > > > > > > > > complex,
> > > > > > > > > > because
> > > > > > > > > > the
> > > > > > > > > > current situation is that only Dominique is handling 9p
> > > > > > > > > > patches on
> > > > > > > > > > kernel
> > > > > > > > > > side, and he barely has time for 9p anymore.
> > > > > > > > > > 
> > > > > > > > > > Another reason for me to catch up on reading current
> > > > > > > > > > kernel
> > > > > > > > > > code
> > > > > > > > > > and
> > > > > > > > > > stepping in as reviewer of 9p on kernel side ASAP,
> > > > > > > > > > independent
> > > > > > > > > > of
> > > > > > > > > > this
> > > > > > > > > > issue.
> > > > > > > > > > 
> > > > > > > > > > As for current kernel patches' complexity: I can certainly
> > > > > > > > > > drop
> > > > > > > > > > patch
> > > > > > > > > > 7
> > > > > > > > > > entirely as it is probably just overkill. Patch 4 is then
> > > > > > > > > > the
> > > > > > > > > > biggest
> > > > > > > > > > chunk, I have to see if I can simplify it, and whether it
> > > > > > > > > > would
> > > > > > > > > > make
> > > > > > > > > > sense to squash with patch 3.
> > > > > > > > > > 
> > > > > > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to
> > > > > > > > > > > > > > preadv(2)
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > will
> > > > > > > > > > > > > > fail
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >   with EINVAL when called with more than IOV_MAX
> > > > > > > > > > > > > >   iovecs
> > > > > > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Hmm, which makes me wonder why I never encountered
> > > > > > > > > > > > > this
> > > > > > > > > > > > > error
> > > > > > > > > > > > > during
> > > > > > > > > > > > > testing.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Most people will use the 9p qemu 'local' fs driver
> > > > > > > > > > > > > backend
> > > > > > > > > > > > > in
> > > > > > > > > > > > > practice,
> > > > > > > > > > > > > so
> > > > > > > > > > > > > that v9fs_read() call would translate for most
> > > > > > > > > > > > > people to
> > > > > > > > > > > > > this
> > > > > > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > > > > > 
> > > > > > > > > > > > > static ssize_t local_preadv(FsContext *ctx,
> > > > > > > > > > > > > V9fsFidOpenState
> > > > > > > > > > > > > *fs,
> > > > > > > > > > > > > 
> > > > > > > > > > > > >                             const struct iovec *iov,
> > > > > > > > > > > > >                             int iovcnt, off_t
> > > > > > > > > > > > >                             offset)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > {
> > > > > > > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > > > > > > 
> > > > > > > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > > > > > > 
> > > > > > > > > > > > > #else
> > > > > > > > > > > > > 
> > > > > > > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > > > > > > >     if (err == -1) {
> > > > > > > > > > > > >     
> > > > > > > > > > > > >         return err;
> > > > > > > > > > > > >     
> > > > > > > > > > > > >     } else {
> > > > > > > > > > > > >     
> > > > > > > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > > > > > > >     
> > > > > > > > > > > > >     }
> > > > > > > > > > > > > 
> > > > > > > > > > > > > #endif
> > > > > > > > > > > > > }
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > Unless I misunderstood the code, neither side can
> > > > > > > > > > > > > > take
> > > > > > > > > > > > > > advantage
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Stefan
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I need to check that when I have some more time. One
> > > > > > > > > > > > > possible
> > > > > > > > > > > > > explanation
> > > > > > > > > > > > > might be that preadv() already has this wrapped into
> > > > > > > > > > > > > a
> > > > > > > > > > > > > loop
> > > > > > > > > > > > > in
> > > > > > > > > > > > > its
> > > > > > > > > > > > > implementation to circumvent a limit like IOV_MAX.
> > > > > > > > > > > > > It
> > > > > > > > > > > > > might
> > > > > > > > > > > > > be
> > > > > > > > > > > > > another
> > > > > > > > > > > > > "it
> > > > > > > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > There are still a bunch of other issues I have to
> > > > > > > > > > > > > resolve.
> > > > > > > > > > > > > If
> > > > > > > > > > > > > you
> > > > > > > > > > > > > look
> > > > > > > > > > > > > at
> > > > > > > > > > > > > net/9p/client.c on kernel side, you'll notice that
> > > > > > > > > > > > > it
> > > > > > > > > > > > > basically
> > > > > > > > > > > > > does
> > > > > > > > > > > > > this ATM> >
> > > > > > > > > > > > > 
> > > > > > > > > > > > >     kmalloc(msize);
> > > > > > > > > > > 
> > > > > > > > > > > Note that this is done twice : once for the T message
> > > > > > > > > > > (client
> > > > > > > > > > > request)
> > > > > > > > > > > and
> > > > > > > > > > > once for the R message (server answer). The 9p driver
> > > > > > > > > > > could
> > > > > > > > > > > adjust
> > > > > > > > > > > the
> > > > > > > > > > > size
> > > > > > > > > > > of the T message to what's really needed instead of
> > > > > > > > > > > allocating
> > > > > > > > > > > the
> > > > > > > > > > > full
> > > > > > > > > > > msize. R message size is not known though.
> > > > > > > > > > 
> > > > > > > > > > Would it make sense adding a second virtio ring, dedicated
> > > > > > > > > > to
> > > > > > > > > > server
> > > > > > > > > > responses to solve this? IIRC 9p server already calculates
> > > > > > > > > > appropriate
> > > > > > > > > > exact sizes for each response type. So server could just
> > > > > > > > > > push
> > > > > > > > > > space
> > > > > > > > > > that's
> > > > > > > > > > really needed for its responses.
> > > > > > > > > > 
> > > > > > > > > > > > > for every 9p request. So not only does it allocate
> > > > > > > > > > > > > much
> > > > > > > > > > > > > more
> > > > > > > > > > > > > memory
> > > > > > > > > > > > > for
> > > > > > > > > > > > > every request than actually required (i.e. say 9pfs
> > > > > > > > > > > > > was
> > > > > > > > > > > > > mounted
> > > > > > > > > > > > > with
> > > > > > > > > > > > > msize=8M, then a 9p request that actually would just
> > > > > > > > > > > > > need 1k
> > > > > > > > > > > > > would
> > > > > > > > > > > > > nevertheless allocate 8M), but also it allocates >
> > > > > > > > > > > > > PAGE_SIZE,
> > > > > > > > > > > > > which
> > > > > > > > > > > > > obviously may fail at any time.>
> > > > > > > > > > > > 
> > > > > > > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs
> > > > > > > > > > > > vmalloc()
> > > > > > > > > > > > situation.
> > > > > > > > > > 
> > > > > > > > > > Hu, I didn't even consider vmalloc(). I just tried the
> > > > > > > > > > kvmalloc()
> > > > > > > > > > wrapper
> > > > > > > > > > as a quick & dirty test, but it crashed in the same way as
> > > > > > > > > > kmalloc()
> > > > > > > > > > with
> > > > > > > > > > large msize values immediately on mounting:
> > > > > > > > > > 
> > > > > > > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > > > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > > > > > > --- a/net/9p/client.c
> > > > > > > > > > +++ b/net/9p/client.c
> > > > > > > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts,
> > > > > > > > > > struct
> > > > > > > > > > p9_client
> > > > > > > > > > *clnt)
> > > > > > > > > > 
> > > > > > > > > >  static int p9_fcall_init(struct p9_client *c, struct
> > > > > > > > > >  p9_fcall
> > > > > > > > > >  *fc,
> > > > > > > > > >  
> > > > > > > > > >                          int alloc_msize)
> > > > > > > > > >  
> > > > > > > > > >  {
> > > > > > > > > > 
> > > > > > > > > > -       if (likely(c->fcall_cache) && alloc_msize ==
> > > > > > > > > > c->msize)
> > > > > > > > > > {
> > > > > > > > > > +       //if (likely(c->fcall_cache) && alloc_msize ==
> > > > > > > > > > c->msize) {
> > > > > > > > > > +       if (false) {
> > > > > > > > > > 
> > > > > > > > > >                 fc->sdata =
> > > > > > > > > >                 kmem_cache_alloc(c->fcall_cache,
> > > > > > > > > >                 GFP_NOFS);
> > > > > > > > > >                 fc->cache = c->fcall_cache;
> > > > > > > > > >         
> > > > > > > > > >         } else {
> > > > > > > > > > 
> > > > > > > > > > -               fc->sdata = kmalloc(alloc_msize,
> > > > > > > > > > GFP_NOFS);
> > > > > > > > > > +               fc->sdata = kvmalloc(alloc_msize,
> > > > > > > > > > GFP_NOFS);
> > > > > > > > > 
> > > > > > > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > > > > > > 
> > > > > > > > > Now I get:
> > > > > > > > >    virtio: bogus descriptor or out of resources
> > > > > > > > > 
> > > > > > > > > So, still some work ahead on both ends.
> > > > > > > > 
> > > > > > > > Few hacks later (only changes on 9p client side) I got this
> > > > > > > > running
> > > > > > > > stable
> > > > > > > > now. The reason for the virtio error above was that kvmalloc()
> > > > > > > > returns
> > > > > > > > a
> > > > > > > > non-logical kernel address for any kvmalloc(>4M), i.e. an
> > > > > > > > address
> > > > > > > > that
> > > > > > > > is
> > > > > > > > inaccessible from host side, hence that "bogus descriptor"
> > > > > > > > message
> > > > > > > > by
> > > > > > > > QEMU.
> > > > > > > > So I had to split those linear 9p client buffers into sparse
> > > > > > > > ones
> > > > > > > > (set
> > > > > > > > of
> > > > > > > > individual pages).
> > > > > > > > 
> > > > > > > > I tested this for some days with various virtio transmission
> > > > > > > > sizes
> > > > > > > > and
> > > > > > > > it
> > > > > > > > works as expected up to 128 MB (more precisely: 128 MB read
> > > > > > > > space
> > > > > > > > +
> > > > > > > > 128 MB
> > > > > > > > write space per virtio round trip message).
> > > > > > > > 
> > > > > > > > I did not encounter a show stopper for large virtio
> > > > > > > > transmission
> > > > > > > > sizes
> > > > > > > > (4 MB ... 128 MB) on virtio level, neither as a result of
> > > > > > > > testing,
> > > > > > > > nor
> > > > > > > > after reviewing the existing code.
> > > > > > > > 
> > > > > > > > About IOV_MAX: that's apparently not an issue on virtio level.
> > > > > > > > Most of
> > > > > > > > the
> > > > > > > > iovec code, both on Linux kernel side and on QEMU side do not
> > > > > > > > have
> > > > > > > > this
> > > > > > > > limitation. It is apparently however indeed a limitation for
> > > > > > > > userland
> > > > > > > > apps
> > > > > > > > calling the Linux kernel's syscalls yet.
> > > > > > > > 
> > > > > > > > Stefan, as it stands now, I am even more convinced that the
> > > > > > > > upper
> > > > > > > > virtio
> > > > > > > > transmission size limit should not be squeezed into the queue
> > > > > > > > size
> > > > > > > > argument of virtio_add_queue(). Not because of the previous
> > > > > > > > argument
> > > > > > > > that
> > > > > > > > it would waste space (~1MB), but rather because they are two
> > > > > > > > different
> > > > > > > > things. To outline this, just a quick recap of what happens
> > > > > > > > exactly
> > > > > > > > when
> > > > > > > > a bulk message is pushed over the virtio wire (assuming virtio
> > > > > > > > "split"
> > > > > > > > layout here):
> > > > > > > > 
> > > > > > > > ---------- [recap-start] ----------
> > > > > > > > 
> > > > > > > > For each bulk message sent guest <-> host, exactly *one* of
> > > > > > > > the
> > > > > > > > pre-allocated descriptors is taken and placed (subsequently)
> > > > > > > > into
> > > > > > > > exactly
> > > > > > > > *one* position of the two available/used ring buffers. The
> > > > > > > > actual
> > > > > > > > descriptor table though, containing all the DMA addresses of
> > > > > > > > the
> > > > > > > > message
> > > > > > > > bulk data, is allocated just in time for each round trip
> > > > > > > > message.
> > > > > > > > Say,
> > > > > > > > it
> > > > > > > > is the first message sent, it yields in the following
> > > > > > > > structure:
> > > > > > > > 
> > > > > > > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > > > > > > 
> > > > > > > >    +-+              +-+           +-----------------+
> > > > > > > >    
> > > > > > > >    |D|------------->|d|---------->| Bulk data block |
> > > > > > > >    
> > > > > > > >    +-+              |d|--------+  +-----------------+
> > > > > > > >    
> > > > > > > >    | |              |d|------+ |
> > > > > > > >    
> > > > > > > >    +-+               .       | |  +-----------------+
> > > > > > > >    
> > > > > > > >    | |               .       | +->| Bulk data block |
> > > > > > > >     
> > > > > > > >     .                .       |    +-----------------+
> > > > > > > >     .               |d|-+    |
> > > > > > > >     .               +-+ |    |    +-----------------+
> > > > > > > >     
> > > > > > > >    | |                  |    +--->| Bulk data block |
> > > > > > > >    
> > > > > > > >    +-+                  |         +-----------------+
> > > > > > > >    
> > > > > > > >    | |                  |                 .
> > > > > > > >    
> > > > > > > >    +-+                  |                 .
> > > > > > > >    
> > > > > > > >                         |                 .
> > > > > > > >                         |         
> > > > > > > >                         |         +-----------------+
> > > > > > > >                         
> > > > > > > >                         +-------->| Bulk data block |
> > > > > > > >                         
> > > > > > > >                                   +-----------------+
> > > > > > > > 
> > > > > > > > Legend:
> > > > > > > > D: pre-allocated descriptor
> > > > > > > > d: just in time allocated descriptor
> > > > > > > > -->: memory pointer (DMA)
> > > > > > > > 
> > > > > > > > The bulk data blocks are allocated by the respective device
> > > > > > > > driver
> > > > > > > > above
> > > > > > > > virtio subsystem level (guest side).
> > > > > > > > 
> > > > > > > > There are exactly as many descriptors pre-allocated (D) as the
> > > > > > > > size of
> > > > > > > > a
> > > > > > > > ring buffer.
> > > > > > > > 
> > > > > > > > A "descriptor" is more or less just a chainable DMA memory
> > > > > > > > pointer;
> > > > > > > > defined
> > > > > > > > as:
> > > > > > > > 
> > > > > > > > /* Virtio ring descriptors: 16 bytes.  These can chain
> > > > > > > > together
> > > > > > > > via
> > > > > > > > "next". */ struct vring_desc {
> > > > > > > > 
> > > > > > > > 	/* Address (guest-physical). */
> > > > > > > > 	__virtio64 addr;
> > > > > > > > 	/* Length. */
> > > > > > > > 	__virtio32 len;
> > > > > > > > 	/* The flags as indicated above. */
> > > > > > > > 	__virtio16 flags;
> > > > > > > > 	/* We chain unused descriptors via this, too */
> > > > > > > > 	__virtio16 next;
> > > > > > > > 
> > > > > > > > };
> > > > > > > > 
> > > > > > > > There are 2 ring buffers; the "available" ring buffer is for
> > > > > > > > sending a
> > > > > > > > message guest->host (which will transmit DMA addresses of
> > > > > > > > guest
> > > > > > > > allocated
> > > > > > > > bulk data blocks that are used for data sent to device, and
> > > > > > > > separate
> > > > > > > > guest allocated bulk data blocks that will be used by host
> > > > > > > > side to
> > > > > > > > place
> > > > > > > > its response bulk data), and the "used" ring buffer is for
> > > > > > > > sending
> > > > > > > > host->guest to let guest know about host's response and that
> > > > > > > > it
> > > > > > > > could
> > > > > > > > now
> > > > > > > > safely consume and then deallocate the bulk data blocks
> > > > > > > > subsequently.
> > > > > > > > 
> > > > > > > > ---------- [recap-end] ----------
> > > > > > > > 
> > > > > > > > So the "queue size" actually defines the ringbuffer size. It
> > > > > > > > does
> > > > > > > > not
> > > > > > > > define the maximum amount of descriptors. The "queue size"
> > > > > > > > rather
> > > > > > > > defines
> > > > > > > > how many pending messages can be pushed into either one
> > > > > > > > ringbuffer
> > > > > > > > before
> > > > > > > > the other side would need to wait until the counter side would
> > > > > > > > step up
> > > > > > > > (i.e. ring buffer full).
> > > > > > > > 
> > > > > > > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE
> > > > > > > > actually
> > > > > > > > is)
> > > > > > > > OTOH defines the max. bulk data size that could be transmitted
> > > > > > > > with
> > > > > > > > each
> > > > > > > > virtio round trip message.
> > > > > > > > 
> > > > > > > > And in fact, 9p currently handles the virtio "queue size" as
> > > > > > > > directly
> > > > > > > > associative with its maximum amount of active 9p requests the
> > > > > > > > server
> > > > > > > > could
> > > > > > > > 
> > > > > > > > handle simultaniously:
> > > > > > > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > > > > > > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > > > > > > >   hw/9pfs/virtio-9p-device.c:    v->vq =
> > > > > > > >   virtio_add_queue(vdev,
> > > > > > > >   MAX_REQ,
> > > > > > > >   
> > > > > > > >                                  handle_9p_output);
> > > > > > > > 
> > > > > > > > So if I would change it like this, just for the purpose to
> > > > > > > > increase
> > > > > > > > the
> > > > > > > > max. virtio transmission size:
> > > > > > > > 
> > > > > > > > --- a/hw/9pfs/virtio-9p-device.c
> > > > > > > > +++ b/hw/9pfs/virtio-9p-device.c
> > > > > > > > @@ -218,7 +218,7 @@ static void
> > > > > > > > virtio_9p_device_realize(DeviceState
> > > > > > > > *dev,
> > > > > > > > Error **errp)>
> > > > > > > > 
> > > > > > > >      v->config_size = sizeof(struct virtio_9p_config) +
> > > > > > > >      strlen(s->fsconf.tag);
> > > > > > > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P,
> > > > > > > >      v->config_size,
> > > > > > > >      
> > > > > > > >                  VIRTQUEUE_MAX_SIZE);
> > > > > > > > 
> > > > > > > > -    v->vq = virtio_add_queue(vdev, MAX_REQ,
> > > > > > > > handle_9p_output);
> > > > > > > > +    v->vq = virtio_add_queue(vdev, 32*1024,
> > > > > > > > handle_9p_output);
> > > > > > > > 
> > > > > > > >  }
> > > > > > > > 
> > > > > > > > Then it would require additional synchronization code on both
> > > > > > > > ends
> > > > > > > > and
> > > > > > > > therefore unnecessary complexity, because it would now be
> > > > > > > > possible
> > > > > > > > that
> > > > > > > > more requests are pushed into the ringbuffer than server could
> > > > > > > > handle.
> > > > > > > > 
> > > > > > > > There is one potential issue though that probably did justify
> > > > > > > > the
> > > > > > > > "don't
> > > > > > > > exceed the queue size" rule:
> > > > > > > > 
> > > > > > > > ATM the descriptor table is allocated (just in time) as *one*
> > > > > > > > continuous
> > > > > > > > buffer via kmalloc_array():
> > > > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53
> > > > > > > > 798c
> > > > > > > > a086
> > > > > > > > f7c7
> > > > > > > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > > > > > > 
> > > > > > > > So assuming transmission size of 2 * 128 MB that
> > > > > > > > kmalloc_array()
> > > > > > > > call
> > > > > > > > would
> > > > > > > > yield in kmalloc(1M) and the latter might fail if guest had
> > > > > > > > highly
> > > > > > > > fragmented physical memory. For such kind of error case there
> > > > > > > > is
> > > > > > > > currently a fallback path in virtqueue_add_split() that would
> > > > > > > > then
> > > > > > > > use
> > > > > > > > the required amount of pre-allocated descriptors instead:
> > > > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53
> > > > > > > > 798c
> > > > > > > > a086
> > > > > > > > f7c7
> > > > > > > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > > > > > > 
> > > > > > > > That fallback recovery path would no longer be viable if the
> > > > > > > > queue
> > > > > > > > size
> > > > > > > > was
> > > > > > > > exceeded. There would be alternatives though, e.g. by allowing
> > > > > > > > to
> > > > > > > > chain
> > > > > > > > indirect descriptor tables (currently prohibited by the virtio
> > > > > > > > specs).
> > > > > > > 
> > > > > > > Making the maximum number of descriptors independent of the
> > > > > > > queue
> > > > > > > size
> > > > > > > requires a change to the VIRTIO spec since the two values are
> > > > > > > currently
> > > > > > > explicitly tied together by the spec.
> > > > > > 
> > > > > > Yes, that's what the virtio specs say. But they don't say why, nor
> > > > > > did
> > > > > > I
> > > > > > hear a reason in this dicussion.
> > > > > > 
> > > > > > That's why I invested time reviewing current virtio implementation
> > > > > > and
> > > > > > specs, as well as actually testing exceeding that limit. And as I
> > > > > > outlined in detail in my previous email, I only found one
> > > > > > theoretical
> > > > > > issue that could be addressed though.
> > > > > 
> > > > > I agree that there is a limitation in the VIRTIO spec, but violating
> > > > > the
> > > > > spec isn't an acceptable solution:
> > > > > 
> > > > > 1. QEMU and Linux aren't the only components that implement VIRTIO.
> > > > > You
> > > > > 
> > > > >    cannot make assumptions about their implementations because it
> > > > >    may
> > > > >    break spec-compliant implementations that you haven't looked at.
> > > > >    
> > > > >    Your patches weren't able to increase Queue Size because some
> > > > >    device
> > > > >    implementations break when descriptor chains are too long. This
> > > > >    shows
> > > > >    there is a practical issue even in QEMU.
> > > > > 
> > > > > 2. The specific spec violation that we discussed creates the problem
> > > > > 
> > > > >    that drivers can no longer determine the maximum description
> > > > >    chain
> > > > >    length. This in turn will lead to more implementation-specific
> > > > >    assumptions being baked into drivers and cause problems with
> > > > >    interoperability and future changes.
> > > > > 
> > > > > The spec needs to be extended instead. I included an idea for how to
> > > > > do
> > > > > that below.
> > > > 
> > > > Sure, I just wanted to see if there was a non-neglectable "hard" show
> > > > stopper per se that I probably haven't seen yet. I have not questioned
> > > > aiming a clean solution.
> > > > 
> > > > Thanks for the clarification!
> > > > 
> > > > > > > Before doing that, are there benchmark results showing that 1 MB
> > > > > > > vs
> > > > > > > 128
> > > > > > > MB produces a performance improvement? I'm asking because if
> > > > > > > performance
> > > > > > > with 1 MB is good then you can probably do that without having
> > > > > > > to
> > > > > > > change
> > > > > > > VIRTIO and also because it's counter-intuitive that 9p needs 128
> > > > > > > MB
> > > > > > > for
> > > > > > > good performance when it's ultimately implemented on top of disk
> > > > > > > and
> > > > > > > network I/O that have lower size limits.
> > > > > > 
> > > > > > First some numbers, linear reading a 12 GB file:
> > > > > > 
> > > > > > msize    average      notes
> > > > > > 
> > > > > > 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> > > > > > 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> > > > > > 512 kB   1961 MB/s    current max. msize with any Linux kernel
> > > > > > <=v5.15
> > > > > > 1 MB     2551 MB/s    this msize would already violate virtio
> > > > > > specs
> > > > > > 2 MB     2521 MB/s    this msize would already violate virtio
> > > > > > specs
> > > > > > 4 MB     2628 MB/s    planned max. msize of my current kernel
> > > > > > patches
> > > > > > [1]
> > > > > 
> > > > > How many descriptors are used? 4 MB can be covered by a single
> > > > > descriptor if the data is physically contiguous in memory, so this
> > > > > data
> > > > > doesn't demonstrate a need for more descriptors.
> > > > 
> > > > No, in the last couple years there was apparently no kernel version
> > > > that
> > > > used just one descriptor, nor did my benchmarked version. Even though
> > > > the
> > > > Linux 9p client uses (yet) simple linear buffers (contiguous physical
> > > > memory) on 9p client level, these are however split into PAGE_SIZE
> > > > chunks
> > > > by function pack_sg_list() [1] before being fed to virtio level:
> > > > 
> > > > static unsigned int rest_of_page(void *data)
> > > > {
> > > > 
> > > > 	return PAGE_SIZE - offset_in_page(data);
> > > > 
> > > > }
> > > > ...
> > > > static int pack_sg_list(struct scatterlist *sg, int start,
> > > > 
> > > > 			int limit, char *data, int count)
> > > > 
> > > > {
> > > > 
> > > > 	int s;
> > > > 	int index = start;
> > > > 	
> > > > 	while (count) {
> > > > 	
> > > > 		s = rest_of_page(data);
> > > > 		...
> > > > 		sg_set_buf(&sg[index++], data, s);
> > > > 		count -= s;
> > > > 		data += s;
> > > > 	
> > > > 	}
> > > > 	...
> > > > 
> > > > }
> > > > 
> > > > [1]
> > > > https://github.com/torvalds/linux/blob/19901165d90fdca1e57c9baa0d5b4c6
> > > > 3d1
> > > > 5c476a/net/9p/trans_virtio.c#L171
> > > > 
> > > > So when sending 4MB over virtio wire, it would yield in 1k descriptors
> > > > ATM.
> > > > 
> > > > I have wondered about this before, but did not question it, because
> > > > due to
> > > > the cross-platform nature I couldn't say for certain whether that's
> > > > probably needed somewhere. I mean for the case virtio-PCI I know for
> > > > sure
> > > > that one descriptor (i.e. >PAGE_SIZE) would be fine, but I don't know
> > > > if
> > > > that applies to all buses and architectures.
> > > 
> > > VIRTIO does not limit descriptor the descriptor len field to PAGE_SIZE,
> > > so I don't think there is a limit at the VIRTIO level.
> > 
> > So you are viewing this purely from virtio specs PoV: in the sense, if it
> > is not prohibited by the virtio specs, then it should work. Maybe.
> 
> Limitations must be specified either in the 9P protocol or the VIRTIO
> specification. Drivers and devices will not be able to operate correctly
> if there are limitations that aren't covered by the specs.
> 
> Do you have something in mind that isn't covered by the specs?

Not sure whether that's something that should be specified by the virtio 
specs, probably not. I simply do not know if there was any bus or architecture 
that would have a limitation for max. size for a memory block passed per one 
DMA address.

> > > If this function coalesces adjacent pages then the descriptor chain
> > > length issues could be reduced.
> > > 
> > > > > > But again, this is not just about performance. My conclusion as
> > > > > > described
> > > > > > in my previous email is that virtio currently squeezes
> > > > > > 
> > > > > > 	"max. simultanious amount of bulk messages"
> > > > > > 
> > > > > > vs.
> > > > > > 
> > > > > > 	"max. bulk data transmission size per bulk messaage"
> > > > > > 
> > > > > > into the same configuration parameter, which is IMO inappropriate
> > > > > > and
> > > > > > hence
> > > > > > splitting them into 2 separate parameters when creating a queue
> > > > > > makes
> > > > > > sense, independent of the performance benchmarks.
> > > > > > 
> > > > > > [1]
> > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crud
> > > > > > ebyt
> > > > > > e.c
> > > > > > om/
> > > > > 
> > > > > Some devices effectively already have this because the device
> > > > > advertises
> > > > > a maximum number of descriptors via device-specific mechanisms like
> > > > > the
> > > > > struct virtio_blk_config seg_max field. But today these fields can
> > > > > only
> > > > > reduce the maximum descriptor chain length because the spec still
> > > > > limits
> > > > > the length to Queue Size.
> > > > > 
> > > > > We can build on this approach to raise the length above Queue Size.
> > > > > This
> > > > > approach has the advantage that the maximum number of segments isn't
> > > > > per
> > > > > device or per virtqueue, it's fine-grained. If the device supports
> > > > > two
> > > > > requests types then different max descriptor chain limits could be
> > > > > given
> > > > > for them by introducing two separate configuration space fields.
> > > > > 
> > > > > Here are the corresponding spec changes:
> > > > > 
> > > > > 1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is
> > > > > added
> > > > > 
> > > > >    to indicate that indirect descriptor table size and maximum
> > > > >    descriptor chain length are not limited by Queue Size value.
> > > > >    (Maybe
> > > > >    there still needs to be a limit like 2^15?)
> > > > 
> > > > Sounds good to me!
> > > > 
> > > > AFAIK it is effectively limited to 2^16 because of vring_desc->next:
> > > > 
> > > > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > > > "next". */ struct vring_desc {
> > > > 
> > > >         /* Address (guest-physical). */
> > > >         __virtio64 addr;
> > > >         /* Length. */
> > > >         __virtio32 len;
> > > >         /* The flags as indicated above. */
> > > >         __virtio16 flags;
> > > >         /* We chain unused descriptors via this, too */
> > > >         __virtio16 next;
> > > > 
> > > > };
> > > 
> > > Yes, Split Virtqueues have a fundamental limit on indirect table size
> > > due to the "next" field. Packed Virtqueue descriptors don't have a
> > > "next" field so descriptor chains could be longer in theory (currently
> > > forbidden by the spec).
> > > 
> > > > > One thing that's messy is that we've been discussing the maximum
> > > > > descriptor chain length but 9p has the "msize" concept, which isn't
> > > > > aware of contiguous memory. It may be necessary to extend the 9p
> > > > > driver
> > > > > code to size requests not just according to their length in bytes
> > > > > but
> > > > > also according to the descriptor chain length. That's how the Linux
> > > > > block layer deals with queue limits (struct queue_limits
> > > > > max_segments vs
> > > > > max_hw_sectors).
> > > > 
> > > > Hmm, can't follow on that one. For what should that be needed in case
> > > > of
> > > > 9p? My plan was to limit msize by 9p client simply at session start to
> > > > whatever is the max. amount virtio descriptors supported by host and
> > > > using PAGE_SIZE as size per descriptor, because that's what 9p client
> > > > actually does ATM (see above). So you think that should be changed to
> > > > e.g. just one descriptor for 4MB, right?
> > > 
> > > Limiting msize to the 9p transport device's maximum number of
> > > descriptors is conservative (i.e. 128 descriptors = 512 KB msize)
> > > because it doesn't take advantage of contiguous memory. I suggest
> > > leaving msize alone, adding a separate limit at which requests are split
> > > according to the maximum descriptor chain length, and tweaking
> > > pack_sg_list() to coalesce adjacent pages.
> > > 
> > > That way msize can be large without necessarily using lots of
> > > descriptors (depending on the memory layout).
> > 
> > That was actually a tempting solution. Because it would neither require
> > changes to the virtio specs (at least for a while) and it would also work
> > with older QEMU versions. And for that pack_sg_list() portion of the code
> > it would work well and easy as the buffer passed to pack_sg_list() is
> > contiguous already.
> > 
> > However I just realized for the zero-copy version of the code that would
> > be
> > more tricky. The ZC version already uses individual pages (struct page,
> > hence PAGE_SIZE each) which are pinned, i.e. it uses pack_sg_list_p() [1]
> > in combination with p9_get_mapped_pages() [2]
> > 
> > [1]
> > https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9
> > ab8389/net/9p/trans_virtio.c#L218 [2]
> > https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9
> > ab8389/net/9p/trans_virtio.c#L309
> > 
> > So that would require much more work and code trying to sort and coalesce
> > individual pages to contiguous physical memory for the sake of reducing
> > virtio descriptors. And there is no guarantee that this is even possible.
> > The kernel may simply return a non-contiguous set of pages which would
> > eventually end up exceeding the virtio descriptor limit again.
> 
> Order must be preserved so pages cannot be sorted by physical address.
> How about simply coalescing when pages are adjacent?

It would help, but not solve the issue we are talking about here: if 99% of 
the cases could successfully merge descriptors to stay below the descriptor 
count limit, but in 1% of the cases it could not, then this still construes a 
severe runtime issue that could trigger at any time.

> > So looks like it was probably still easier and realistic to just add
> > virtio
> > capabilities for now for allowing to exceed current descriptor limit.
> 
> I'm still not sure why virtio-net, virtio-blk, virtio-fs, etc perform
> fine under today's limits while virtio-9p needs a much higher limit to
> achieve good performance. Maybe there is an issue in a layer above the
> vring that's causing the virtio-9p performance you've observed?

Are you referring to (somewhat) recent benchmarks when saying those would all 
still perform fine today?

Vivek was running detailed benchmarks for virtiofs vs. 9p:
https://lists.gnu.org/archive/html/qemu-devel/2020-12/msg02704.html

For the virtio aspect discussed here, only the benchmark configurations 
without cache are relevant (9p-none, vtfs-none) and under this aspect the 
situation seems to be quite similar between 9p and virtio-fs. You'll also note 
once DAX is enabled (vtfs-none-dax) that apparently boosts virtio-fs 
performance significantly, which however seems to corelate to numbers when I 
am running 9p with msize > 300k. Note: Vivek was presumably running 9p 
effecively with msize=300k, as this was the kernel limitation at that time.

To bring things into relation: there are known performance aspects in 9p that 
can be improved, yes, both on Linux kernel side and on 9p server side in QEMU. 
For instance 9p server uses coroutines [1] and currently dispatches between 
worker thread(s) and main thread too often per request (partly addressed 
already [2], but still WIP), which accumulates to overall latency. But Vivek 
was actually using a 9p patch here which disabled coroutines entirely, which 
suggests that the virtio transmission size limit still represents a 
bottleneck.

[1] https://wiki.qemu.org/Documentation/9p#Coroutines
[2] https://wiki.qemu.org/Documentation/9p#Implementation_Plans

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-11-09 13:09                               ` [Virtio-fs] " Christian Schoenebeck
@ 2021-11-10 10:05                                 ` Stefan Hajnoczi
  -1 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-11-10 10:05 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, Greg Kurz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 43910 bytes --]

On Tue, Nov 09, 2021 at 02:09:59PM +0100, Christian Schoenebeck wrote:
> On Dienstag, 9. November 2021 11:56:35 CET Stefan Hajnoczi wrote:
> > On Thu, Nov 04, 2021 at 03:41:23PM +0100, Christian Schoenebeck wrote:
> > > On Mittwoch, 3. November 2021 12:33:33 CET Stefan Hajnoczi wrote:
> > > > On Mon, Nov 01, 2021 at 09:29:26PM +0100, Christian Schoenebeck wrote:
> > > > > On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> > > > > > On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck 
> wrote:
> > > > > > > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > > > > > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck 
> wrote:
> > > > > > > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian 
> Schoenebeck wrote:
> > > > > > > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian 
> Schoenebeck wrote:
> > > > > > > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > > > > > > 
> > > > > > > > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian 
> Schoenebeck wrote:
> > > > > > > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan 
> Hajnoczi wrote:
> > > > > > > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200,
> > > > > > > > > > > > > > > Christian
> > > > > > > > > > > > > > > Schoenebeck
> > > > > > > > > > 
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > At the moment the maximum transfer size with
> > > > > > > > > > > > > > > > virtio
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > limited
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > 4M
> > > > > > > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this
> > > > > > > > > > > > > > > > limit to
> > > > > > > > > > > > > > > > its
> > > > > > > > > > > > > > > > maximum
> > > > > > > > > > > > > > > > theoretical possible transfer size of 128M (32k
> > > > > > > > > > > > > > > > pages)
> > > > > > > > > > > > > > > > according
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > virtio specs:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/c
> > > > > > > > > > > > > > > > s01/
> > > > > > > > > > > > > > > > virt
> > > > > > > > > > > > > > > > io-v
> > > > > > > > > > > > > > > > 1.1-
> > > > > > > > > > > > > > > > cs
> > > > > > > > > > > > > > > > 01
> > > > > > > > > > > > > > > > .html#
> > > > > > > > > > > > > > > > x1-240006
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Hi Christian,
> > > > > > > > > > > > 
> > > > > > > > > > > > > > > I took a quick look at the code:
> > > > > > > > > > > > Hi,
> > > > > > > > > > > > 
> > > > > > > > > > > > Thanks Stefan for sharing virtio expertise and helping
> > > > > > > > > > > > Christian
> > > > > > > > > > > > !
> > > > > > > > > > > > 
> > > > > > > > > > > > > > > - The Linux 9p driver restricts descriptor chains
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > 128
> > > > > > > > > > > > > > > elements
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Yes, that's the limitation that I am about to remove
> > > > > > > > > > > > > > (WIP);
> > > > > > > > > > > > > > current
> > > > > > > > > > > > > > kernel
> > > > > > > > > > > > > > patches:
> > > > > > > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.
> > > > > > > > > > > > > > linu
> > > > > > > > > > > > > > x_os
> > > > > > > > > > > > > > s@cr
> > > > > > > > > > > > > > udeb
> > > > > > > > > > > > > > yt
> > > > > > > > > > > > > > e.
> > > > > > > > > > > > > > com/>
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I haven't read the patches yet but I'm concerned that
> > > > > > > > > > > > > today
> > > > > > > > > > > > > the
> > > > > > > > > > > > > driver
> > > > > > > > > > > > > is pretty well-behaved and this new patch series
> > > > > > > > > > > > > introduces a
> > > > > > > > > > > > > spec
> > > > > > > > > > > > > violation. Not fixing existing spec violations is
> > > > > > > > > > > > > okay,
> > > > > > > > > > > > > but
> > > > > > > > > > > > > adding
> > > > > > > > > > > > > new
> > > > > > > > > > > > > ones is a red flag. I think we need to figure out a
> > > > > > > > > > > > > clean
> > > > > > > > > > > > > solution.
> > > > > > > > > > > 
> > > > > > > > > > > Nobody has reviewed the kernel patches yet. My main
> > > > > > > > > > > concern
> > > > > > > > > > > therefore
> > > > > > > > > > > actually is that the kernel patches are already too
> > > > > > > > > > > complex,
> > > > > > > > > > > because
> > > > > > > > > > > the
> > > > > > > > > > > current situation is that only Dominique is handling 9p
> > > > > > > > > > > patches on
> > > > > > > > > > > kernel
> > > > > > > > > > > side, and he barely has time for 9p anymore.
> > > > > > > > > > > 
> > > > > > > > > > > Another reason for me to catch up on reading current
> > > > > > > > > > > kernel
> > > > > > > > > > > code
> > > > > > > > > > > and
> > > > > > > > > > > stepping in as reviewer of 9p on kernel side ASAP,
> > > > > > > > > > > independent
> > > > > > > > > > > of
> > > > > > > > > > > this
> > > > > > > > > > > issue.
> > > > > > > > > > > 
> > > > > > > > > > > As for current kernel patches' complexity: I can certainly
> > > > > > > > > > > drop
> > > > > > > > > > > patch
> > > > > > > > > > > 7
> > > > > > > > > > > entirely as it is probably just overkill. Patch 4 is then
> > > > > > > > > > > the
> > > > > > > > > > > biggest
> > > > > > > > > > > chunk, I have to see if I can simplify it, and whether it
> > > > > > > > > > > would
> > > > > > > > > > > make
> > > > > > > > > > > sense to squash with patch 3.
> > > > > > > > > > > 
> > > > > > > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to
> > > > > > > > > > > > > > > preadv(2)
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > will
> > > > > > > > > > > > > > > fail
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >   with EINVAL when called with more than IOV_MAX
> > > > > > > > > > > > > > >   iovecs
> > > > > > > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Hmm, which makes me wonder why I never encountered
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > error
> > > > > > > > > > > > > > during
> > > > > > > > > > > > > > testing.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Most people will use the 9p qemu 'local' fs driver
> > > > > > > > > > > > > > backend
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > practice,
> > > > > > > > > > > > > > so
> > > > > > > > > > > > > > that v9fs_read() call would translate for most
> > > > > > > > > > > > > > people to
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > static ssize_t local_preadv(FsContext *ctx,
> > > > > > > > > > > > > > V9fsFidOpenState
> > > > > > > > > > > > > > *fs,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >                             const struct iovec *iov,
> > > > > > > > > > > > > >                             int iovcnt, off_t
> > > > > > > > > > > > > >                             offset)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > {
> > > > > > > > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > #else
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > > > > > > > >     if (err == -1) {
> > > > > > > > > > > > > >     
> > > > > > > > > > > > > >         return err;
> > > > > > > > > > > > > >     
> > > > > > > > > > > > > >     } else {
> > > > > > > > > > > > > >     
> > > > > > > > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > > > > > > > >     
> > > > > > > > > > > > > >     }
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > #endif
> > > > > > > > > > > > > > }
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Unless I misunderstood the code, neither side can
> > > > > > > > > > > > > > > take
> > > > > > > > > > > > > > > advantage
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Stefan
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I need to check that when I have some more time. One
> > > > > > > > > > > > > > possible
> > > > > > > > > > > > > > explanation
> > > > > > > > > > > > > > might be that preadv() already has this wrapped into
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > loop
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > its
> > > > > > > > > > > > > > implementation to circumvent a limit like IOV_MAX.
> > > > > > > > > > > > > > It
> > > > > > > > > > > > > > might
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > another
> > > > > > > > > > > > > > "it
> > > > > > > > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > There are still a bunch of other issues I have to
> > > > > > > > > > > > > > resolve.
> > > > > > > > > > > > > > If
> > > > > > > > > > > > > > you
> > > > > > > > > > > > > > look
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > net/9p/client.c on kernel side, you'll notice that
> > > > > > > > > > > > > > it
> > > > > > > > > > > > > > basically
> > > > > > > > > > > > > > does
> > > > > > > > > > > > > > this ATM> >
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >     kmalloc(msize);
> > > > > > > > > > > > 
> > > > > > > > > > > > Note that this is done twice : once for the T message
> > > > > > > > > > > > (client
> > > > > > > > > > > > request)
> > > > > > > > > > > > and
> > > > > > > > > > > > once for the R message (server answer). The 9p driver
> > > > > > > > > > > > could
> > > > > > > > > > > > adjust
> > > > > > > > > > > > the
> > > > > > > > > > > > size
> > > > > > > > > > > > of the T message to what's really needed instead of
> > > > > > > > > > > > allocating
> > > > > > > > > > > > the
> > > > > > > > > > > > full
> > > > > > > > > > > > msize. R message size is not known though.
> > > > > > > > > > > 
> > > > > > > > > > > Would it make sense adding a second virtio ring, dedicated
> > > > > > > > > > > to
> > > > > > > > > > > server
> > > > > > > > > > > responses to solve this? IIRC 9p server already calculates
> > > > > > > > > > > appropriate
> > > > > > > > > > > exact sizes for each response type. So server could just
> > > > > > > > > > > push
> > > > > > > > > > > space
> > > > > > > > > > > that's
> > > > > > > > > > > really needed for its responses.
> > > > > > > > > > > 
> > > > > > > > > > > > > > for every 9p request. So not only does it allocate
> > > > > > > > > > > > > > much
> > > > > > > > > > > > > > more
> > > > > > > > > > > > > > memory
> > > > > > > > > > > > > > for
> > > > > > > > > > > > > > every request than actually required (i.e. say 9pfs
> > > > > > > > > > > > > > was
> > > > > > > > > > > > > > mounted
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > msize=8M, then a 9p request that actually would just
> > > > > > > > > > > > > > need 1k
> > > > > > > > > > > > > > would
> > > > > > > > > > > > > > nevertheless allocate 8M), but also it allocates >
> > > > > > > > > > > > > > PAGE_SIZE,
> > > > > > > > > > > > > > which
> > > > > > > > > > > > > > obviously may fail at any time.>
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs
> > > > > > > > > > > > > vmalloc()
> > > > > > > > > > > > > situation.
> > > > > > > > > > > 
> > > > > > > > > > > Hu, I didn't even consider vmalloc(). I just tried the
> > > > > > > > > > > kvmalloc()
> > > > > > > > > > > wrapper
> > > > > > > > > > > as a quick & dirty test, but it crashed in the same way as
> > > > > > > > > > > kmalloc()
> > > > > > > > > > > with
> > > > > > > > > > > large msize values immediately on mounting:
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > > > > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > > > > > > > --- a/net/9p/client.c
> > > > > > > > > > > +++ b/net/9p/client.c
> > > > > > > > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts,
> > > > > > > > > > > struct
> > > > > > > > > > > p9_client
> > > > > > > > > > > *clnt)
> > > > > > > > > > > 
> > > > > > > > > > >  static int p9_fcall_init(struct p9_client *c, struct
> > > > > > > > > > >  p9_fcall
> > > > > > > > > > >  *fc,
> > > > > > > > > > >  
> > > > > > > > > > >                          int alloc_msize)
> > > > > > > > > > >  
> > > > > > > > > > >  {
> > > > > > > > > > > 
> > > > > > > > > > > -       if (likely(c->fcall_cache) && alloc_msize ==
> > > > > > > > > > > c->msize)
> > > > > > > > > > > {
> > > > > > > > > > > +       //if (likely(c->fcall_cache) && alloc_msize ==
> > > > > > > > > > > c->msize) {
> > > > > > > > > > > +       if (false) {
> > > > > > > > > > > 
> > > > > > > > > > >                 fc->sdata =
> > > > > > > > > > >                 kmem_cache_alloc(c->fcall_cache,
> > > > > > > > > > >                 GFP_NOFS);
> > > > > > > > > > >                 fc->cache = c->fcall_cache;
> > > > > > > > > > >         
> > > > > > > > > > >         } else {
> > > > > > > > > > > 
> > > > > > > > > > > -               fc->sdata = kmalloc(alloc_msize,
> > > > > > > > > > > GFP_NOFS);
> > > > > > > > > > > +               fc->sdata = kvmalloc(alloc_msize,
> > > > > > > > > > > GFP_NOFS);
> > > > > > > > > > 
> > > > > > > > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > > > > > > > 
> > > > > > > > > > Now I get:
> > > > > > > > > >    virtio: bogus descriptor or out of resources
> > > > > > > > > > 
> > > > > > > > > > So, still some work ahead on both ends.
> > > > > > > > > 
> > > > > > > > > Few hacks later (only changes on 9p client side) I got this
> > > > > > > > > running
> > > > > > > > > stable
> > > > > > > > > now. The reason for the virtio error above was that kvmalloc()
> > > > > > > > > returns
> > > > > > > > > a
> > > > > > > > > non-logical kernel address for any kvmalloc(>4M), i.e. an
> > > > > > > > > address
> > > > > > > > > that
> > > > > > > > > is
> > > > > > > > > inaccessible from host side, hence that "bogus descriptor"
> > > > > > > > > message
> > > > > > > > > by
> > > > > > > > > QEMU.
> > > > > > > > > So I had to split those linear 9p client buffers into sparse
> > > > > > > > > ones
> > > > > > > > > (set
> > > > > > > > > of
> > > > > > > > > individual pages).
> > > > > > > > > 
> > > > > > > > > I tested this for some days with various virtio transmission
> > > > > > > > > sizes
> > > > > > > > > and
> > > > > > > > > it
> > > > > > > > > works as expected up to 128 MB (more precisely: 128 MB read
> > > > > > > > > space
> > > > > > > > > +
> > > > > > > > > 128 MB
> > > > > > > > > write space per virtio round trip message).
> > > > > > > > > 
> > > > > > > > > I did not encounter a show stopper for large virtio
> > > > > > > > > transmission
> > > > > > > > > sizes
> > > > > > > > > (4 MB ... 128 MB) on virtio level, neither as a result of
> > > > > > > > > testing,
> > > > > > > > > nor
> > > > > > > > > after reviewing the existing code.
> > > > > > > > > 
> > > > > > > > > About IOV_MAX: that's apparently not an issue on virtio level.
> > > > > > > > > Most of
> > > > > > > > > the
> > > > > > > > > iovec code, both on Linux kernel side and on QEMU side do not
> > > > > > > > > have
> > > > > > > > > this
> > > > > > > > > limitation. It is apparently however indeed a limitation for
> > > > > > > > > userland
> > > > > > > > > apps
> > > > > > > > > calling the Linux kernel's syscalls yet.
> > > > > > > > > 
> > > > > > > > > Stefan, as it stands now, I am even more convinced that the
> > > > > > > > > upper
> > > > > > > > > virtio
> > > > > > > > > transmission size limit should not be squeezed into the queue
> > > > > > > > > size
> > > > > > > > > argument of virtio_add_queue(). Not because of the previous
> > > > > > > > > argument
> > > > > > > > > that
> > > > > > > > > it would waste space (~1MB), but rather because they are two
> > > > > > > > > different
> > > > > > > > > things. To outline this, just a quick recap of what happens
> > > > > > > > > exactly
> > > > > > > > > when
> > > > > > > > > a bulk message is pushed over the virtio wire (assuming virtio
> > > > > > > > > "split"
> > > > > > > > > layout here):
> > > > > > > > > 
> > > > > > > > > ---------- [recap-start] ----------
> > > > > > > > > 
> > > > > > > > > For each bulk message sent guest <-> host, exactly *one* of
> > > > > > > > > the
> > > > > > > > > pre-allocated descriptors is taken and placed (subsequently)
> > > > > > > > > into
> > > > > > > > > exactly
> > > > > > > > > *one* position of the two available/used ring buffers. The
> > > > > > > > > actual
> > > > > > > > > descriptor table though, containing all the DMA addresses of
> > > > > > > > > the
> > > > > > > > > message
> > > > > > > > > bulk data, is allocated just in time for each round trip
> > > > > > > > > message.
> > > > > > > > > Say,
> > > > > > > > > it
> > > > > > > > > is the first message sent, it yields in the following
> > > > > > > > > structure:
> > > > > > > > > 
> > > > > > > > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > > > > > > > 
> > > > > > > > >    +-+              +-+           +-----------------+
> > > > > > > > >    
> > > > > > > > >    |D|------------->|d|---------->| Bulk data block |
> > > > > > > > >    
> > > > > > > > >    +-+              |d|--------+  +-----------------+
> > > > > > > > >    
> > > > > > > > >    | |              |d|------+ |
> > > > > > > > >    
> > > > > > > > >    +-+               .       | |  +-----------------+
> > > > > > > > >    
> > > > > > > > >    | |               .       | +->| Bulk data block |
> > > > > > > > >     
> > > > > > > > >     .                .       |    +-----------------+
> > > > > > > > >     .               |d|-+    |
> > > > > > > > >     .               +-+ |    |    +-----------------+
> > > > > > > > >     
> > > > > > > > >    | |                  |    +--->| Bulk data block |
> > > > > > > > >    
> > > > > > > > >    +-+                  |         +-----------------+
> > > > > > > > >    
> > > > > > > > >    | |                  |                 .
> > > > > > > > >    
> > > > > > > > >    +-+                  |                 .
> > > > > > > > >    
> > > > > > > > >                         |                 .
> > > > > > > > >                         |         
> > > > > > > > >                         |         +-----------------+
> > > > > > > > >                         
> > > > > > > > >                         +-------->| Bulk data block |
> > > > > > > > >                         
> > > > > > > > >                                   +-----------------+
> > > > > > > > > 
> > > > > > > > > Legend:
> > > > > > > > > D: pre-allocated descriptor
> > > > > > > > > d: just in time allocated descriptor
> > > > > > > > > -->: memory pointer (DMA)
> > > > > > > > > 
> > > > > > > > > The bulk data blocks are allocated by the respective device
> > > > > > > > > driver
> > > > > > > > > above
> > > > > > > > > virtio subsystem level (guest side).
> > > > > > > > > 
> > > > > > > > > There are exactly as many descriptors pre-allocated (D) as the
> > > > > > > > > size of
> > > > > > > > > a
> > > > > > > > > ring buffer.
> > > > > > > > > 
> > > > > > > > > A "descriptor" is more or less just a chainable DMA memory
> > > > > > > > > pointer;
> > > > > > > > > defined
> > > > > > > > > as:
> > > > > > > > > 
> > > > > > > > > /* Virtio ring descriptors: 16 bytes.  These can chain
> > > > > > > > > together
> > > > > > > > > via
> > > > > > > > > "next". */ struct vring_desc {
> > > > > > > > > 
> > > > > > > > > 	/* Address (guest-physical). */
> > > > > > > > > 	__virtio64 addr;
> > > > > > > > > 	/* Length. */
> > > > > > > > > 	__virtio32 len;
> > > > > > > > > 	/* The flags as indicated above. */
> > > > > > > > > 	__virtio16 flags;
> > > > > > > > > 	/* We chain unused descriptors via this, too */
> > > > > > > > > 	__virtio16 next;
> > > > > > > > > 
> > > > > > > > > };
> > > > > > > > > 
> > > > > > > > > There are 2 ring buffers; the "available" ring buffer is for
> > > > > > > > > sending a
> > > > > > > > > message guest->host (which will transmit DMA addresses of
> > > > > > > > > guest
> > > > > > > > > allocated
> > > > > > > > > bulk data blocks that are used for data sent to device, and
> > > > > > > > > separate
> > > > > > > > > guest allocated bulk data blocks that will be used by host
> > > > > > > > > side to
> > > > > > > > > place
> > > > > > > > > its response bulk data), and the "used" ring buffer is for
> > > > > > > > > sending
> > > > > > > > > host->guest to let guest know about host's response and that
> > > > > > > > > it
> > > > > > > > > could
> > > > > > > > > now
> > > > > > > > > safely consume and then deallocate the bulk data blocks
> > > > > > > > > subsequently.
> > > > > > > > > 
> > > > > > > > > ---------- [recap-end] ----------
> > > > > > > > > 
> > > > > > > > > So the "queue size" actually defines the ringbuffer size. It
> > > > > > > > > does
> > > > > > > > > not
> > > > > > > > > define the maximum amount of descriptors. The "queue size"
> > > > > > > > > rather
> > > > > > > > > defines
> > > > > > > > > how many pending messages can be pushed into either one
> > > > > > > > > ringbuffer
> > > > > > > > > before
> > > > > > > > > the other side would need to wait until the counter side would
> > > > > > > > > step up
> > > > > > > > > (i.e. ring buffer full).
> > > > > > > > > 
> > > > > > > > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE
> > > > > > > > > actually
> > > > > > > > > is)
> > > > > > > > > OTOH defines the max. bulk data size that could be transmitted
> > > > > > > > > with
> > > > > > > > > each
> > > > > > > > > virtio round trip message.
> > > > > > > > > 
> > > > > > > > > And in fact, 9p currently handles the virtio "queue size" as
> > > > > > > > > directly
> > > > > > > > > associative with its maximum amount of active 9p requests the
> > > > > > > > > server
> > > > > > > > > could
> > > > > > > > > 
> > > > > > > > > handle simultaniously:
> > > > > > > > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > > > > > > > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > > > > > > > >   hw/9pfs/virtio-9p-device.c:    v->vq =
> > > > > > > > >   virtio_add_queue(vdev,
> > > > > > > > >   MAX_REQ,
> > > > > > > > >   
> > > > > > > > >                                  handle_9p_output);
> > > > > > > > > 
> > > > > > > > > So if I would change it like this, just for the purpose to
> > > > > > > > > increase
> > > > > > > > > the
> > > > > > > > > max. virtio transmission size:
> > > > > > > > > 
> > > > > > > > > --- a/hw/9pfs/virtio-9p-device.c
> > > > > > > > > +++ b/hw/9pfs/virtio-9p-device.c
> > > > > > > > > @@ -218,7 +218,7 @@ static void
> > > > > > > > > virtio_9p_device_realize(DeviceState
> > > > > > > > > *dev,
> > > > > > > > > Error **errp)>
> > > > > > > > > 
> > > > > > > > >      v->config_size = sizeof(struct virtio_9p_config) +
> > > > > > > > >      strlen(s->fsconf.tag);
> > > > > > > > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P,
> > > > > > > > >      v->config_size,
> > > > > > > > >      
> > > > > > > > >                  VIRTQUEUE_MAX_SIZE);
> > > > > > > > > 
> > > > > > > > > -    v->vq = virtio_add_queue(vdev, MAX_REQ,
> > > > > > > > > handle_9p_output);
> > > > > > > > > +    v->vq = virtio_add_queue(vdev, 32*1024,
> > > > > > > > > handle_9p_output);
> > > > > > > > > 
> > > > > > > > >  }
> > > > > > > > > 
> > > > > > > > > Then it would require additional synchronization code on both
> > > > > > > > > ends
> > > > > > > > > and
> > > > > > > > > therefore unnecessary complexity, because it would now be
> > > > > > > > > possible
> > > > > > > > > that
> > > > > > > > > more requests are pushed into the ringbuffer than server could
> > > > > > > > > handle.
> > > > > > > > > 
> > > > > > > > > There is one potential issue though that probably did justify
> > > > > > > > > the
> > > > > > > > > "don't
> > > > > > > > > exceed the queue size" rule:
> > > > > > > > > 
> > > > > > > > > ATM the descriptor table is allocated (just in time) as *one*
> > > > > > > > > continuous
> > > > > > > > > buffer via kmalloc_array():
> > > > > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53
> > > > > > > > > 798c
> > > > > > > > > a086
> > > > > > > > > f7c7
> > > > > > > > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > > > > > > > 
> > > > > > > > > So assuming transmission size of 2 * 128 MB that
> > > > > > > > > kmalloc_array()
> > > > > > > > > call
> > > > > > > > > would
> > > > > > > > > yield in kmalloc(1M) and the latter might fail if guest had
> > > > > > > > > highly
> > > > > > > > > fragmented physical memory. For such kind of error case there
> > > > > > > > > is
> > > > > > > > > currently a fallback path in virtqueue_add_split() that would
> > > > > > > > > then
> > > > > > > > > use
> > > > > > > > > the required amount of pre-allocated descriptors instead:
> > > > > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53
> > > > > > > > > 798c
> > > > > > > > > a086
> > > > > > > > > f7c7
> > > > > > > > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > > > > > > > 
> > > > > > > > > That fallback recovery path would no longer be viable if the
> > > > > > > > > queue
> > > > > > > > > size
> > > > > > > > > was
> > > > > > > > > exceeded. There would be alternatives though, e.g. by allowing
> > > > > > > > > to
> > > > > > > > > chain
> > > > > > > > > indirect descriptor tables (currently prohibited by the virtio
> > > > > > > > > specs).
> > > > > > > > 
> > > > > > > > Making the maximum number of descriptors independent of the
> > > > > > > > queue
> > > > > > > > size
> > > > > > > > requires a change to the VIRTIO spec since the two values are
> > > > > > > > currently
> > > > > > > > explicitly tied together by the spec.
> > > > > > > 
> > > > > > > Yes, that's what the virtio specs say. But they don't say why, nor
> > > > > > > did
> > > > > > > I
> > > > > > > hear a reason in this dicussion.
> > > > > > > 
> > > > > > > That's why I invested time reviewing current virtio implementation
> > > > > > > and
> > > > > > > specs, as well as actually testing exceeding that limit. And as I
> > > > > > > outlined in detail in my previous email, I only found one
> > > > > > > theoretical
> > > > > > > issue that could be addressed though.
> > > > > > 
> > > > > > I agree that there is a limitation in the VIRTIO spec, but violating
> > > > > > the
> > > > > > spec isn't an acceptable solution:
> > > > > > 
> > > > > > 1. QEMU and Linux aren't the only components that implement VIRTIO.
> > > > > > You
> > > > > > 
> > > > > >    cannot make assumptions about their implementations because it
> > > > > >    may
> > > > > >    break spec-compliant implementations that you haven't looked at.
> > > > > >    
> > > > > >    Your patches weren't able to increase Queue Size because some
> > > > > >    device
> > > > > >    implementations break when descriptor chains are too long. This
> > > > > >    shows
> > > > > >    there is a practical issue even in QEMU.
> > > > > > 
> > > > > > 2. The specific spec violation that we discussed creates the problem
> > > > > > 
> > > > > >    that drivers can no longer determine the maximum description
> > > > > >    chain
> > > > > >    length. This in turn will lead to more implementation-specific
> > > > > >    assumptions being baked into drivers and cause problems with
> > > > > >    interoperability and future changes.
> > > > > > 
> > > > > > The spec needs to be extended instead. I included an idea for how to
> > > > > > do
> > > > > > that below.
> > > > > 
> > > > > Sure, I just wanted to see if there was a non-neglectable "hard" show
> > > > > stopper per se that I probably haven't seen yet. I have not questioned
> > > > > aiming a clean solution.
> > > > > 
> > > > > Thanks for the clarification!
> > > > > 
> > > > > > > > Before doing that, are there benchmark results showing that 1 MB
> > > > > > > > vs
> > > > > > > > 128
> > > > > > > > MB produces a performance improvement? I'm asking because if
> > > > > > > > performance
> > > > > > > > with 1 MB is good then you can probably do that without having
> > > > > > > > to
> > > > > > > > change
> > > > > > > > VIRTIO and also because it's counter-intuitive that 9p needs 128
> > > > > > > > MB
> > > > > > > > for
> > > > > > > > good performance when it's ultimately implemented on top of disk
> > > > > > > > and
> > > > > > > > network I/O that have lower size limits.
> > > > > > > 
> > > > > > > First some numbers, linear reading a 12 GB file:
> > > > > > > 
> > > > > > > msize    average      notes
> > > > > > > 
> > > > > > > 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> > > > > > > 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> > > > > > > 512 kB   1961 MB/s    current max. msize with any Linux kernel
> > > > > > > <=v5.15
> > > > > > > 1 MB     2551 MB/s    this msize would already violate virtio
> > > > > > > specs
> > > > > > > 2 MB     2521 MB/s    this msize would already violate virtio
> > > > > > > specs
> > > > > > > 4 MB     2628 MB/s    planned max. msize of my current kernel
> > > > > > > patches
> > > > > > > [1]
> > > > > > 
> > > > > > How many descriptors are used? 4 MB can be covered by a single
> > > > > > descriptor if the data is physically contiguous in memory, so this
> > > > > > data
> > > > > > doesn't demonstrate a need for more descriptors.
> > > > > 
> > > > > No, in the last couple years there was apparently no kernel version
> > > > > that
> > > > > used just one descriptor, nor did my benchmarked version. Even though
> > > > > the
> > > > > Linux 9p client uses (yet) simple linear buffers (contiguous physical
> > > > > memory) on 9p client level, these are however split into PAGE_SIZE
> > > > > chunks
> > > > > by function pack_sg_list() [1] before being fed to virtio level:
> > > > > 
> > > > > static unsigned int rest_of_page(void *data)
> > > > > {
> > > > > 
> > > > > 	return PAGE_SIZE - offset_in_page(data);
> > > > > 
> > > > > }
> > > > > ...
> > > > > static int pack_sg_list(struct scatterlist *sg, int start,
> > > > > 
> > > > > 			int limit, char *data, int count)
> > > > > 
> > > > > {
> > > > > 
> > > > > 	int s;
> > > > > 	int index = start;
> > > > > 	
> > > > > 	while (count) {
> > > > > 	
> > > > > 		s = rest_of_page(data);
> > > > > 		...
> > > > > 		sg_set_buf(&sg[index++], data, s);
> > > > > 		count -= s;
> > > > > 		data += s;
> > > > > 	
> > > > > 	}
> > > > > 	...
> > > > > 
> > > > > }
> > > > > 
> > > > > [1]
> > > > > https://github.com/torvalds/linux/blob/19901165d90fdca1e57c9baa0d5b4c6
> > > > > 3d1
> > > > > 5c476a/net/9p/trans_virtio.c#L171
> > > > > 
> > > > > So when sending 4MB over virtio wire, it would yield in 1k descriptors
> > > > > ATM.
> > > > > 
> > > > > I have wondered about this before, but did not question it, because
> > > > > due to
> > > > > the cross-platform nature I couldn't say for certain whether that's
> > > > > probably needed somewhere. I mean for the case virtio-PCI I know for
> > > > > sure
> > > > > that one descriptor (i.e. >PAGE_SIZE) would be fine, but I don't know
> > > > > if
> > > > > that applies to all buses and architectures.
> > > > 
> > > > VIRTIO does not limit descriptor the descriptor len field to PAGE_SIZE,
> > > > so I don't think there is a limit at the VIRTIO level.
> > > 
> > > So you are viewing this purely from virtio specs PoV: in the sense, if it
> > > is not prohibited by the virtio specs, then it should work. Maybe.
> > 
> > Limitations must be specified either in the 9P protocol or the VIRTIO
> > specification. Drivers and devices will not be able to operate correctly
> > if there are limitations that aren't covered by the specs.
> > 
> > Do you have something in mind that isn't covered by the specs?
> 
> Not sure whether that's something that should be specified by the virtio 
> specs, probably not. I simply do not know if there was any bus or architecture 
> that would have a limitation for max. size for a memory block passed per one 
> DMA address.

Host-side limitations like that can exist. For example when a physical
storage device on the host has limits that the VIRTIO device does not
have. In this case both virtio-scsi and virtio-blk report those limits
to the guest so that the guest won't submit requests that the physical
device would reject. I guess networking MTU is kind of similar too. What
they have in common is that the limit needs to be reported to the guest,
typically using a VIRTIO Configuration Space field. It is an explicit
limit that is part of the host<->guest interface (VIRTIO spec, SCSI,
etc).

> > > > If this function coalesces adjacent pages then the descriptor chain
> > > > length issues could be reduced.
> > > > 
> > > > > > > But again, this is not just about performance. My conclusion as
> > > > > > > described
> > > > > > > in my previous email is that virtio currently squeezes
> > > > > > > 
> > > > > > > 	"max. simultanious amount of bulk messages"
> > > > > > > 
> > > > > > > vs.
> > > > > > > 
> > > > > > > 	"max. bulk data transmission size per bulk messaage"
> > > > > > > 
> > > > > > > into the same configuration parameter, which is IMO inappropriate
> > > > > > > and
> > > > > > > hence
> > > > > > > splitting them into 2 separate parameters when creating a queue
> > > > > > > makes
> > > > > > > sense, independent of the performance benchmarks.
> > > > > > > 
> > > > > > > [1]
> > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crud
> > > > > > > ebyt
> > > > > > > e.c
> > > > > > > om/
> > > > > > 
> > > > > > Some devices effectively already have this because the device
> > > > > > advertises
> > > > > > a maximum number of descriptors via device-specific mechanisms like
> > > > > > the
> > > > > > struct virtio_blk_config seg_max field. But today these fields can
> > > > > > only
> > > > > > reduce the maximum descriptor chain length because the spec still
> > > > > > limits
> > > > > > the length to Queue Size.
> > > > > > 
> > > > > > We can build on this approach to raise the length above Queue Size.
> > > > > > This
> > > > > > approach has the advantage that the maximum number of segments isn't
> > > > > > per
> > > > > > device or per virtqueue, it's fine-grained. If the device supports
> > > > > > two
> > > > > > requests types then different max descriptor chain limits could be
> > > > > > given
> > > > > > for them by introducing two separate configuration space fields.
> > > > > > 
> > > > > > Here are the corresponding spec changes:
> > > > > > 
> > > > > > 1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is
> > > > > > added
> > > > > > 
> > > > > >    to indicate that indirect descriptor table size and maximum
> > > > > >    descriptor chain length are not limited by Queue Size value.
> > > > > >    (Maybe
> > > > > >    there still needs to be a limit like 2^15?)
> > > > > 
> > > > > Sounds good to me!
> > > > > 
> > > > > AFAIK it is effectively limited to 2^16 because of vring_desc->next:
> > > > > 
> > > > > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > > > > "next". */ struct vring_desc {
> > > > > 
> > > > >         /* Address (guest-physical). */
> > > > >         __virtio64 addr;
> > > > >         /* Length. */
> > > > >         __virtio32 len;
> > > > >         /* The flags as indicated above. */
> > > > >         __virtio16 flags;
> > > > >         /* We chain unused descriptors via this, too */
> > > > >         __virtio16 next;
> > > > > 
> > > > > };
> > > > 
> > > > Yes, Split Virtqueues have a fundamental limit on indirect table size
> > > > due to the "next" field. Packed Virtqueue descriptors don't have a
> > > > "next" field so descriptor chains could be longer in theory (currently
> > > > forbidden by the spec).
> > > > 
> > > > > > One thing that's messy is that we've been discussing the maximum
> > > > > > descriptor chain length but 9p has the "msize" concept, which isn't
> > > > > > aware of contiguous memory. It may be necessary to extend the 9p
> > > > > > driver
> > > > > > code to size requests not just according to their length in bytes
> > > > > > but
> > > > > > also according to the descriptor chain length. That's how the Linux
> > > > > > block layer deals with queue limits (struct queue_limits
> > > > > > max_segments vs
> > > > > > max_hw_sectors).
> > > > > 
> > > > > Hmm, can't follow on that one. For what should that be needed in case
> > > > > of
> > > > > 9p? My plan was to limit msize by 9p client simply at session start to
> > > > > whatever is the max. amount virtio descriptors supported by host and
> > > > > using PAGE_SIZE as size per descriptor, because that's what 9p client
> > > > > actually does ATM (see above). So you think that should be changed to
> > > > > e.g. just one descriptor for 4MB, right?
> > > > 
> > > > Limiting msize to the 9p transport device's maximum number of
> > > > descriptors is conservative (i.e. 128 descriptors = 512 KB msize)
> > > > because it doesn't take advantage of contiguous memory. I suggest
> > > > leaving msize alone, adding a separate limit at which requests are split
> > > > according to the maximum descriptor chain length, and tweaking
> > > > pack_sg_list() to coalesce adjacent pages.
> > > > 
> > > > That way msize can be large without necessarily using lots of
> > > > descriptors (depending on the memory layout).
> > > 
> > > That was actually a tempting solution. Because it would neither require
> > > changes to the virtio specs (at least for a while) and it would also work
> > > with older QEMU versions. And for that pack_sg_list() portion of the code
> > > it would work well and easy as the buffer passed to pack_sg_list() is
> > > contiguous already.
> > > 
> > > However I just realized for the zero-copy version of the code that would
> > > be
> > > more tricky. The ZC version already uses individual pages (struct page,
> > > hence PAGE_SIZE each) which are pinned, i.e. it uses pack_sg_list_p() [1]
> > > in combination with p9_get_mapped_pages() [2]
> > > 
> > > [1]
> > > https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9
> > > ab8389/net/9p/trans_virtio.c#L218 [2]
> > > https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9
> > > ab8389/net/9p/trans_virtio.c#L309
> > > 
> > > So that would require much more work and code trying to sort and coalesce
> > > individual pages to contiguous physical memory for the sake of reducing
> > > virtio descriptors. And there is no guarantee that this is even possible.
> > > The kernel may simply return a non-contiguous set of pages which would
> > > eventually end up exceeding the virtio descriptor limit again.
> > 
> > Order must be preserved so pages cannot be sorted by physical address.
> > How about simply coalescing when pages are adjacent?
> 
> It would help, but not solve the issue we are talking about here: if 99% of 
> the cases could successfully merge descriptors to stay below the descriptor 
> count limit, but in 1% of the cases it could not, then this still construes a 
> severe runtime issue that could trigger at any time.
> 
> > > So looks like it was probably still easier and realistic to just add
> > > virtio
> > > capabilities for now for allowing to exceed current descriptor limit.
> > 
> > I'm still not sure why virtio-net, virtio-blk, virtio-fs, etc perform
> > fine under today's limits while virtio-9p needs a much higher limit to
> > achieve good performance. Maybe there is an issue in a layer above the
> > vring that's causing the virtio-9p performance you've observed?
> 
> Are you referring to (somewhat) recent benchmarks when saying those would all 
> still perform fine today?

I'm not referring to specific benchmark results. Just that none of those
devices needed to raise the descriptor chain length, so I'm surprised
that virtio-9p needs it because it's conceptually similar to these
devices.

> Vivek was running detailed benchmarks for virtiofs vs. 9p:
> https://lists.gnu.org/archive/html/qemu-devel/2020-12/msg02704.html
> 
> For the virtio aspect discussed here, only the benchmark configurations 
> without cache are relevant (9p-none, vtfs-none) and under this aspect the 
> situation seems to be quite similar between 9p and virtio-fs. You'll also note 
> once DAX is enabled (vtfs-none-dax) that apparently boosts virtio-fs 
> performance significantly, which however seems to corelate to numbers when I 
> am running 9p with msize > 300k. Note: Vivek was presumably running 9p 
> effecively with msize=300k, as this was the kernel limitation at that time.

Agreed, virtio-9p and virtiofs are similar without caching.

I think we shouldn't consider DAX here since it bypasses the virtqueue.

> To bring things into relation: there are known performance aspects in 9p that 
> can be improved, yes, both on Linux kernel side and on 9p server side in QEMU. 
> For instance 9p server uses coroutines [1] and currently dispatches between 
> worker thread(s) and main thread too often per request (partly addressed 
> already [2], but still WIP), which accumulates to overall latency. But Vivek 
> was actually using a 9p patch here which disabled coroutines entirely, which 
> suggests that the virtio transmission size limit still represents a 
> bottleneck.

These results were collected with 4k block size. Neither msize nor the
descriptor chain length limits will be stressed, so I don't think these
results are relevant here.

Maybe a more relevant comparison would be virtio-9p, virtiofs, and
virtio-blk when block size is large (e.g. 1M). The Linux block layer in
the guest will split virtio-blk requests when they exceed the block
queue limits.

Stefan

> 
> [1] https://wiki.qemu.org/Documentation/9p#Coroutines
> [2] https://wiki.qemu.org/Documentation/9p#Implementation_Plans
> 
> Best regards,
> Christian Schoenebeck
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-11-10 10:05                                 ` Stefan Hajnoczi
  0 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-11-10 10:05 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz

[-- Attachment #1: Type: text/plain, Size: 43910 bytes --]

On Tue, Nov 09, 2021 at 02:09:59PM +0100, Christian Schoenebeck wrote:
> On Dienstag, 9. November 2021 11:56:35 CET Stefan Hajnoczi wrote:
> > On Thu, Nov 04, 2021 at 03:41:23PM +0100, Christian Schoenebeck wrote:
> > > On Mittwoch, 3. November 2021 12:33:33 CET Stefan Hajnoczi wrote:
> > > > On Mon, Nov 01, 2021 at 09:29:26PM +0100, Christian Schoenebeck wrote:
> > > > > On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> > > > > > On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck 
> wrote:
> > > > > > > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > > > > > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck 
> wrote:
> > > > > > > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian 
> Schoenebeck wrote:
> > > > > > > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian 
> Schoenebeck wrote:
> > > > > > > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > > > > > > 
> > > > > > > > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian 
> Schoenebeck wrote:
> > > > > > > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan 
> Hajnoczi wrote:
> > > > > > > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200,
> > > > > > > > > > > > > > > Christian
> > > > > > > > > > > > > > > Schoenebeck
> > > > > > > > > > 
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > At the moment the maximum transfer size with
> > > > > > > > > > > > > > > > virtio
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > limited
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > 4M
> > > > > > > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this
> > > > > > > > > > > > > > > > limit to
> > > > > > > > > > > > > > > > its
> > > > > > > > > > > > > > > > maximum
> > > > > > > > > > > > > > > > theoretical possible transfer size of 128M (32k
> > > > > > > > > > > > > > > > pages)
> > > > > > > > > > > > > > > > according
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > virtio specs:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/c
> > > > > > > > > > > > > > > > s01/
> > > > > > > > > > > > > > > > virt
> > > > > > > > > > > > > > > > io-v
> > > > > > > > > > > > > > > > 1.1-
> > > > > > > > > > > > > > > > cs
> > > > > > > > > > > > > > > > 01
> > > > > > > > > > > > > > > > .html#
> > > > > > > > > > > > > > > > x1-240006
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Hi Christian,
> > > > > > > > > > > > 
> > > > > > > > > > > > > > > I took a quick look at the code:
> > > > > > > > > > > > Hi,
> > > > > > > > > > > > 
> > > > > > > > > > > > Thanks Stefan for sharing virtio expertise and helping
> > > > > > > > > > > > Christian
> > > > > > > > > > > > !
> > > > > > > > > > > > 
> > > > > > > > > > > > > > > - The Linux 9p driver restricts descriptor chains
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > 128
> > > > > > > > > > > > > > > elements
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Yes, that's the limitation that I am about to remove
> > > > > > > > > > > > > > (WIP);
> > > > > > > > > > > > > > current
> > > > > > > > > > > > > > kernel
> > > > > > > > > > > > > > patches:
> > > > > > > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.
> > > > > > > > > > > > > > linu
> > > > > > > > > > > > > > x_os
> > > > > > > > > > > > > > s@cr
> > > > > > > > > > > > > > udeb
> > > > > > > > > > > > > > yt
> > > > > > > > > > > > > > e.
> > > > > > > > > > > > > > com/>
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I haven't read the patches yet but I'm concerned that
> > > > > > > > > > > > > today
> > > > > > > > > > > > > the
> > > > > > > > > > > > > driver
> > > > > > > > > > > > > is pretty well-behaved and this new patch series
> > > > > > > > > > > > > introduces a
> > > > > > > > > > > > > spec
> > > > > > > > > > > > > violation. Not fixing existing spec violations is
> > > > > > > > > > > > > okay,
> > > > > > > > > > > > > but
> > > > > > > > > > > > > adding
> > > > > > > > > > > > > new
> > > > > > > > > > > > > ones is a red flag. I think we need to figure out a
> > > > > > > > > > > > > clean
> > > > > > > > > > > > > solution.
> > > > > > > > > > > 
> > > > > > > > > > > Nobody has reviewed the kernel patches yet. My main
> > > > > > > > > > > concern
> > > > > > > > > > > therefore
> > > > > > > > > > > actually is that the kernel patches are already too
> > > > > > > > > > > complex,
> > > > > > > > > > > because
> > > > > > > > > > > the
> > > > > > > > > > > current situation is that only Dominique is handling 9p
> > > > > > > > > > > patches on
> > > > > > > > > > > kernel
> > > > > > > > > > > side, and he barely has time for 9p anymore.
> > > > > > > > > > > 
> > > > > > > > > > > Another reason for me to catch up on reading current
> > > > > > > > > > > kernel
> > > > > > > > > > > code
> > > > > > > > > > > and
> > > > > > > > > > > stepping in as reviewer of 9p on kernel side ASAP,
> > > > > > > > > > > independent
> > > > > > > > > > > of
> > > > > > > > > > > this
> > > > > > > > > > > issue.
> > > > > > > > > > > 
> > > > > > > > > > > As for current kernel patches' complexity: I can certainly
> > > > > > > > > > > drop
> > > > > > > > > > > patch
> > > > > > > > > > > 7
> > > > > > > > > > > entirely as it is probably just overkill. Patch 4 is then
> > > > > > > > > > > the
> > > > > > > > > > > biggest
> > > > > > > > > > > chunk, I have to see if I can simplify it, and whether it
> > > > > > > > > > > would
> > > > > > > > > > > make
> > > > > > > > > > > sense to squash with patch 3.
> > > > > > > > > > > 
> > > > > > > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to
> > > > > > > > > > > > > > > preadv(2)
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > will
> > > > > > > > > > > > > > > fail
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >   with EINVAL when called with more than IOV_MAX
> > > > > > > > > > > > > > >   iovecs
> > > > > > > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Hmm, which makes me wonder why I never encountered
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > error
> > > > > > > > > > > > > > during
> > > > > > > > > > > > > > testing.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Most people will use the 9p qemu 'local' fs driver
> > > > > > > > > > > > > > backend
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > practice,
> > > > > > > > > > > > > > so
> > > > > > > > > > > > > > that v9fs_read() call would translate for most
> > > > > > > > > > > > > > people to
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > static ssize_t local_preadv(FsContext *ctx,
> > > > > > > > > > > > > > V9fsFidOpenState
> > > > > > > > > > > > > > *fs,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >                             const struct iovec *iov,
> > > > > > > > > > > > > >                             int iovcnt, off_t
> > > > > > > > > > > > > >                             offset)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > {
> > > > > > > > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > #else
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > > > > > > > >     if (err == -1) {
> > > > > > > > > > > > > >     
> > > > > > > > > > > > > >         return err;
> > > > > > > > > > > > > >     
> > > > > > > > > > > > > >     } else {
> > > > > > > > > > > > > >     
> > > > > > > > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > > > > > > > >     
> > > > > > > > > > > > > >     }
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > #endif
> > > > > > > > > > > > > > }
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Unless I misunderstood the code, neither side can
> > > > > > > > > > > > > > > take
> > > > > > > > > > > > > > > advantage
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Stefan
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I need to check that when I have some more time. One
> > > > > > > > > > > > > > possible
> > > > > > > > > > > > > > explanation
> > > > > > > > > > > > > > might be that preadv() already has this wrapped into
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > loop
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > its
> > > > > > > > > > > > > > implementation to circumvent a limit like IOV_MAX.
> > > > > > > > > > > > > > It
> > > > > > > > > > > > > > might
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > another
> > > > > > > > > > > > > > "it
> > > > > > > > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > There are still a bunch of other issues I have to
> > > > > > > > > > > > > > resolve.
> > > > > > > > > > > > > > If
> > > > > > > > > > > > > > you
> > > > > > > > > > > > > > look
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > net/9p/client.c on kernel side, you'll notice that
> > > > > > > > > > > > > > it
> > > > > > > > > > > > > > basically
> > > > > > > > > > > > > > does
> > > > > > > > > > > > > > this ATM> >
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >     kmalloc(msize);
> > > > > > > > > > > > 
> > > > > > > > > > > > Note that this is done twice : once for the T message
> > > > > > > > > > > > (client
> > > > > > > > > > > > request)
> > > > > > > > > > > > and
> > > > > > > > > > > > once for the R message (server answer). The 9p driver
> > > > > > > > > > > > could
> > > > > > > > > > > > adjust
> > > > > > > > > > > > the
> > > > > > > > > > > > size
> > > > > > > > > > > > of the T message to what's really needed instead of
> > > > > > > > > > > > allocating
> > > > > > > > > > > > the
> > > > > > > > > > > > full
> > > > > > > > > > > > msize. R message size is not known though.
> > > > > > > > > > > 
> > > > > > > > > > > Would it make sense adding a second virtio ring, dedicated
> > > > > > > > > > > to
> > > > > > > > > > > server
> > > > > > > > > > > responses to solve this? IIRC 9p server already calculates
> > > > > > > > > > > appropriate
> > > > > > > > > > > exact sizes for each response type. So server could just
> > > > > > > > > > > push
> > > > > > > > > > > space
> > > > > > > > > > > that's
> > > > > > > > > > > really needed for its responses.
> > > > > > > > > > > 
> > > > > > > > > > > > > > for every 9p request. So not only does it allocate
> > > > > > > > > > > > > > much
> > > > > > > > > > > > > > more
> > > > > > > > > > > > > > memory
> > > > > > > > > > > > > > for
> > > > > > > > > > > > > > every request than actually required (i.e. say 9pfs
> > > > > > > > > > > > > > was
> > > > > > > > > > > > > > mounted
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > msize=8M, then a 9p request that actually would just
> > > > > > > > > > > > > > need 1k
> > > > > > > > > > > > > > would
> > > > > > > > > > > > > > nevertheless allocate 8M), but also it allocates >
> > > > > > > > > > > > > > PAGE_SIZE,
> > > > > > > > > > > > > > which
> > > > > > > > > > > > > > obviously may fail at any time.>
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs
> > > > > > > > > > > > > vmalloc()
> > > > > > > > > > > > > situation.
> > > > > > > > > > > 
> > > > > > > > > > > Hu, I didn't even consider vmalloc(). I just tried the
> > > > > > > > > > > kvmalloc()
> > > > > > > > > > > wrapper
> > > > > > > > > > > as a quick & dirty test, but it crashed in the same way as
> > > > > > > > > > > kmalloc()
> > > > > > > > > > > with
> > > > > > > > > > > large msize values immediately on mounting:
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > > > > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > > > > > > > --- a/net/9p/client.c
> > > > > > > > > > > +++ b/net/9p/client.c
> > > > > > > > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts,
> > > > > > > > > > > struct
> > > > > > > > > > > p9_client
> > > > > > > > > > > *clnt)
> > > > > > > > > > > 
> > > > > > > > > > >  static int p9_fcall_init(struct p9_client *c, struct
> > > > > > > > > > >  p9_fcall
> > > > > > > > > > >  *fc,
> > > > > > > > > > >  
> > > > > > > > > > >                          int alloc_msize)
> > > > > > > > > > >  
> > > > > > > > > > >  {
> > > > > > > > > > > 
> > > > > > > > > > > -       if (likely(c->fcall_cache) && alloc_msize ==
> > > > > > > > > > > c->msize)
> > > > > > > > > > > {
> > > > > > > > > > > +       //if (likely(c->fcall_cache) && alloc_msize ==
> > > > > > > > > > > c->msize) {
> > > > > > > > > > > +       if (false) {
> > > > > > > > > > > 
> > > > > > > > > > >                 fc->sdata =
> > > > > > > > > > >                 kmem_cache_alloc(c->fcall_cache,
> > > > > > > > > > >                 GFP_NOFS);
> > > > > > > > > > >                 fc->cache = c->fcall_cache;
> > > > > > > > > > >         
> > > > > > > > > > >         } else {
> > > > > > > > > > > 
> > > > > > > > > > > -               fc->sdata = kmalloc(alloc_msize,
> > > > > > > > > > > GFP_NOFS);
> > > > > > > > > > > +               fc->sdata = kvmalloc(alloc_msize,
> > > > > > > > > > > GFP_NOFS);
> > > > > > > > > > 
> > > > > > > > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > > > > > > > 
> > > > > > > > > > Now I get:
> > > > > > > > > >    virtio: bogus descriptor or out of resources
> > > > > > > > > > 
> > > > > > > > > > So, still some work ahead on both ends.
> > > > > > > > > 
> > > > > > > > > Few hacks later (only changes on 9p client side) I got this
> > > > > > > > > running
> > > > > > > > > stable
> > > > > > > > > now. The reason for the virtio error above was that kvmalloc()
> > > > > > > > > returns
> > > > > > > > > a
> > > > > > > > > non-logical kernel address for any kvmalloc(>4M), i.e. an
> > > > > > > > > address
> > > > > > > > > that
> > > > > > > > > is
> > > > > > > > > inaccessible from host side, hence that "bogus descriptor"
> > > > > > > > > message
> > > > > > > > > by
> > > > > > > > > QEMU.
> > > > > > > > > So I had to split those linear 9p client buffers into sparse
> > > > > > > > > ones
> > > > > > > > > (set
> > > > > > > > > of
> > > > > > > > > individual pages).
> > > > > > > > > 
> > > > > > > > > I tested this for some days with various virtio transmission
> > > > > > > > > sizes
> > > > > > > > > and
> > > > > > > > > it
> > > > > > > > > works as expected up to 128 MB (more precisely: 128 MB read
> > > > > > > > > space
> > > > > > > > > +
> > > > > > > > > 128 MB
> > > > > > > > > write space per virtio round trip message).
> > > > > > > > > 
> > > > > > > > > I did not encounter a show stopper for large virtio
> > > > > > > > > transmission
> > > > > > > > > sizes
> > > > > > > > > (4 MB ... 128 MB) on virtio level, neither as a result of
> > > > > > > > > testing,
> > > > > > > > > nor
> > > > > > > > > after reviewing the existing code.
> > > > > > > > > 
> > > > > > > > > About IOV_MAX: that's apparently not an issue on virtio level.
> > > > > > > > > Most of
> > > > > > > > > the
> > > > > > > > > iovec code, both on Linux kernel side and on QEMU side do not
> > > > > > > > > have
> > > > > > > > > this
> > > > > > > > > limitation. It is apparently however indeed a limitation for
> > > > > > > > > userland
> > > > > > > > > apps
> > > > > > > > > calling the Linux kernel's syscalls yet.
> > > > > > > > > 
> > > > > > > > > Stefan, as it stands now, I am even more convinced that the
> > > > > > > > > upper
> > > > > > > > > virtio
> > > > > > > > > transmission size limit should not be squeezed into the queue
> > > > > > > > > size
> > > > > > > > > argument of virtio_add_queue(). Not because of the previous
> > > > > > > > > argument
> > > > > > > > > that
> > > > > > > > > it would waste space (~1MB), but rather because they are two
> > > > > > > > > different
> > > > > > > > > things. To outline this, just a quick recap of what happens
> > > > > > > > > exactly
> > > > > > > > > when
> > > > > > > > > a bulk message is pushed over the virtio wire (assuming virtio
> > > > > > > > > "split"
> > > > > > > > > layout here):
> > > > > > > > > 
> > > > > > > > > ---------- [recap-start] ----------
> > > > > > > > > 
> > > > > > > > > For each bulk message sent guest <-> host, exactly *one* of
> > > > > > > > > the
> > > > > > > > > pre-allocated descriptors is taken and placed (subsequently)
> > > > > > > > > into
> > > > > > > > > exactly
> > > > > > > > > *one* position of the two available/used ring buffers. The
> > > > > > > > > actual
> > > > > > > > > descriptor table though, containing all the DMA addresses of
> > > > > > > > > the
> > > > > > > > > message
> > > > > > > > > bulk data, is allocated just in time for each round trip
> > > > > > > > > message.
> > > > > > > > > Say,
> > > > > > > > > it
> > > > > > > > > is the first message sent, it yields in the following
> > > > > > > > > structure:
> > > > > > > > > 
> > > > > > > > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > > > > > > > 
> > > > > > > > >    +-+              +-+           +-----------------+
> > > > > > > > >    
> > > > > > > > >    |D|------------->|d|---------->| Bulk data block |
> > > > > > > > >    
> > > > > > > > >    +-+              |d|--------+  +-----------------+
> > > > > > > > >    
> > > > > > > > >    | |              |d|------+ |
> > > > > > > > >    
> > > > > > > > >    +-+               .       | |  +-----------------+
> > > > > > > > >    
> > > > > > > > >    | |               .       | +->| Bulk data block |
> > > > > > > > >     
> > > > > > > > >     .                .       |    +-----------------+
> > > > > > > > >     .               |d|-+    |
> > > > > > > > >     .               +-+ |    |    +-----------------+
> > > > > > > > >     
> > > > > > > > >    | |                  |    +--->| Bulk data block |
> > > > > > > > >    
> > > > > > > > >    +-+                  |         +-----------------+
> > > > > > > > >    
> > > > > > > > >    | |                  |                 .
> > > > > > > > >    
> > > > > > > > >    +-+                  |                 .
> > > > > > > > >    
> > > > > > > > >                         |                 .
> > > > > > > > >                         |         
> > > > > > > > >                         |         +-----------------+
> > > > > > > > >                         
> > > > > > > > >                         +-------->| Bulk data block |
> > > > > > > > >                         
> > > > > > > > >                                   +-----------------+
> > > > > > > > > 
> > > > > > > > > Legend:
> > > > > > > > > D: pre-allocated descriptor
> > > > > > > > > d: just in time allocated descriptor
> > > > > > > > > -->: memory pointer (DMA)
> > > > > > > > > 
> > > > > > > > > The bulk data blocks are allocated by the respective device
> > > > > > > > > driver
> > > > > > > > > above
> > > > > > > > > virtio subsystem level (guest side).
> > > > > > > > > 
> > > > > > > > > There are exactly as many descriptors pre-allocated (D) as the
> > > > > > > > > size of
> > > > > > > > > a
> > > > > > > > > ring buffer.
> > > > > > > > > 
> > > > > > > > > A "descriptor" is more or less just a chainable DMA memory
> > > > > > > > > pointer;
> > > > > > > > > defined
> > > > > > > > > as:
> > > > > > > > > 
> > > > > > > > > /* Virtio ring descriptors: 16 bytes.  These can chain
> > > > > > > > > together
> > > > > > > > > via
> > > > > > > > > "next". */ struct vring_desc {
> > > > > > > > > 
> > > > > > > > > 	/* Address (guest-physical). */
> > > > > > > > > 	__virtio64 addr;
> > > > > > > > > 	/* Length. */
> > > > > > > > > 	__virtio32 len;
> > > > > > > > > 	/* The flags as indicated above. */
> > > > > > > > > 	__virtio16 flags;
> > > > > > > > > 	/* We chain unused descriptors via this, too */
> > > > > > > > > 	__virtio16 next;
> > > > > > > > > 
> > > > > > > > > };
> > > > > > > > > 
> > > > > > > > > There are 2 ring buffers; the "available" ring buffer is for
> > > > > > > > > sending a
> > > > > > > > > message guest->host (which will transmit DMA addresses of
> > > > > > > > > guest
> > > > > > > > > allocated
> > > > > > > > > bulk data blocks that are used for data sent to device, and
> > > > > > > > > separate
> > > > > > > > > guest allocated bulk data blocks that will be used by host
> > > > > > > > > side to
> > > > > > > > > place
> > > > > > > > > its response bulk data), and the "used" ring buffer is for
> > > > > > > > > sending
> > > > > > > > > host->guest to let guest know about host's response and that
> > > > > > > > > it
> > > > > > > > > could
> > > > > > > > > now
> > > > > > > > > safely consume and then deallocate the bulk data blocks
> > > > > > > > > subsequently.
> > > > > > > > > 
> > > > > > > > > ---------- [recap-end] ----------
> > > > > > > > > 
> > > > > > > > > So the "queue size" actually defines the ringbuffer size. It
> > > > > > > > > does
> > > > > > > > > not
> > > > > > > > > define the maximum amount of descriptors. The "queue size"
> > > > > > > > > rather
> > > > > > > > > defines
> > > > > > > > > how many pending messages can be pushed into either one
> > > > > > > > > ringbuffer
> > > > > > > > > before
> > > > > > > > > the other side would need to wait until the counter side would
> > > > > > > > > step up
> > > > > > > > > (i.e. ring buffer full).
> > > > > > > > > 
> > > > > > > > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE
> > > > > > > > > actually
> > > > > > > > > is)
> > > > > > > > > OTOH defines the max. bulk data size that could be transmitted
> > > > > > > > > with
> > > > > > > > > each
> > > > > > > > > virtio round trip message.
> > > > > > > > > 
> > > > > > > > > And in fact, 9p currently handles the virtio "queue size" as
> > > > > > > > > directly
> > > > > > > > > associative with its maximum amount of active 9p requests the
> > > > > > > > > server
> > > > > > > > > could
> > > > > > > > > 
> > > > > > > > > handle simultaniously:
> > > > > > > > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > > > > > > > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > > > > > > > >   hw/9pfs/virtio-9p-device.c:    v->vq =
> > > > > > > > >   virtio_add_queue(vdev,
> > > > > > > > >   MAX_REQ,
> > > > > > > > >   
> > > > > > > > >                                  handle_9p_output);
> > > > > > > > > 
> > > > > > > > > So if I would change it like this, just for the purpose to
> > > > > > > > > increase
> > > > > > > > > the
> > > > > > > > > max. virtio transmission size:
> > > > > > > > > 
> > > > > > > > > --- a/hw/9pfs/virtio-9p-device.c
> > > > > > > > > +++ b/hw/9pfs/virtio-9p-device.c
> > > > > > > > > @@ -218,7 +218,7 @@ static void
> > > > > > > > > virtio_9p_device_realize(DeviceState
> > > > > > > > > *dev,
> > > > > > > > > Error **errp)>
> > > > > > > > > 
> > > > > > > > >      v->config_size = sizeof(struct virtio_9p_config) +
> > > > > > > > >      strlen(s->fsconf.tag);
> > > > > > > > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P,
> > > > > > > > >      v->config_size,
> > > > > > > > >      
> > > > > > > > >                  VIRTQUEUE_MAX_SIZE);
> > > > > > > > > 
> > > > > > > > > -    v->vq = virtio_add_queue(vdev, MAX_REQ,
> > > > > > > > > handle_9p_output);
> > > > > > > > > +    v->vq = virtio_add_queue(vdev, 32*1024,
> > > > > > > > > handle_9p_output);
> > > > > > > > > 
> > > > > > > > >  }
> > > > > > > > > 
> > > > > > > > > Then it would require additional synchronization code on both
> > > > > > > > > ends
> > > > > > > > > and
> > > > > > > > > therefore unnecessary complexity, because it would now be
> > > > > > > > > possible
> > > > > > > > > that
> > > > > > > > > more requests are pushed into the ringbuffer than server could
> > > > > > > > > handle.
> > > > > > > > > 
> > > > > > > > > There is one potential issue though that probably did justify
> > > > > > > > > the
> > > > > > > > > "don't
> > > > > > > > > exceed the queue size" rule:
> > > > > > > > > 
> > > > > > > > > ATM the descriptor table is allocated (just in time) as *one*
> > > > > > > > > continuous
> > > > > > > > > buffer via kmalloc_array():
> > > > > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53
> > > > > > > > > 798c
> > > > > > > > > a086
> > > > > > > > > f7c7
> > > > > > > > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > > > > > > > 
> > > > > > > > > So assuming transmission size of 2 * 128 MB that
> > > > > > > > > kmalloc_array()
> > > > > > > > > call
> > > > > > > > > would
> > > > > > > > > yield in kmalloc(1M) and the latter might fail if guest had
> > > > > > > > > highly
> > > > > > > > > fragmented physical memory. For such kind of error case there
> > > > > > > > > is
> > > > > > > > > currently a fallback path in virtqueue_add_split() that would
> > > > > > > > > then
> > > > > > > > > use
> > > > > > > > > the required amount of pre-allocated descriptors instead:
> > > > > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53
> > > > > > > > > 798c
> > > > > > > > > a086
> > > > > > > > > f7c7
> > > > > > > > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > > > > > > > 
> > > > > > > > > That fallback recovery path would no longer be viable if the
> > > > > > > > > queue
> > > > > > > > > size
> > > > > > > > > was
> > > > > > > > > exceeded. There would be alternatives though, e.g. by allowing
> > > > > > > > > to
> > > > > > > > > chain
> > > > > > > > > indirect descriptor tables (currently prohibited by the virtio
> > > > > > > > > specs).
> > > > > > > > 
> > > > > > > > Making the maximum number of descriptors independent of the
> > > > > > > > queue
> > > > > > > > size
> > > > > > > > requires a change to the VIRTIO spec since the two values are
> > > > > > > > currently
> > > > > > > > explicitly tied together by the spec.
> > > > > > > 
> > > > > > > Yes, that's what the virtio specs say. But they don't say why, nor
> > > > > > > did
> > > > > > > I
> > > > > > > hear a reason in this dicussion.
> > > > > > > 
> > > > > > > That's why I invested time reviewing current virtio implementation
> > > > > > > and
> > > > > > > specs, as well as actually testing exceeding that limit. And as I
> > > > > > > outlined in detail in my previous email, I only found one
> > > > > > > theoretical
> > > > > > > issue that could be addressed though.
> > > > > > 
> > > > > > I agree that there is a limitation in the VIRTIO spec, but violating
> > > > > > the
> > > > > > spec isn't an acceptable solution:
> > > > > > 
> > > > > > 1. QEMU and Linux aren't the only components that implement VIRTIO.
> > > > > > You
> > > > > > 
> > > > > >    cannot make assumptions about their implementations because it
> > > > > >    may
> > > > > >    break spec-compliant implementations that you haven't looked at.
> > > > > >    
> > > > > >    Your patches weren't able to increase Queue Size because some
> > > > > >    device
> > > > > >    implementations break when descriptor chains are too long. This
> > > > > >    shows
> > > > > >    there is a practical issue even in QEMU.
> > > > > > 
> > > > > > 2. The specific spec violation that we discussed creates the problem
> > > > > > 
> > > > > >    that drivers can no longer determine the maximum description
> > > > > >    chain
> > > > > >    length. This in turn will lead to more implementation-specific
> > > > > >    assumptions being baked into drivers and cause problems with
> > > > > >    interoperability and future changes.
> > > > > > 
> > > > > > The spec needs to be extended instead. I included an idea for how to
> > > > > > do
> > > > > > that below.
> > > > > 
> > > > > Sure, I just wanted to see if there was a non-neglectable "hard" show
> > > > > stopper per se that I probably haven't seen yet. I have not questioned
> > > > > aiming a clean solution.
> > > > > 
> > > > > Thanks for the clarification!
> > > > > 
> > > > > > > > Before doing that, are there benchmark results showing that 1 MB
> > > > > > > > vs
> > > > > > > > 128
> > > > > > > > MB produces a performance improvement? I'm asking because if
> > > > > > > > performance
> > > > > > > > with 1 MB is good then you can probably do that without having
> > > > > > > > to
> > > > > > > > change
> > > > > > > > VIRTIO and also because it's counter-intuitive that 9p needs 128
> > > > > > > > MB
> > > > > > > > for
> > > > > > > > good performance when it's ultimately implemented on top of disk
> > > > > > > > and
> > > > > > > > network I/O that have lower size limits.
> > > > > > > 
> > > > > > > First some numbers, linear reading a 12 GB file:
> > > > > > > 
> > > > > > > msize    average      notes
> > > > > > > 
> > > > > > > 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> > > > > > > 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> > > > > > > 512 kB   1961 MB/s    current max. msize with any Linux kernel
> > > > > > > <=v5.15
> > > > > > > 1 MB     2551 MB/s    this msize would already violate virtio
> > > > > > > specs
> > > > > > > 2 MB     2521 MB/s    this msize would already violate virtio
> > > > > > > specs
> > > > > > > 4 MB     2628 MB/s    planned max. msize of my current kernel
> > > > > > > patches
> > > > > > > [1]
> > > > > > 
> > > > > > How many descriptors are used? 4 MB can be covered by a single
> > > > > > descriptor if the data is physically contiguous in memory, so this
> > > > > > data
> > > > > > doesn't demonstrate a need for more descriptors.
> > > > > 
> > > > > No, in the last couple years there was apparently no kernel version
> > > > > that
> > > > > used just one descriptor, nor did my benchmarked version. Even though
> > > > > the
> > > > > Linux 9p client uses (yet) simple linear buffers (contiguous physical
> > > > > memory) on 9p client level, these are however split into PAGE_SIZE
> > > > > chunks
> > > > > by function pack_sg_list() [1] before being fed to virtio level:
> > > > > 
> > > > > static unsigned int rest_of_page(void *data)
> > > > > {
> > > > > 
> > > > > 	return PAGE_SIZE - offset_in_page(data);
> > > > > 
> > > > > }
> > > > > ...
> > > > > static int pack_sg_list(struct scatterlist *sg, int start,
> > > > > 
> > > > > 			int limit, char *data, int count)
> > > > > 
> > > > > {
> > > > > 
> > > > > 	int s;
> > > > > 	int index = start;
> > > > > 	
> > > > > 	while (count) {
> > > > > 	
> > > > > 		s = rest_of_page(data);
> > > > > 		...
> > > > > 		sg_set_buf(&sg[index++], data, s);
> > > > > 		count -= s;
> > > > > 		data += s;
> > > > > 	
> > > > > 	}
> > > > > 	...
> > > > > 
> > > > > }
> > > > > 
> > > > > [1]
> > > > > https://github.com/torvalds/linux/blob/19901165d90fdca1e57c9baa0d5b4c6
> > > > > 3d1
> > > > > 5c476a/net/9p/trans_virtio.c#L171
> > > > > 
> > > > > So when sending 4MB over virtio wire, it would yield in 1k descriptors
> > > > > ATM.
> > > > > 
> > > > > I have wondered about this before, but did not question it, because
> > > > > due to
> > > > > the cross-platform nature I couldn't say for certain whether that's
> > > > > probably needed somewhere. I mean for the case virtio-PCI I know for
> > > > > sure
> > > > > that one descriptor (i.e. >PAGE_SIZE) would be fine, but I don't know
> > > > > if
> > > > > that applies to all buses and architectures.
> > > > 
> > > > VIRTIO does not limit descriptor the descriptor len field to PAGE_SIZE,
> > > > so I don't think there is a limit at the VIRTIO level.
> > > 
> > > So you are viewing this purely from virtio specs PoV: in the sense, if it
> > > is not prohibited by the virtio specs, then it should work. Maybe.
> > 
> > Limitations must be specified either in the 9P protocol or the VIRTIO
> > specification. Drivers and devices will not be able to operate correctly
> > if there are limitations that aren't covered by the specs.
> > 
> > Do you have something in mind that isn't covered by the specs?
> 
> Not sure whether that's something that should be specified by the virtio 
> specs, probably not. I simply do not know if there was any bus or architecture 
> that would have a limitation for max. size for a memory block passed per one 
> DMA address.

Host-side limitations like that can exist. For example when a physical
storage device on the host has limits that the VIRTIO device does not
have. In this case both virtio-scsi and virtio-blk report those limits
to the guest so that the guest won't submit requests that the physical
device would reject. I guess networking MTU is kind of similar too. What
they have in common is that the limit needs to be reported to the guest,
typically using a VIRTIO Configuration Space field. It is an explicit
limit that is part of the host<->guest interface (VIRTIO spec, SCSI,
etc).

> > > > If this function coalesces adjacent pages then the descriptor chain
> > > > length issues could be reduced.
> > > > 
> > > > > > > But again, this is not just about performance. My conclusion as
> > > > > > > described
> > > > > > > in my previous email is that virtio currently squeezes
> > > > > > > 
> > > > > > > 	"max. simultanious amount of bulk messages"
> > > > > > > 
> > > > > > > vs.
> > > > > > > 
> > > > > > > 	"max. bulk data transmission size per bulk messaage"
> > > > > > > 
> > > > > > > into the same configuration parameter, which is IMO inappropriate
> > > > > > > and
> > > > > > > hence
> > > > > > > splitting them into 2 separate parameters when creating a queue
> > > > > > > makes
> > > > > > > sense, independent of the performance benchmarks.
> > > > > > > 
> > > > > > > [1]
> > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crud
> > > > > > > ebyt
> > > > > > > e.c
> > > > > > > om/
> > > > > > 
> > > > > > Some devices effectively already have this because the device
> > > > > > advertises
> > > > > > a maximum number of descriptors via device-specific mechanisms like
> > > > > > the
> > > > > > struct virtio_blk_config seg_max field. But today these fields can
> > > > > > only
> > > > > > reduce the maximum descriptor chain length because the spec still
> > > > > > limits
> > > > > > the length to Queue Size.
> > > > > > 
> > > > > > We can build on this approach to raise the length above Queue Size.
> > > > > > This
> > > > > > approach has the advantage that the maximum number of segments isn't
> > > > > > per
> > > > > > device or per virtqueue, it's fine-grained. If the device supports
> > > > > > two
> > > > > > requests types then different max descriptor chain limits could be
> > > > > > given
> > > > > > for them by introducing two separate configuration space fields.
> > > > > > 
> > > > > > Here are the corresponding spec changes:
> > > > > > 
> > > > > > 1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is
> > > > > > added
> > > > > > 
> > > > > >    to indicate that indirect descriptor table size and maximum
> > > > > >    descriptor chain length are not limited by Queue Size value.
> > > > > >    (Maybe
> > > > > >    there still needs to be a limit like 2^15?)
> > > > > 
> > > > > Sounds good to me!
> > > > > 
> > > > > AFAIK it is effectively limited to 2^16 because of vring_desc->next:
> > > > > 
> > > > > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > > > > "next". */ struct vring_desc {
> > > > > 
> > > > >         /* Address (guest-physical). */
> > > > >         __virtio64 addr;
> > > > >         /* Length. */
> > > > >         __virtio32 len;
> > > > >         /* The flags as indicated above. */
> > > > >         __virtio16 flags;
> > > > >         /* We chain unused descriptors via this, too */
> > > > >         __virtio16 next;
> > > > > 
> > > > > };
> > > > 
> > > > Yes, Split Virtqueues have a fundamental limit on indirect table size
> > > > due to the "next" field. Packed Virtqueue descriptors don't have a
> > > > "next" field so descriptor chains could be longer in theory (currently
> > > > forbidden by the spec).
> > > > 
> > > > > > One thing that's messy is that we've been discussing the maximum
> > > > > > descriptor chain length but 9p has the "msize" concept, which isn't
> > > > > > aware of contiguous memory. It may be necessary to extend the 9p
> > > > > > driver
> > > > > > code to size requests not just according to their length in bytes
> > > > > > but
> > > > > > also according to the descriptor chain length. That's how the Linux
> > > > > > block layer deals with queue limits (struct queue_limits
> > > > > > max_segments vs
> > > > > > max_hw_sectors).
> > > > > 
> > > > > Hmm, can't follow on that one. For what should that be needed in case
> > > > > of
> > > > > 9p? My plan was to limit msize by 9p client simply at session start to
> > > > > whatever is the max. amount virtio descriptors supported by host and
> > > > > using PAGE_SIZE as size per descriptor, because that's what 9p client
> > > > > actually does ATM (see above). So you think that should be changed to
> > > > > e.g. just one descriptor for 4MB, right?
> > > > 
> > > > Limiting msize to the 9p transport device's maximum number of
> > > > descriptors is conservative (i.e. 128 descriptors = 512 KB msize)
> > > > because it doesn't take advantage of contiguous memory. I suggest
> > > > leaving msize alone, adding a separate limit at which requests are split
> > > > according to the maximum descriptor chain length, and tweaking
> > > > pack_sg_list() to coalesce adjacent pages.
> > > > 
> > > > That way msize can be large without necessarily using lots of
> > > > descriptors (depending on the memory layout).
> > > 
> > > That was actually a tempting solution. Because it would neither require
> > > changes to the virtio specs (at least for a while) and it would also work
> > > with older QEMU versions. And for that pack_sg_list() portion of the code
> > > it would work well and easy as the buffer passed to pack_sg_list() is
> > > contiguous already.
> > > 
> > > However I just realized for the zero-copy version of the code that would
> > > be
> > > more tricky. The ZC version already uses individual pages (struct page,
> > > hence PAGE_SIZE each) which are pinned, i.e. it uses pack_sg_list_p() [1]
> > > in combination with p9_get_mapped_pages() [2]
> > > 
> > > [1]
> > > https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9
> > > ab8389/net/9p/trans_virtio.c#L218 [2]
> > > https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9
> > > ab8389/net/9p/trans_virtio.c#L309
> > > 
> > > So that would require much more work and code trying to sort and coalesce
> > > individual pages to contiguous physical memory for the sake of reducing
> > > virtio descriptors. And there is no guarantee that this is even possible.
> > > The kernel may simply return a non-contiguous set of pages which would
> > > eventually end up exceeding the virtio descriptor limit again.
> > 
> > Order must be preserved so pages cannot be sorted by physical address.
> > How about simply coalescing when pages are adjacent?
> 
> It would help, but not solve the issue we are talking about here: if 99% of 
> the cases could successfully merge descriptors to stay below the descriptor 
> count limit, but in 1% of the cases it could not, then this still construes a 
> severe runtime issue that could trigger at any time.
> 
> > > So looks like it was probably still easier and realistic to just add
> > > virtio
> > > capabilities for now for allowing to exceed current descriptor limit.
> > 
> > I'm still not sure why virtio-net, virtio-blk, virtio-fs, etc perform
> > fine under today's limits while virtio-9p needs a much higher limit to
> > achieve good performance. Maybe there is an issue in a layer above the
> > vring that's causing the virtio-9p performance you've observed?
> 
> Are you referring to (somewhat) recent benchmarks when saying those would all 
> still perform fine today?

I'm not referring to specific benchmark results. Just that none of those
devices needed to raise the descriptor chain length, so I'm surprised
that virtio-9p needs it because it's conceptually similar to these
devices.

> Vivek was running detailed benchmarks for virtiofs vs. 9p:
> https://lists.gnu.org/archive/html/qemu-devel/2020-12/msg02704.html
> 
> For the virtio aspect discussed here, only the benchmark configurations 
> without cache are relevant (9p-none, vtfs-none) and under this aspect the 
> situation seems to be quite similar between 9p and virtio-fs. You'll also note 
> once DAX is enabled (vtfs-none-dax) that apparently boosts virtio-fs 
> performance significantly, which however seems to corelate to numbers when I 
> am running 9p with msize > 300k. Note: Vivek was presumably running 9p 
> effecively with msize=300k, as this was the kernel limitation at that time.

Agreed, virtio-9p and virtiofs are similar without caching.

I think we shouldn't consider DAX here since it bypasses the virtqueue.

> To bring things into relation: there are known performance aspects in 9p that 
> can be improved, yes, both on Linux kernel side and on 9p server side in QEMU. 
> For instance 9p server uses coroutines [1] and currently dispatches between 
> worker thread(s) and main thread too often per request (partly addressed 
> already [2], but still WIP), which accumulates to overall latency. But Vivek 
> was actually using a 9p patch here which disabled coroutines entirely, which 
> suggests that the virtio transmission size limit still represents a 
> bottleneck.

These results were collected with 4k block size. Neither msize nor the
descriptor chain length limits will be stressed, so I don't think these
results are relevant here.

Maybe a more relevant comparison would be virtio-9p, virtiofs, and
virtio-blk when block size is large (e.g. 1M). The Linux block layer in
the guest will split virtio-blk requests when they exceed the block
queue limits.

Stefan

> 
> [1] https://wiki.qemu.org/Documentation/9p#Coroutines
> [2] https://wiki.qemu.org/Documentation/9p#Implementation_Plans
> 
> Best regards,
> Christian Schoenebeck
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-11-10 10:05                                 ` [Virtio-fs] " Stefan Hajnoczi
@ 2021-11-10 13:14                                   ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-11-10 13:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Kevin Wolf, Laurent Vivier, qemu-block,
	Michael S. Tsirkin, Jason Wang, Amit Shah, David Hildenbrand,
	Greg Kurz, virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > > > So looks like it was probably still easier and realistic to just add
> > > > virtio
> > > > capabilities for now for allowing to exceed current descriptor limit.
> > > 
> > > I'm still not sure why virtio-net, virtio-blk, virtio-fs, etc perform
> > > fine under today's limits while virtio-9p needs a much higher limit to
> > > achieve good performance. Maybe there is an issue in a layer above the
> > > vring that's causing the virtio-9p performance you've observed?
> > 
> > Are you referring to (somewhat) recent benchmarks when saying those would
> > all still perform fine today?
> 
> I'm not referring to specific benchmark results. Just that none of those
> devices needed to raise the descriptor chain length, so I'm surprised
> that virtio-9p needs it because it's conceptually similar to these
> devices.

I would not say virtio-net and virtio-blk were comparable with virtio-9p and 
virtio-fs. virtio-9p and virtio-fs are fully fledged file servers which must 
perform various controller tasks before handling the actually requested I/O 
task, which inevitably adds latency to each request, whereas virtio-net and 
virtio-blk are just very thin layers that do not have that controller task 
overhead per request. And a network device only needs to handle very small 
messages in the first place.

> > Vivek was running detailed benchmarks for virtiofs vs. 9p:
> > https://lists.gnu.org/archive/html/qemu-devel/2020-12/msg02704.html
> > 
> > For the virtio aspect discussed here, only the benchmark configurations
> > without cache are relevant (9p-none, vtfs-none) and under this aspect the
> > situation seems to be quite similar between 9p and virtio-fs. You'll also
> > note once DAX is enabled (vtfs-none-dax) that apparently boosts virtio-fs
> > performance significantly, which however seems to corelate to numbers
> > when I am running 9p with msize > 300k. Note: Vivek was presumably
> > running 9p effecively with msize=300k, as this was the kernel limitation
> > at that time.
> Agreed, virtio-9p and virtiofs are similar without caching.
> 
> I think we shouldn't consider DAX here since it bypasses the virtqueue.

DAX bypasses virtio, sure, but the performance boost you get with DAX actually 
shows the limiting factor with virtio pretty well.

> > To bring things into relation: there are known performance aspects in 9p
> > that can be improved, yes, both on Linux kernel side and on 9p server
> > side in QEMU. For instance 9p server uses coroutines [1] and currently
> > dispatches between worker thread(s) and main thread too often per request
> > (partly addressed already [2], but still WIP), which accumulates to
> > overall latency. But Vivek was actually using a 9p patch here which
> > disabled coroutines entirely, which suggests that the virtio transmission
> > size limit still represents a bottleneck.
> 
> These results were collected with 4k block size. Neither msize nor the
> descriptor chain length limits will be stressed, so I don't think these
> results are relevant here.
> 
> Maybe a more relevant comparison would be virtio-9p, virtiofs, and
> virtio-blk when block size is large (e.g. 1M). The Linux block layer in
> the guest will split virtio-blk requests when they exceed the block
> queue limits.

I am sorry, I cannot spend time for more benchmarks like that. For really 
making fair comparisons I would need to review all their code on both ends, 
adjust configuration/sources, etc.

I do think that I performed enough benchmarks and tests to show that 
increasing the transmission size can significantly improve performance with 
9p, and that allowing to exceed the queue size does make sense even for small 
transmission sizes (e.g. max. active requests on 9p server side vs. max. 
transmission size per request).

The reason for the performance gain is the minimum latency involved per 
request, and like I said, that can be improved to a certain extent with 9p, 
but that will take a long time and it could not be eliminated entirely.

As you are apparently reluctant for changing the virtio specs, what about 
introducing those discussed virtio capabalities either as experimental ones 
without specs changes, or even just as 9p specific device capabilities for 
now. I mean those could be revoked on both sides at any time anyway.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-11-10 13:14                                   ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-11-10 13:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > > > So looks like it was probably still easier and realistic to just add
> > > > virtio
> > > > capabilities for now for allowing to exceed current descriptor limit.
> > > 
> > > I'm still not sure why virtio-net, virtio-blk, virtio-fs, etc perform
> > > fine under today's limits while virtio-9p needs a much higher limit to
> > > achieve good performance. Maybe there is an issue in a layer above the
> > > vring that's causing the virtio-9p performance you've observed?
> > 
> > Are you referring to (somewhat) recent benchmarks when saying those would
> > all still perform fine today?
> 
> I'm not referring to specific benchmark results. Just that none of those
> devices needed to raise the descriptor chain length, so I'm surprised
> that virtio-9p needs it because it's conceptually similar to these
> devices.

I would not say virtio-net and virtio-blk were comparable with virtio-9p and 
virtio-fs. virtio-9p and virtio-fs are fully fledged file servers which must 
perform various controller tasks before handling the actually requested I/O 
task, which inevitably adds latency to each request, whereas virtio-net and 
virtio-blk are just very thin layers that do not have that controller task 
overhead per request. And a network device only needs to handle very small 
messages in the first place.

> > Vivek was running detailed benchmarks for virtiofs vs. 9p:
> > https://lists.gnu.org/archive/html/qemu-devel/2020-12/msg02704.html
> > 
> > For the virtio aspect discussed here, only the benchmark configurations
> > without cache are relevant (9p-none, vtfs-none) and under this aspect the
> > situation seems to be quite similar between 9p and virtio-fs. You'll also
> > note once DAX is enabled (vtfs-none-dax) that apparently boosts virtio-fs
> > performance significantly, which however seems to corelate to numbers
> > when I am running 9p with msize > 300k. Note: Vivek was presumably
> > running 9p effecively with msize=300k, as this was the kernel limitation
> > at that time.
> Agreed, virtio-9p and virtiofs are similar without caching.
> 
> I think we shouldn't consider DAX here since it bypasses the virtqueue.

DAX bypasses virtio, sure, but the performance boost you get with DAX actually 
shows the limiting factor with virtio pretty well.

> > To bring things into relation: there are known performance aspects in 9p
> > that can be improved, yes, both on Linux kernel side and on 9p server
> > side in QEMU. For instance 9p server uses coroutines [1] and currently
> > dispatches between worker thread(s) and main thread too often per request
> > (partly addressed already [2], but still WIP), which accumulates to
> > overall latency. But Vivek was actually using a 9p patch here which
> > disabled coroutines entirely, which suggests that the virtio transmission
> > size limit still represents a bottleneck.
> 
> These results were collected with 4k block size. Neither msize nor the
> descriptor chain length limits will be stressed, so I don't think these
> results are relevant here.
> 
> Maybe a more relevant comparison would be virtio-9p, virtiofs, and
> virtio-blk when block size is large (e.g. 1M). The Linux block layer in
> the guest will split virtio-blk requests when they exceed the block
> queue limits.

I am sorry, I cannot spend time for more benchmarks like that. For really 
making fair comparisons I would need to review all their code on both ends, 
adjust configuration/sources, etc.

I do think that I performed enough benchmarks and tests to show that 
increasing the transmission size can significantly improve performance with 
9p, and that allowing to exceed the queue size does make sense even for small 
transmission sizes (e.g. max. active requests on 9p server side vs. max. 
transmission size per request).

The reason for the performance gain is the minimum latency involved per 
request, and like I said, that can be improved to a certain extent with 9p, 
but that will take a long time and it could not be eliminated entirely.

As you are apparently reluctant for changing the virtio specs, what about 
introducing those discussed virtio capabalities either as experimental ones 
without specs changes, or even just as 9p specific device capabilities for 
now. I mean those could be revoked on both sides at any time anyway.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-11-10 13:14                                   ` [Virtio-fs] " Christian Schoenebeck
@ 2021-11-10 15:14                                     ` Stefan Hajnoczi
  -1 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-11-10 15:14 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, Greg Kurz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 1174 bytes --]

On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote:
> On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> As you are apparently reluctant for changing the virtio specs, what about 
> introducing those discussed virtio capabalities either as experimental ones 
> without specs changes, or even just as 9p specific device capabilities for 
> now. I mean those could be revoked on both sides at any time anyway.

I would like to understand the root cause before making changes.

"It's faster when I do X" is useful information but it doesn't
necessarily mean doing X is the solution. The "it's faster when I do X
because Y" part is missing in my mind. Once there is evidence that shows
Y then it will be clearer if X is a good solution, if there's a more
general solution, or if it was just a side-effect.

I'm sorry for frustrating your efforts here. We have discussed a lot of
different ideas and maybe our perspectives are not that far apart
anymore.

Keep going with what you think is best. If I am still skeptical we can
ask someone else to review the patches instead of me so you have a
second opinion.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-11-10 15:14                                     ` Stefan Hajnoczi
  0 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-11-10 15:14 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Raphael Norwitz

[-- Attachment #1: Type: text/plain, Size: 1174 bytes --]

On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote:
> On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> As you are apparently reluctant for changing the virtio specs, what about 
> introducing those discussed virtio capabalities either as experimental ones 
> without specs changes, or even just as 9p specific device capabilities for 
> now. I mean those could be revoked on both sides at any time anyway.

I would like to understand the root cause before making changes.

"It's faster when I do X" is useful information but it doesn't
necessarily mean doing X is the solution. The "it's faster when I do X
because Y" part is missing in my mind. Once there is evidence that shows
Y then it will be clearer if X is a good solution, if there's a more
general solution, or if it was just a side-effect.

I'm sorry for frustrating your efforts here. We have discussed a lot of
different ideas and maybe our perspectives are not that far apart
anymore.

Keep going with what you think is best. If I am still skeptical we can
ask someone else to review the patches instead of me so you have a
second opinion.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-11-10 15:14                                     ` [Virtio-fs] " Stefan Hajnoczi
@ 2021-11-10 15:53                                       ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-11-10 15:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Kevin Wolf, Laurent Vivier, qemu-block,
	Michael S. Tsirkin, Jason Wang, Amit Shah, David Hildenbrand,
	Greg Kurz, virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote:
> On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote:
> > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > As you are apparently reluctant for changing the virtio specs, what about
> > introducing those discussed virtio capabalities either as experimental
> > ones
> > without specs changes, or even just as 9p specific device capabilities for
> > now. I mean those could be revoked on both sides at any time anyway.
> 
> I would like to understand the root cause before making changes.
> 
> "It's faster when I do X" is useful information but it doesn't
> necessarily mean doing X is the solution. The "it's faster when I do X
> because Y" part is missing in my mind. Once there is evidence that shows
> Y then it will be clearer if X is a good solution, if there's a more
> general solution, or if it was just a side-effect.

I think I made it clear that the root cause of the observed performance gain 
with rising transmission size is latency (and also that performance is not the 
only reason for addressing this queue size issue).

Each request roundtrip has a certain minimum latency, the virtio ring alone 
has its latency, plus latency of the controller portion of the file server 
(e.g. permissions, sandbox checks, file IDs) that is executed with *every* 
request, plus latency of dispatching the request handling between threads 
several times back and forth (also for each request).

Therefore when you split large payloads (e.g. reading a large file) into 
smaller n amount of chunks, then that individual latency per request 
accumulates to n times the individual latency, eventually leading to degraded 
transmission speed as those requests are serialized.

> I'm sorry for frustrating your efforts here. We have discussed a lot of
> different ideas and maybe our perspectives are not that far apart
> anymore.
> 
> Keep going with what you think is best. If I am still skeptical we can
> ask someone else to review the patches instead of me so you have a
> second opinion.
> 
> Stefan

Thanks Stefan!

In the meantime I try to address your objections as far as I can. If there is 
more I can do (with reasonable effort) to resolve your doubts, just let me 
know.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-11-10 15:53                                       ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-11-10 15:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng

On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote:
> On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote:
> > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > As you are apparently reluctant for changing the virtio specs, what about
> > introducing those discussed virtio capabalities either as experimental
> > ones
> > without specs changes, or even just as 9p specific device capabilities for
> > now. I mean those could be revoked on both sides at any time anyway.
> 
> I would like to understand the root cause before making changes.
> 
> "It's faster when I do X" is useful information but it doesn't
> necessarily mean doing X is the solution. The "it's faster when I do X
> because Y" part is missing in my mind. Once there is evidence that shows
> Y then it will be clearer if X is a good solution, if there's a more
> general solution, or if it was just a side-effect.

I think I made it clear that the root cause of the observed performance gain 
with rising transmission size is latency (and also that performance is not the 
only reason for addressing this queue size issue).

Each request roundtrip has a certain minimum latency, the virtio ring alone 
has its latency, plus latency of the controller portion of the file server 
(e.g. permissions, sandbox checks, file IDs) that is executed with *every* 
request, plus latency of dispatching the request handling between threads 
several times back and forth (also for each request).

Therefore when you split large payloads (e.g. reading a large file) into 
smaller n amount of chunks, then that individual latency per request 
accumulates to n times the individual latency, eventually leading to degraded 
transmission speed as those requests are serialized.

> I'm sorry for frustrating your efforts here. We have discussed a lot of
> different ideas and maybe our perspectives are not that far apart
> anymore.
> 
> Keep going with what you think is best. If I am still skeptical we can
> ask someone else to review the patches instead of me so you have a
> second opinion.
> 
> Stefan

Thanks Stefan!

In the meantime I try to address your objections as far as I can. If there is 
more I can do (with reasonable effort) to resolve your doubts, just let me 
know.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-11-10 15:53                                       ` [Virtio-fs] " Christian Schoenebeck
@ 2021-11-11 16:31                                         ` Stefan Hajnoczi
  -1 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-11-11 16:31 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, Greg Kurz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 2746 bytes --]

On Wed, Nov 10, 2021 at 04:53:33PM +0100, Christian Schoenebeck wrote:
> On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote:
> > On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote:
> > > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > > As you are apparently reluctant for changing the virtio specs, what about
> > > introducing those discussed virtio capabalities either as experimental
> > > ones
> > > without specs changes, or even just as 9p specific device capabilities for
> > > now. I mean those could be revoked on both sides at any time anyway.
> > 
> > I would like to understand the root cause before making changes.
> > 
> > "It's faster when I do X" is useful information but it doesn't
> > necessarily mean doing X is the solution. The "it's faster when I do X
> > because Y" part is missing in my mind. Once there is evidence that shows
> > Y then it will be clearer if X is a good solution, if there's a more
> > general solution, or if it was just a side-effect.
> 
> I think I made it clear that the root cause of the observed performance gain 
> with rising transmission size is latency (and also that performance is not the 
> only reason for addressing this queue size issue).
> 
> Each request roundtrip has a certain minimum latency, the virtio ring alone 
> has its latency, plus latency of the controller portion of the file server 
> (e.g. permissions, sandbox checks, file IDs) that is executed with *every* 
> request, plus latency of dispatching the request handling between threads 
> several times back and forth (also for each request).
> 
> Therefore when you split large payloads (e.g. reading a large file) into 
> smaller n amount of chunks, then that individual latency per request 
> accumulates to n times the individual latency, eventually leading to degraded 
> transmission speed as those requests are serialized.

It's easy to increase the blocksize in benchmarks, but real applications
offer less control over the I/O pattern. If latency in the device
implementation (QEMU) is the root cause then reduce the latency to speed
up all applications, even those that cannot send huge requests.

One idea is request merging on the QEMU side. If the application sends
10 sequential read or write requests, coalesce them together before the
main part of request processing begins in the device. Process a single
large request to spread the cost of the file server over the 10
requests. (virtio-blk has request merging to help with the cost of lots
of small qcow2 I/O requests.) The cool thing about this is that the
guest does not need to change its I/O pattern to benefit from the
optimization.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-11-11 16:31                                         ` Stefan Hajnoczi
  0 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-11-11 16:31 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz

[-- Attachment #1: Type: text/plain, Size: 2746 bytes --]

On Wed, Nov 10, 2021 at 04:53:33PM +0100, Christian Schoenebeck wrote:
> On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote:
> > On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote:
> > > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > > As you are apparently reluctant for changing the virtio specs, what about
> > > introducing those discussed virtio capabalities either as experimental
> > > ones
> > > without specs changes, or even just as 9p specific device capabilities for
> > > now. I mean those could be revoked on both sides at any time anyway.
> > 
> > I would like to understand the root cause before making changes.
> > 
> > "It's faster when I do X" is useful information but it doesn't
> > necessarily mean doing X is the solution. The "it's faster when I do X
> > because Y" part is missing in my mind. Once there is evidence that shows
> > Y then it will be clearer if X is a good solution, if there's a more
> > general solution, or if it was just a side-effect.
> 
> I think I made it clear that the root cause of the observed performance gain 
> with rising transmission size is latency (and also that performance is not the 
> only reason for addressing this queue size issue).
> 
> Each request roundtrip has a certain minimum latency, the virtio ring alone 
> has its latency, plus latency of the controller portion of the file server 
> (e.g. permissions, sandbox checks, file IDs) that is executed with *every* 
> request, plus latency of dispatching the request handling between threads 
> several times back and forth (also for each request).
> 
> Therefore when you split large payloads (e.g. reading a large file) into 
> smaller n amount of chunks, then that individual latency per request 
> accumulates to n times the individual latency, eventually leading to degraded 
> transmission speed as those requests are serialized.

It's easy to increase the blocksize in benchmarks, but real applications
offer less control over the I/O pattern. If latency in the device
implementation (QEMU) is the root cause then reduce the latency to speed
up all applications, even those that cannot send huge requests.

One idea is request merging on the QEMU side. If the application sends
10 sequential read or write requests, coalesce them together before the
main part of request processing begins in the device. Process a single
large request to spread the cost of the file server over the 10
requests. (virtio-blk has request merging to help with the cost of lots
of small qcow2 I/O requests.) The cool thing about this is that the
guest does not need to change its I/O pattern to benefit from the
optimization.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-11-11 16:31                                         ` [Virtio-fs] " Stefan Hajnoczi
@ 2021-11-11 17:54                                           ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-11-11 17:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Kevin Wolf, Laurent Vivier, qemu-block,
	Michael S. Tsirkin, Jason Wang, Amit Shah, David Hildenbrand,
	Greg Kurz, virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

On Donnerstag, 11. November 2021 17:31:52 CET Stefan Hajnoczi wrote:
> On Wed, Nov 10, 2021 at 04:53:33PM +0100, Christian Schoenebeck wrote:
> > On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote:
> > > On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote:
> > > > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > > > As you are apparently reluctant for changing the virtio specs, what
> > > > about
> > > > introducing those discussed virtio capabalities either as experimental
> > > > ones
> > > > without specs changes, or even just as 9p specific device capabilities
> > > > for
> > > > now. I mean those could be revoked on both sides at any time anyway.
> > > 
> > > I would like to understand the root cause before making changes.
> > > 
> > > "It's faster when I do X" is useful information but it doesn't
> > > necessarily mean doing X is the solution. The "it's faster when I do X
> > > because Y" part is missing in my mind. Once there is evidence that shows
> > > Y then it will be clearer if X is a good solution, if there's a more
> > > general solution, or if it was just a side-effect.
> > 
> > I think I made it clear that the root cause of the observed performance
> > gain with rising transmission size is latency (and also that performance
> > is not the only reason for addressing this queue size issue).
> > 
> > Each request roundtrip has a certain minimum latency, the virtio ring
> > alone
> > has its latency, plus latency of the controller portion of the file server
> > (e.g. permissions, sandbox checks, file IDs) that is executed with *every*
> > request, plus latency of dispatching the request handling between threads
> > several times back and forth (also for each request).
> > 
> > Therefore when you split large payloads (e.g. reading a large file) into
> > smaller n amount of chunks, then that individual latency per request
> > accumulates to n times the individual latency, eventually leading to
> > degraded transmission speed as those requests are serialized.
> 
> It's easy to increase the blocksize in benchmarks, but real applications
> offer less control over the I/O pattern. If latency in the device
> implementation (QEMU) is the root cause then reduce the latency to speed
> up all applications, even those that cannot send huge requests.

Which I did, still do, and also mentioned before, e.g.:

8d6cb100731c4d28535adbf2a3c2d1f29be3fef4 9pfs: reduce latency of Twalk
0c4356ba7dafc8ecb5877a42fc0d68d45ccf5951 9pfs: T_readdir latency optimization

Reducing overall latency is a process that is ongoing and will still take a 
very long development time. Not because of me, but because of lack of 
reviewers. And even then, it does not make the effort to support higher 
transmission sizes obsolete.

> One idea is request merging on the QEMU side. If the application sends
> 10 sequential read or write requests, coalesce them together before the
> main part of request processing begins in the device. Process a single
> large request to spread the cost of the file server over the 10
> requests. (virtio-blk has request merging to help with the cost of lots
> of small qcow2 I/O requests.) The cool thing about this is that the
> guest does not need to change its I/O pattern to benefit from the
> optimization.
> 
> Stefan

Ok, don't get me wrong: I appreciate that you are suggesting approaches that 
could improve things. But I could already hand you over a huge list of mine. 
The limiting factor here is not the lack of ideas of what could be improved, 
but rather the lack of people helping out actively on 9p side:
https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg06452.html

The situation on kernel side is the same. I already have a huge list of what 
could & should be improved. But there is basically no reviewer for 9p patches 
on Linux kernel side either.

The much I appreciate suggestions of what could be improved, I would 
appreciate much more if there was *anybody* actively assisting as well. In the 
time being I have to work the list down in small patch chunks, priority based.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-11-11 17:54                                           ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-11-11 17:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Donnerstag, 11. November 2021 17:31:52 CET Stefan Hajnoczi wrote:
> On Wed, Nov 10, 2021 at 04:53:33PM +0100, Christian Schoenebeck wrote:
> > On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote:
> > > On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote:
> > > > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > > > As you are apparently reluctant for changing the virtio specs, what
> > > > about
> > > > introducing those discussed virtio capabalities either as experimental
> > > > ones
> > > > without specs changes, or even just as 9p specific device capabilities
> > > > for
> > > > now. I mean those could be revoked on both sides at any time anyway.
> > > 
> > > I would like to understand the root cause before making changes.
> > > 
> > > "It's faster when I do X" is useful information but it doesn't
> > > necessarily mean doing X is the solution. The "it's faster when I do X
> > > because Y" part is missing in my mind. Once there is evidence that shows
> > > Y then it will be clearer if X is a good solution, if there's a more
> > > general solution, or if it was just a side-effect.
> > 
> > I think I made it clear that the root cause of the observed performance
> > gain with rising transmission size is latency (and also that performance
> > is not the only reason for addressing this queue size issue).
> > 
> > Each request roundtrip has a certain minimum latency, the virtio ring
> > alone
> > has its latency, plus latency of the controller portion of the file server
> > (e.g. permissions, sandbox checks, file IDs) that is executed with *every*
> > request, plus latency of dispatching the request handling between threads
> > several times back and forth (also for each request).
> > 
> > Therefore when you split large payloads (e.g. reading a large file) into
> > smaller n amount of chunks, then that individual latency per request
> > accumulates to n times the individual latency, eventually leading to
> > degraded transmission speed as those requests are serialized.
> 
> It's easy to increase the blocksize in benchmarks, but real applications
> offer less control over the I/O pattern. If latency in the device
> implementation (QEMU) is the root cause then reduce the latency to speed
> up all applications, even those that cannot send huge requests.

Which I did, still do, and also mentioned before, e.g.:

8d6cb100731c4d28535adbf2a3c2d1f29be3fef4 9pfs: reduce latency of Twalk
0c4356ba7dafc8ecb5877a42fc0d68d45ccf5951 9pfs: T_readdir latency optimization

Reducing overall latency is a process that is ongoing and will still take a 
very long development time. Not because of me, but because of lack of 
reviewers. And even then, it does not make the effort to support higher 
transmission sizes obsolete.

> One idea is request merging on the QEMU side. If the application sends
> 10 sequential read or write requests, coalesce them together before the
> main part of request processing begins in the device. Process a single
> large request to spread the cost of the file server over the 10
> requests. (virtio-blk has request merging to help with the cost of lots
> of small qcow2 I/O requests.) The cool thing about this is that the
> guest does not need to change its I/O pattern to benefit from the
> optimization.
> 
> Stefan

Ok, don't get me wrong: I appreciate that you are suggesting approaches that 
could improve things. But I could already hand you over a huge list of mine. 
The limiting factor here is not the lack of ideas of what could be improved, 
but rather the lack of people helping out actively on 9p side:
https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg06452.html

The situation on kernel side is the same. I already have a huge list of what 
could & should be improved. But there is basically no reviewer for 9p patches 
on Linux kernel side either.

The much I appreciate suggestions of what could be improved, I would 
appreciate much more if there was *anybody* actively assisting as well. In the 
time being I have to work the list down in small patch chunks, priority based.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-11-11 17:54                                           ` [Virtio-fs] " Christian Schoenebeck
@ 2021-11-15 11:54                                             ` Stefan Hajnoczi
  -1 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-11-15 11:54 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, Greg Kurz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 5011 bytes --]

On Thu, Nov 11, 2021 at 06:54:03PM +0100, Christian Schoenebeck wrote:
> On Donnerstag, 11. November 2021 17:31:52 CET Stefan Hajnoczi wrote:
> > On Wed, Nov 10, 2021 at 04:53:33PM +0100, Christian Schoenebeck wrote:
> > > On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote:
> > > > On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote:
> > > > > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > > > > As you are apparently reluctant for changing the virtio specs, what
> > > > > about
> > > > > introducing those discussed virtio capabalities either as experimental
> > > > > ones
> > > > > without specs changes, or even just as 9p specific device capabilities
> > > > > for
> > > > > now. I mean those could be revoked on both sides at any time anyway.
> > > > 
> > > > I would like to understand the root cause before making changes.
> > > > 
> > > > "It's faster when I do X" is useful information but it doesn't
> > > > necessarily mean doing X is the solution. The "it's faster when I do X
> > > > because Y" part is missing in my mind. Once there is evidence that shows
> > > > Y then it will be clearer if X is a good solution, if there's a more
> > > > general solution, or if it was just a side-effect.
> > > 
> > > I think I made it clear that the root cause of the observed performance
> > > gain with rising transmission size is latency (and also that performance
> > > is not the only reason for addressing this queue size issue).
> > > 
> > > Each request roundtrip has a certain minimum latency, the virtio ring
> > > alone
> > > has its latency, plus latency of the controller portion of the file server
> > > (e.g. permissions, sandbox checks, file IDs) that is executed with *every*
> > > request, plus latency of dispatching the request handling between threads
> > > several times back and forth (also for each request).
> > > 
> > > Therefore when you split large payloads (e.g. reading a large file) into
> > > smaller n amount of chunks, then that individual latency per request
> > > accumulates to n times the individual latency, eventually leading to
> > > degraded transmission speed as those requests are serialized.
> > 
> > It's easy to increase the blocksize in benchmarks, but real applications
> > offer less control over the I/O pattern. If latency in the device
> > implementation (QEMU) is the root cause then reduce the latency to speed
> > up all applications, even those that cannot send huge requests.
> 
> Which I did, still do, and also mentioned before, e.g.:
> 
> 8d6cb100731c4d28535adbf2a3c2d1f29be3fef4 9pfs: reduce latency of Twalk
> 0c4356ba7dafc8ecb5877a42fc0d68d45ccf5951 9pfs: T_readdir latency optimization
> 
> Reducing overall latency is a process that is ongoing and will still take a 
> very long development time. Not because of me, but because of lack of 
> reviewers. And even then, it does not make the effort to support higher 
> transmission sizes obsolete.
> 
> > One idea is request merging on the QEMU side. If the application sends
> > 10 sequential read or write requests, coalesce them together before the
> > main part of request processing begins in the device. Process a single
> > large request to spread the cost of the file server over the 10
> > requests. (virtio-blk has request merging to help with the cost of lots
> > of small qcow2 I/O requests.) The cool thing about this is that the
> > guest does not need to change its I/O pattern to benefit from the
> > optimization.
> > 
> > Stefan
> 
> Ok, don't get me wrong: I appreciate that you are suggesting approaches that 
> could improve things. But I could already hand you over a huge list of mine. 
> The limiting factor here is not the lack of ideas of what could be improved, 
> but rather the lack of people helping out actively on 9p side:
> https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg06452.html
> 
> The situation on kernel side is the same. I already have a huge list of what 
> could & should be improved. But there is basically no reviewer for 9p patches 
> on Linux kernel side either.
> 
> The much I appreciate suggestions of what could be improved, I would 
> appreciate much more if there was *anybody* actively assisting as well. In the 
> time being I have to work the list down in small patch chunks, priority based.

I see request merging as an alternative to this patch series, not as an
additional idea.

My thoughts behind this is that request merging is less work than this
patch series and more broadly applicable. It would be easy to merge (no
idea how easy it is to implement though) in QEMU's virtio-9p device
implementation, does not require changes across the stack, and benefits
applications that can't change their I/O pattern to take advantage of
huge requests.

There is a risk that request merging won't pan out, it could have worse
performance than submitting huge requests.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-11-15 11:54                                             ` Stefan Hajnoczi
  0 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-11-15 11:54 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Raphael Norwitz

[-- Attachment #1: Type: text/plain, Size: 5011 bytes --]

On Thu, Nov 11, 2021 at 06:54:03PM +0100, Christian Schoenebeck wrote:
> On Donnerstag, 11. November 2021 17:31:52 CET Stefan Hajnoczi wrote:
> > On Wed, Nov 10, 2021 at 04:53:33PM +0100, Christian Schoenebeck wrote:
> > > On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote:
> > > > On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote:
> > > > > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > > > > As you are apparently reluctant for changing the virtio specs, what
> > > > > about
> > > > > introducing those discussed virtio capabalities either as experimental
> > > > > ones
> > > > > without specs changes, or even just as 9p specific device capabilities
> > > > > for
> > > > > now. I mean those could be revoked on both sides at any time anyway.
> > > > 
> > > > I would like to understand the root cause before making changes.
> > > > 
> > > > "It's faster when I do X" is useful information but it doesn't
> > > > necessarily mean doing X is the solution. The "it's faster when I do X
> > > > because Y" part is missing in my mind. Once there is evidence that shows
> > > > Y then it will be clearer if X is a good solution, if there's a more
> > > > general solution, or if it was just a side-effect.
> > > 
> > > I think I made it clear that the root cause of the observed performance
> > > gain with rising transmission size is latency (and also that performance
> > > is not the only reason for addressing this queue size issue).
> > > 
> > > Each request roundtrip has a certain minimum latency, the virtio ring
> > > alone
> > > has its latency, plus latency of the controller portion of the file server
> > > (e.g. permissions, sandbox checks, file IDs) that is executed with *every*
> > > request, plus latency of dispatching the request handling between threads
> > > several times back and forth (also for each request).
> > > 
> > > Therefore when you split large payloads (e.g. reading a large file) into
> > > smaller n amount of chunks, then that individual latency per request
> > > accumulates to n times the individual latency, eventually leading to
> > > degraded transmission speed as those requests are serialized.
> > 
> > It's easy to increase the blocksize in benchmarks, but real applications
> > offer less control over the I/O pattern. If latency in the device
> > implementation (QEMU) is the root cause then reduce the latency to speed
> > up all applications, even those that cannot send huge requests.
> 
> Which I did, still do, and also mentioned before, e.g.:
> 
> 8d6cb100731c4d28535adbf2a3c2d1f29be3fef4 9pfs: reduce latency of Twalk
> 0c4356ba7dafc8ecb5877a42fc0d68d45ccf5951 9pfs: T_readdir latency optimization
> 
> Reducing overall latency is a process that is ongoing and will still take a 
> very long development time. Not because of me, but because of lack of 
> reviewers. And even then, it does not make the effort to support higher 
> transmission sizes obsolete.
> 
> > One idea is request merging on the QEMU side. If the application sends
> > 10 sequential read or write requests, coalesce them together before the
> > main part of request processing begins in the device. Process a single
> > large request to spread the cost of the file server over the 10
> > requests. (virtio-blk has request merging to help with the cost of lots
> > of small qcow2 I/O requests.) The cool thing about this is that the
> > guest does not need to change its I/O pattern to benefit from the
> > optimization.
> > 
> > Stefan
> 
> Ok, don't get me wrong: I appreciate that you are suggesting approaches that 
> could improve things. But I could already hand you over a huge list of mine. 
> The limiting factor here is not the lack of ideas of what could be improved, 
> but rather the lack of people helping out actively on 9p side:
> https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg06452.html
> 
> The situation on kernel side is the same. I already have a huge list of what 
> could & should be improved. But there is basically no reviewer for 9p patches 
> on Linux kernel side either.
> 
> The much I appreciate suggestions of what could be improved, I would 
> appreciate much more if there was *anybody* actively assisting as well. In the 
> time being I have to work the list down in small patch chunks, priority based.

I see request merging as an alternative to this patch series, not as an
additional idea.

My thoughts behind this is that request merging is less work than this
patch series and more broadly applicable. It would be easy to merge (no
idea how easy it is to implement though) in QEMU's virtio-9p device
implementation, does not require changes across the stack, and benefits
applications that can't change their I/O pattern to take advantage of
huge requests.

There is a risk that request merging won't pan out, it could have worse
performance than submitting huge requests.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-11-15 11:54                                             ` [Virtio-fs] " Stefan Hajnoczi
@ 2021-11-15 14:32                                               ` Christian Schoenebeck
  -1 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-11-15 14:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Kevin Wolf, Laurent Vivier, qemu-block,
	Michael S. Tsirkin, Jason Wang, Amit Shah, David Hildenbrand,
	Greg Kurz, virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Paolo Bonzini, Marc-André Lureau, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

On Montag, 15. November 2021 12:54:32 CET Stefan Hajnoczi wrote:
> On Thu, Nov 11, 2021 at 06:54:03PM +0100, Christian Schoenebeck wrote:
> > On Donnerstag, 11. November 2021 17:31:52 CET Stefan Hajnoczi wrote:
> > > On Wed, Nov 10, 2021 at 04:53:33PM +0100, Christian Schoenebeck wrote:
> > > > On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote:
> > > > > On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck 
wrote:
> > > > > > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > > > > > As you are apparently reluctant for changing the virtio specs,
> > > > > > what
> > > > > > about
> > > > > > introducing those discussed virtio capabalities either as
> > > > > > experimental
> > > > > > ones
> > > > > > without specs changes, or even just as 9p specific device
> > > > > > capabilities
> > > > > > for
> > > > > > now. I mean those could be revoked on both sides at any time
> > > > > > anyway.
> > > > > 
> > > > > I would like to understand the root cause before making changes.
> > > > > 
> > > > > "It's faster when I do X" is useful information but it doesn't
> > > > > necessarily mean doing X is the solution. The "it's faster when I do
> > > > > X
> > > > > because Y" part is missing in my mind. Once there is evidence that
> > > > > shows
> > > > > Y then it will be clearer if X is a good solution, if there's a more
> > > > > general solution, or if it was just a side-effect.
> > > > 
> > > > I think I made it clear that the root cause of the observed
> > > > performance
> > > > gain with rising transmission size is latency (and also that
> > > > performance
> > > > is not the only reason for addressing this queue size issue).
> > > > 
> > > > Each request roundtrip has a certain minimum latency, the virtio ring
> > > > alone
> > > > has its latency, plus latency of the controller portion of the file
> > > > server
> > > > (e.g. permissions, sandbox checks, file IDs) that is executed with
> > > > *every*
> > > > request, plus latency of dispatching the request handling between
> > > > threads
> > > > several times back and forth (also for each request).
> > > > 
> > > > Therefore when you split large payloads (e.g. reading a large file)
> > > > into
> > > > smaller n amount of chunks, then that individual latency per request
> > > > accumulates to n times the individual latency, eventually leading to
> > > > degraded transmission speed as those requests are serialized.
> > > 
> > > It's easy to increase the blocksize in benchmarks, but real applications
> > > offer less control over the I/O pattern. If latency in the device
> > > implementation (QEMU) is the root cause then reduce the latency to speed
> > > up all applications, even those that cannot send huge requests.
> > 
> > Which I did, still do, and also mentioned before, e.g.:
> > 
> > 8d6cb100731c4d28535adbf2a3c2d1f29be3fef4 9pfs: reduce latency of Twalk
> > 0c4356ba7dafc8ecb5877a42fc0d68d45ccf5951 9pfs: T_readdir latency
> > optimization
> > 
> > Reducing overall latency is a process that is ongoing and will still take
> > a
> > very long development time. Not because of me, but because of lack of
> > reviewers. And even then, it does not make the effort to support higher
> > transmission sizes obsolete.
> > 
> > > One idea is request merging on the QEMU side. If the application sends
> > > 10 sequential read or write requests, coalesce them together before the
> > > main part of request processing begins in the device. Process a single
> > > large request to spread the cost of the file server over the 10
> > > requests. (virtio-blk has request merging to help with the cost of lots
> > > of small qcow2 I/O requests.) The cool thing about this is that the
> > > guest does not need to change its I/O pattern to benefit from the
> > > optimization.
> > > 
> > > Stefan
> > 
> > Ok, don't get me wrong: I appreciate that you are suggesting approaches
> > that could improve things. But I could already hand you over a huge list
> > of mine. The limiting factor here is not the lack of ideas of what could
> > be improved, but rather the lack of people helping out actively on 9p
> > side:
> > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg06452.html
> > 
> > The situation on kernel side is the same. I already have a huge list of
> > what could & should be improved. But there is basically no reviewer for
> > 9p patches on Linux kernel side either.
> > 
> > The much I appreciate suggestions of what could be improved, I would
> > appreciate much more if there was *anybody* actively assisting as well. In
> > the time being I have to work the list down in small patch chunks,
> > priority based.
> I see request merging as an alternative to this patch series, not as an
> additional idea.

It is not an alternative. Like I said before, even if it would solve the 
sequential I/O performance issue (by not simply moving the problem somewhere 
else), which I doubt, your suggestion would still not solve the semantical 
conflict of virtio's "maximum queue size" terminology: i.e. max. active/
pending messages vs. max. transmissions size. Denying that simply means 
attempting to postpone addressing this virtio issue appropriately.

The legitimate concerns you came up with can easily be addressed by two virtio 
capabilities to make things clean and officially supported by both ends, which 
could also be revoked at any time without breaking things if there were any 
real-life issues actually coming up on virtio level in future. The rest is 
already prepared.

> My thoughts behind this is that request merging is less work than this
> patch series and more broadly applicable.

It is definitely not less work. All I still have to do is adding two virtio 
capabilities as suggested by you, either as official virtio ones, or as 9p 
device specific ones. The other outstanding tasks on my side are independent 
of this overall issue.

> It would be easy to merge (no
> idea how easy it is to implement though) in QEMU's virtio-9p device
> implementation, does not require changes across the stack, and benefits
> applications that can't change their I/O pattern to take advantage of
> huge requests.

And waiting on every single request whether there might be more sequential 
requests coming in somewhere in future, i.e. even increasing latency and 
worsening the situation on random I/O side, increasing the probability of a 
full queue and client having to wait too often, piling up more complex code, 
and what not.

Your idea of merging sequential requests on QEMU side already fails at the 
intial point: sequential read is typically initiated by sequential calls to 
read() by a guest application thread. However that read() function call must 
return some data before guest app thread would be able to call read() again 
for subsequent chunks.

> There is a risk that request merging won't pan out, it could have worse
> performance than submitting huge requests.

That's not a risk, it is not even feasible. Plus my plan is to improve 
performance for various use case patterns, especially for both sequential and 
random I/O patterns, not just only one of them.

A more appropriate solution from what you suggested, is e.g. to extend the 
already existing Linux 9p client's optional fs-cache feature by an optional 
read-ahead feature. Again, one of many things on my TODO list, but also still 
a bunch of things to do on fs-cache alone before being able to start that.

We have discussed this issue for almost 2 months now. I think it is time to 
move on. If you are still convinced about your ideas, please send your patches 
and benchmark results.

I would appreciate if you'd let me know whether I should suggest the discussed 
two virtio capabilities as official virtio ones, or whether I should directly 
go for 9p device specific ones instead.

Thanks!

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-11-15 14:32                                               ` Christian Schoenebeck
  0 siblings, 0 replies; 97+ messages in thread
From: Christian Schoenebeck @ 2021-11-15 14:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, Raphael Norwitz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng

On Montag, 15. November 2021 12:54:32 CET Stefan Hajnoczi wrote:
> On Thu, Nov 11, 2021 at 06:54:03PM +0100, Christian Schoenebeck wrote:
> > On Donnerstag, 11. November 2021 17:31:52 CET Stefan Hajnoczi wrote:
> > > On Wed, Nov 10, 2021 at 04:53:33PM +0100, Christian Schoenebeck wrote:
> > > > On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote:
> > > > > On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck 
wrote:
> > > > > > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > > > > > As you are apparently reluctant for changing the virtio specs,
> > > > > > what
> > > > > > about
> > > > > > introducing those discussed virtio capabalities either as
> > > > > > experimental
> > > > > > ones
> > > > > > without specs changes, or even just as 9p specific device
> > > > > > capabilities
> > > > > > for
> > > > > > now. I mean those could be revoked on both sides at any time
> > > > > > anyway.
> > > > > 
> > > > > I would like to understand the root cause before making changes.
> > > > > 
> > > > > "It's faster when I do X" is useful information but it doesn't
> > > > > necessarily mean doing X is the solution. The "it's faster when I do
> > > > > X
> > > > > because Y" part is missing in my mind. Once there is evidence that
> > > > > shows
> > > > > Y then it will be clearer if X is a good solution, if there's a more
> > > > > general solution, or if it was just a side-effect.
> > > > 
> > > > I think I made it clear that the root cause of the observed
> > > > performance
> > > > gain with rising transmission size is latency (and also that
> > > > performance
> > > > is not the only reason for addressing this queue size issue).
> > > > 
> > > > Each request roundtrip has a certain minimum latency, the virtio ring
> > > > alone
> > > > has its latency, plus latency of the controller portion of the file
> > > > server
> > > > (e.g. permissions, sandbox checks, file IDs) that is executed with
> > > > *every*
> > > > request, plus latency of dispatching the request handling between
> > > > threads
> > > > several times back and forth (also for each request).
> > > > 
> > > > Therefore when you split large payloads (e.g. reading a large file)
> > > > into
> > > > smaller n amount of chunks, then that individual latency per request
> > > > accumulates to n times the individual latency, eventually leading to
> > > > degraded transmission speed as those requests are serialized.
> > > 
> > > It's easy to increase the blocksize in benchmarks, but real applications
> > > offer less control over the I/O pattern. If latency in the device
> > > implementation (QEMU) is the root cause then reduce the latency to speed
> > > up all applications, even those that cannot send huge requests.
> > 
> > Which I did, still do, and also mentioned before, e.g.:
> > 
> > 8d6cb100731c4d28535adbf2a3c2d1f29be3fef4 9pfs: reduce latency of Twalk
> > 0c4356ba7dafc8ecb5877a42fc0d68d45ccf5951 9pfs: T_readdir latency
> > optimization
> > 
> > Reducing overall latency is a process that is ongoing and will still take
> > a
> > very long development time. Not because of me, but because of lack of
> > reviewers. And even then, it does not make the effort to support higher
> > transmission sizes obsolete.
> > 
> > > One idea is request merging on the QEMU side. If the application sends
> > > 10 sequential read or write requests, coalesce them together before the
> > > main part of request processing begins in the device. Process a single
> > > large request to spread the cost of the file server over the 10
> > > requests. (virtio-blk has request merging to help with the cost of lots
> > > of small qcow2 I/O requests.) The cool thing about this is that the
> > > guest does not need to change its I/O pattern to benefit from the
> > > optimization.
> > > 
> > > Stefan
> > 
> > Ok, don't get me wrong: I appreciate that you are suggesting approaches
> > that could improve things. But I could already hand you over a huge list
> > of mine. The limiting factor here is not the lack of ideas of what could
> > be improved, but rather the lack of people helping out actively on 9p
> > side:
> > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg06452.html
> > 
> > The situation on kernel side is the same. I already have a huge list of
> > what could & should be improved. But there is basically no reviewer for
> > 9p patches on Linux kernel side either.
> > 
> > The much I appreciate suggestions of what could be improved, I would
> > appreciate much more if there was *anybody* actively assisting as well. In
> > the time being I have to work the list down in small patch chunks,
> > priority based.
> I see request merging as an alternative to this patch series, not as an
> additional idea.

It is not an alternative. Like I said before, even if it would solve the 
sequential I/O performance issue (by not simply moving the problem somewhere 
else), which I doubt, your suggestion would still not solve the semantical 
conflict of virtio's "maximum queue size" terminology: i.e. max. active/
pending messages vs. max. transmissions size. Denying that simply means 
attempting to postpone addressing this virtio issue appropriately.

The legitimate concerns you came up with can easily be addressed by two virtio 
capabilities to make things clean and officially supported by both ends, which 
could also be revoked at any time without breaking things if there were any 
real-life issues actually coming up on virtio level in future. The rest is 
already prepared.

> My thoughts behind this is that request merging is less work than this
> patch series and more broadly applicable.

It is definitely not less work. All I still have to do is adding two virtio 
capabilities as suggested by you, either as official virtio ones, or as 9p 
device specific ones. The other outstanding tasks on my side are independent 
of this overall issue.

> It would be easy to merge (no
> idea how easy it is to implement though) in QEMU's virtio-9p device
> implementation, does not require changes across the stack, and benefits
> applications that can't change their I/O pattern to take advantage of
> huge requests.

And waiting on every single request whether there might be more sequential 
requests coming in somewhere in future, i.e. even increasing latency and 
worsening the situation on random I/O side, increasing the probability of a 
full queue and client having to wait too often, piling up more complex code, 
and what not.

Your idea of merging sequential requests on QEMU side already fails at the 
intial point: sequential read is typically initiated by sequential calls to 
read() by a guest application thread. However that read() function call must 
return some data before guest app thread would be able to call read() again 
for subsequent chunks.

> There is a risk that request merging won't pan out, it could have worse
> performance than submitting huge requests.

That's not a risk, it is not even feasible. Plus my plan is to improve 
performance for various use case patterns, especially for both sequential and 
random I/O patterns, not just only one of them.

A more appropriate solution from what you suggested, is e.g. to extend the 
already existing Linux 9p client's optional fs-cache feature by an optional 
read-ahead feature. Again, one of many things on my TODO list, but also still 
a bunch of things to do on fs-cache alone before being able to start that.

We have discussed this issue for almost 2 months now. I think it is time to 
move on. If you are still convinced about your ideas, please send your patches 
and benchmark results.

I would appreciate if you'd let me know whether I should suggest the discussed 
two virtio capabilities as official virtio ones, or whether I should directly 
go for 9p device specific ones instead.

Thanks!

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  2021-11-15 14:32                                               ` [Virtio-fs] " Christian Schoenebeck
@ 2021-11-16 11:13                                                 ` Stefan Hajnoczi
  -1 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-11-16 11:13 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, Greg Kurz,
	virtio-fs, Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 413 bytes --]

On Mon, Nov 15, 2021 at 03:32:46PM +0100, Christian Schoenebeck wrote:
> I would appreciate if you'd let me know whether I should suggest the discussed 
> two virtio capabilities as official virtio ones, or whether I should directly 
> go for 9p device specific ones instead.

Please propose changes for increasing the maximum descriptor chain
length in the common virtqueue section of the spec.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
@ 2021-11-16 11:13                                                 ` Stefan Hajnoczi
  0 siblings, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2021-11-16 11:13 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Kevin Wolf, Laurent Vivier, qemu-block, Michael S. Tsirkin,
	Jason Wang, Amit Shah, David Hildenbrand, qemu-devel, virtio-fs,
	Eric Auger, Hanna Reitz, Gonglei (Arei),
	Gerd Hoffmann, Marc-André Lureau, Paolo Bonzini, Fam Zheng,
	Raphael Norwitz

[-- Attachment #1: Type: text/plain, Size: 413 bytes --]

On Mon, Nov 15, 2021 at 03:32:46PM +0100, Christian Schoenebeck wrote:
> I would appreciate if you'd let me know whether I should suggest the discussed 
> two virtio capabilities as official virtio ones, or whether I should directly 
> go for 9p device specific ones instead.

Please propose changes for increasing the maximum descriptor chain
length in the common virtqueue section of the spec.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

end of thread, other threads:[~2021-11-16 11:17 UTC | newest]

Thread overview: 97+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-04 19:38 [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k Christian Schoenebeck
2021-10-04 19:38 ` [Virtio-fs] " Christian Schoenebeck
2021-10-04 19:38 ` [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable Christian Schoenebeck
2021-10-04 19:38   ` [Virtio-fs] " Christian Schoenebeck
2021-10-05  7:36   ` Greg Kurz
2021-10-05  7:36     ` [Virtio-fs] " Greg Kurz
2021-10-05 12:45   ` Stefan Hajnoczi
2021-10-05 12:45     ` [Virtio-fs] " Stefan Hajnoczi
2021-10-05 13:15     ` Christian Schoenebeck
2021-10-05 13:15       ` [Virtio-fs] " Christian Schoenebeck
2021-10-05 15:10       ` Stefan Hajnoczi
2021-10-05 15:10         ` [Virtio-fs] " Stefan Hajnoczi
2021-10-05 16:32         ` Christian Schoenebeck
2021-10-05 16:32           ` [Virtio-fs] " Christian Schoenebeck
2021-10-06 11:06           ` Stefan Hajnoczi
2021-10-06 11:06             ` [Virtio-fs] " Stefan Hajnoczi
2021-10-06 12:50             ` Christian Schoenebeck
2021-10-06 12:50               ` [Virtio-fs] " Christian Schoenebeck
2021-10-06 14:42               ` Stefan Hajnoczi
2021-10-06 14:42                 ` [Virtio-fs] " Stefan Hajnoczi
2021-10-07 13:09                 ` Christian Schoenebeck
2021-10-07 13:09                   ` [Virtio-fs] " Christian Schoenebeck
2021-10-07 15:18                   ` Stefan Hajnoczi
2021-10-07 15:18                     ` [Virtio-fs] " Stefan Hajnoczi
2021-10-08 14:48                     ` Christian Schoenebeck
2021-10-08 14:48                       ` [Virtio-fs] " Christian Schoenebeck
2021-10-04 19:38 ` [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k Christian Schoenebeck
2021-10-04 19:38   ` [Virtio-fs] " Christian Schoenebeck
2021-10-05  7:16   ` Michael S. Tsirkin
2021-10-05  7:16     ` [Virtio-fs] " Michael S. Tsirkin
2021-10-05  7:35     ` Greg Kurz
2021-10-05  7:35       ` [Virtio-fs] " Greg Kurz
2021-10-05 11:17     ` Christian Schoenebeck
2021-10-05 11:17       ` [Virtio-fs] " Christian Schoenebeck
2021-10-05 11:24       ` Michael S. Tsirkin
2021-10-05 11:24         ` [Virtio-fs] " Michael S. Tsirkin
2021-10-05 12:01         ` Christian Schoenebeck
2021-10-05 12:01           ` [Virtio-fs] " Christian Schoenebeck
2021-10-04 19:38 ` [PATCH v2 3/3] virtio-9p-device: switch to 32k max. transfer size Christian Schoenebeck
2021-10-04 19:38   ` [Virtio-fs] " Christian Schoenebeck
2021-10-05  7:38 ` [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k David Hildenbrand
2021-10-05  7:38   ` [Virtio-fs] " David Hildenbrand
2021-10-05 11:10   ` Christian Schoenebeck
2021-10-05 11:10     ` [Virtio-fs] " Christian Schoenebeck
2021-10-05 11:19     ` Michael S. Tsirkin
2021-10-05 11:19       ` [Virtio-fs] " Michael S. Tsirkin
2021-10-05 11:43       ` Christian Schoenebeck
2021-10-05 11:43         ` [Virtio-fs] " Christian Schoenebeck
2021-10-07  5:23 ` Stefan Hajnoczi
2021-10-07  5:23   ` [Virtio-fs] " Stefan Hajnoczi
2021-10-07 12:51   ` Christian Schoenebeck
2021-10-07 12:51     ` [Virtio-fs] " Christian Schoenebeck
2021-10-07 15:42     ` Stefan Hajnoczi
2021-10-07 15:42       ` [Virtio-fs] " Stefan Hajnoczi
2021-10-08  7:25       ` Greg Kurz
2021-10-08  7:25         ` [Virtio-fs] " Greg Kurz
2021-10-08  8:27         ` Greg Kurz
2021-10-08 14:24         ` Christian Schoenebeck
2021-10-08 14:24           ` [Virtio-fs] " Christian Schoenebeck
2021-10-08 16:08           ` Christian Schoenebeck
2021-10-08 16:08             ` [Virtio-fs] " Christian Schoenebeck
2021-10-21 15:39             ` Christian Schoenebeck
2021-10-21 15:39               ` [Virtio-fs] " Christian Schoenebeck
2021-10-25 10:30               ` Stefan Hajnoczi
2021-10-25 10:30                 ` [Virtio-fs] " Stefan Hajnoczi
2021-10-25 15:03                 ` Christian Schoenebeck
2021-10-25 15:03                   ` [Virtio-fs] " Christian Schoenebeck
2021-10-28  9:00                   ` Stefan Hajnoczi
2021-10-28  9:00                     ` [Virtio-fs] " Stefan Hajnoczi
2021-11-01 20:29                     ` Christian Schoenebeck
2021-11-01 20:29                       ` [Virtio-fs] " Christian Schoenebeck
2021-11-03 11:33                       ` Stefan Hajnoczi
2021-11-03 11:33                         ` [Virtio-fs] " Stefan Hajnoczi
2021-11-04 14:41                         ` Christian Schoenebeck
2021-11-04 14:41                           ` [Virtio-fs] " Christian Schoenebeck
2021-11-09 10:56                           ` Stefan Hajnoczi
2021-11-09 10:56                             ` [Virtio-fs] " Stefan Hajnoczi
2021-11-09 13:09                             ` Christian Schoenebeck
2021-11-09 13:09                               ` [Virtio-fs] " Christian Schoenebeck
2021-11-10 10:05                               ` Stefan Hajnoczi
2021-11-10 10:05                                 ` [Virtio-fs] " Stefan Hajnoczi
2021-11-10 13:14                                 ` Christian Schoenebeck
2021-11-10 13:14                                   ` [Virtio-fs] " Christian Schoenebeck
2021-11-10 15:14                                   ` Stefan Hajnoczi
2021-11-10 15:14                                     ` [Virtio-fs] " Stefan Hajnoczi
2021-11-10 15:53                                     ` Christian Schoenebeck
2021-11-10 15:53                                       ` [Virtio-fs] " Christian Schoenebeck
2021-11-11 16:31                                       ` Stefan Hajnoczi
2021-11-11 16:31                                         ` [Virtio-fs] " Stefan Hajnoczi
2021-11-11 17:54                                         ` Christian Schoenebeck
2021-11-11 17:54                                           ` [Virtio-fs] " Christian Schoenebeck
2021-11-15 11:54                                           ` Stefan Hajnoczi
2021-11-15 11:54                                             ` [Virtio-fs] " Stefan Hajnoczi
2021-11-15 14:32                                             ` Christian Schoenebeck
2021-11-15 14:32                                               ` [Virtio-fs] " Christian Schoenebeck
2021-11-16 11:13                                               ` Stefan Hajnoczi
2021-11-16 11:13                                                 ` [Virtio-fs] " Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.