All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v3 0/2] virtio-scsi: Optimizing request allocation
@ 2014-09-16  7:20 Fam Zheng
  2014-09-16  7:20 ` [Qemu-devel] [PATCH v3 1/2] scsi: Optimize scsi_req_alloc Fam Zheng
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Fam Zheng @ 2014-09-16  7:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini

v3: Small tweak on "cmd" in 1/2 and "sreq" in 2/2.

Zeroing is relatively expensive since we have big request structures.
VirtQueueElement (>48k!) and sense_buf (256 bytes) are two points to look at.

This visibly reduces overhead of request handling when testing with the
unmerged "null" driver and virtio-scsi dataplane. Before, the issue is very
obvious with perf top:

perf top -G -p `pidof qemu-system-x86_64`
-----------------------------------------
+  16.50%  libc-2.17.so             [.] __memset_sse2
+   2.28%  libc-2.17.so             [.] _int_malloc
+   2.25%  [vdso]                   [.] 0x0000000000000cd1
+   2.02%  [kernel]                 [k] _raw_spin_lock_irqsave
+   1.97%  libpthread-2.17.so       [.] pthread_mutex_lock
+   1.87%  libpthread-2.17.so       [.] pthread_mutex_unlock
+   1.81%  [kernel]                 [k] fget_light
+   1.70%  libc-2.17.so             [.] malloc

After, the high __memset_sse2 and _int_malloc is gone:

perf top -G -p `pidof qemu-system-x86_64`
-----------------------------------------
+   4.20%  [kernel]                 [k] vcpu_enter_guest
+   3.97%  [kernel]                 [k] vmx_vcpu_run
+   2.63%  [kernel]                 [k] _raw_spin_lock_irqsave
+   1.72%  [kernel]                 [k] native_read_msr_safe
+   1.65%  [kernel]                 [k] __srcu_read_lock
+   1.64%  [kernel]                 [k] _raw_spin_unlock_irqrestore
+   1.57%  [vdso]                   [.] 0x00000000000008d8
+   1.49%  libc-2.17.so             [.] _int_malloc
+   1.29%  libpthread-2.17.so       [.] pthread_mutex_unlock
+   1.26%  [kernel]                 [k] native_write_msr_safe

See the commit message of patch 2 for some fio test data.

Thanks,
Fam


Fam Zheng (2):
  scsi: Optimize scsi_req_alloc
  virtio-scsi: Optimize virtio_scsi_init_req

 hw/scsi/scsi-bus.c     |  8 +++++---
 hw/scsi/virtio-scsi.c  | 24 +++++++++++++++++-------
 include/hw/scsi/scsi.h | 21 ++++++++++++++-------
 3 files changed, 36 insertions(+), 17 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Qemu-devel] [PATCH v3 1/2] scsi: Optimize scsi_req_alloc
  2014-09-16  7:20 [Qemu-devel] [PATCH v3 0/2] virtio-scsi: Optimizing request allocation Fam Zheng
@ 2014-09-16  7:20 ` Fam Zheng
  2014-09-16  7:20 ` [Qemu-devel] [PATCH v3 2/2] virtio-scsi: Optimize virtio_scsi_init_req Fam Zheng
  2014-09-16  8:19 ` [Qemu-devel] [PATCH v3 0/2] virtio-scsi: Optimizing request allocation Paolo Bonzini
  2 siblings, 0 replies; 4+ messages in thread
From: Fam Zheng @ 2014-09-16  7:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini

Zeroing sense buffer for each scsi request is not efficient, we can just
leave it uninitialized because sense_len is set to 0.

Move the implicitly zeroed fields to the end of the structure and use a
partial memset.

The explicitly initialized fields (by scsi_req_alloc or scsi_req_new)
are moved to the beginning of the structure, before sense buffer, to
skip the memset.

Also change g_malloc0 to g_slice_alloc.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 hw/scsi/scsi-bus.c     |  8 +++++---
 include/hw/scsi/scsi.h | 21 ++++++++++++++-------
 2 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 954c607..af293b5 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -551,8 +551,11 @@ SCSIRequest *scsi_req_alloc(const SCSIReqOps *reqops, SCSIDevice *d,
     SCSIRequest *req;
     SCSIBus *bus = scsi_bus_from_device(d);
     BusState *qbus = BUS(bus);
+    const int memset_off = offsetof(SCSIRequest, sense)
+                           + sizeof(req->sense);
 
-    req = g_malloc0(reqops->size);
+    req = g_slice_alloc(reqops->size);
+    memset((uint8_t *)req + memset_off, 0, reqops->size - memset_off);
     req->refcount = 1;
     req->bus = bus;
     req->dev = d;
@@ -560,7 +563,6 @@ SCSIRequest *scsi_req_alloc(const SCSIReqOps *reqops, SCSIDevice *d,
     req->lun = lun;
     req->hba_private = hba_private;
     req->status = -1;
-    req->sense_len = 0;
     req->ops = reqops;
     object_ref(OBJECT(d));
     object_ref(OBJECT(qbus->parent));
@@ -1603,7 +1605,7 @@ void scsi_req_unref(SCSIRequest *req)
         }
         object_unref(OBJECT(req->dev));
         object_unref(OBJECT(qbus->parent));
-        g_free(req);
+        g_slice_free1(req->ops->size, req);
     }
 }
 
diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index 2e3a8f9..6271ad3 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -50,17 +50,24 @@ struct SCSIRequest {
     uint32_t          tag;
     uint32_t          lun;
     uint32_t          status;
+    void              *hba_private;
     size_t            resid;
     SCSICommand       cmd;
+
+    /* Note:
+     * - fields before sense are initialized by scsi_req_alloc;
+     * - sense[] is uninitialized;
+     * - fields after sense are memset to 0 by scsi_req_alloc.
+     * */
+
+    uint8_t           sense[SCSI_SENSE_BUF_SIZE];
+    uint32_t          sense_len;
+    bool              enqueued;
+    bool              io_canceled;
+    bool              retry;
+    bool              dma_started;
     BlockDriverAIOCB  *aiocb;
     QEMUSGList        *sg;
-    bool              dma_started;
-    uint8_t sense[SCSI_SENSE_BUF_SIZE];
-    uint32_t sense_len;
-    bool enqueued;
-    bool io_canceled;
-    bool retry;
-    void *hba_private;
     QTAILQ_ENTRY(SCSIRequest) next;
 };
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [Qemu-devel] [PATCH v3 2/2] virtio-scsi: Optimize virtio_scsi_init_req
  2014-09-16  7:20 [Qemu-devel] [PATCH v3 0/2] virtio-scsi: Optimizing request allocation Fam Zheng
  2014-09-16  7:20 ` [Qemu-devel] [PATCH v3 1/2] scsi: Optimize scsi_req_alloc Fam Zheng
@ 2014-09-16  7:20 ` Fam Zheng
  2014-09-16  8:19 ` [Qemu-devel] [PATCH v3 0/2] virtio-scsi: Optimizing request allocation Paolo Bonzini
  2 siblings, 0 replies; 4+ messages in thread
From: Fam Zheng @ 2014-09-16  7:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini

The VirtQueueElement is a very big structure (>48k!), since it will be
initialzed by virtqueue_pop, we can save the expensive zeroing here.

This saves a few microseconds per request in my test:

[fio-test]      rw         bs         iodepth    jobs       bw         iops       latency
--------------------------------------------------------------------------------------------
Before          read       4k         1          1          110        28269      34
After           read       4k         1          1          131        33745      28

Whereas,

virtio-blk      read       4k         1          1          217        55673      16

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 hw/scsi/virtio-scsi.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 86aba88..f0d21a3 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -24,12 +24,19 @@
 typedef struct VirtIOSCSIReq {
     VirtIOSCSI *dev;
     VirtQueue *vq;
-    VirtQueueElement elem;
     QEMUSGList qsgl;
+    QEMUIOVector resp_iov;
+
+    /* Note:
+     * - fields before elem are initialized by virtio_scsi_init_req;
+     * - elem is uninitialized at the time of allocation.
+     * - fields after elem are zeroed by virtio_scsi_init_req.
+     * */
+
+    VirtQueueElement elem;
     SCSIRequest *sreq;
     size_t resp_size;
     enum SCSIXferMode mode;
-    QEMUIOVector resp_iov;
     union {
         VirtIOSCSICmdResp     cmd;
         VirtIOSCSICtrlTMFResp tmf;
@@ -68,23 +75,26 @@ static inline SCSIDevice *virtio_scsi_device_find(VirtIOSCSI *s, uint8_t *lun)
 static VirtIOSCSIReq *virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq)
 {
     VirtIOSCSIReq *req;
-    VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
-
-    req = g_malloc0(sizeof(*req) + vs->cdb_size);
+    VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s;
+    const size_t zero_skip = offsetof(VirtIOSCSIReq, elem)
+                             + sizeof(VirtQueueElement);
 
+    req = g_slice_alloc(sizeof(*req) + vs->cdb_size);
     req->vq = vq;
     req->dev = s;
-    req->sreq = NULL;
     qemu_sglist_init(&req->qsgl, DEVICE(s), 8, &address_space_memory);
     qemu_iovec_init(&req->resp_iov, 1);
+    memset((uint8_t *)req + zero_skip, 0, sizeof(*req) - zero_skip);
     return req;
 }
 
 static void virtio_scsi_free_req(VirtIOSCSIReq *req)
 {
+    VirtIOSCSICommon *vs = (VirtIOSCSICommon *)req->dev;
+
     qemu_iovec_destroy(&req->resp_iov);
     qemu_sglist_destroy(&req->qsgl);
-    g_free(req);
+    g_slice_free1(sizeof(*req) + vs->cdb_size, req);
 }
 
 static void virtio_scsi_complete_req(VirtIOSCSIReq *req)
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/2] virtio-scsi: Optimizing request allocation
  2014-09-16  7:20 [Qemu-devel] [PATCH v3 0/2] virtio-scsi: Optimizing request allocation Fam Zheng
  2014-09-16  7:20 ` [Qemu-devel] [PATCH v3 1/2] scsi: Optimize scsi_req_alloc Fam Zheng
  2014-09-16  7:20 ` [Qemu-devel] [PATCH v3 2/2] virtio-scsi: Optimize virtio_scsi_init_req Fam Zheng
@ 2014-09-16  8:19 ` Paolo Bonzini
  2 siblings, 0 replies; 4+ messages in thread
From: Paolo Bonzini @ 2014-09-16  8:19 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel

Il 16/09/2014 09:20, Fam Zheng ha scritto:
> v3: Small tweak on "cmd" in 1/2 and "sreq" in 2/2.
> 
> Zeroing is relatively expensive since we have big request structures.
> VirtQueueElement (>48k!) and sense_buf (256 bytes) are two points to look at.
> 
> This visibly reduces overhead of request handling when testing with the
> unmerged "null" driver and virtio-scsi dataplane. Before, the issue is very
> obvious with perf top:
> 
> perf top -G -p `pidof qemu-system-x86_64`
> -----------------------------------------
> +  16.50%  libc-2.17.so             [.] __memset_sse2
> +   2.28%  libc-2.17.so             [.] _int_malloc
> +   2.25%  [vdso]                   [.] 0x0000000000000cd1
> +   2.02%  [kernel]                 [k] _raw_spin_lock_irqsave
> +   1.97%  libpthread-2.17.so       [.] pthread_mutex_lock
> +   1.87%  libpthread-2.17.so       [.] pthread_mutex_unlock
> +   1.81%  [kernel]                 [k] fget_light
> +   1.70%  libc-2.17.so             [.] malloc
> 
> After, the high __memset_sse2 and _int_malloc is gone:
> 
> perf top -G -p `pidof qemu-system-x86_64`
> -----------------------------------------
> +   4.20%  [kernel]                 [k] vcpu_enter_guest
> +   3.97%  [kernel]                 [k] vmx_vcpu_run
> +   2.63%  [kernel]                 [k] _raw_spin_lock_irqsave
> +   1.72%  [kernel]                 [k] native_read_msr_safe
> +   1.65%  [kernel]                 [k] __srcu_read_lock
> +   1.64%  [kernel]                 [k] _raw_spin_unlock_irqrestore
> +   1.57%  [vdso]                   [.] 0x00000000000008d8
> +   1.49%  libc-2.17.so             [.] _int_malloc
> +   1.29%  libpthread-2.17.so       [.] pthread_mutex_unlock
> +   1.26%  [kernel]                 [k] native_write_msr_safe
> 
> See the commit message of patch 2 for some fio test data.

Thanks, applied to scsi-next.

Paolo

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-09-16  8:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-16  7:20 [Qemu-devel] [PATCH v3 0/2] virtio-scsi: Optimizing request allocation Fam Zheng
2014-09-16  7:20 ` [Qemu-devel] [PATCH v3 1/2] scsi: Optimize scsi_req_alloc Fam Zheng
2014-09-16  7:20 ` [Qemu-devel] [PATCH v3 2/2] virtio-scsi: Optimize virtio_scsi_init_req Fam Zheng
2014-09-16  8:19 ` [Qemu-devel] [PATCH v3 0/2] virtio-scsi: Optimizing request allocation Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.