* [Qemu-devel] [PATCH v2 0/2] virtio-scsi: Optimizing request allocation
@ 2014-09-15 5:23 Fam Zheng
2014-09-15 5:23 ` [Qemu-devel] [PATCH v2 1/2] scsi: Optimize scsi_req_alloc Fam Zheng
2014-09-15 5:23 ` [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Optimize virtio_scsi_init_req Fam Zheng
0 siblings, 2 replies; 7+ messages in thread
From: Fam Zheng @ 2014-09-15 5:23 UTC (permalink / raw)
To: qemu-devel; +Cc: Paolo Bonzini
v2: Slight improvements according to Paolo's comments.
4k -> 48k.
Zeroing is relatively expensive since we have big request structures.
VirtQueueElement (>48k!) and sense_buf (256 bytes) are two points to look at.
This visibly reduces overhead of request handling when testing with the
unmerged "null" driver and virtio-scsi dataplane. Before, the issue is very
obvious with perf top:
perf top -G -p `pidof qemu-system-x86_64`
-----------------------------------------
+ 16.50% libc-2.17.so [.] __memset_sse2
+ 2.28% libc-2.17.so [.] _int_malloc
+ 2.25% [vdso] [.] 0x0000000000000cd1
+ 2.02% [kernel] [k] _raw_spin_lock_irqsave
+ 1.97% libpthread-2.17.so [.] pthread_mutex_lock
+ 1.87% libpthread-2.17.so [.] pthread_mutex_unlock
+ 1.81% [kernel] [k] fget_light
+ 1.70% libc-2.17.so [.] malloc
After, the high __memset_sse2 and _int_malloc is gone:
perf top -G -p `pidof qemu-system-x86_64`
-----------------------------------------
+ 4.20% [kernel] [k] vcpu_enter_guest
+ 3.97% [kernel] [k] vmx_vcpu_run
+ 2.63% [kernel] [k] _raw_spin_lock_irqsave
+ 1.72% [kernel] [k] native_read_msr_safe
+ 1.65% [kernel] [k] __srcu_read_lock
+ 1.64% [kernel] [k] _raw_spin_unlock_irqrestore
+ 1.57% [vdso] [.] 0x00000000000008d8
+ 1.49% libc-2.17.so [.] _int_malloc
+ 1.29% libpthread-2.17.so [.] pthread_mutex_unlock
+ 1.26% [kernel] [k] native_write_msr_safe
See the commit message of patch 2 for some fio test data.
Thanks,
Fam
Fam Zheng (2):
scsi: Optimize scsi_req_alloc
virtio-scsi: Optimize virtio_scsi_init_req
hw/scsi/scsi-bus.c | 8 +++++---
hw/scsi/virtio-scsi.c | 23 +++++++++++++++++------
include/hw/scsi/scsi.h | 21 ++++++++++++++-------
3 files changed, 36 insertions(+), 16 deletions(-)
--
1.9.3
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Qemu-devel] [PATCH v2 1/2] scsi: Optimize scsi_req_alloc
2014-09-15 5:23 [Qemu-devel] [PATCH v2 0/2] virtio-scsi: Optimizing request allocation Fam Zheng
@ 2014-09-15 5:23 ` Fam Zheng
2014-09-15 10:15 ` Paolo Bonzini
2014-09-15 5:23 ` [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Optimize virtio_scsi_init_req Fam Zheng
1 sibling, 1 reply; 7+ messages in thread
From: Fam Zheng @ 2014-09-15 5:23 UTC (permalink / raw)
To: qemu-devel; +Cc: Paolo Bonzini
Zeroing sense buffer for each scsi request is not efficient, we can just
leave it uninitialized because sense_len is set to 0.
Move the implicitly zeroed fields to the end of the structure and use a
partial memset.
The explicitly initialized fields (by scsi_req_alloc or scsi_req_new)
are moved to the beginning of the structure, before sense buffer, to
skip the memset.
Also change g_malloc0 to g_slice_alloc.
Signed-off-by: Fam Zheng <famz@redhat.com>
---
hw/scsi/scsi-bus.c | 8 +++++---
include/hw/scsi/scsi.h | 21 ++++++++++++++-------
2 files changed, 19 insertions(+), 10 deletions(-)
diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 954c607..af293b5 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -551,8 +551,11 @@ SCSIRequest *scsi_req_alloc(const SCSIReqOps *reqops, SCSIDevice *d,
SCSIRequest *req;
SCSIBus *bus = scsi_bus_from_device(d);
BusState *qbus = BUS(bus);
+ const int memset_off = offsetof(SCSIRequest, sense)
+ + sizeof(req->sense);
- req = g_malloc0(reqops->size);
+ req = g_slice_alloc(reqops->size);
+ memset((uint8_t *)req + memset_off, 0, reqops->size - memset_off);
req->refcount = 1;
req->bus = bus;
req->dev = d;
@@ -560,7 +563,6 @@ SCSIRequest *scsi_req_alloc(const SCSIReqOps *reqops, SCSIDevice *d,
req->lun = lun;
req->hba_private = hba_private;
req->status = -1;
- req->sense_len = 0;
req->ops = reqops;
object_ref(OBJECT(d));
object_ref(OBJECT(qbus->parent));
@@ -1603,7 +1605,7 @@ void scsi_req_unref(SCSIRequest *req)
}
object_unref(OBJECT(req->dev));
object_unref(OBJECT(qbus->parent));
- g_free(req);
+ g_slice_free1(req->ops->size, req);
}
}
diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index 2e3a8f9..2a7d4e5 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -50,17 +50,24 @@ struct SCSIRequest {
uint32_t tag;
uint32_t lun;
uint32_t status;
+ void *hba_private;
size_t resid;
+
+ /* Note:
+ * - fields before sense are initialized by scsi_req_alloc;
+ * - sense[] is uninitialized;
+ * - fields after sense are memset to 0 by scsi_req_alloc.
+ * */
+
+ uint8_t sense[SCSI_SENSE_BUF_SIZE];
+ uint32_t sense_len;
+ bool enqueued;
+ bool io_canceled;
+ bool retry;
+ bool dma_started;
SCSICommand cmd;
BlockDriverAIOCB *aiocb;
QEMUSGList *sg;
- bool dma_started;
- uint8_t sense[SCSI_SENSE_BUF_SIZE];
- uint32_t sense_len;
- bool enqueued;
- bool io_canceled;
- bool retry;
- void *hba_private;
QTAILQ_ENTRY(SCSIRequest) next;
};
--
1.9.3
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Optimize virtio_scsi_init_req
2014-09-15 5:23 [Qemu-devel] [PATCH v2 0/2] virtio-scsi: Optimizing request allocation Fam Zheng
2014-09-15 5:23 ` [Qemu-devel] [PATCH v2 1/2] scsi: Optimize scsi_req_alloc Fam Zheng
@ 2014-09-15 5:23 ` Fam Zheng
2014-09-15 10:17 ` Paolo Bonzini
1 sibling, 1 reply; 7+ messages in thread
From: Fam Zheng @ 2014-09-15 5:23 UTC (permalink / raw)
To: qemu-devel; +Cc: Paolo Bonzini
The VirtQueueElement is a very big structure (>48k!), since it will be
initialzed by virtqueue_pop, we can save the expensive zeroing here.
This saves a few microseconds per request in my test:
[fio-test] rw bs iodepth jobs bw iops latency
--------------------------------------------------------------------------------------------
Before read 4k 1 1 110 28269 34
After read 4k 1 1 131 33745 28
Whereas,
virtio-blk read 4k 1 1 217 55673 16
Signed-off-by: Fam Zheng <famz@redhat.com>
---
hw/scsi/virtio-scsi.c | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 86aba88..7bf03c4 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -24,12 +24,19 @@
typedef struct VirtIOSCSIReq {
VirtIOSCSI *dev;
VirtQueue *vq;
- VirtQueueElement elem;
QEMUSGList qsgl;
+ QEMUIOVector resp_iov;
+
+ /* Note:
+ * - fields before elem are initialized by virtio_scsi_init_req;
+ * - elem is uninitialized at the time of allocation.
+ * - fields after elem are zeroed by virtio_scsi_init_req.
+ * */
+
+ VirtQueueElement elem;
SCSIRequest *sreq;
size_t resp_size;
enum SCSIXferMode mode;
- QEMUIOVector resp_iov;
union {
VirtIOSCSICmdResp cmd;
VirtIOSCSICtrlTMFResp tmf;
@@ -68,23 +75,27 @@ static inline SCSIDevice *virtio_scsi_device_find(VirtIOSCSI *s, uint8_t *lun)
static VirtIOSCSIReq *virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq)
{
VirtIOSCSIReq *req;
- VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
-
- req = g_malloc0(sizeof(*req) + vs->cdb_size);
+ VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s;
+ const size_t zero_skip = offsetof(VirtIOSCSIReq, elem)
+ + sizeof(VirtQueueElement);
+ req = g_slice_alloc(sizeof(*req) + vs->cdb_size);
req->vq = vq;
req->dev = s;
req->sreq = NULL;
qemu_sglist_init(&req->qsgl, DEVICE(s), 8, &address_space_memory);
qemu_iovec_init(&req->resp_iov, 1);
+ memset((uint8_t *)req + zero_skip, 0, sizeof(*req) - zero_skip);
return req;
}
static void virtio_scsi_free_req(VirtIOSCSIReq *req)
{
+ VirtIOSCSICommon *vs = (VirtIOSCSICommon *)req->dev;
+
qemu_iovec_destroy(&req->resp_iov);
qemu_sglist_destroy(&req->qsgl);
- g_free(req);
+ g_slice_free1(sizeof(*req) + vs->cdb_size, req);
}
static void virtio_scsi_complete_req(VirtIOSCSIReq *req)
--
1.9.3
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] [PATCH v2 1/2] scsi: Optimize scsi_req_alloc
2014-09-15 5:23 ` [Qemu-devel] [PATCH v2 1/2] scsi: Optimize scsi_req_alloc Fam Zheng
@ 2014-09-15 10:15 ` Paolo Bonzini
0 siblings, 0 replies; 7+ messages in thread
From: Paolo Bonzini @ 2014-09-15 10:15 UTC (permalink / raw)
To: Fam Zheng, qemu-devel
Il 15/09/2014 07:23, Fam Zheng ha scritto:
> + uint8_t sense[SCSI_SENSE_BUF_SIZE];
> + uint32_t sense_len;
> + bool enqueued;
> + bool io_canceled;
> + bool retry;
> + bool dma_started;
> SCSICommand cmd;
> BlockDriverAIOCB *aiocb;
> QEMUSGList *sg;
> - bool dma_started;
> - uint8_t sense[SCSI_SENSE_BUF_SIZE];
> - uint32_t sense_len;
> - bool enqueued;
> - bool io_canceled;
> - bool retry;
> - void *hba_private;
> QTAILQ_ENTRY(SCSIRequest) next;
Similar to v1, cmd is filled in by scsi_req_new and can be moved to the
uninitialized area.
Paolo
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Optimize virtio_scsi_init_req
2014-09-15 5:23 ` [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Optimize virtio_scsi_init_req Fam Zheng
@ 2014-09-15 10:17 ` Paolo Bonzini
2014-09-16 7:16 ` Fam Zheng
0 siblings, 1 reply; 7+ messages in thread
From: Paolo Bonzini @ 2014-09-15 10:17 UTC (permalink / raw)
To: Fam Zheng, qemu-devel
Il 15/09/2014 07:23, Fam Zheng ha scritto:
> SCSIRequest *sreq;
> size_t resp_size;
> enum SCSIXferMode mode;
> - QEMUIOVector resp_iov;
> union {
> VirtIOSCSICmdResp cmd;
> VirtIOSCSICtrlTMFResp tmf;
> @@ -68,23 +75,27 @@ static inline SCSIDevice *virtio_scsi_device_find(VirtIOSCSI *s, uint8_t *lun)
> static VirtIOSCSIReq *virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq)
> {
> VirtIOSCSIReq *req;
> - VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
> -
> - req = g_malloc0(sizeof(*req) + vs->cdb_size);
> + VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s;
> + const size_t zero_skip = offsetof(VirtIOSCSIReq, elem)
> + + sizeof(VirtQueueElement);
>
> + req = g_slice_alloc(sizeof(*req) + vs->cdb_size);
Looks good, but why do you need to zero the union? You only need to
zero sreq, resp_size and mode, don't you (and at this point, memset
becomes superfluous)?
Paolo
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Optimize virtio_scsi_init_req
2014-09-15 10:17 ` Paolo Bonzini
@ 2014-09-16 7:16 ` Fam Zheng
2014-09-16 8:15 ` Paolo Bonzini
0 siblings, 1 reply; 7+ messages in thread
From: Fam Zheng @ 2014-09-16 7:16 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: qemu-devel
On Mon, 09/15 12:17, Paolo Bonzini wrote:
> Il 15/09/2014 07:23, Fam Zheng ha scritto:
> > SCSIRequest *sreq;
> > size_t resp_size;
> > enum SCSIXferMode mode;
> > - QEMUIOVector resp_iov;
> > union {
> > VirtIOSCSICmdResp cmd;
> > VirtIOSCSICtrlTMFResp tmf;
> > @@ -68,23 +75,27 @@ static inline SCSIDevice *virtio_scsi_device_find(VirtIOSCSI *s, uint8_t *lun)
> > static VirtIOSCSIReq *virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq)
> > {
> > VirtIOSCSIReq *req;
> > - VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
> > -
> > - req = g_malloc0(sizeof(*req) + vs->cdb_size);
> > + VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s;
> > + const size_t zero_skip = offsetof(VirtIOSCSIReq, elem)
> > + + sizeof(VirtQueueElement);
> >
> > + req = g_slice_alloc(sizeof(*req) + vs->cdb_size);
>
> Looks good, but why do you need to zero the union? You only need to
> zero sreq, resp_size and mode, don't you (and at this point, memset
> becomes superfluous)?
>
The structures in unions are not zeroed by caller, also leaving them breaks
virtio-scsi in my test.
FWIW, I will remove the "req->sreq = NULL;" two lines below in v3. At this
point tuning these small fields are subtle optimization compared to the arrays,
I say let's just simply keep the memset so that adding more fields in the
future are also safe.
Fam
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Optimize virtio_scsi_init_req
2014-09-16 7:16 ` Fam Zheng
@ 2014-09-16 8:15 ` Paolo Bonzini
0 siblings, 0 replies; 7+ messages in thread
From: Paolo Bonzini @ 2014-09-16 8:15 UTC (permalink / raw)
To: Fam Zheng; +Cc: qemu-devel
Il 16/09/2014 09:16, Fam Zheng ha scritto:
> On Mon, 09/15 12:17, Paolo Bonzini wrote:
>> Il 15/09/2014 07:23, Fam Zheng ha scritto:
>>> SCSIRequest *sreq;
>>> size_t resp_size;
>>> enum SCSIXferMode mode;
>>> - QEMUIOVector resp_iov;
>>> union {
>>> VirtIOSCSICmdResp cmd;
>>> VirtIOSCSICtrlTMFResp tmf;
>>> @@ -68,23 +75,27 @@ static inline SCSIDevice *virtio_scsi_device_find(VirtIOSCSI *s, uint8_t *lun)
>>> static VirtIOSCSIReq *virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq)
>>> {
>>> VirtIOSCSIReq *req;
>>> - VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
>>> -
>>> - req = g_malloc0(sizeof(*req) + vs->cdb_size);
>>> + VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s;
>>> + const size_t zero_skip = offsetof(VirtIOSCSIReq, elem)
>>> + + sizeof(VirtQueueElement);
>>>
>>> + req = g_slice_alloc(sizeof(*req) + vs->cdb_size);
>>
>> Looks good, but why do you need to zero the union? You only need to
>> zero sreq, resp_size and mode, don't you (and at this point, memset
>> becomes superfluous)?
>>
>
> The structures in unions are not zeroed by caller, also leaving them breaks
> virtio-scsi in my test.
>
> FWIW, I will remove the "req->sreq = NULL;" two lines below in v3. At this
> point tuning these small fields are subtle optimization compared to the arrays,
> I say let's just simply keep the memset so that adding more fields in the
> future are also safe.
Perhaps the response fields have to be zeroed? The request shouldn't
need it. It can be done separately though---the VirtQueueElement is the
big one that we have to fix.
Paolo
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-09-16 8:16 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-15 5:23 [Qemu-devel] [PATCH v2 0/2] virtio-scsi: Optimizing request allocation Fam Zheng
2014-09-15 5:23 ` [Qemu-devel] [PATCH v2 1/2] scsi: Optimize scsi_req_alloc Fam Zheng
2014-09-15 10:15 ` Paolo Bonzini
2014-09-15 5:23 ` [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Optimize virtio_scsi_init_req Fam Zheng
2014-09-15 10:17 ` Paolo Bonzini
2014-09-16 7:16 ` Fam Zheng
2014-09-16 8:15 ` Paolo Bonzini
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.