From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S941293AbdAIVRO convert rfc822-to-8bit (ORCPT ); Mon, 9 Jan 2017 16:17:14 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:38270 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751114AbdAIVRK (ORCPT ); Mon, 9 Jan 2017 16:17:10 -0500 From: Boris Ostrovsky Subject: Re: [PATCH 3/3] xen: optimize xenbus driver for multiple concurrent xenstore accesses To: Juergen Gross , linux-kernel@vger.kernel.org, xen-devel@lists.xenproject.org References: <20170106150544.10836-1-jgross@suse.com> <20170106150544.10836-4-jgross@suse.com> Message-ID: Date: Mon, 9 Jan 2017 16:17:11 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20170106150544.10836-4-jgross@suse.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8BIT X-Source-IP: userv0021.oracle.com [156.151.31.71] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/06/2017 10:05 AM, Juergen Gross wrote: > Handling of multiple concurrent Xenstore accesses through xenbus driver > either from the kernel or user land is rather lame today: xenbus is > capable to have one access active only at one point of time. > > Rewrite xenbus to handle multiple requests concurrently by making use > of the request id of the Xenstore protocol. This requires to: > > - Instead of blocking inside xb_read() when trying to read data from > the xenstore ring buffer do so only in the main loop of > xenbus_thread(). > > - Instead of doing writes to the xenstore ring buffer in the context of > the caller just queue the request and do the write in the dedicated > xenbus thread. > > - Instead of just forwarding the request id specified by the caller of > xenbus to xenstore use a xenbus internal unique request id. This will > allow multiple outstanding requests. > > - Modify the locking scheme in order to allow multiple requests being > active in parallel. > > - Instead of waiting for the reply of a user's xenstore request after > writing the request to the xenstore ring buffer return directly to > the caller and do the waiting in the read path. > > Additionally signal handling was optimized by avoiding waking up the > xenbus thread or sending an event to Xenstore in case the addressed > entity is known to be running already. > > As a result communication with Xenstore is sped up by a factor of up > to 5: depending on the request type (read or write) and the amount of > data transferred the gain was at least 20% (small reads) and went up to > a factor of 5 for large writes. > > In the end some more rough edges of xenbus have been smoothed: > > - Handling of memory shortage when reading from xenstore ring buffer in > the xenbus driver was not optimal: it was busy looping and issuing a > warning in each loop. > > - In case of xenstore not running in dom0 but in a stubdom we end up > with two xenbus threads running as the initialization of xenbus in > dom0 expecting a local xenstored will be redone later when connecting > to the xenstore domain. Up to now this was no problem as locking > would prevent the two xenbus threads interfering with each other, but > this was just a waste of kernel resources. > > - An out of memory situation while writing to or reading from the > xenstore ring buffer no longer will lead to a possible loss of > synchronization with xenstore. > > - The user read and write part are now interruptible by signals. > > Signed-off-by: Juergen Gross > --- > I'm aware that the changes are quite large. I thought about sending a > version split into multiple patches, but a lot of lines would have been > touched by more than one patch. I still have the multiple patch variant > lying around - this patch is split into 11 smaller ones. While all > steps of this larger series is operational some steps are not optimal > as they are even slower than the original version of xenbus. > > Nevertheless I can send the large series if there are requests for it. I will comment only on xen_comms changes for now since otherwise I am afraid it may be difficult to keep track of conversation. > diff --git a/drivers/xen/xenbus/xenbus_comms.c b/drivers/xen/xenbus/xenbus_comms.c > index c21ec02..fa054ca 100644 > --- a/drivers/xen/xenbus/xenbus_comms.c > +++ b/drivers/xen/xenbus/xenbus_comms.c > @@ -34,6 +34,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -42,11 +43,40 @@ > #include > #include "xenbus.h" > > +struct xs_thread_state_write { > + struct xb_req_data *req; > + int idx; > + unsigned int used; "written" or "sent"? > +}; > + > +struct xs_thread_state_read { > + struct xsd_sockmsg msg; > + char *body; > + union { > + void *alloc; > + struct xs_watch_event *watch; > + }; > + bool in_msg; > + bool in_hdr; It may be better to keep track of which state we are in using a bitmap. Otherwise it easy to lose track of one or the other. > + unsigned int used; "read" or"received"? > +}; Both of these are private to process_msg/process_write so perhaps they can be declared in those routines' scopes. > + > +/* A list of replies. Currently only one will ever be outstanding. */ > +LIST_HEAD(xs_reply_list); > + > +/* A list of write requests. */ > +LIST_HEAD(xb_write_list); > +DECLARE_WAIT_QUEUE_HEAD(xb_waitq); > +DEFINE_MUTEX(xb_write_mutex); > + > +/* Protect xenbus reader thread against save/restore. */ > +DEFINE_MUTEX(xs_response_mutex); > + > static int xenbus_irq; > +static struct task_struct *xenbus_task; > > static DECLARE_WORK(probe_work, xenbus_probe); > > -static DECLARE_WAIT_QUEUE_HEAD(xb_waitq); > > static irqreturn_t wake_waiting(int irq, void *unused) > { > @@ -84,30 +114,31 @@ static const void *get_input_chunk(XENSTORE_RING_IDX cons, > return buf + MASK_XENSTORE_IDX(cons); > } > > +static int xb_data_to_write(void) > +{ > + struct xenstore_domain_interface *intf = xen_store_interface; > + > + return (intf->req_prod - intf->req_cons) != XENSTORE_RING_SIZE && > + !list_empty(&xb_write_list); > +} > + > /** > * xb_write - low level write > * @data: buffer to send > * @len: length of buffer > * > - * Returns 0 on success, error otherwise. > + * Returns number of bytes written or -err. > */ > -int xb_write(const void *data, unsigned len) > +static int xb_write(const void *data, unsigned int len) > { > struct xenstore_domain_interface *intf = xen_store_interface; > XENSTORE_RING_IDX cons, prod; > - int rc; > + unsigned int bytes = 0; > > while (len != 0) { > void *dst; > unsigned int avail; > > - rc = wait_event_interruptible( > - xb_waitq, > - (intf->req_prod - intf->req_cons) != > - XENSTORE_RING_SIZE); > - if (rc < 0) > - return rc; > - > /* Read indexes, then verify. */ > cons = intf->req_cons; > prod = intf->req_prod; > @@ -115,59 +146,57 @@ int xb_write(const void *data, unsigned len) > intf->req_cons = intf->req_prod = 0; > return -EIO; > } > - > - dst = get_output_chunk(cons, prod, intf->req, &avail); > - if (avail == 0) > - continue; > - if (avail > len) > - avail = len; > + if (!xb_data_to_write()) > + return bytes; > > /* Must write data /after/ reading the consumer index. */ > virt_mb(); > > + dst = get_output_chunk(cons, prod, intf->req, &avail); > + if (avail == 0) > + continue; Should we continue the loop here or return? We are waiting for the reader to get stuff off the ring. > + if (avail > len) > + avail = len; > + > memcpy(dst, data, avail); > data += avail; > len -= avail; > + bytes += avail; > > /* Other side must not see new producer until data is there. */ > virt_wmb(); > intf->req_prod += avail; > > /* Implies mb(): other side will see the updated producer. */ > - notify_remote_via_evtchn(xen_store_evtchn); > + if (prod <= intf->req_cons) > + notify_remote_via_evtchn(xen_store_evtchn); > } > > - return 0; > + return bytes; > } > > -int xb_data_to_read(void) > +static int xb_data_to_read(void) > { > struct xenstore_domain_interface *intf = xen_store_interface; > return (intf->rsp_cons != intf->rsp_prod); > } > > -int xb_wait_for_data_to_read(void) > -{ > - return wait_event_interruptible(xb_waitq, xb_data_to_read()); > -} > - > -int xb_read(void *data, unsigned len) > +static int xb_read(void *data, unsigned int len) > { > struct xenstore_domain_interface *intf = xen_store_interface; > XENSTORE_RING_IDX cons, prod; > - int rc; > + unsigned int bytes = 0; > > while (len != 0) { > unsigned int avail; > const char *src; > > - rc = xb_wait_for_data_to_read(); > - if (rc < 0) > - return rc; > - > /* Read indexes, then verify. */ > cons = intf->rsp_cons; > prod = intf->rsp_prod; > + if (cons == prod) > + return bytes; > + > if (!check_indexes(cons, prod)) { > intf->rsp_cons = intf->rsp_prod = 0; > return -EIO; > @@ -185,17 +214,229 @@ int xb_read(void *data, unsigned len) > memcpy(data, src, avail); > data += avail; > len -= avail; > + bytes += avail; > > /* Other side must not see free space until we've copied out */ > virt_mb(); > intf->rsp_cons += avail; > > - pr_debug("Finished read of %i bytes (%i to go)\n", avail, len); > - > /* Implies mb(): other side will see the updated consumer. */ > - notify_remote_via_evtchn(xen_store_evtchn); > + if (intf->rsp_prod - cons >= XENSTORE_RING_SIZE) > + notify_remote_via_evtchn(xen_store_evtchn); > } > > + return bytes; > +} > + > +static int process_msg(void) > +{ > + static struct xs_thread_state_read state; > + struct xb_req_data *req; > + int err; > + unsigned int len; > + > + if (!state.in_msg) { > + state.in_msg = true; > + state.in_hdr = true; > + state.used = 0; > + > + /* > + * We must disallow save/restore while reading a message. > + * A partial read across s/r leaves us out of sync with > + * xenstored. > + */ > + mutex_lock(&xs_response_mutex); > + > + if (!xb_data_to_read()) { > + /* We raced with save/restore: pending data 'gone'. */ > + mutex_unlock(&xs_response_mutex); > + state.in_msg = false; > + return 0; > + } > + } > + > + if (state.in_hdr) { > + if (state.used != sizeof(state.msg)) { > + err = xb_read((void *)&state.msg + state.used, > + sizeof(state.msg) - state.used); > + if (err < 0) > + goto out; > + state.used += err; > + if (state.used != sizeof(state.msg)) > + return 0; Would it be possible to do locking at the caller? I understand that you are trying to hold the lock across multiple invocations of this function but it feels somewhat counter-intuitive and bug-prone. If it's not possible then at least please add a comment explaining locking algorithm. > + if (state.msg.len > XENSTORE_PAYLOAD_MAX) { > + err = -EINVAL; > + goto out; > + } > + } > + > + len = state.msg.len + 1; > + if (state.msg.type == XS_WATCH_EVENT) > + len += sizeof(*state.watch); > + > + state.alloc = kmalloc(len, GFP_NOIO | __GFP_HIGH); Why can't you kmalloc to state.body only when type!=XS_WATCH_EVENT ? > + if (!state.alloc) > + return -ENOMEM; > + > + if (state.msg.type == XS_WATCH_EVENT) > + state.body = state.watch->body; > + else > + state.body = state.alloc; > + state.in_hdr = false; > + state.used = 0; > + } > + > + err = xb_read(state.body + state.used, state.msg.len - state.used); > + if (err < 0) > + goto out; > + > + state.used += err; > + if (state.used != state.msg.len) > + return 0; > + > + state.body[state.msg.len] = '\0'; > + > + if (state.msg.type == XS_WATCH_EVENT) { > + state.watch->len = state.msg.len; > + err = xs_watch_msg(state.watch); > + } else { > + err = -ENOENT; > + mutex_lock(&xb_write_mutex); > + list_for_each_entry(req, &xs_reply_list, list) { > + if (req->msg.req_id == state.msg.req_id) { > + if (req->state == xb_req_state_wait_reply) { > + req->msg.type = state.msg.type; > + req->msg.len = state.msg.len; > + req->body = state.body; > + req->state = xb_req_state_got_reply; > + list_del(&req->list); > + req->cb(req); > + } else { > + list_del(&req->list); > + kfree(req); > + } > + err = 0; > + break; > + } > + } > + mutex_unlock(&xb_write_mutex); > + if (err) > + goto out; > + } > + > + mutex_unlock(&xs_response_mutex); > + > + state.in_msg = false; > + state.alloc = NULL; > + return err; > + > + out: > + mutex_unlock(&xs_response_mutex); > + state.in_msg = false; > + kfree(state.alloc); > + state.alloc = NULL; > + return err; > +} > + > +static int process_writes(void) > +{ > + static struct xs_thread_state_write state; > + void *base; > + unsigned int len; > + int err = 0; > + > + if (!xb_data_to_write()) > + return 0; > + > + mutex_lock(&xb_write_mutex); > + > + if (!state.req) { > + state.req = list_first_entry(&xb_write_list, > + struct xb_req_data, list); > + state.idx = -1; > + state.used = 0; > + } > + > + if (state.req->state == xb_req_state_aborted) > + goto out_err; > + > + while (state.idx < state.req->num_vecs) { > + if (state.idx < 0) { > + base = &state.req->msg; > + len = sizeof(state.req->msg); > + } else { > + base = state.req->vec[state.idx].iov_base; > + len = state.req->vec[state.idx].iov_len; > + } > + err = xb_write(base + state.used, len - state.used); > + if (err < 0) > + goto out_err; > + state.used += err; > + if (state.used != len) > + goto out; > + > + state.idx++; > + state.used = 0; > + } > + > + /* > + * You would expect the following to be racy, but as the response is > + * being read by our thread there is no risk of req being freed > + * under our feet. > + */ I don't think I understand this (and it's missing a "so" or something like that between "thread" and "there"). If this is not racy, why are we doing this under xb_write_mutex? > + list_del(&state.req->list); > + state.req->state = xb_req_state_wait_reply; > + list_add_tail(&state.req->list, &xs_reply_list); > + state.req = NULL; > + > + out: > + mutex_unlock(&xb_write_mutex); > + > + return 0; > + > + out_err: > + state.req->msg.type = XS_ERROR; > + state.req->err = err; You don't seem to need this for xb_req_state_aborted since you are freeing state_req. OTOH, why shouldn't aborted requests generate an error reply as well? > + list_del(&state.req->list); > + if (state.req->state == xb_req_state_aborted) > + kfree(state.req); > + else { > + state.req->state = xb_req_state_got_reply; > + wake_up(&state.req->wq); > + } > + > + mutex_unlock(&xb_write_mutex); > + > + state.req = NULL; > + > + return err; > +} > + > +static int xb_thread_work(void) > +{ > + return xb_data_to_read() || xb_data_to_write(); > +} > + > +static int xenbus_thread(void *unused) > +{ > + int err; > + > + while (!kthread_should_stop()) { > + if (wait_event_interruptible(xb_waitq, xb_thread_work())) > + continue; > + > + err = process_msg(); > + if (err == -ENOMEM) > + schedule(); > + else if (err) > + pr_warn("error %d while reading message\n", err); > + > + err = process_writes(); > + if (err) > + pr_warn("error %d while writing message\n", err); Is there a chance that errors are persistent and you then spam the log? -boris > + } > + > + xenbus_task = NULL; > return 0; > } > > @@ -223,6 +464,7 @@ int xb_init_comms(void) > rebind_evtchn_irq(xen_store_evtchn, xenbus_irq); > } else { > int err; > + > err = bind_evtchn_to_irqhandler(xen_store_evtchn, wake_waiting, > 0, "xenbus", &xb_waitq); > if (err < 0) { > @@ -231,6 +473,13 @@ int xb_init_comms(void) > } > > xenbus_irq = err; > + > + if (!xenbus_task) { > + xenbus_task = kthread_run(xenbus_thread, NULL, > + "xenbus"); > + if (IS_ERR(xenbus_task)) > + return PTR_ERR(xenbus_task); > + } > } > > return 0; > > > > >