All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] Re: [PATCH 2/3] NBD library: add aio-compatible read/write function
       [not found] ` <1298033729-17945-2-git-send-email-nick@bytemark.co.uk>
@ 2011-02-21 12:37   ` Kevin Wolf
  2011-02-21 20:10     ` Stefan Hajnoczi
  0 siblings, 1 reply; 12+ messages in thread
From: Kevin Wolf @ 2011-02-21 12:37 UTC (permalink / raw)
  To: Nick Thomas; +Cc: stefanha, qemu-devel

Am 18.02.2011 13:55, schrieb Nick Thomas:
> Signed-off-by: Nicholas Thomas <nick@bytemark.co.uk>
> ---
>  nbd.c |   51 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  nbd.h |    2 ++
>  2 files changed, 53 insertions(+), 0 deletions(-)
> 
> diff --git a/nbd.c b/nbd.c
> index abe0ecb..83d3342 100644
> --- a/nbd.c
> +++ b/nbd.c
> @@ -107,6 +107,57 @@ size_t nbd_wr_sync(int fd, void *buffer, size_t size, bool do_read)
>      return offset;
>  }
>  
> +int nbd_wr_aio(int sockfd, struct iovec *iov, size_t len,  off_t offset,
> +               bool do_read)
> +{
> +    struct msghdr msg;
> +    int ret, diff;
> +
> +    memset(&msg, 0, sizeof(msg));
> +    msg.msg_iov = iov;
> +    msg.msg_iovlen = 1;
> +
> +    len += offset;
> +
> +    while (iov->iov_len < len) {
> +        len -= iov->iov_len;
> +
> +        iov++;
> +        msg.msg_iovlen++;
> +    }

Is there a reason why you don't use a QEMUIOVector here? It already has
alls the information like the number of entries. Throwing this
information away only to recalculate it here seems a bit pointless.

> +
> +    diff = iov->iov_len - len;
> +    iov->iov_len -= diff;
> +
> +    while (msg.msg_iov->iov_len <= offset) {
> +        offset -= msg.msg_iov->iov_len;
> +
> +        msg.msg_iov++;
> +        msg.msg_iovlen--;
> +    }
> +
> +    msg.msg_iov->iov_base = (char *) msg.msg_iov->iov_base + offset;
> +    msg.msg_iov->iov_len -= offset;

You could use qemu_iovec_copy to create a qiov that covers exactly the
part that you want to use.

> +
> +retry:
> +    if (do_read) {
> +        ret = recvmsg(sockfd, &msg, 0);
> +    } else {
> +        ret = sendmsg(sockfd, &msg, 0);
> +    }
> +
> +    /* recoverable error */
> +    if (ret == -1 && (errno == EAGAIN || errno == EINTR)) {
> +        goto retry;
> +    }

Make this a do...while loop, no reason for goto here.

> +
> +    msg.msg_iov->iov_base = (char *) msg.msg_iov->iov_base - offset;
> +    msg.msg_iov->iov_len += offset;
> +
> +    iov->iov_len += diff;
> +    return ret;

This wouldn't be necessary with a copied qiov either (but you'd need to
destroy the copy)

Kevin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] Re: [PATCH 1/3] NBD library: whitespace changes
       [not found] <1298033729-17945-1-git-send-email-nick@bytemark.co.uk>
       [not found] ` <1298033729-17945-2-git-send-email-nick@bytemark.co.uk>
@ 2011-02-21 12:37 ` Kevin Wolf
       [not found] ` <1298033729-17945-3-git-send-email-nick@bytemark.co.uk>
  2 siblings, 0 replies; 12+ messages in thread
From: Kevin Wolf @ 2011-02-21 12:37 UTC (permalink / raw)
  To: Nick Thomas; +Cc: stefanha, qemu-devel

Am 18.02.2011 13:55, schrieb Nick Thomas:
> ---
>  nbd.c |  835 +++++++++++++++++++++++++++++++++--------------------------------
>  1 files changed, 418 insertions(+), 417 deletions(-)

Patch 1 and 3 are lacking a Signed-off-by.

Kevin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] Re: [PATCH 3/3] block/nbd: Make the NBD block device use the AIO interface
       [not found] ` <1298033729-17945-3-git-send-email-nick@bytemark.co.uk>
@ 2011-02-21 12:59   ` Kevin Wolf
  2011-02-21 16:31     ` Nicholas Thomas
  2011-02-21 20:07   ` Stefan Hajnoczi
  1 sibling, 1 reply; 12+ messages in thread
From: Kevin Wolf @ 2011-02-21 12:59 UTC (permalink / raw)
  To: Nick Thomas; +Cc: stefanha, qemu-devel

Am 18.02.2011 13:55, schrieb Nick Thomas:
> This preserves the previous behaviour where the NBD server is
> unavailable or goes away during guest execution, but switches the
> NBD backend to present the AIO interface instead of the sync IO
> interface.
> 
> We also split write requests into 1 MiB blocks (minus request size).
> This is a hard limit in the NBD servers (including qemu-nbd), but
> never seemed to come up with the previous backend code.
> ---
>  block/nbd.c |  555 ++++++++++++++++++++++++++++++++++++++++++++++++++---------
>  1 files changed, 470 insertions(+), 85 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index c8dc763..e7b1f7e 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -1,11 +1,12 @@
>  /*
> - * QEMU Block driver for  NBD
> + * QEMU Block driver for  NBD - asynchronous IO
>   *
>   * Copyright (C) 2008 Bull S.A.S.
>   *     Author: Laurent Vivier <Laurent.Vivier@bull.net>
>   *
>   * Some parts:
>   *    Copyright (C) 2007 Anthony Liguori <anthony@codemonkey.ws>
> + *    Copyright (C) 2011 Nick Thomas <nick@bytemark.co.uk>
>   *
>   * Permission is hereby granted, free of charge, to any person obtaining a copy
>   * of this software and associated documentation files (the "Software"), to deal
> @@ -27,66 +28,135 @@
>   */
>  
>  #include "qemu-common.h"
> +#include "qemu_socket.h"
>  #include "nbd.h"
>  #include "module.h"
>  
>  #include <sys/types.h>
>  #include <unistd.h>
>  
> -#define EN_OPTSTR ":exportname="
> +#define EN_OPTSTR      ":exportname="
> +#define SECTOR_SIZE    512
> +
> +/* 1MiB minus request header size */
> +#define MAX_NBD_WRITE      ((1024*1024) - (4 + 4 + 8 + 8 + 4))

-EMAGIC :-)

I think we can add an uint32_t magic to struct nbd_request, make it
packed and then replace this magic "4 + 4 + 8 + 8 + 4" thing by
"sizeof(struct nbd_request)" everywhere.

> +
> +/* #define DEBUG_NBD */
> +
> +#if defined(DEBUG_NBD)
> +#define logout(fmt, ...) \
> +                fprintf(stderr, "nbd\t%-24s" fmt, __func__, ##__VA_ARGS__)
> +#else
> +#define logout(fmt, ...) ((void)0)
> +#endif
> +
> +
> +typedef struct NBDAIOCB NBDAIOCB;
> +
> +typedef struct AIOReq {
> +    NBDAIOCB *aiocb;
> +    off_t iov_offset; /* Where on the iov does this req start? */
> +    off_t offset;     /* Starting point of the read */
> +
> +    size_t data_len;
> +    uint8_t flags;
> +    uint64_t handle;
> +
> +    QLIST_ENTRY(AIOReq) outstanding_aio_siblings;
> +    QLIST_ENTRY(AIOReq) aioreq_siblings;
> +} AIOReq;
> +
>  
>  typedef struct BDRVNBDState {
>      int sock;
>      off_t size;
>      size_t blocksize;
> +
> +    /* Filled in by nbd_config. Store host_spec because DNS may change  */
> +    bool tcp_conn;      /* True, we use TCP. False, UNIX domain sockets */

Should it be an enum then?

> +    char *export_name;  /* An NBD server may export several devices     */
> +    char *host_spec;    /* Path to socket (UNIX) or hostname/IP (TCP)   */
> +    uint16_t tcp_port;

Is this part really related to AIO? If not, it should be a patch of its own.

> +
> +    /* We use these for asynchronous I/O */
> +    uint64_t aioreq_seq_num;
> +    QLIST_HEAD(outstanding_aio_head, AIOReq) outstanding_aio_head;
>  } BDRVNBDState;
>  
> -static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
> +enum AIOCBState {
> +    AIOCB_WRITE_UDATA,
> +    AIOCB_READ_UDATA,
> +};
> +
> +struct NBDAIOCB {
> +    BlockDriverAIOCB common;
> +
> +    QEMUIOVector *qiov;
> +
> +    int64_t sector_num;
> +    int nb_sectors;
> +
> +    int ret;
> +    enum AIOCBState aiocb_type;
> +
> +    QEMUBH *bh;
> +    void (*aio_done_func)(NBDAIOCB *);
> +
> +    int canceled;

bool?

> +
> +    QLIST_HEAD(aioreq_head, AIOReq) aioreq_head;
> +};
> +
> +static inline int free_aio_req(BDRVNBDState *s, AIOReq *aio_req)
>  {
> -    BDRVNBDState *s = bs->opaque;
> -    uint32_t nbdflags;
> +    NBDAIOCB *acb = aio_req->aiocb;
> +    QLIST_REMOVE(aio_req, outstanding_aio_siblings);
> +    QLIST_REMOVE(aio_req, aioreq_siblings);
> +    qemu_free(aio_req);
>  
> +    return !QLIST_EMPTY(&acb->aioreq_head);
> +}
> +
> +static int nbd_config(BDRVNBDState *s, const char* filename, int flags)
> +{
>      char *file;
> -    char *name;
> -    const char *host;
> +    char *export_name;
> +    const char *host_spec;
>      const char *unixpath;
> -    int sock;
> -    off_t size;
> -    size_t blocksize;
> -    int ret;
>      int err = -EINVAL;
>  
>      file = qemu_strdup(filename);
>  
> -    name = strstr(file, EN_OPTSTR);
> -    if (name) {
> -        if (name[strlen(EN_OPTSTR)] == 0) {
> +    export_name = strstr(file, EN_OPTSTR);
> +    if (export_name) {
> +        if (export_name[strlen(EN_OPTSTR)] == 0) {
>              goto out;
>          }
> -        name[0] = 0;
> -        name += strlen(EN_OPTSTR);
> +        export_name[0] = 0; /* truncate 'file' */
> +        export_name += strlen(EN_OPTSTR);
> +        s->export_name = qemu_strdup(export_name);
>      }
>  
> -    if (!strstart(file, "nbd:", &host)) {
> +    /* extract the host_spec - fail if it's not nbd:* */
> +    if (!strstart(file, "nbd:", &host_spec)) {
>          goto out;
>      }
>  
> -    if (strstart(host, "unix:", &unixpath)) {
> -
> -        if (unixpath[0] != '/') {
> +    /* are we a UNIX or TCP socket? */
> +    if (strstart(host_spec, "unix:", &unixpath)) {
> +        if (unixpath[0] != '/') { /* We demand  an absolute path*/
>              goto out;
>          }
> -
> -        sock = unix_socket_outgoing(unixpath);
> -
> +        s->tcp_conn = false;
> +        s->host_spec = qemu_strdup(unixpath);
>      } else {
> +        /* We should have an <IPv4 address>:<port> string to split up */
>          uint16_t port = NBD_DEFAULT_PORT;
>          char *p, *r;
>          char hostname[128];
>  
> -        pstrcpy(hostname, 128, host);
> -
> -        p = strchr(hostname, ':');
> +        pstrcpy(hostname, 128, host_spec);
> +        p = strchr(hostname, ':'); /* FIXME: IPv6 */
>          if (p != NULL) {
>              *p = '\0';
>              p++;
> @@ -96,121 +166,436 @@ static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
>                  goto out;
>              }
>          }
> +        s->tcp_conn = true;
> +        s->host_spec = qemu_strdup(hostname);
> +        s->tcp_port = port;
> +    }
> +
> +    err = 0;
> +
> +out:
> +    qemu_free(file);
> +    if (err != 0) {
> +        if (s->export_name != NULL) {
> +            qemu_free(s->export_name);
> +        }
> +        if (s->host_spec != NULL) {
> +            qemu_free(s->host_spec);
> +        }

free (and qemu_free) works just fine with NULL. It's defined to do
nothing in this case.

> +    }
> +    return err;
> +}
> +
> +static void aio_read_response(void *opaque)
> +{
> +    BDRVNBDState *s = opaque;
> +    struct nbd_reply reply;
> +
> +    AIOReq *aio_req = NULL;
> +    NBDAIOCB *acb;
> +    int rest;
> +
> +    if (QLIST_EMPTY(&s->outstanding_aio_head)) {
> +        return;
> +    }
> +
> +    /* read the header */
> +    if (nbd_receive_reply(s->sock, &reply) == -1) {
> +        logout("Failed to read response from socket\n");
> +        /* Having failed to read the reply header, we can't know which
> +         * aio_req this corresponds to - so we can't signal a failure.
> +         */
> +        return;
> +    }
> +
> +    /* find the right aio_req from the outstanding_aio list */
> +    QLIST_FOREACH(aio_req, &s->outstanding_aio_head, outstanding_aio_siblings) {
> +        if (aio_req->handle == reply.handle) {
> +            break;
> +        }
> +    }
> +
> +    if (!aio_req) {
> +        logout("cannot find aio_req for handle %lu\n", reply.handle);
> +        return;
> +    }
> +
> +    acb = aio_req->aiocb;
> +
> +    if (acb->aiocb_type == AIOCB_READ_UDATA) {
> +        off_t offset = 0;
> +        int ret = 0;
> +        size_t total = aio_req->data_len;
> +
> +        while (offset < total) {
> +            ret = nbd_wr_aio(s->sock, acb->qiov->iov, total - offset, offset,
> +                             true);
> +
> +            if (ret == -1) {
> +                logout("Error reading from NBD server: %i (%s)\n",
> +                       errno, strerror(errno));

I think you need to call the AIO callback with ret = -errno here.

> +                return;
> +            }
> +
> +            offset += ret;
> +        }
> +    }
>  
> -        sock = tcp_socket_outgoing(hostname, port);
> +    if (reply.error != 0) {
> +        acb->ret = -EIO;
> +        logout("NBD request resulted in error %i\n", reply.error);
>      }
>  
> +    rest = free_aio_req(s, aio_req);
> +    if (!rest) {
> +        acb->aio_done_func(acb);
> +    }
> +
> +    return;
> +}
> +
> +static int aio_flush_request(void *opaque)
> +{
> +    BDRVNBDState *s = opaque;
> +
> +    return !QLIST_EMPTY(&s->outstanding_aio_head);
> +}
> +
> +/*
> + * Connect to the NBD server specified in the state object
> + */
> +static int nbd_establish_connection(BlockDriverState *bs)
> +{
> +    BDRVNBDState *s = bs->opaque;
> +    int sock;
> +    int ret;
> +    off_t size;
> +    size_t blocksize;
> +    uint32_t nbdflags;
> +
> +    if (s->tcp_conn == true) {
> +        sock = tcp_socket_outgoing(s->host_spec, s->tcp_port);
> +    } else {
> +        sock = unix_socket_outgoing(s->host_spec);
> +    }
> +
> +    /* Failed to establish connection */
>      if (sock == -1) {
> -        err = -errno;
> -        goto out;
> +        logout("Failed to establish connection to NBD server\n");
> +        return -errno;
>      }
>  
> -    ret = nbd_receive_negotiate(sock, name, &nbdflags, &size, &blocksize);
> +    /* NBD handshake */
> +    ret = nbd_receive_negotiate(sock, s->export_name, &nbdflags, &size,
> +                                &blocksize);
>      if (ret == -1) {
> -        err = -errno;
> -        goto out;
> +        logout("Failed to negotiate with the NBD server\n");
> +        closesocket(sock);
> +        return -errno;
>      }
>  
> +    /* Now that we're connected, set the socket to be non-blocking */
> +    socket_set_nonblock(sock);
> +
>      s->sock = sock;
>      s->size = size;
>      s->blocksize = blocksize;
> -    err = 0;
>  
> -out:
> -    qemu_free(file);
> -    return err;
> +    /* Response handler. This is called when there is data to read */
> +    qemu_aio_set_fd_handler(sock, aio_read_response, NULL, aio_flush_request,
> +                            NULL, s);
> +    logout("Established connection with NBD server\n");
> +    return 0;
>  }
>  
> -static int nbd_read(BlockDriverState *bs, int64_t sector_num,
> -                    uint8_t *buf, int nb_sectors)
> +static void nbd_teardown_connection(BlockDriverState *bs)
>  {
> +    /* Send the final packet to the NBD server and close the socket */
>      BDRVNBDState *s = bs->opaque;
>      struct nbd_request request;
> -    struct nbd_reply reply;
>  
> -    request.type = NBD_CMD_READ;
> +    request.type = NBD_CMD_DISC;
>      request.handle = (uint64_t)(intptr_t)bs;
> -    request.from = sector_num * 512;;
> -    request.len = nb_sectors * 512;
> +    request.from = 0;
> +    request.len = 0;
> +    nbd_send_request(s->sock, &request);
>  
> -    if (nbd_send_request(s->sock, &request) == -1)
> -        return -errno;
> +    qemu_aio_set_fd_handler(s->sock, NULL, NULL, NULL, NULL, NULL);
> +    closesocket(s->sock);
> +    logout("Connection to NBD server closed\n");
> +    return;
> +}
>  
> -    if (nbd_receive_reply(s->sock, &reply) == -1)
> -        return -errno;
> +static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
> +{
> +    BDRVNBDState *s = bs->opaque;
> +    int result;
>  
> -    if (reply.error !=0)
> -        return -reply.error;
> +    /* Pop the config into our state object. Exit if invalid. */
> +    result = nbd_config(s, filename, flags);
>  
> -    if (reply.handle != request.handle)
> -        return -EIO;
> +    if (result != 0) {
> +        return result;
> +    }
>  
> -    if (nbd_wr_sync(s->sock, buf, request.len, 1) != request.len)
> -        return -EIO;
> +    QLIST_INIT(&s->outstanding_aio_head);
> +
> +    /* establish TCP connection, return error if it fails
> +     * TODO: Configurable retry-until-timeout behaviour.
> +     */
> +    result = nbd_establish_connection(bs);
> +    if (result != 0) {
> +        return result;
> +    }
>  
>      return 0;
>  }
>  
> -static int nbd_write(BlockDriverState *bs, int64_t sector_num,
> -                     const uint8_t *buf, int nb_sectors)
> +static void nbd_close(BlockDriverState *bs)
>  {
> +    nbd_teardown_connection(bs);
>      BDRVNBDState *s = bs->opaque;
> +
> +    if (s->export_name != NULL) {
> +        qemu_free(s->export_name);
> +    }
> +    if (s->host_spec != NULL) {
> +        qemu_free(s->host_spec);
> +    }
> +
> +    return;
> +}
> +
> +static int add_aio_request(BDRVNBDState *s, AIOReq *aio_req, QEMUIOVector *qiov,
> +                           enum AIOCBState aiocb_type)
> +{
>      struct nbd_request request;
> -    struct nbd_reply reply;
>  
> -    request.type = NBD_CMD_WRITE;
> -    request.handle = (uint64_t)(intptr_t)bs;
> -    request.from = sector_num * 512;;
> -    request.len = nb_sectors * 512;
> +    request.from = aio_req->offset;
> +    request.len = aio_req->data_len;
> +    request.handle = aio_req->handle;
> +
> +    if (aiocb_type == AIOCB_READ_UDATA) {
> +        request.type = NBD_CMD_READ;
> +    } else {
> +        request.type = NBD_CMD_WRITE;
> +    }
>  
> -    if (nbd_send_request(s->sock, &request) == -1)
> +    /* Write the request to the socket. Header first. */
> +    if (nbd_send_request(s->sock, &request) == -1) {
> +        /* TODO: retry handling. This leads to -EIO and request cancellation */
> +        logout("writing request header to server failed\n");
>          return -errno;
> +    }
>  
> -    if (nbd_wr_sync(s->sock, (uint8_t*)buf, request.len, 0) != request.len)
> -        return -EIO;
> +    /* If this is a write, send the data too */
> +    if (aiocb_type == AIOCB_WRITE_UDATA) {
> +        int ret = 0;
> +        off_t offset = 0;
> +        size_t total = aio_req->data_len;
> +
> +        while (offset < total) {
> +            ret = nbd_wr_aio(s->sock, qiov->iov, total - offset,
> +                             offset + aio_req->iov_offset, false);
> +            if (ret == -1) {
> +                logout("Error writing request data to NBD server: %i (%s)\n",
> +                       errno, strerror(errno));
> +                return -EIO;

-errno

Or maybe nbd_wr_aio should already return -errno so that you can return
ret here.

> +            }
>  
> -    if (nbd_receive_reply(s->sock, &reply) == -1)
> -        return -errno;
> +            offset += ret;
> +        }
> +    }
>  
> -    if (reply.error !=0)
> -        return -reply.error;
> +    return 0;
> +}
> +
> +static inline AIOReq *alloc_aio_req(BDRVNBDState *s, NBDAIOCB *acb,
> +                                    size_t data_len,
> +                                    off_t offset,
> +                                    off_t iov_offset)
> +{
> +    AIOReq *aio_req;
>  
> -    if (reply.handle != request.handle)
> +    aio_req = qemu_malloc(sizeof(*aio_req));
> +    aio_req->aiocb = acb;
> +    aio_req->iov_offset = iov_offset;
> +    aio_req->offset = offset;
> +    aio_req->data_len = data_len;
> +    aio_req->handle = s->aioreq_seq_num++; /* FIXME: Trivially guessable */
> +
> +    QLIST_INSERT_HEAD(&s->outstanding_aio_head, aio_req,
> +                      outstanding_aio_siblings);
> +    QLIST_INSERT_HEAD(&acb->aioreq_head, aio_req, aioreq_siblings);
> +
> +    return aio_req;
> +}
> +
> +static void nbd_finish_aiocb(NBDAIOCB *acb)
> +{
> +    if (!acb->canceled) {
> +        acb->common.cb(acb->common.opaque, acb->ret);
> +    }
> +    qemu_aio_release(acb);
> +}
> +
> +
> +static void nbd_aio_cancel(BlockDriverAIOCB *blockacb)
> +{
> +    NBDAIOCB *acb = (NBDAIOCB *)blockacb;
> +
> +    /*
> +     * We cannot cancel the requests which are already sent to
> +     * the servers, so we just complete the request with -EIO here.
> +     */
> +    acb->common.cb(acb->common.opaque, -EIO);
> +    acb->canceled = 1;
> +}

I think you need to check for acb->canceled before you write to the
associated buffer when receiving the reply for a read request. The
buffer might not exist any more after the request is cancelled.



> +
> +static AIOPool nbd_aio_pool = {
> +    .aiocb_size = sizeof(NBDAIOCB),
> +    .cancel = nbd_aio_cancel,
> +};
> +
> +static NBDAIOCB *nbd_aio_setup(BlockDriverState *bs, QEMUIOVector *qiov,
> +                                   int64_t sector_num, int nb_sectors,
> +                                   BlockDriverCompletionFunc *cb, void *opaque)
> +{
> +    NBDAIOCB *acb;
> +
> +    acb = qemu_aio_get(&nbd_aio_pool, bs, cb, opaque);
> +
> +    acb->qiov = qiov;
> +
> +    acb->sector_num = sector_num;
> +    acb->nb_sectors = nb_sectors;
> +
> +    acb->aio_done_func = NULL;
> +    acb->canceled = 0;
> +    acb->bh = NULL;
> +    acb->ret = 0;
> +    QLIST_INIT(&acb->aioreq_head);
> +    return acb;
> +}
> +
> +static int nbd_schedule_bh(QEMUBHFunc *cb, NBDAIOCB *acb)
> +{
> +    if (acb->bh) {
> +        logout("bug: %d %d\n", acb->aiocb_type, acb->aiocb_type);
>          return -EIO;
> +    }
> +
> +    acb->bh = qemu_bh_new(cb, acb);
> +    if (!acb->bh) {
> +        logout("oom: %d %d\n", acb->aiocb_type, acb->aiocb_type);
> +        return -EIO;
> +    }
> +
> +    qemu_bh_schedule(acb->bh);
>  
>      return 0;
>  }
>  
> -static void nbd_close(BlockDriverState *bs)
> +/*
> + * Send I/O requests to the server.
> + *
> + * This function sends requests to the server, links the requests to
> + * the outstanding_list in BDRVNBDState, and exits without waiting for
> + * the response.  The responses are received in the `aio_read_response'
> + * function which is called from the main loop as a fd handler.
> + * If this is a write request and it's >1MB, split it into multiple AIOReqs
> + */
> +static void nbd_readv_writev_bh_cb(void *p)
>  {
> -    BDRVNBDState *s = bs->opaque;
> -    struct nbd_request request;
> +    NBDAIOCB *acb = p;
> +    int ret = 0;
>  
> -    request.type = NBD_CMD_DISC;
> -    request.handle = (uint64_t)(intptr_t)bs;
> -    request.from = 0;
> -    request.len = 0;
> -    nbd_send_request(s->sock, &request);
> +    size_t len, done = 0;
> +    size_t total = acb->nb_sectors * SECTOR_SIZE;
> +
> +    /* Where the read/write starts from */
> +    size_t offset = acb->sector_num * SECTOR_SIZE;
> +    BDRVNBDState *s = acb->common.bs->opaque;
> +
> +    AIOReq *aio_req;
>  
> -    close(s->sock);
> +    qemu_bh_delete(acb->bh);
> +    acb->bh = NULL;
> +
> +    while (done != total) {
> +        len = (total - done);
> +
> +        /* Split write requests into 1MiB segments */
> +        if(acb->aiocb_type == AIOCB_WRITE_UDATA && len > MAX_NBD_WRITE) {
> +          len = MAX_NBD_WRITE;
> +        }
> +
> +        aio_req = alloc_aio_req(s, acb, len, offset + done, done);
> +        ret = add_aio_request(s, aio_req, acb->qiov, acb->aiocb_type);
> +
> +        if (ret < 0) {
> +            free_aio_req(s, aio_req);
> +            acb->ret = -EIO;
> +            goto out;
> +        }
> +
> +        done += len;
> +    }
> +out:
> +    if (QLIST_EMPTY(&acb->aioreq_head)) {
> +        nbd_finish_aiocb(acb);
> +    }
>  }
>  
> +static BlockDriverAIOCB *nbd_aio_readv(BlockDriverState *bs,
> +        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
> +        BlockDriverCompletionFunc *cb, void *opaque)
> +{
> +    NBDAIOCB *acb;
> +    int i;
> +
> +    acb = nbd_aio_setup(bs, qiov, sector_num, nb_sectors, cb, opaque);
> +    acb->aiocb_type = AIOCB_READ_UDATA;
> +    acb->aio_done_func = nbd_finish_aiocb;
> +
> +    for (i = 0; i < qiov->niov; i++) {
> +        memset(qiov->iov[i].iov_base, 0, qiov->iov[i].iov_len);
> +    }

qemu_iovec_memset?

What is this even for? Aren't these zeros overwritten anyway?

> +
> +    nbd_schedule_bh(nbd_readv_writev_bh_cb, acb);
> +    return &acb->common;
> +}
> +
> +static BlockDriverAIOCB *nbd_aio_writev(BlockDriverState *bs,
> +        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
> +        BlockDriverCompletionFunc *cb, void *opaque)
> +{
> +    NBDAIOCB *acb;
> +
> +    acb = nbd_aio_setup(bs, qiov, sector_num, nb_sectors, cb, opaque);
> +    acb->aiocb_type = AIOCB_WRITE_UDATA;
> +    acb->aio_done_func = nbd_finish_aiocb;
> +
> +    nbd_schedule_bh(nbd_readv_writev_bh_cb, acb);
> +    return &acb->common;
> +}
> +
> +
>  static int64_t nbd_getlength(BlockDriverState *bs)
>  {
>      BDRVNBDState *s = bs->opaque;
> -
>      return s->size;
>  }
>  
>  static BlockDriver bdrv_nbd = {
> -    .format_name	= "nbd",
> -    .instance_size	= sizeof(BDRVNBDState),
> -    .bdrv_file_open	= nbd_open,
> -    .bdrv_read		= nbd_read,
> -    .bdrv_write		= nbd_write,
> -    .bdrv_close		= nbd_close,
> -    .bdrv_getlength	= nbd_getlength,
> -    .protocol_name	= "nbd",
> +    .format_name     = "nbd",
> +    .instance_size   = sizeof(BDRVNBDState),
> +    .bdrv_file_open  = nbd_open,
> +    .bdrv_aio_readv  = nbd_aio_readv,
> +    .bdrv_aio_writev = nbd_aio_writev,
> +    .bdrv_close      = nbd_close,
> +    .bdrv_getlength  = nbd_getlength,
> +    .protocol_name   = "nbd"
>  };
>  
>  static void bdrv_nbd_init(void)

Kevin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] Re: [PATCH 3/3] block/nbd: Make the NBD block device use the AIO interface
  2011-02-21 12:59   ` [Qemu-devel] Re: [PATCH 3/3] block/nbd: Make the NBD block device use the AIO interface Kevin Wolf
@ 2011-02-21 16:31     ` Nicholas Thomas
  2011-02-21 16:48       ` Kevin Wolf
  0 siblings, 1 reply; 12+ messages in thread
From: Nicholas Thomas @ 2011-02-21 16:31 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: stefanha, qemu-devel

Hi again,

Thanks for looking through the patches. I'm just going through and
making the suggested changes now. I've also got qemu-nbd and block/nbd.c
working over IPv6 :) - hopefully I'll be able to provide patches in a
couple of days. Just a few questions about some of the changes...

Canceled requests: 
> > +
> > +
> > +static void nbd_aio_cancel(BlockDriverAIOCB *blockacb)
> > +{
> > +    NBDAIOCB *acb = (NBDAIOCB *)blockacb;
> > +
> > +    /*
> > +     * We cannot cancel the requests which are already sent to
> > +     * the servers, so we just complete the request with -EIO here.
> > +     */
> > +    acb->common.cb(acb->common.opaque, -EIO);
> > +    acb->canceled = 1;
> > +}
> 
> I think you need to check for acb->canceled before you write to the
> associated buffer when receiving the reply for a read request. The
> buffer might not exist any more after the request is cancelled.

I "borrowed" this code from block/sheepdog.c (along with a fair few
other bits ;) ) - which doesn't seem to do any special checking for
cancelled write requests. So if there is a potential SIGSEGV here, I
guess sheepdog is also vulnerable.

Presumably, in aio_read_response, I'd need to allocate a buffer of the
appropriate size, read the data from the socket, then deallocate the
buffer? Or would fsetpos(fd, ftell(fd)+data_len) be sufficient?

qemu-io doesn't seem to have any provisions for testing this particular
code path - can you think of any tests I could run to ensure
correctness? I've no idea under what situations an IO request might be
cancelled.


> > +static BlockDriverAIOCB *nbd_aio_readv(BlockDriverState *bs,
> > +        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
> > +        BlockDriverCompletionFunc *cb, void *opaque)
> > +{
> > [...]
> > +    for (i = 0; i < qiov->niov; i++) {
> > +        memset(qiov->iov[i].iov_base, 0, qiov->iov[i].iov_len);
> > +    }
> 
> qemu_iovec_memset?
> 
> What is this even for? Aren't these zeros overwritten anyway?

Again, present in sheepdog - but it does seem to work fine without it.
I'll remove it from NBD.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] Re: [PATCH 3/3] block/nbd: Make the NBD block device use the AIO interface
  2011-02-21 16:31     ` Nicholas Thomas
@ 2011-02-21 16:48       ` Kevin Wolf
  2011-02-22 16:06         ` MORITA Kazutaka
  0 siblings, 1 reply; 12+ messages in thread
From: Kevin Wolf @ 2011-02-21 16:48 UTC (permalink / raw)
  To: Nicholas Thomas; +Cc: stefanha, qemu-devel

Am 21.02.2011 17:31, schrieb Nicholas Thomas:
> Hi again,
> 
> Thanks for looking through the patches. I'm just going through and
> making the suggested changes now. I've also got qemu-nbd and block/nbd.c
> working over IPv6 :) - hopefully I'll be able to provide patches in a
> couple of days. Just a few questions about some of the changes...
> 
> Canceled requests: 
>>> +
>>> +
>>> +static void nbd_aio_cancel(BlockDriverAIOCB *blockacb)
>>> +{
>>> +    NBDAIOCB *acb = (NBDAIOCB *)blockacb;
>>> +
>>> +    /*
>>> +     * We cannot cancel the requests which are already sent to
>>> +     * the servers, so we just complete the request with -EIO here.
>>> +     */
>>> +    acb->common.cb(acb->common.opaque, -EIO);
>>> +    acb->canceled = 1;
>>> +}
>>
>> I think you need to check for acb->canceled before you write to the
>> associated buffer when receiving the reply for a read request. The
>> buffer might not exist any more after the request is cancelled.
> 
> I "borrowed" this code from block/sheepdog.c (along with a fair few
> other bits ;) ) - which doesn't seem to do any special checking for
> cancelled write requests. So if there is a potential SIGSEGV here, I
> guess sheepdog is also vulnerable.

Right, now that you mention it, I seem to remember this from Sheepdog. I
think I had a discussion with Stefan and he convinced me that we could
get away with it in Sheepdog because of some condition that Sheepdog
meets. Not sure any more what that condition was and if it applies to NBD.

Was it that Sheepdog has a bounce buffer for all requests?

> Presumably, in aio_read_response, I'd need to allocate a buffer of the
> appropriate size, read the data from the socket, then deallocate the
> buffer? Or would fsetpos(fd, ftell(fd)+data_len) be sufficient?

I don't think that fseek works with sockets, so you'd probably need to
read into a temporary buffer, yes.

> qemu-io doesn't seem to have any provisions for testing this particular
> code path - can you think of any tests I could run to ensure
> correctness? I've no idea under what situations an IO request might be
> cancelled.

aio_cancel is hard to implement and hard to test, and I would be
surprised if any format more complex than raw got it completely right...

>>> +static BlockDriverAIOCB *nbd_aio_readv(BlockDriverState *bs,
>>> +        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
>>> +        BlockDriverCompletionFunc *cb, void *opaque)
>>> +{
>>> [...]
>>> +    for (i = 0; i < qiov->niov; i++) {
>>> +        memset(qiov->iov[i].iov_base, 0, qiov->iov[i].iov_len);
>>> +    }
>>
>> qemu_iovec_memset?
>>
>> What is this even for? Aren't these zeros overwritten anyway?
> 
> Again, present in sheepdog - but it does seem to work fine without it.
> I'll remove it from NBD.

Maybe Sheepdog reads only partially from the server if blocks are
unallocated or something.

Kevin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] Re: [PATCH 3/3] block/nbd: Make the NBD block device use the AIO interface
       [not found] ` <1298033729-17945-3-git-send-email-nick@bytemark.co.uk>
  2011-02-21 12:59   ` [Qemu-devel] Re: [PATCH 3/3] block/nbd: Make the NBD block device use the AIO interface Kevin Wolf
@ 2011-02-21 20:07   ` Stefan Hajnoczi
  2011-02-22 11:48     ` Nicholas Thomas
  1 sibling, 1 reply; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-02-21 20:07 UTC (permalink / raw)
  To: Nick Thomas; +Cc: kwolf, qemu-devel

On Fri, Feb 18, 2011 at 12:55:29PM +0000, Nick Thomas wrote:
> +static inline AIOReq *alloc_aio_req(BDRVNBDState *s, NBDAIOCB *acb,
> +                                    size_t data_len,
> +                                    off_t offset,
> +                                    off_t iov_offset)
> +{
> +    AIOReq *aio_req;
>
> -    if (reply.handle != request.handle)
> +    aio_req = qemu_malloc(sizeof(*aio_req));
> +    aio_req->aiocb = acb;
> +    aio_req->iov_offset = iov_offset;
> +    aio_req->offset = offset;
> +    aio_req->data_len = data_len;
> +    aio_req->handle = s->aioreq_seq_num++; /* FIXME: Trivially guessable */

Does it matter at all that this number is guessable?  I think we can
drop the FIXME comment.

> +/*
> + * Send I/O requests to the server.
> + *
> + * This function sends requests to the server, links the requests to
> + * the outstanding_list in BDRVNBDState, and exits without waiting for
> + * the response.  The responses are received in the `aio_read_response'
> + * function which is called from the main loop as a fd handler.
> + * If this is a write request and it's >1MB, split it into multiple AIOReqs
> + */
> +static void nbd_readv_writev_bh_cb(void *p)
>  {
> -    BDRVNBDState *s = bs->opaque;
> -    struct nbd_request request;
> +    NBDAIOCB *acb = p;
> +    int ret = 0;
>
> -    request.type = NBD_CMD_DISC;
> -    request.handle = (uint64_t)(intptr_t)bs;
> -    request.from = 0;
> -    request.len = 0;
> -    nbd_send_request(s->sock, &request);
> +    size_t len, done = 0;
> +    size_t total = acb->nb_sectors * SECTOR_SIZE;
> +
> +    /* Where the read/write starts from */
> +    size_t offset = acb->sector_num * SECTOR_SIZE;

On 32-bit hosts size_t is only 32 bits wide.  It's not suitable for
storing the byte offset into the device.  Please use off_t.

> +    BDRVNBDState *s = acb->common.bs->opaque;

I don't follow this.  You have no guarantee that acb->common.bs->opaque
is the BDRVNBDState *.

> +
> +    AIOReq *aio_req;
>
> -    close(s->sock);
> +    qemu_bh_delete(acb->bh);
> +    acb->bh = NULL;
> +
> +    while (done != total) {
> +        len = (total - done);
> +
> +        /* Split write requests into 1MiB segments */
> +        if(acb->aiocb_type == AIOCB_WRITE_UDATA && len > MAX_NBD_WRITE) {
> +          len = MAX_NBD_WRITE;
> +        }

We need to be more conservative than 1 MB because qemu-nbd.c checks the
following:

    if (request.len + NBD_REPLY_SIZE > data_size) {
        LOG("len (%u) is larger than max len (%u)",
            request.len + NBD_REPLY_SIZE, data_size);
        errno = EINVAL;
        return -1;
    }

and NBD_REPLY_SIZE is 16.

Stefan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] Re: [PATCH 2/3] NBD library: add aio-compatible read/write function
  2011-02-21 12:37   ` [Qemu-devel] Re: [PATCH 2/3] NBD library: add aio-compatible read/write function Kevin Wolf
@ 2011-02-21 20:10     ` Stefan Hajnoczi
  2011-02-22 15:26       ` Nicholas Thomas
  0 siblings, 1 reply; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-02-21 20:10 UTC (permalink / raw)
  To: Nick Thomas; +Cc: Kevin Wolf, qemu-devel

On Mon, Feb 21, 2011 at 12:37 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 18.02.2011 13:55, schrieb Nick Thomas:
>> +retry:
>> +    if (do_read) {
>> +        ret = recvmsg(sockfd, &msg, 0);
>> +    } else {
>> +        ret = sendmsg(sockfd, &msg, 0);
>> +    }
>> +
>> +    /* recoverable error */
>> +    if (ret == -1 && (errno == EAGAIN || errno == EINTR)) {
>> +        goto retry;
>> +    }
>
> Make this a do...while loop, no reason for goto here.

This is not asynchronous.

If we're writing and the in-kernel socket buffer is full then we'll spin
retrying - this burns CPU for no reason.  Instead we should get a
callback when the socket becomes writable again so we can fill it with
more data.

If we're reading and enough data hasn't arrived yet we will spin
retrying.  Again, we should set up a callback to be notified when the
socket becomes readable again.

It seems sheepdog suffers from the same problem.

Stefan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 3/3] block/nbd: Make the NBD block device use the AIO interface
  2011-02-21 20:07   ` Stefan Hajnoczi
@ 2011-02-22 11:48     ` Nicholas Thomas
  2011-02-22 13:31       ` Stefan Hajnoczi
  0 siblings, 1 reply; 12+ messages in thread
From: Nicholas Thomas @ 2011-02-22 11:48 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, qemu-devel


> > + * Send I/O requests to the server.
> > + *
> > + * This function sends requests to the server, links the requests to
> > + * the outstanding_list in BDRVNBDState, and exits without waiting for
> > + * the response.  The responses are received in the `aio_read_response'
> > + * function which is called from the main loop as a fd handler.
> > + * If this is a write request and it's >1MB, split it into multiple AIOReqs
> > + */
> > +static void nbd_readv_writev_bh_cb(void *p)
> >  {
> > -    BDRVNBDState *s = bs->opaque;
> > -    struct nbd_request request;
> > +    NBDAIOCB *acb = p;
> > +    int ret = 0;
> >
> > -    request.type = NBD_CMD_DISC;
> > -    request.handle = (uint64_t)(intptr_t)bs;
> > -    request.from = 0;
> > -    request.len = 0;
> > -    nbd_send_request(s->sock, &request);
> > +    size_t len, done = 0;
> > +    size_t total = acb->nb_sectors * SECTOR_SIZE;
> > +
> > +    /* Where the read/write starts from */
> > +    size_t offset = acb->sector_num * SECTOR_SIZE;
> 
> On 32-bit hosts size_t is only 32 bits wide.  It's not suitable for
> storing the byte offset into the device.  Please use off_t.
> 
> > +    BDRVNBDState *s = acb->common.bs->opaque;
> 
> I don't follow this.  You have no guarantee that acb->common.bs->opaque
> is the BDRVNBDState *.

Another idiom from sheepdog. Seems qed, qcow2, qcow & vdi do the same.

It seems the magic is being done in nbd_aio_setup & qemu_aio_setup 
(gdb) bt
#0  nbd_aio_setup (bs=0xa69600, qiov=0x7fffffffdc00, sector_num=0,
nb_sectors=4, cb=0x4041c6 <aio_rw_done>, 
    opaque=0x7fffffffdbac) at block/nbd.c:443
#1  0x00000000004395c3 in nbd_aio_readv (bs=0xa69600, sector_num=0,
qiov=0x7fffffffdc00, nb_sectors=4, 
    cb=0x4041c6 <aio_rw_done>, opaque=0x7fffffffdbac) at block/nbd.c:538
#2  0x000000000041223c in bdrv_aio_readv (bs=0xa69600, sector_num=0,
qiov=0x7fffffffdc00, nb_sectors=4, 
    cb=0x4041c6 <aio_rw_done>, opaque=0x7fffffffdbac) at block.c:2158
[...]

(gdb) inspect acb
$17 = (NBDAIOCB *) 0x40f708

(gdb) inspect &acb->common
$21 = (BlockDriverAIOCB *) 0x40f708

(Sorry, the line numbers are unlikely to match yours).

It's some kind of simple polymorphism? Everything south of block/nbd.c
accesses the memory region as a BlockDriverAIOCB, while everything north
accesses it as an NBDAIOCB - with the additional fields that implies?
I've never come across something like this before - but it does seem to
work. 

acb->common.bs is set to the BlockDriverState we use everywhere - whose
opaque field is known. In what situations will this fail to work?

/Nick

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 3/3] block/nbd: Make the NBD block device use the AIO interface
  2011-02-22 11:48     ` Nicholas Thomas
@ 2011-02-22 13:31       ` Stefan Hajnoczi
  0 siblings, 0 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-02-22 13:31 UTC (permalink / raw)
  To: Nicholas Thomas; +Cc: kwolf, qemu-devel

On Tue, Feb 22, 2011 at 11:48 AM, Nicholas Thomas <nick@bytemark.co.uk> wrote:
>
>> > + * Send I/O requests to the server.
>> > + *
>> > + * This function sends requests to the server, links the requests to
>> > + * the outstanding_list in BDRVNBDState, and exits without waiting for
>> > + * the response.  The responses are received in the `aio_read_response'
>> > + * function which is called from the main loop as a fd handler.
>> > + * If this is a write request and it's >1MB, split it into multiple AIOReqs
>> > + */
>> > +static void nbd_readv_writev_bh_cb(void *p)
>> >  {
>> > -    BDRVNBDState *s = bs->opaque;
>> > -    struct nbd_request request;
>> > +    NBDAIOCB *acb = p;
>> > +    int ret = 0;
>> >
>> > -    request.type = NBD_CMD_DISC;
>> > -    request.handle = (uint64_t)(intptr_t)bs;
>> > -    request.from = 0;
>> > -    request.len = 0;
>> > -    nbd_send_request(s->sock, &request);
>> > +    size_t len, done = 0;
>> > +    size_t total = acb->nb_sectors * SECTOR_SIZE;
>> > +
>> > +    /* Where the read/write starts from */
>> > +    size_t offset = acb->sector_num * SECTOR_SIZE;
>>
>> On 32-bit hosts size_t is only 32 bits wide.  It's not suitable for
>> storing the byte offset into the device.  Please use off_t.
>>
>> > +    BDRVNBDState *s = acb->common.bs->opaque;
>>
>> I don't follow this.  You have no guarantee that acb->common.bs->opaque
>> is the BDRVNBDState *.
>
> Another idiom from sheepdog. Seems qed, qcow2, qcow & vdi do the same.

My mistake, I ignored the "bs->".  This is fine but I read it as
acb->common.opaque.

Stefan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 2/3] NBD library: add aio-compatible read/write function
  2011-02-21 20:10     ` Stefan Hajnoczi
@ 2011-02-22 15:26       ` Nicholas Thomas
  2011-02-22 20:14         ` Stefan Hajnoczi
  0 siblings, 1 reply; 12+ messages in thread
From: Nicholas Thomas @ 2011-02-22 15:26 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel

On Mon, 2011-02-21 at 20:10 +0000, Stefan Hajnoczi wrote:
> On Mon, Feb 21, 2011 at 12:37 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> > Am 18.02.2011 13:55, schrieb Nick Thomas:
> >> +retry:
> >> +    if (do_read) {
> >> +        ret = recvmsg(sockfd, &msg, 0);
> >> +    } else {
> >> +        ret = sendmsg(sockfd, &msg, 0);
> >> +    }
> >> +
> >> +    /* recoverable error */
> >> +    if (ret == -1 && (errno == EAGAIN || errno == EINTR)) {
> >> +        goto retry;
> >> +    }
> >
> > Make this a do...while loop, no reason for goto here.
> 
> This is not asynchronous.
> 
> If we're writing and the in-kernel socket buffer is full then we'll spin
> retrying - this burns CPU for no reason.  Instead we should get a
> callback when the socket becomes writable again so we can fill it with
> more data.
> 
> If we're reading and enough data hasn't arrived yet we will spin
> retrying.  Again, we should set up a callback to be notified when the
> socket becomes readable again.
> 
> It seems sheepdog suffers from the same problem.

Here, I'm a little concerned by the possibility of getting reads and
writes from different requests interleaved - so, writing half of the
data for one write request, returning, then getting another write
request through and starting that before the other one is completed. 

Obviously, I 'just' need to write clever code that avoids that ;)

Since sheepdog has the same issue, it might be worth me looking at this
as a separate issue in both code bases, starting from a patch that gets
nbd to the same kind of state as sheepdog in this area? The intermediate
stage is definitely better than the starting point, after all.

v3 of patches incoming, with this in mind.

/Nick

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 3/3] block/nbd: Make the NBD block device use the AIO interface
  2011-02-21 16:48       ` Kevin Wolf
@ 2011-02-22 16:06         ` MORITA Kazutaka
  0 siblings, 0 replies; 12+ messages in thread
From: MORITA Kazutaka @ 2011-02-22 16:06 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: stefanha, qemu-devel, Nicholas Thomas

At Mon, 21 Feb 2011 17:48:49 +0100,
Kevin Wolf wrote:
> 
> Am 21.02.2011 17:31, schrieb Nicholas Thomas:
> > Hi again,
> > 
> > Thanks for looking through the patches. I'm just going through and
> > making the suggested changes now. I've also got qemu-nbd and block/nbd.c
> > working over IPv6 :) - hopefully I'll be able to provide patches in a
> > couple of days. Just a few questions about some of the changes...
> > 
> > Canceled requests: 
> >>> +
> >>> +
> >>> +static void nbd_aio_cancel(BlockDriverAIOCB *blockacb)
> >>> +{
> >>> +    NBDAIOCB *acb = (NBDAIOCB *)blockacb;
> >>> +
> >>> +    /*
> >>> +     * We cannot cancel the requests which are already sent to
> >>> +     * the servers, so we just complete the request with -EIO here.
> >>> +     */
> >>> +    acb->common.cb(acb->common.opaque, -EIO);
> >>> +    acb->canceled = 1;
> >>> +}
> >>
> >> I think you need to check for acb->canceled before you write to the
> >> associated buffer when receiving the reply for a read request. The
> >> buffer might not exist any more after the request is cancelled.
> > 
> > I "borrowed" this code from block/sheepdog.c (along with a fair few
> > other bits ;) ) - which doesn't seem to do any special checking for
> > cancelled write requests. So if there is a potential SIGSEGV here, I
> > guess sheepdog is also vulnerable.
> 
> Right, now that you mention it, I seem to remember this from Sheepdog. I
> think I had a discussion with Stefan and he convinced me that we could
> get away with it in Sheepdog because of some condition that Sheepdog
> meets. Not sure any more what that condition was and if it applies to NBD.
> 
> Was it that Sheepdog has a bounce buffer for all requests?

Sheepdog doesn't use a bounce buffer for any requests, and to me, it
seems that Sheepdog also needs to check acb->canceled before reading
the response of a read request...


> >>> +static BlockDriverAIOCB *nbd_aio_readv(BlockDriverState *bs,
> >>> +        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
> >>> +        BlockDriverCompletionFunc *cb, void *opaque)
> >>> +{
> >>> [...]
> >>> +    for (i = 0; i < qiov->niov; i++) {
> >>> +        memset(qiov->iov[i].iov_base, 0, qiov->iov[i].iov_len);
> >>> +    }
> >>
> >> qemu_iovec_memset?
> >>
> >> What is this even for? Aren't these zeros overwritten anyway?
> > 
> > Again, present in sheepdog - but it does seem to work fine without it.
> > I'll remove it from NBD.
> 
> Maybe Sheepdog reads only partially from the server if blocks are
> unallocated or something.

Yes, exactly.


Thanks,

Kazutaka

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 2/3] NBD library: add aio-compatible read/write function
  2011-02-22 15:26       ` Nicholas Thomas
@ 2011-02-22 20:14         ` Stefan Hajnoczi
  0 siblings, 0 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-02-22 20:14 UTC (permalink / raw)
  To: Nicholas Thomas; +Cc: Kevin Wolf, qemu-devel

On Tue, Feb 22, 2011 at 3:26 PM, Nicholas Thomas <nick@bytemark.co.uk> wrote:
> On Mon, 2011-02-21 at 20:10 +0000, Stefan Hajnoczi wrote:
>> On Mon, Feb 21, 2011 at 12:37 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> > Am 18.02.2011 13:55, schrieb Nick Thomas:
>> >> +retry:
>> >> +    if (do_read) {
>> >> +        ret = recvmsg(sockfd, &msg, 0);
>> >> +    } else {
>> >> +        ret = sendmsg(sockfd, &msg, 0);
>> >> +    }
>> >> +
>> >> +    /* recoverable error */
>> >> +    if (ret == -1 && (errno == EAGAIN || errno == EINTR)) {
>> >> +        goto retry;
>> >> +    }
>> >
>> > Make this a do...while loop, no reason for goto here.
>>
>> This is not asynchronous.
>>
>> If we're writing and the in-kernel socket buffer is full then we'll spin
>> retrying - this burns CPU for no reason.  Instead we should get a
>> callback when the socket becomes writable again so we can fill it with
>> more data.
>>
>> If we're reading and enough data hasn't arrived yet we will spin
>> retrying.  Again, we should set up a callback to be notified when the
>> socket becomes readable again.
>>
>> It seems sheepdog suffers from the same problem.
>
> Here, I'm a little concerned by the possibility of getting reads and
> writes from different requests interleaved - so, writing half of the
> data for one write request, returning, then getting another write
> request through and starting that before the other one is completed.
>
> Obviously, I 'just' need to write clever code that avoids that ;)
>
> Since sheepdog has the same issue, it might be worth me looking at this
> as a separate issue in both code bases, starting from a patch that gets
> nbd to the same kind of state as sheepdog in this area? The intermediate
> stage is definitely better than the starting point, after all.

Please try to solve it in the same patch series.  This is not a
limitation where we're being sub-optimal but well-behaved.  Spinning
is a bug and introducing it is a problem.

ui/vnc.c has a solution to this problem, it has a Buffer abstraction.
Data can be copied into the buffer.  It then drains the buffer to the
socket in vnc_client_write().  For reads the caller says how many
bytes they are expecting and their callback will not be invoked until
the buffer has filled at least that much.

For NBD I would look at something more like QEMUIOVector instead of
copying between a Buffer.  It should be possible to concatenate iovec
arrays that need to be transmitted without any interleaving issues.

Stefan

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-02-22 20:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1298033729-17945-1-git-send-email-nick@bytemark.co.uk>
     [not found] ` <1298033729-17945-2-git-send-email-nick@bytemark.co.uk>
2011-02-21 12:37   ` [Qemu-devel] Re: [PATCH 2/3] NBD library: add aio-compatible read/write function Kevin Wolf
2011-02-21 20:10     ` Stefan Hajnoczi
2011-02-22 15:26       ` Nicholas Thomas
2011-02-22 20:14         ` Stefan Hajnoczi
2011-02-21 12:37 ` [Qemu-devel] Re: [PATCH 1/3] NBD library: whitespace changes Kevin Wolf
     [not found] ` <1298033729-17945-3-git-send-email-nick@bytemark.co.uk>
2011-02-21 12:59   ` [Qemu-devel] Re: [PATCH 3/3] block/nbd: Make the NBD block device use the AIO interface Kevin Wolf
2011-02-21 16:31     ` Nicholas Thomas
2011-02-21 16:48       ` Kevin Wolf
2011-02-22 16:06         ` MORITA Kazutaka
2011-02-21 20:07   ` Stefan Hajnoczi
2011-02-22 11:48     ` Nicholas Thomas
2011-02-22 13:31       ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.