All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] NBD block device backend - 'improvements'
@ 2011-02-14 19:40 Nicholas Thomas
  2011-02-14 20:32 ` Stefan Hajnoczi
  0 siblings, 1 reply; 13+ messages in thread
From: Nicholas Thomas @ 2011-02-14 19:40 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1491 bytes --]

[Apologies for the cross-post - I originally sent this to the KVM ML -
obviously, it's far more appropriate here]

Hi,

I've been doing some work with /block/nbd.c with the aim of improving
its behaviour when the NBD server is inaccessible or goes away.

Current behaviour is to exit on startup if connecting to the NBD server
fails for any reason. If the connection dies  once KVM is started, the
VM stays up but reads and writes fail. No attempt to re-establish the
connection is made. 

I've written a patch that changes the behaviour - instead of exiting at
startup, we wait for the NBD connection to be established, and we hang
on reads and writes until the connection is re-established.

I'm interested in getting the changes merged upstream, so I thought I'd
get in early and ask if you'd be interested in the patch, in principle;
whether the old behaviour would need to be preserved, making the new
behaviour accessible via a config option ("-drive
file=nbd:127.0.0.1:5000:retry=forever,..." ?); and whether I'm going
about the changes in a sane way (I've attached the current version of
the patch).

Another thing I've noticed is that the nbd library ( /nbd.c ) is not
IPv6-compatible ( "-drive file=nbd:\:\:1:5000", for instance ) - I don't
have a patch for that yet, but I'm going to need to write one :) -
presumably you'd like that merging upstream too (and I should make the
library use the functions provided in qemu_socket.h) ?

Regards,
-- 
Nick Thomas
Bytemark Computing



[-- Attachment #2: 01-make-nbd-retry.patch --]
[-- Type: text/x-patch, Size: 4761 bytes --]

diff --git a/block/nbd.c b/block/nbd.c
index c8dc763..87da07e 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -39,9 +39,11 @@ typedef struct BDRVNBDState {
     int sock;
     off_t size;
     size_t blocksize;
+    char *filename;
 } BDRVNBDState;
 
-static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
+
+static int nbd_open(BlockDriverState *bs)
 {
     BDRVNBDState *s = bs->opaque;
     uint32_t nbdflags;
@@ -56,7 +58,7 @@ static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
     int ret;
     int err = -EINVAL;
 
-    file = qemu_strdup(filename);
+    file = qemu_strdup(s->filename);
 
     name = strstr(file, EN_OPTSTR);
     if (name) {
@@ -121,32 +123,62 @@ out:
     return err;
 }
 
+// Puts the filename into the driver state and calls nbd_open - hanging until
+// it is successful.
+static int nbd_setup(BlockDriverState *bs, const char* filename, int flags)
+{
+    BDRVNBDState *s = bs->opaque;
+    int err = 1;
+
+    s->filename = qemu_strdup(filename);
+    while (err != 0)
+    {
+        err = nbd_open(bs);
+        // Avoid tight loops
+        if (err != 0)
+            sleep(1);
+    }
+
+    return err;
+}
+
 static int nbd_read(BlockDriverState *bs, int64_t sector_num,
                     uint8_t *buf, int nb_sectors)
 {
     BDRVNBDState *s = bs->opaque;
     struct nbd_request request;
     struct nbd_reply reply;
+    bool success = false;
 
     request.type = NBD_CMD_READ;
     request.handle = (uint64_t)(intptr_t)bs;
-    request.from = sector_num * 512;;
+    request.from = sector_num * 512;
     request.len = nb_sectors * 512;
 
-    if (nbd_send_request(s->sock, &request) == -1)
-        return -errno;
+    while (success == false)
+    {
+        if ( (nbd_send_request(s->sock, &request) == -1) ||
+             (nbd_receive_reply(s->sock, &reply) == -1)     )
+        {
+            // We hang here until the TCP session is established
+            close(s->sock);
+            while(nbd_open(bs) != 0)
+                sleep(1);
+            continue;
+        }
 
-    if (nbd_receive_reply(s->sock, &reply) == -1)
-        return -errno;
+        if (reply.error !=0)
+            return -reply.error;
 
-    if (reply.error !=0)
-        return -reply.error;
+        if (reply.handle != request.handle)
+            return -EIO;
 
-    if (reply.handle != request.handle)
-        return -EIO;
+        // If reading the actual data fails, we retry the whole request
+        if (nbd_wr_sync(s->sock, buf, request.len, 1) != request.len)
+            continue;
 
-    if (nbd_wr_sync(s->sock, buf, request.len, 1) != request.len)
-        return -EIO;
+        success = true;
+    }
 
     return 0;
 }
@@ -157,26 +189,39 @@ static int nbd_write(BlockDriverState *bs, int64_t sector_num,
     BDRVNBDState *s = bs->opaque;
     struct nbd_request request;
     struct nbd_reply reply;
+    bool success = false;
 
     request.type = NBD_CMD_WRITE;
     request.handle = (uint64_t)(intptr_t)bs;
     request.from = sector_num * 512;;
     request.len = nb_sectors * 512;
 
-    if (nbd_send_request(s->sock, &request) == -1)
-        return -errno;
+    while (success == false)
+    {
+        if ( (nbd_send_request(s->sock, &request) == -1) ||
+             (nbd_wr_sync(s->sock, (uint8_t*)buf, request.len, 0) != request.len) )
+        {
+            // We hang here until the TCP session is established
+            close(s->sock);
+            while(nbd_open(bs) != 0)
+                sleep(1);
+            continue;
+        }
+
+        // We didn't get a reply from the write, so try again
+        if (nbd_receive_reply(s->sock, &reply) == -1)
+            continue;
 
-    if (nbd_wr_sync(s->sock, (uint8_t*)buf, request.len, 0) != request.len)
-        return -EIO;
+        // Problem with the response itself
+        if (reply.error !=0)
+            return -reply.error;
 
-    if (nbd_receive_reply(s->sock, &reply) == -1)
-        return -errno;
+        if (reply.handle != request.handle)
+            return -EIO;
 
-    if (reply.error !=0)
-        return -reply.error;
+        success = true;
+    }
 
-    if (reply.handle != request.handle)
-        return -EIO;
 
     return 0;
 }
@@ -191,6 +236,7 @@ static void nbd_close(BlockDriverState *bs)
     request.from = 0;
     request.len = 0;
     nbd_send_request(s->sock, &request);
+    qemu_free(s->filename);
 
     close(s->sock);
 }
@@ -205,7 +251,7 @@ static int64_t nbd_getlength(BlockDriverState *bs)
 static BlockDriver bdrv_nbd = {
     .format_name	= "nbd",
     .instance_size	= sizeof(BDRVNBDState),
-    .bdrv_file_open	= nbd_open,
+    .bdrv_file_open	= nbd_setup,
     .bdrv_read		= nbd_read,
     .bdrv_write		= nbd_write,
     .bdrv_close		= nbd_close,

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] NBD block device backend - 'improvements'
  2011-02-14 19:40 [Qemu-devel] NBD block device backend - 'improvements' Nicholas Thomas
@ 2011-02-14 20:32 ` Stefan Hajnoczi
  2011-02-15 11:09   ` Kevin Wolf
  0 siblings, 1 reply; 13+ messages in thread
From: Stefan Hajnoczi @ 2011-02-14 20:32 UTC (permalink / raw)
  To: nick; +Cc: qemu-devel, Laurent Vivier

On Mon, Feb 14, 2011 at 7:40 PM, Nicholas Thomas <nick@lupine.me.uk> wrote:
> I've written a patch that changes the behaviour - instead of exiting at
> startup, we wait for the NBD connection to be established, and we hang
> on reads and writes until the connection is re-established.

Hi Nick,
I think reconnect is a useful feature.  For more info on submitting
patches to QEMU, please see
http://wiki.qemu.org/Contribute/SubmitAPatch.  It contains a few
points like sending patches inline instead of as an attachment (makes
review easier), using Signed-off-by:, and references to QEMU coding
style.

> I'm interested in getting the changes merged upstream, so I thought I'd
> get in early and ask if you'd be interested in the patch, in principle;
> whether the old behaviour would need to be preserved, making the new
> behaviour accessible via a config option ("-drive
> file=nbd:127.0.0.1:5000:retry=forever,..." ?); and whether I'm going
> about the changes in a sane way (I've attached the current version of
> the patch).

block/nbd.c needs to be made asynchronous in order for this change to
work.  Otherwise the only thing you can do is to block QEMU and the
VM, and even that shouldn't be done using sleep(2) because that could
stop the I/O thread which processes the QEMU monitor and other
external interfaces.  See below for more info.

> Another thing I've noticed is that the nbd library ( /nbd.c ) is not
> IPv6-compatible ( "-drive file=nbd:\:\:1:5000", for instance ) - I don't
> have a patch for that yet, but I'm going to need to write one :) -
> presumably you'd like that merging upstream too (and I should make the
> library use the functions provided in qemu_socket.h) ?

IPv6 would be nice and if you can consolidate that in qemu_socket.h,
then that's a win for non-nbd socket users too.

I have inlined your patch below for easy reviewing:

> diff --git a/block/nbd.c b/block/nbd.c
> index c8dc763..87da07e 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -39,9 +39,11 @@ typedef struct BDRVNBDState {
>      int sock;
>      off_t size;
>      size_t blocksize;
> +    char *filename;

block_int.h:BlockDriverState->filename already has this.

>  } BDRVNBDState;
>
> -static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
> +
> +static int nbd_open(BlockDriverState *bs)
>  {
>      BDRVNBDState *s = bs->opaque;
>      uint32_t nbdflags;
> @@ -56,7 +58,7 @@ static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
>      int ret;
>      int err = -EINVAL;
>
> -    file = qemu_strdup(filename);
> +    file = qemu_strdup(s->filename);
>
>      name = strstr(file, EN_OPTSTR);
>      if (name) {
> @@ -121,32 +123,62 @@ out:
>      return err;
>  }
>
> +// Puts the filename into the driver state and calls nbd_open - hanging until
> +// it is successful.

Please use /* */ comments instead, QEMU coding style.

> +static int nbd_setup(BlockDriverState *bs, const char* filename, int flags)
> +{
> +    BDRVNBDState *s = bs->opaque;
> +    int err = 1;
> +
> +    s->filename = qemu_strdup(filename);
> +    while (err != 0)
> +    {
> +        err = nbd_open(bs);
> +        // Avoid tight loops
> +        if (err != 0)
> +            sleep(1);

Please use {} even for single statement blocks.

> +    }
> +
> +    return err;
> +}

Waiting until the connection succeeds is a policy decision.  It might
be equally useful to abort starting QEMU or to start with the nbd
device unavailable (either returning I/O errors or not completing
requests until connection).  There should be a way to configure this,
otherwise we might get a user who depended on the old way having to
file a bug report.

> +
>  static int nbd_read(BlockDriverState *bs, int64_t sector_num,
>                      uint8_t *buf, int nb_sectors)
>  {
>      BDRVNBDState *s = bs->opaque;
>      struct nbd_request request;
>      struct nbd_reply reply;
> +    bool success = false;
>
>      request.type = NBD_CMD_READ;
>      request.handle = (uint64_t)(intptr_t)bs;
> -    request.from = sector_num * 512;;
> +    request.from = sector_num * 512;
>      request.len = nb_sectors * 512;
>
> -    if (nbd_send_request(s->sock, &request) == -1)
> -        return -errno;
> +    while (success == false)
> +    {
> +        if ( (nbd_send_request(s->sock, &request) == -1) ||
> +             (nbd_receive_reply(s->sock, &reply) == -1)     )
> +        {
> +            // We hang here until the TCP session is established
> +            close(s->sock);
> +            while(nbd_open(bs) != 0)
> +                sleep(1);
> +            continue;

This hangs the VM.  It will literally stop executing code and be
frozen.  Doing this without hanging the VM involves changing
nbd_read()/nbd_write() to work asynchronously.

Also, on ENOSPC QEMU can already pause the VM using a different
mechanism.  That may not be appropriate here but it could be
interesting to see how that code works.

> +        }
>
> -    if (nbd_receive_reply(s->sock, &reply) == -1)
> -        return -errno;
> +        if (reply.error !=0)
> +            return -reply.error;
>
> -    if (reply.error !=0)
> -        return -reply.error;
> +        if (reply.handle != request.handle)
> +            return -EIO;
>
> -    if (reply.handle != request.handle)
> -        return -EIO;
> +        // If reading the actual data fails, we retry the whole request
> +        if (nbd_wr_sync(s->sock, buf, request.len, 1) != request.len)
> +            continue;
>
> -    if (nbd_wr_sync(s->sock, buf, request.len, 1) != request.len)
> -        return -EIO;
> +        success = true;
> +    }
>
>      return 0;
>  }
> @@ -157,26 +189,39 @@ static int nbd_write(BlockDriverState *bs, int64_t sector_num,
>      BDRVNBDState *s = bs->opaque;
>      struct nbd_request request;
>      struct nbd_reply reply;
> +    bool success = false;
>
>      request.type = NBD_CMD_WRITE;
>      request.handle = (uint64_t)(intptr_t)bs;
>      request.from = sector_num * 512;;
>      request.len = nb_sectors * 512;
>
> -    if (nbd_send_request(s->sock, &request) == -1)
> -        return -errno;
> +    while (success == false)
> +    {
> +        if ( (nbd_send_request(s->sock, &request) == -1) ||
> +             (nbd_wr_sync(s->sock, (uint8_t*)buf, request.len, 0) != request.len) )
> +        {
> +            // We hang here until the TCP session is established
> +            close(s->sock);
> +            while(nbd_open(bs) != 0)
> +                sleep(1);
> +            continue;
> +        }
> +
> +        // We didn't get a reply from the write, so try again
> +        if (nbd_receive_reply(s->sock, &reply) == -1)
> +            continue;
>
> -    if (nbd_wr_sync(s->sock, (uint8_t*)buf, request.len, 0) != request.len)
> -        return -EIO;
> +        // Problem with the response itself
> +        if (reply.error !=0)
> +            return -reply.error;
>
> -    if (nbd_receive_reply(s->sock, &reply) == -1)
> -        return -errno;
> +        if (reply.handle != request.handle)
> +            return -EIO;
>
> -    if (reply.error !=0)
> -        return -reply.error;
> +        success = true;
> +    }
>
> -    if (reply.handle != request.handle)
> -        return -EIO;
>
>      return 0;
>  }
> @@ -191,6 +236,7 @@ static void nbd_close(BlockDriverState *bs)
>      request.from = 0;
>      request.len = 0;
>      nbd_send_request(s->sock, &request);
> +    qemu_free(s->filename);
>
>      close(s->sock);
>  }
> @@ -205,7 +251,7 @@ static int64_t nbd_getlength(BlockDriverState *bs)
>  static BlockDriver bdrv_nbd = {
>      .format_name	= "nbd",
>      .instance_size	= sizeof(BDRVNBDState),
> -    .bdrv_file_open	= nbd_open,
> +    .bdrv_file_open	= nbd_setup,
>      .bdrv_read		= nbd_read,
>      .bdrv_write		= nbd_write,
>      .bdrv_close		= nbd_close,

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] NBD block device backend - 'improvements'
  2011-02-14 20:32 ` Stefan Hajnoczi
@ 2011-02-15 11:09   ` Kevin Wolf
  2011-02-15 21:26     ` Nicholas Thomas
  0 siblings, 1 reply; 13+ messages in thread
From: Kevin Wolf @ 2011-02-15 11:09 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: nick, qemu-devel, Laurent Vivier

Am 14.02.2011 21:32, schrieb Stefan Hajnoczi:
> On Mon, Feb 14, 2011 at 7:40 PM, Nicholas Thomas <nick@lupine.me.uk> wrote:
>> I've written a patch that changes the behaviour - instead of exiting at
>> startup, we wait for the NBD connection to be established, and we hang
>> on reads and writes until the connection is re-established.
> 
> Hi Nick,
> I think reconnect is a useful feature.  For more info on submitting
> patches to QEMU, please see
> http://wiki.qemu.org/Contribute/SubmitAPatch.  It contains a few
> points like sending patches inline instead of as an attachment (makes
> review easier), using Signed-off-by:, and references to QEMU coding
> style.
> 
>> I'm interested in getting the changes merged upstream, so I thought I'd
>> get in early and ask if you'd be interested in the patch, in principle;
>> whether the old behaviour would need to be preserved, making the new
>> behaviour accessible via a config option ("-drive
>> file=nbd:127.0.0.1:5000:retry=forever,..." ?); and whether I'm going
>> about the changes in a sane way (I've attached the current version of
>> the patch).
> 
> block/nbd.c needs to be made asynchronous in order for this change to
> work.  

And even then it's not free of problem: For example qemu_aio_flush()
will hang. We're having all kinds of fun with NFS servers that go away
and let requests hang indefinitely.

So maybe what we should add is a timeout option which defaults to 0
(fail immediately, like today)

> Otherwise the only thing you can do is to block QEMU and the
> VM, and even that shouldn't be done using sleep(2) because that could
> stop the I/O thread which processes the QEMU monitor and other
> external interfaces.  See below for more info.

Unconditionally stopping the VM from a block driver sounds wrong to me.
If you want to have this behaviour, the block driver should return an
error and you should use werror=stop.

>> Another thing I've noticed is that the nbd library ( /nbd.c ) is not
>> IPv6-compatible ( "-drive file=nbd:\:\:1:5000", for instance ) - I don't
>> have a patch for that yet, but I'm going to need to write one :) -
>> presumably you'd like that merging upstream too (and I should make the
>> library use the functions provided in qemu_socket.h) ?
> 
> IPv6 would be nice and if you can consolidate that in qemu_socket.h,
> then that's a win for non-nbd socket users too.

Agreed.

Kevin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] NBD block device backend - 'improvements'
  2011-02-15 11:09   ` Kevin Wolf
@ 2011-02-15 21:26     ` Nicholas Thomas
  2011-02-16 12:00       ` Kevin Wolf
  0 siblings, 1 reply; 13+ messages in thread
From: Nicholas Thomas @ 2011-02-15 21:26 UTC (permalink / raw)
  To: qemu-devel

Hi Kevin, Stefan.

On Tue, 2011-02-15 at 12:09 +0100, Kevin Wolf wrote:
> Am 14.02.2011 21:32, schrieb Stefan Hajnoczi:
[...]
> > block/nbd.c needs to be made asynchronous in order for this change to
> > work.  
> 
> And even then it's not free of problem: For example qemu_aio_flush()
> will hang. We're having all kinds of fun with NFS servers that go away
> and let requests hang indefinitely.
> 
> So maybe what we should add is a timeout option which defaults to 0
> (fail immediately, like today)

Noted, so long as we can have -1 as "forever". I'm currently spending
time reworking block/nbd.c to be asynchronous, following the model in
block/sheepdog.c

There does seem to be a lot of scope for code duplication (setting up
the TCP connection, taking it down, the mechanics of actually reading /
writing bytes using the aio interface, etc) between the two, and
presumably for rbd as well. 

Reading http://www.mail-archive.com/qemu-devel@nongnu.org/msg36479.html
suggests it should be possible to have a "tcp" (+ "unix") protocol /
transport, which nbd+sheepdog could stack on top of (curl+rbd seem to
depend on their own libraries for managing the TCP part of the
connection).

They would implement talking the actual protocol, while the tcp/unix
transports would have the duplicatable bits.

I've not investigated it in code yet - it's possible I'm just letting my
appetite for abstraction get away with me. Thoughts?

> Unconditionally stopping the VM from a block driver sounds wrong to me.
> If you want to have this behaviour, the block driver should return an
> error and you should use werror=stop.

Unconditional? - if the socket manages to re-establish, the process
continues on its way (I guess we'd see the same behaviour if a send/recv
happened to take an unconscionably long time with the current code).

Making just the I/O hang until the network comes back, keeping guest
execution and qemu monitor working, is obviously better than that
(although not /strictly/ necessary for our particular use case), so I
hope to be able to offer an AIO NBD patch for review "soon". 

> > IPv6 would be nice and if you can consolidate that in qemu_socket.h,
> > then that's a win for non-nbd socket users too.
> 
> Agreed.

We'd get it for free with a unified TCP transport, as described above
(sheepdog already uses getaddrinfo and friends) - but if that's not
feasible, I'll be happy to supply a patch just for this. Much easier
than aio! :)

/Nick

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] NBD block device backend - 'improvements'
  2011-02-15 21:26     ` Nicholas Thomas
@ 2011-02-16 12:00       ` Kevin Wolf
  2011-02-17 16:27         ` [Qemu-devel] " Nicholas Thomas
                           ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Kevin Wolf @ 2011-02-16 12:00 UTC (permalink / raw)
  To: Nicholas Thomas; +Cc: Stefan Hajnoczi, qemu-devel

Hi Nick,

[ Please use "reply to all" so that the CC list is kept intact,
re-adding Stefan ]

Am 15.02.2011 22:26, schrieb Nicholas Thomas:
> Hi Kevin, Stefan.
> 
> On Tue, 2011-02-15 at 12:09 +0100, Kevin Wolf wrote:
>> Am 14.02.2011 21:32, schrieb Stefan Hajnoczi:
> [...]
>>> block/nbd.c needs to be made asynchronous in order for this change to
>>> work.  
>>
>> And even then it's not free of problem: For example qemu_aio_flush()
>> will hang. We're having all kinds of fun with NFS servers that go away
>> and let requests hang indefinitely.
>>
>> So maybe what we should add is a timeout option which defaults to 0
>> (fail immediately, like today)
> 
> Noted, so long as we can have -1 as "forever". 

As long as it's optional, that's fine with me.

> I'm currently spending
> time reworking block/nbd.c to be asynchronous, following the model in
> block/sheepdog.c
> 
> There does seem to be a lot of scope for code duplication (setting up
> the TCP connection, taking it down, the mechanics of actually reading /
> writing bytes using the aio interface, etc) between the two, and
> presumably for rbd as well. 
> 
> Reading http://www.mail-archive.com/qemu-devel@nongnu.org/msg36479.html
> suggests it should be possible to have a "tcp" (+ "unix") protocol /
> transport, which nbd+sheepdog could stack on top of (curl+rbd seem to
> depend on their own libraries for managing the TCP part of the
> connection).
> 
> They would implement talking the actual protocol, while the tcp/unix
> transports would have the duplicatable bits.
> 
> I've not investigated it in code yet - it's possible I'm just letting my
> appetite for abstraction get away with me. Thoughts?

I'm not sure about how much duplication there actually is, but if you
can take a closer look and think it's worthwhile, we should probably
consider it.

>> Unconditionally stopping the VM from a block driver sounds wrong to me.
>> If you want to have this behaviour, the block driver should return an
>> error and you should use werror=stop.
> 
> Unconditional? - if the socket manages to re-establish, the process
> continues on its way (I guess we'd see the same behaviour if a send/recv
> happened to take an unconscionably long time with the current code).
> 
> Making just the I/O hang until the network comes back, keeping guest
> execution and qemu monitor working, is obviously better than that
> (although not /strictly/ necessary for our particular use case), so I
> hope to be able to offer an AIO NBD patch for review "soon". 

Maybe I wasn't very clear. I was talking about Stefan's suggestion to
completely stop the VM, like we already can do for I/O errors (see the
werror and rerror options for -drive). I think it's not what you're
looking for, you just need the timeout=-1 thing.

>>> IPv6 would be nice and if you can consolidate that in qemu_socket.h,
>>> then that's a win for non-nbd socket users too.
>>
>> Agreed.
> 
> We'd get it for free with a unified TCP transport, as described above
> (sheepdog already uses getaddrinfo and friends) - but if that's not
> feasible, I'll be happy to supply a patch just for this. Much easier
> than aio! :)

Sure, I think it would be a good thing to have. And even if you
implemented this unified TCP transport (I'm not sure yet what it would
look like), I think the basic support could still be in qemu_socket.h,
so that users outside the block layer can benefit from it, too.

Kevin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] Re: NBD block device backend - 'improvements'
  2011-02-16 12:00       ` Kevin Wolf
@ 2011-02-17 16:27         ` Nicholas Thomas
  2011-02-17 16:34         ` [Qemu-devel] [PATCH 1/3] NBD library: whitespace changes Nicholas Thomas
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Nicholas Thomas @ 2011-02-17 16:27 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, qemu-devel

Hi again,

On Wed, 2011-02-16 at 13:00 +0100, Kevin Wolf wrote:
> Am 15.02.2011 22:26, schrieb Nicholas Thomas:
> > On Tue, 2011-02-15 at 12:09 +0100, Kevin Wolf wrote:
> >> Am 14.02.2011 21:32, schrieb Stefan Hajnoczi:
> I'm not sure about how much duplication there actually is, but if you
> can take a closer look and think it's worthwhile, we should probably
> consider it.

It's a couple of hundred lines, I'd guess - on reflection, it's probably
not enough to be too bothered about - so I haven't. I'll submit a
patchset in a moment implementing the first part of what we've been
talking about (converting the NBD driver to use the aio interface).

Assuming we can get this merged, I'll submit an IPv6 and a timeout=
patch too. It might also be worth adding aio support to qemu-nbd - we
use the canonical nbd-server binary, so it won't affect us directly, but
it seems a shame for the server to be behind the client's capabilities.

/Nick

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH 1/3] NBD library: whitespace changes
  2011-02-16 12:00       ` Kevin Wolf
  2011-02-17 16:27         ` [Qemu-devel] " Nicholas Thomas
@ 2011-02-17 16:34         ` Nicholas Thomas
  2011-02-17 16:34         ` [Qemu-devel] [PATCH 2/3] NBD library: add aio-compatible read/write function Nicholas Thomas
  2011-02-17 16:34         ` [Qemu-devel] [PATCH 3/3] block/nbd: Make the NBD block device use the AIO interface Nicholas Thomas
  3 siblings, 0 replies; 13+ messages in thread
From: Nicholas Thomas @ 2011-02-17 16:34 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, qemu-devel

Replace an entertaining mixture of tabs and spaces with four-space
indents.

Signed-off-by: Nick Thomas <nick@bytemark.co.uk>
---
 nbd.c |  835
+++++++++++++++++++++++++++++++++--------------------------------
 1 files changed, 418 insertions(+), 417 deletions(-)

diff --git a/nbd.c b/nbd.c
index d8ebc42..abe0ecb 100644
--- a/nbd.c
+++ b/nbd.c
@@ -49,7 +49,7 @@
 
 /* This is all part of the "official" NBD API */
 
-#define NBD_REPLY_SIZE		(4 + 4 + 8)
+#define NBD_REPLY_SIZE          (4 + 4 + 8)
 #define NBD_REQUEST_MAGIC       0x25609513
 #define NBD_REPLY_MAGIC         0x67446698
 
@@ -59,11 +59,11 @@
 #define NBD_DO_IT               _IO(0xab, 3)
 #define NBD_CLEAR_SOCK          _IO(0xab, 4)
 #define NBD_CLEAR_QUE           _IO(0xab, 5)
-#define NBD_PRINT_DEBUG	        _IO(0xab, 6)
-#define NBD_SET_SIZE_BLOCKS	_IO(0xab, 7)
+#define NBD_PRINT_DEBUG         _IO(0xab, 6)
+#define NBD_SET_SIZE_BLOCKS     _IO(0xab, 7)
 #define NBD_DISCONNECT          _IO(0xab, 8)
 
-#define NBD_OPT_EXPORT_NAME	(1 << 0)
+#define NBD_OPT_EXPORT_NAME     (1 << 0)
 
 /* That's all folks */
 
@@ -273,241 +273,241 @@ int unix_socket_outgoing(const char *path)
 
 int nbd_negotiate(int csock, off_t size)
 {
-	char buf[8 + 8 + 8 + 128];
-
-	/* Negotiate
-	   [ 0 ..   7]   passwd   ("NBDMAGIC")
-	   [ 8 ..  15]   magic    (0x00420281861253)
-	   [16 ..  23]   size
-	   [24 .. 151]   reserved (0)
-	 */
-
-	TRACE("Beginning negotiation.");
-	memcpy(buf, "NBDMAGIC", 8);
-	cpu_to_be64w((uint64_t*)(buf + 8), 0x00420281861253LL);
-	cpu_to_be64w((uint64_t*)(buf + 16), size);
-	memset(buf + 24, 0, 128);
-
-	if (write_sync(csock, buf, sizeof(buf)) != sizeof(buf)) {
-		LOG("write failed");
-		errno = EINVAL;
-		return -1;
-	}
-
-	TRACE("Negotation succeeded.");
-
-	return 0;
+    char buf[8 + 8 + 8 + 128];
+
+    /* Negotiate
+        [ 0 ..   7]   passwd   ("NBDMAGIC")
+        [ 8 ..  15]   magic    (0x00420281861253)
+        [16 ..  23]   size
+        [24 .. 151]   reserved (0)
+     */
+
+    TRACE("Beginning negotiation.");
+    memcpy(buf, "NBDMAGIC", 8);
+    cpu_to_be64w((uint64_t*)(buf + 8), 0x00420281861253LL);
+    cpu_to_be64w((uint64_t*)(buf + 16), size);
+    memset(buf + 24, 0, 128);
+
+    if (write_sync(csock, buf, sizeof(buf)) != sizeof(buf)) {
+        LOG("write failed");
+        errno = EINVAL;
+        return -1;
+    }
+
+    TRACE("Negotation succeeded.");
+
+    return 0;
 }
 
 int nbd_receive_negotiate(int csock, const char *name, uint32_t *flags,
                           off_t *size, size_t *blocksize)
 {
-	char buf[256];
-	uint64_t magic, s;
-	uint16_t tmp;
-
-	TRACE("Receiving negotation.");
-
-	if (read_sync(csock, buf, 8) != 8) {
-		LOG("read failed");
-		errno = EINVAL;
-		return -1;
-	}
-
-	buf[8] = '\0';
-	if (strlen(buf) == 0) {
-		LOG("server connection closed");
-		errno = EINVAL;
-		return -1;
-	}
-
-	TRACE("Magic is %c%c%c%c%c%c%c%c",
-	      qemu_isprint(buf[0]) ? buf[0] : '.',
-	      qemu_isprint(buf[1]) ? buf[1] : '.',
-	      qemu_isprint(buf[2]) ? buf[2] : '.',
-	      qemu_isprint(buf[3]) ? buf[3] : '.',
-	      qemu_isprint(buf[4]) ? buf[4] : '.',
-	      qemu_isprint(buf[5]) ? buf[5] : '.',
-	      qemu_isprint(buf[6]) ? buf[6] : '.',
-	      qemu_isprint(buf[7]) ? buf[7] : '.');
-
-	if (memcmp(buf, "NBDMAGIC", 8) != 0) {
-		LOG("Invalid magic received");
-		errno = EINVAL;
-		return -1;
-	}
-
-	if (read_sync(csock, &magic, sizeof(magic)) != sizeof(magic)) {
-		LOG("read failed");
-		errno = EINVAL;
-		return -1;
-	}
-	magic = be64_to_cpu(magic);
-	TRACE("Magic is 0x%" PRIx64, magic);
-
-	if (name) {
-		uint32_t reserved = 0;
-		uint32_t opt;
-		uint32_t namesize;
-
-		TRACE("Checking magic (opts_magic)");
-		if (magic != 0x49484156454F5054LL) {
-			LOG("Bad magic received");
-			errno = EINVAL;
-			return -1;
-		}
-		if (read_sync(csock, &tmp, sizeof(tmp)) != sizeof(tmp)) {
-			LOG("flags read failed");
-			errno = EINVAL;
-			return -1;
-		}
-		*flags = be16_to_cpu(tmp) << 16;
-		/* reserved for future use */
-		if (write_sync(csock, &reserved, sizeof(reserved)) !=
-		    sizeof(reserved)) {
-			LOG("write failed (reserved)");
-			errno = EINVAL;
-			return -1;
-		}
-		/* write the export name */
-		magic = cpu_to_be64(magic);
-		if (write_sync(csock, &magic, sizeof(magic)) != sizeof(magic)) {
-			LOG("write failed (magic)");
-			errno = EINVAL;
-			return -1;
-		}
-		opt = cpu_to_be32(NBD_OPT_EXPORT_NAME);
-		if (write_sync(csock, &opt, sizeof(opt)) != sizeof(opt)) {
-			LOG("write failed (opt)");
-			errno = EINVAL;
-			return -1;
-		}
-		namesize = cpu_to_be32(strlen(name));
-		if (write_sync(csock, &namesize, sizeof(namesize)) !=
-		    sizeof(namesize)) {
-			LOG("write failed (namesize)");
-			errno = EINVAL;
-			return -1;
-		}
-		if (write_sync(csock, (char*)name, strlen(name)) != strlen(name)) {
-			LOG("write failed (name)");
-			errno = EINVAL;
-			return -1;
-		}
-	} else {
-		TRACE("Checking magic (cli_magic)");
-
-		if (magic != 0x00420281861253LL) {
-			LOG("Bad magic received");
-			errno = EINVAL;
-			return -1;
-		}
-	}
-
-	if (read_sync(csock, &s, sizeof(s)) != sizeof(s)) {
-		LOG("read failed");
-		errno = EINVAL;
-		return -1;
-	}
-	*size = be64_to_cpu(s);
-	*blocksize = 1024;
-	TRACE("Size is %" PRIu64, *size);
-
-	if (!name) {
-		if (read_sync(csock, flags, sizeof(*flags)) != sizeof(*flags)) {
-			LOG("read failed (flags)");
-			errno = EINVAL;
-			return -1;
-		}
-		*flags = be32_to_cpup(flags);
-	} else {
-		if (read_sync(csock, &tmp, sizeof(tmp)) != sizeof(tmp)) {
-			LOG("read failed (tmp)");
-			errno = EINVAL;
-			return -1;
-		}
-		*flags |= be32_to_cpu(tmp);
-	}
-	if (read_sync(csock, &buf, 124) != 124) {
-		LOG("read failed (buf)");
-		errno = EINVAL;
-		return -1;
-	}
+    char buf[256];
+    uint64_t magic, s;
+    uint16_t tmp;
+
+    TRACE("Receiving negotation.");
+
+    if (read_sync(csock, buf, 8) != 8) {
+        LOG("read failed");
+        errno = EINVAL;
+        return -1;
+    }
+
+    buf[8] = '\0';
+    if (strlen(buf) == 0) {
+        LOG("server connection closed");
+        errno = EINVAL;
+        return -1;
+    }
+
+    TRACE("Magic is %c%c%c%c%c%c%c%c",
+          qemu_isprint(buf[0]) ? buf[0] : '.',
+          qemu_isprint(buf[1]) ? buf[1] : '.',
+          qemu_isprint(buf[2]) ? buf[2] : '.',
+          qemu_isprint(buf[3]) ? buf[3] : '.',
+          qemu_isprint(buf[4]) ? buf[4] : '.',
+          qemu_isprint(buf[5]) ? buf[5] : '.',
+          qemu_isprint(buf[6]) ? buf[6] : '.',
+          qemu_isprint(buf[7]) ? buf[7] : '.');
+
+    if (memcmp(buf, "NBDMAGIC", 8) != 0) {
+        LOG("Invalid magic received");
+        errno = EINVAL;
+        return -1;
+    }
+
+    if (read_sync(csock, &magic, sizeof(magic)) != sizeof(magic)) {
+        LOG("read failed");
+        errno = EINVAL;
+        return -1;
+    }
+    magic = be64_to_cpu(magic);
+    TRACE("Magic is 0x%" PRIx64, magic);
+
+    if (name) {
+        uint32_t reserved = 0;
+        uint32_t opt;
+        uint32_t namesize;
+
+        TRACE("Checking magic (opts_magic)");
+        if (magic != 0x49484156454F5054LL) {
+            LOG("Bad magic received");
+            errno = EINVAL;
+            return -1;
+        }
+        if (read_sync(csock, &tmp, sizeof(tmp)) != sizeof(tmp)) {
+            LOG("flags read failed");
+            errno = EINVAL;
+            return -1;
+        }
+        *flags = be16_to_cpu(tmp) << 16;
+        /* reserved for future use */
+        if (write_sync(csock, &reserved, sizeof(reserved)) !=
+            sizeof(reserved)) {
+            LOG("write failed (reserved)");
+            errno = EINVAL;
+            return -1;
+        }
+        /* write the export name */
+        magic = cpu_to_be64(magic);
+        if (write_sync(csock, &magic, sizeof(magic)) != sizeof(magic))
{
+            LOG("write failed (magic)");
+            errno = EINVAL;
+            return -1;
+        }
+        opt = cpu_to_be32(NBD_OPT_EXPORT_NAME);
+        if (write_sync(csock, &opt, sizeof(opt)) != sizeof(opt)) {
+            LOG("write failed (opt)");
+            errno = EINVAL;
+            return -1;
+        }
+        namesize = cpu_to_be32(strlen(name));
+        if (write_sync(csock, &namesize, sizeof(namesize)) !=
+            sizeof(namesize)) {
+            LOG("write failed (namesize)");
+            errno = EINVAL;
+            return -1;
+        }
+        if (write_sync(csock, (char*)name, strlen(name)) !=
strlen(name)) {
+            LOG("write failed (name)");
+            errno = EINVAL;
+            return -1;
+        }
+    } else {
+        TRACE("Checking magic (cli_magic)");
+
+        if (magic != 0x00420281861253LL) {
+            LOG("Bad magic received");
+            errno = EINVAL;
+            return -1;
+        }
+    }
+
+    if (read_sync(csock, &s, sizeof(s)) != sizeof(s)) {
+        LOG("read failed");
+        errno = EINVAL;
+        return -1;
+    }
+    *size = be64_to_cpu(s);
+    *blocksize = 1024;
+    TRACE("Size is %" PRIu64, *size);
+
+    if (!name) {
+        if (read_sync(csock, flags, sizeof(*flags)) != sizeof(*flags))
{
+            LOG("read failed (flags)");
+            errno = EINVAL;
+            return -1;
+        }
+        *flags = be32_to_cpup(flags);
+    } else {
+        if (read_sync(csock, &tmp, sizeof(tmp)) != sizeof(tmp)) {
+            LOG("read failed (tmp)");
+            errno = EINVAL;
+            return -1;
+        }
+        *flags |= be32_to_cpu(tmp);
+    }
+    if (read_sync(csock, &buf, 124) != 124) {
+        LOG("read failed (buf)");
+        errno = EINVAL;
+        return -1;
+    }
         return 0;
 }
 
 #ifndef _WIN32
 int nbd_init(int fd, int csock, off_t size, size_t blocksize)
 {
-	TRACE("Setting block size to %lu", (unsigned long)blocksize);
+    TRACE("Setting block size to %lu", (unsigned long)blocksize);
 
-	if (ioctl(fd, NBD_SET_BLKSIZE, blocksize) == -1) {
-		int serrno = errno;
-		LOG("Failed setting NBD block size");
-		errno = serrno;
-		return -1;
-	}
+    if (ioctl(fd, NBD_SET_BLKSIZE, blocksize) == -1) {
+        int serrno = errno;
+        LOG("Failed setting NBD block size");
+        errno = serrno;
+        return -1;
+    }
 
         TRACE("Setting size to %zd block(s)", (size_t)(size /
blocksize));
 
-	if (ioctl(fd, NBD_SET_SIZE_BLOCKS, size / blocksize) == -1) {
-		int serrno = errno;
-		LOG("Failed setting size (in blocks)");
-		errno = serrno;
-		return -1;
-	}
+    if (ioctl(fd, NBD_SET_SIZE_BLOCKS, size / blocksize) == -1) {
+        int serrno = errno;
+        LOG("Failed setting size (in blocks)");
+        errno = serrno;
+        return -1;
+    }
 
-	TRACE("Clearing NBD socket");
+    TRACE("Clearing NBD socket");
 
-	if (ioctl(fd, NBD_CLEAR_SOCK) == -1) {
-		int serrno = errno;
-		LOG("Failed clearing NBD socket");
-		errno = serrno;
-		return -1;
-	}
+    if (ioctl(fd, NBD_CLEAR_SOCK) == -1) {
+        int serrno = errno;
+        LOG("Failed clearing NBD socket");
+        errno = serrno;
+        return -1;
+    }
 
-	TRACE("Setting NBD socket");
+    TRACE("Setting NBD socket");
 
-	if (ioctl(fd, NBD_SET_SOCK, csock) == -1) {
-		int serrno = errno;
-		LOG("Failed to set NBD socket");
-		errno = serrno;
-		return -1;
-	}
+    if (ioctl(fd, NBD_SET_SOCK, csock) == -1) {
+        int serrno = errno;
+        LOG("Failed to set NBD socket");
+        errno = serrno;
+        return -1;
+    }
 
-	TRACE("Negotiation ended");
+    TRACE("Negotiation ended");
 
-	return 0;
+    return 0;
 }
 
 int nbd_disconnect(int fd)
 {
-	ioctl(fd, NBD_CLEAR_QUE);
-	ioctl(fd, NBD_DISCONNECT);
-	ioctl(fd, NBD_CLEAR_SOCK);
-	return 0;
+    ioctl(fd, NBD_CLEAR_QUE);
+    ioctl(fd, NBD_DISCONNECT);
+    ioctl(fd, NBD_CLEAR_SOCK);
+    return 0;
 }
 
 int nbd_client(int fd)
 {
-	int ret;
-	int serrno;
+    int ret;
+    int serrno;
 
-	TRACE("Doing NBD loop");
+    TRACE("Doing NBD loop");
 
-	ret = ioctl(fd, NBD_DO_IT);
-	serrno = errno;
+    ret = ioctl(fd, NBD_DO_IT);
+    serrno = errno;
 
-	TRACE("NBD loop returned %d: %s", ret, strerror(serrno));
+    TRACE("NBD loop returned %d: %s", ret, strerror(serrno));
 
-	TRACE("Clearing NBD queue");
-	ioctl(fd, NBD_CLEAR_QUE);
+    TRACE("Clearing NBD queue");
+    ioctl(fd, NBD_CLEAR_QUE);
 
-	TRACE("Clearing NBD socket");
-	ioctl(fd, NBD_CLEAR_SOCK);
+    TRACE("Clearing NBD socket");
+    ioctl(fd, NBD_CLEAR_SOCK);
 
-	errno = serrno;
-	return ret;
+    errno = serrno;
+    return ret;
 }
 #else
 int nbd_init(int fd, int csock, off_t size, size_t blocksize)
@@ -531,235 +531,236 @@ int nbd_client(int fd)
 
 int nbd_send_request(int csock, struct nbd_request *request)
 {
-	uint8_t buf[4 + 4 + 8 + 8 + 4];
-
-	cpu_to_be32w((uint32_t*)buf, NBD_REQUEST_MAGIC);
-	cpu_to_be32w((uint32_t*)(buf + 4), request->type);
-	cpu_to_be64w((uint64_t*)(buf + 8), request->handle);
-	cpu_to_be64w((uint64_t*)(buf + 16), request->from);
-	cpu_to_be32w((uint32_t*)(buf + 24), request->len);
-
-	TRACE("Sending request to client");
-
-	if (write_sync(csock, buf, sizeof(buf)) != sizeof(buf)) {
-		LOG("writing to socket failed");
-		errno = EINVAL;
-		return -1;
-	}
-	return 0;
-}
+    uint8_t buf[4 + 4 + 8 + 8 + 4];
+
+    cpu_to_be32w((uint32_t*)buf, NBD_REQUEST_MAGIC);
+    cpu_to_be32w((uint32_t*)(buf + 4), request->type);
+    cpu_to_be64w((uint64_t*)(buf + 8), request->handle);
+    cpu_to_be64w((uint64_t*)(buf + 16), request->from);
+    cpu_to_be32w((uint32_t*)(buf + 24), request->len);
 
+    TRACE("Sending request to client: "
+          "{ .from = %" PRIu64", .len = %u, .handle = %"
PRIu64", .type=%i}",
+          request->from, request->len, request->handle, request->type);
+
+    if (write_sync(csock, buf, sizeof(buf)) != sizeof(buf)) {
+        LOG("writing to socket failed");
+        errno = EINVAL;
+        return -1;
+    }
+    return 0;
+}
 
 static int nbd_receive_request(int csock, struct nbd_request *request)
 {
-	uint8_t buf[4 + 4 + 8 + 8 + 4];
-	uint32_t magic;
-
-	if (read_sync(csock, buf, sizeof(buf)) != sizeof(buf)) {
-		LOG("read failed");
-		errno = EINVAL;
-		return -1;
-	}
-
-	/* Request
-	   [ 0 ..  3]   magic   (NBD_REQUEST_MAGIC)
-	   [ 4 ..  7]   type    (0 == READ, 1 == WRITE)
-	   [ 8 .. 15]   handle
-	   [16 .. 23]   from
-	   [24 .. 27]   len
-	 */
-
-	magic = be32_to_cpup((uint32_t*)buf);
-	request->type  = be32_to_cpup((uint32_t*)(buf + 4));
-	request->handle = be64_to_cpup((uint64_t*)(buf + 8));
-	request->from  = be64_to_cpup((uint64_t*)(buf + 16));
-	request->len   = be32_to_cpup((uint32_t*)(buf + 24));
-
-	TRACE("Got request: "
-	      "{ magic = 0x%x, .type = %d, from = %" PRIu64" , len = %u }",
-	      magic, request->type, request->from, request->len);
-
-	if (magic != NBD_REQUEST_MAGIC) {
-		LOG("invalid magic (got 0x%x)", magic);
-		errno = EINVAL;
-		return -1;
-	}
-	return 0;
+    uint8_t buf[4 + 4 + 8 + 8 + 4];
+    uint32_t magic;
+
+    if (read_sync(csock, buf, sizeof(buf)) != sizeof(buf)) {
+        LOG("read failed");
+        errno = EINVAL;
+        return -1;
+    }
+
+    /* Request
+       [ 0 ..  3]   magic   (NBD_REQUEST_MAGIC)
+       [ 4 ..  7]   type    (0 == READ, 1 == WRITE)
+       [ 8 .. 15]   handle
+       [16 .. 23]   from
+       [24 .. 27]   len
+     */
+
+    magic = be32_to_cpup((uint32_t*)buf);
+    request->type  = be32_to_cpup((uint32_t*)(buf + 4));
+    request->handle = be64_to_cpup((uint64_t*)(buf + 8));
+    request->from  = be64_to_cpup((uint64_t*)(buf + 16));
+    request->len   = be32_to_cpup((uint32_t*)(buf + 24));
+
+    TRACE("Got request: "
+          "{ magic = 0x%x, .type = %d, from = %" PRIu64" , len = %u }",
+          magic, request->type, request->from, request->len);
+
+    if (magic != NBD_REQUEST_MAGIC) {
+        LOG("invalid magic (got 0x%x)", magic);
+        errno = EINVAL;
+        return -1;
+    }
+    return 0;
 }
 
 int nbd_receive_reply(int csock, struct nbd_reply *reply)
 {
-	uint8_t buf[NBD_REPLY_SIZE];
-	uint32_t magic;
-
-	memset(buf, 0xAA, sizeof(buf));
-
-	if (read_sync(csock, buf, sizeof(buf)) != sizeof(buf)) {
-		LOG("read failed");
-		errno = EINVAL;
-		return -1;
-	}
-
-	/* Reply
-	   [ 0 ..  3]    magic   (NBD_REPLY_MAGIC)
-	   [ 4 ..  7]    error   (0 == no error)
-	   [ 7 .. 15]    handle
-	 */
-
-	magic = be32_to_cpup((uint32_t*)buf);
-	reply->error  = be32_to_cpup((uint32_t*)(buf + 4));
-	reply->handle = be64_to_cpup((uint64_t*)(buf + 8));
-
-	TRACE("Got reply: "
-	      "{ magic = 0x%x, .error = %d, handle = %" PRIu64" }",
-	      magic, reply->error, reply->handle);
-
-	if (magic != NBD_REPLY_MAGIC) {
-		LOG("invalid magic (got 0x%x)", magic);
-		errno = EINVAL;
-		return -1;
-	}
-	return 0;
+    uint8_t buf[NBD_REPLY_SIZE];
+    uint32_t magic;
+
+    memset(buf, 0xAA, sizeof(buf));
+
+    if (read_sync(csock, buf, sizeof(buf)) != sizeof(buf)) {
+        LOG("read failed");
+        errno = EINVAL;
+        return -1;
+    }
+
+    /* Reply
+       [ 0 ..  3]    magic   (NBD_REPLY_MAGIC)
+       [ 4 ..  7]    error   (0 == no error)
+       [ 7 .. 15]    handle
+     */
+
+    magic = be32_to_cpup((uint32_t*)buf);
+    reply->error  = be32_to_cpup((uint32_t*)(buf + 4));
+    reply->handle = be64_to_cpup((uint64_t*)(buf + 8));
+
+    TRACE("Got reply: "
+          "{ magic = 0x%x, .error = %d, handle = %" PRIu64" }",
+          magic, reply->error, reply->handle);
+
+    if (magic != NBD_REPLY_MAGIC) {
+        LOG("invalid magic (got 0x%x)", magic);
+        errno = EINVAL;
+        return -1;
+    }
+    return 0;
 }
 
 static int nbd_send_reply(int csock, struct nbd_reply *reply)
 {
-	uint8_t buf[4 + 4 + 8];
-
-	/* Reply
-	   [ 0 ..  3]    magic   (NBD_REPLY_MAGIC)
-	   [ 4 ..  7]    error   (0 == no error)
-	   [ 7 .. 15]    handle
-	 */
-	cpu_to_be32w((uint32_t*)buf, NBD_REPLY_MAGIC);
-	cpu_to_be32w((uint32_t*)(buf + 4), reply->error);
-	cpu_to_be64w((uint64_t*)(buf + 8), reply->handle);
-
-	TRACE("Sending response to client");
-
-	if (write_sync(csock, buf, sizeof(buf)) != sizeof(buf)) {
-		LOG("writing to socket failed");
-		errno = EINVAL;
-		return -1;
-	}
-	return 0;
+    uint8_t buf[4 + 4 + 8];
+
+    /* Reply
+       [ 0 ..  3]    magic   (NBD_REPLY_MAGIC)
+       [ 4 ..  7]    error   (0 == no error)
+       [ 7 .. 15]    handle
+     */
+    cpu_to_be32w((uint32_t*)buf, NBD_REPLY_MAGIC);
+    cpu_to_be32w((uint32_t*)(buf + 4), reply->error);
+    cpu_to_be64w((uint64_t*)(buf + 8), reply->handle);
+
+    TRACE("Sending response to client");
+
+    if (write_sync(csock, buf, sizeof(buf)) != sizeof(buf)) {
+        LOG("writing to socket failed");
+        errno = EINVAL;
+        return -1;
+    }
+    return 0;
 }
 
 int nbd_trip(BlockDriverState *bs, int csock, off_t size, uint64_t
dev_offset,
              off_t *offset, bool readonly, uint8_t *data, int
data_size)
 {
-	struct nbd_request request;
-	struct nbd_reply reply;
-
-	TRACE("Reading request.");
-
-	if (nbd_receive_request(csock, &request) == -1)
-		return -1;
-
-	if (request.len + NBD_REPLY_SIZE > data_size) {
-		LOG("len (%u) is larger than max len (%u)",
-		    request.len + NBD_REPLY_SIZE, data_size);
-		errno = EINVAL;
-		return -1;
-	}
-
-	if ((request.from + request.len) < request.from) {
-		LOG("integer overflow detected! "
-		    "you're probably being attacked");
-		errno = EINVAL;
-		return -1;
-	}
-
-	if ((request.from + request.len) > size) {
-	        LOG("From: %" PRIu64 ", Len: %u, Size: %" PRIu64
-		    ", Offset: %" PRIu64 "\n",
+    struct nbd_request request;
+    struct nbd_reply reply;
+
+    TRACE("Reading request.");
+
+    if (nbd_receive_request(csock, &request) == -1)
+        return -1;
+
+    if (request.len + NBD_REPLY_SIZE > data_size) {
+        LOG("len (%u) is larger than max len (%u)",
+            request.len + NBD_REPLY_SIZE, data_size);
+        errno = EINVAL;
+        return -1;
+    }
+
+    if ((request.from + request.len) < request.from) {
+        LOG("integer overflow detected! "
+            "you're probably being attacked");
+        errno = EINVAL;
+        return -1;
+    }
+
+    if ((request.from + request.len) > size) {
+            LOG("From: %" PRIu64 ", Len: %u, Size: %" PRIu64
+            ", Offset: %" PRIu64 "\n",
                     request.from, request.len, (uint64_t)size,
dev_offset);
-		LOG("requested operation past EOF--bad client?");
-		errno = EINVAL;
-		return -1;
-	}
-
-	TRACE("Decoding type");
-
-	reply.handle = request.handle;
-	reply.error = 0;
-
-	switch (request.type) {
-	case NBD_CMD_READ:
-		TRACE("Request type is READ");
-
-		if (bdrv_read(bs, (request.from + dev_offset) / 512,
-			      data + NBD_REPLY_SIZE,
-			      request.len / 512) == -1) {
-			LOG("reading from file failed");
-			errno = EINVAL;
-			return -1;
-		}
-		*offset += request.len;
-
-		TRACE("Read %u byte(s)", request.len);
-
-		/* Reply
-		   [ 0 ..  3]    magic   (NBD_REPLY_MAGIC)
-		   [ 4 ..  7]    error   (0 == no error)
-		   [ 7 .. 15]    handle
-		 */
-
-		cpu_to_be32w((uint32_t*)data, NBD_REPLY_MAGIC);
-		cpu_to_be32w((uint32_t*)(data + 4), reply.error);
-		cpu_to_be64w((uint64_t*)(data + 8), reply.handle);
-
-		TRACE("Sending data to client");
-
-		if (write_sync(csock, data,
-			       request.len + NBD_REPLY_SIZE) !=
-			       request.len + NBD_REPLY_SIZE) {
-			LOG("writing to socket failed");
-			errno = EINVAL;
-			return -1;
-		}
-		break;
-	case NBD_CMD_WRITE:
-		TRACE("Request type is WRITE");
-
-		TRACE("Reading %u byte(s)", request.len);
-
-		if (read_sync(csock, data, request.len) != request.len) {
-			LOG("reading from socket failed");
-			errno = EINVAL;
-			return -1;
-		}
-
-		if (readonly) {
-			TRACE("Server is read-only, return error");
-			reply.error = 1;
-		} else {
-			TRACE("Writing to device");
-
-			if (bdrv_write(bs, (request.from + dev_offset) / 512,
-				       data, request.len / 512) == -1) {
-				LOG("writing to file failed");
-				errno = EINVAL;
-				return -1;
-			}
-
-			*offset += request.len;
-		}
-
-		if (nbd_send_reply(csock, &reply) == -1)
-			return -1;
-		break;
-	case NBD_CMD_DISC:
-		TRACE("Request type is DISCONNECT");
-		errno = 0;
-		return 1;
-	default:
-		LOG("invalid request type (%u) received", request.type);
-		errno = EINVAL;
-		return -1;
-	}
-
-	TRACE("Request/Reply complete");
-
-	return 0;
+        LOG("requested operation past EOF--bad client?");
+        errno = EINVAL;
+        return -1;
+    }
+
+    TRACE("Decoding type");
+
+    reply.handle = request.handle;
+    reply.error = 0;
+
+    switch (request.type) {
+    case NBD_CMD_READ:
+        TRACE("Request type is READ");
+
+        if (bdrv_read(bs, (request.from + dev_offset) / 512,
+                  data + NBD_REPLY_SIZE,
+                  request.len / 512) == -1) {
+            LOG("reading from file failed");
+            errno = EINVAL;
+            return -1;
+        }
+        *offset += request.len;
+
+        TRACE("Read %u byte(s)", request.len);
+
+        /* Reply
+           [ 0 ..  3]    magic   (NBD_REPLY_MAGIC)
+           [ 4 ..  7]    error   (0 == no error)
+           [ 7 .. 15]    handle
+         */
+
+        cpu_to_be32w((uint32_t*)data, NBD_REPLY_MAGIC);
+        cpu_to_be32w((uint32_t*)(data + 4), reply.error);
+        cpu_to_be64w((uint64_t*)(data + 8), reply.handle);
+
+        TRACE("Sending data to client");
+
+        if (write_sync(csock, data,
+                   request.len + NBD_REPLY_SIZE) !=
+                   request.len + NBD_REPLY_SIZE) {
+            LOG("writing to socket failed");
+            errno = EINVAL;
+            return -1;
+        }
+        break;
+    case NBD_CMD_WRITE:
+        TRACE("Request type is WRITE");
+
+        TRACE("Reading %u byte(s)", request.len);
+
+        if (read_sync(csock, data, request.len) != request.len) {
+            LOG("reading from socket failed");
+            errno = EINVAL;
+            return -1;
+        }
+
+        if (readonly) {
+            TRACE("Server is read-only, return error");
+            reply.error = 1;
+        } else {
+            TRACE("Writing to device");
+
+            if (bdrv_write(bs, (request.from + dev_offset) / 512,
+                       data, request.len / 512) == -1) {
+                LOG("writing to file failed");
+                errno = EINVAL;
+                return -1;
+            }
+
+            *offset += request.len;
+        }
+
+        if (nbd_send_reply(csock, &reply) == -1)
+            return -1;
+        break;
+    case NBD_CMD_DISC:
+        TRACE("Request type is DISCONNECT");
+        errno = 0;
+        return 1;
+    default:
+        LOG("invalid request type (%u) received", request.type);
+        errno = EINVAL;
+        return -1;
+    }
+
+    TRACE("Request/Reply complete");
+
+    return 0;
 }
-- 
1.7.0.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH 2/3] NBD library: add aio-compatible read/write function
  2011-02-16 12:00       ` Kevin Wolf
  2011-02-17 16:27         ` [Qemu-devel] " Nicholas Thomas
  2011-02-17 16:34         ` [Qemu-devel] [PATCH 1/3] NBD library: whitespace changes Nicholas Thomas
@ 2011-02-17 16:34         ` Nicholas Thomas
  2011-02-17 16:34         ` [Qemu-devel] [PATCH 3/3] block/nbd: Make the NBD block device use the AIO interface Nicholas Thomas
  3 siblings, 0 replies; 13+ messages in thread
From: Nicholas Thomas @ 2011-02-17 16:34 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, qemu-devel

Signed-off-by: Nick Thomas <nick@bytemark.co.uk>
---
 nbd.c |   51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 nbd.h |    2 ++
 2 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/nbd.c b/nbd.c
index abe0ecb..83d3342 100644
--- a/nbd.c
+++ b/nbd.c
@@ -107,6 +107,57 @@ size_t nbd_wr_sync(int fd, void *buffer, size_t
size, bool do_read)
     return offset;
 }
 
+int nbd_wr_aio(int sockfd, struct iovec *iov, size_t len,  off_t
offset,
+               bool do_read)
+{
+    struct msghdr msg;
+    int ret, diff;
+
+    memset(&msg, 0, sizeof(msg));
+    msg.msg_iov = iov;
+    msg.msg_iovlen = 1;
+
+    len += offset;
+
+    while (iov->iov_len < len) {
+        len -= iov->iov_len;
+
+        iov++;
+        msg.msg_iovlen++;
+    }
+
+    diff = iov->iov_len - len;
+    iov->iov_len -= diff;
+
+    while (msg.msg_iov->iov_len <= offset) {
+        offset -= msg.msg_iov->iov_len;
+
+        msg.msg_iov++;
+        msg.msg_iovlen--;
+    }
+
+    msg.msg_iov->iov_base = (char *) msg.msg_iov->iov_base + offset;
+    msg.msg_iov->iov_len -= offset;
+
+retry:
+    if (do_read) {
+        ret = recvmsg(sockfd, &msg, 0);
+    } else {
+        ret = sendmsg(sockfd, &msg, 0);
+    }
+
+    /* recoverable error */
+    if (ret == -1 && (errno == EAGAIN || errno == EINTR)) {
+        goto retry;
+    }
+
+    msg.msg_iov->iov_base = (char *) msg.msg_iov->iov_base - offset;
+    msg.msg_iov->iov_len += offset;
+
+    iov->iov_len += diff;
+    return ret;
+}
+
 int tcp_socket_outgoing(const char *address, uint16_t port)
 {
     int s;
diff --git a/nbd.h b/nbd.h
index fc3a594..872218c 100644
--- a/nbd.h
+++ b/nbd.h
@@ -45,6 +45,8 @@ enum {
 #define NBD_DEFAULT_PORT	10809
 
 size_t nbd_wr_sync(int fd, void *buffer, size_t size, bool do_read);
+int nbd_wr_aio(int sockfd, struct iovec *iov, size_t len,  off_t
offset,
+               bool do_read);
 int tcp_socket_outgoing(const char *address, uint16_t port);
 int tcp_socket_incoming(const char *address, uint16_t port);
 int unix_socket_outgoing(const char *path);
-- 
1.7.0.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH 3/3] block/nbd: Make the NBD block device use the AIO interface
  2011-02-16 12:00       ` Kevin Wolf
                           ` (2 preceding siblings ...)
  2011-02-17 16:34         ` [Qemu-devel] [PATCH 2/3] NBD library: add aio-compatible read/write function Nicholas Thomas
@ 2011-02-17 16:34         ` Nicholas Thomas
  2011-02-17 19:28           ` Nicholas Thomas
  3 siblings, 1 reply; 13+ messages in thread
From: Nicholas Thomas @ 2011-02-17 16:34 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, qemu-devel

Signed-off-by: Nick Thomas <nick@bytemark.co.uk>
---
 block/nbd.c |  549
++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 464 insertions(+), 85 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index c8dc763..1387227 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -1,11 +1,12 @@
 /*
- * QEMU Block driver for  NBD
+ * QEMU Block driver for  NBD - asynchronous IO
  *
  * Copyright (C) 2008 Bull S.A.S.
  *     Author: Laurent Vivier <Laurent.Vivier@bull.net>
  *
  * Some parts:
  *    Copyright (C) 2007 Anthony Liguori <anthony@codemonkey.ws>
+ *    Copyright (C) 2011 Nick Thomas <nick@bytemark.co.uk>
  *
  * Permission is hereby granted, free of charge, to any person
obtaining a copy
  * of this software and associated documentation files (the
"Software"), to deal
@@ -27,66 +28,132 @@
  */
 
 #include "qemu-common.h"
+#include "qemu_socket.h"
 #include "nbd.h"
 #include "module.h"
 
 #include <sys/types.h>
 #include <unistd.h>
 
-#define EN_OPTSTR ":exportname="
+#define EN_OPTSTR      ":exportname="
+#define SECTOR_SIZE    512
+
+#define DEBUG_NBD
+
+#if defined(DEBUG_NBD)
+#define logout(fmt, ...) \
+                fprintf(stderr, "nbd\t%-24s" fmt, __func__,
##__VA_ARGS__)
+#else
+#define logout(fmt, ...) ((void)0)
+#endif
+
+
+typedef struct NBDAIOCB NBDAIOCB;
+
+typedef struct AIOReq {
+    NBDAIOCB *aiocb;
+    unsigned int iov_offset;
+
+    off_t offset;
+    size_t data_len;
+    uint8_t flags;
+    uint64_t handle;
+
+    QLIST_ENTRY(AIOReq) outstanding_aio_siblings;
+    QLIST_ENTRY(AIOReq) aioreq_siblings;
+} AIOReq;
+
 
 typedef struct BDRVNBDState {
     int sock;
     off_t size;
     size_t blocksize;
+
+    /* Filled in by nbd_config. Store host_spec because DNS may change
*/
+    bool tcp_conn;      /* True, we use TCP. False, UNIX domain sockets
*/
+    char *export_name;  /* An NBD server may export several devices
*/
+    char *host_spec;    /* Path to socket (UNIX) or hostname/IP (TCP)
*/
+    uint16_t tcp_port;
+
+    /* We use these for asynchronous I/O */
+    uint64_t aioreq_seq_num;
+    QLIST_HEAD(outstanding_aio_head, AIOReq) outstanding_aio_head;
 } BDRVNBDState;
 
-static int nbd_open(BlockDriverState *bs, const char* filename, int
flags)
+enum AIOCBState {
+    AIOCB_WRITE_UDATA,
+    AIOCB_READ_UDATA,
+};
+
+struct NBDAIOCB {
+    BlockDriverAIOCB common;
+
+    QEMUIOVector *qiov;
+
+    int64_t sector_num;
+    int nb_sectors;
+
+    int ret;
+    enum AIOCBState aiocb_type;
+
+    QEMUBH *bh;
+    void (*aio_done_func)(NBDAIOCB *);
+
+    int canceled;
+
+    QLIST_HEAD(aioreq_head, AIOReq) aioreq_head;
+};
+
+static inline int free_aio_req(BDRVNBDState *s, AIOReq *aio_req)
 {
-    BDRVNBDState *s = bs->opaque;
-    uint32_t nbdflags;
+    NBDAIOCB *acb = aio_req->aiocb;
+    QLIST_REMOVE(aio_req, outstanding_aio_siblings);
+    QLIST_REMOVE(aio_req, aioreq_siblings);
+    qemu_free(aio_req);
+
+    return !QLIST_EMPTY(&acb->aioreq_head);
+}
 
+static int nbd_config(BDRVNBDState *s, const char* filename, int flags)
+{
     char *file;
-    char *name;
-    const char *host;
+    char *export_name;
+    const char *host_spec;
     const char *unixpath;
-    int sock;
-    off_t size;
-    size_t blocksize;
-    int ret;
     int err = -EINVAL;
 
     file = qemu_strdup(filename);
 
-    name = strstr(file, EN_OPTSTR);
-    if (name) {
-        if (name[strlen(EN_OPTSTR)] == 0) {
+    export_name = strstr(file, EN_OPTSTR);
+    if (export_name) {
+        if (export_name[strlen(EN_OPTSTR)] == 0) {
             goto out;
         }
-        name[0] = 0;
-        name += strlen(EN_OPTSTR);
+        export_name[0] = 0; /* truncate 'file' */
+        export_name += strlen(EN_OPTSTR);
+        s->export_name = qemu_strdup(export_name);
     }
 
-    if (!strstart(file, "nbd:", &host)) {
+    /* extract the host_spec - fail if it's not nbd:* */
+    if (!strstart(file, "nbd:", &host_spec)) {
         goto out;
     }
 
-    if (strstart(host, "unix:", &unixpath)) {
-
-        if (unixpath[0] != '/') {
+    /* are we a UNIX or TCP socket? */
+    if (strstart(host_spec, "unix:", &unixpath)) {
+        if (unixpath[0] != '/') { /* We demand  an absolute path*/
             goto out;
         }
-
-        sock = unix_socket_outgoing(unixpath);
-
+        s->tcp_conn = false;
+        s->host_spec = qemu_strdup(unixpath);
     } else {
+        /* We should have an <IPv4 address>:<port> string to split up
*/
         uint16_t port = NBD_DEFAULT_PORT;
         char *p, *r;
         char hostname[128];
 
-        pstrcpy(hostname, 128, host);
-
-        p = strchr(hostname, ':');
+        pstrcpy(hostname, 128, host_spec);
+        p = strchr(hostname, ':'); /* FIXME: IPv6 */
         if (p != NULL) {
             *p = '\0';
             p++;
@@ -96,121 +163,433 @@ static int nbd_open(BlockDriverState *bs, const
char* filename, int flags)
                 goto out;
             }
         }
+        s->tcp_conn = true;
+        s->host_spec = qemu_strdup(hostname);
+        s->tcp_port = port;
+    }
 
-        sock = tcp_socket_outgoing(hostname, port);
+    err = 0;
+
+out:
+    qemu_free(file);
+    if (err != 0) {
+        if (s->export_name != NULL) {
+            qemu_free(s->export_name);
+        }
+        if (s->host_spec != NULL) {
+            qemu_free(s->host_spec);
+        }
+    }
+    return err;
+}
+
+static void aio_read_response(void *opaque)
+{
+    BDRVNBDState *s = opaque;
+    struct nbd_reply reply;
+
+    AIOReq *aio_req = NULL;
+    NBDAIOCB *acb;
+    int rest;
+
+    if (QLIST_EMPTY(&s->outstanding_aio_head)) {
+        return;
     }
 
+    /* read the header */
+    if (nbd_receive_reply(s->sock, &reply) == -1) {
+        logout("Failed to read response from socket\n");
+        /* Having failed to read the reply header, we can't know which
+         * aio_req this corresponds to - so we can't signal a failure.
+         */
+        return;
+    }
+
+    /* find the right aio_req from the outstanding_aio list */
+    QLIST_FOREACH(aio_req, &s->outstanding_aio_head,
outstanding_aio_siblings) {
+        if (aio_req->handle == reply.handle) {
+            break;
+        }
+    }
+
+    if (!aio_req) {
+        logout("cannot find aio_req for handle %lu\n", reply.handle);
+        return;
+    }
+
+    acb = aio_req->aiocb;
+
+    if (acb->aiocb_type == AIOCB_READ_UDATA) {
+        off_t offset = 0;
+        int ret = 0;
+        size_t total = aio_req->data_len;
+
+        while (offset < total) {
+            ret = nbd_wr_aio(s->sock, acb->qiov->iov, total - offset,
offset,
+                             true);
+
+            if (ret == -1) {
+                logout("Error reading from NBD server: %i (%s)\n",
+                       errno, strerror(errno));
+                return;
+            }
+
+            offset += ret;
+        }
+    }
+
+    if (reply.error != 0) {
+        acb->ret = -EIO;
+        logout("NBD request resulted in error %i\n", reply.error);
+    }
+
+    rest = free_aio_req(s, aio_req);
+    if (!rest) {
+        acb->aio_done_func(acb);
+    }
+
+    return;
+}
+
+static int aio_flush_request(void *opaque)
+{
+    BDRVNBDState *s = opaque;
+
+    return !QLIST_EMPTY(&s->outstanding_aio_head);
+}
+
+/*
+ * Connect to the NBD server specified in the state object
+ */
+static int nbd_establish_connection(BlockDriverState *bs)
+{
+    BDRVNBDState *s = bs->opaque;
+    int sock;
+    int ret;
+    off_t size;
+    size_t blocksize;
+    uint32_t nbdflags;
+
+    if (s->tcp_conn == true) {
+        sock = tcp_socket_outgoing(s->host_spec, s->tcp_port);
+    } else {
+        sock = unix_socket_outgoing(s->host_spec);
+    }
+
+    /* Failed to establish connection */
     if (sock == -1) {
-        err = -errno;
-        goto out;
+        logout("Failed to establish connection to NBD server\n");
+        return -errno;
     }
 
-    ret = nbd_receive_negotiate(sock, name, &nbdflags, &size,
&blocksize);
+    /* NBD handshake */
+    ret = nbd_receive_negotiate(sock, s->export_name, &nbdflags, &size,
+                                &blocksize);
     if (ret == -1) {
-        err = -errno;
-        goto out;
+        logout("Failed to negotiate with the NBD server\n");
+        closesocket(sock);
+        return -errno;
     }
 
+    /* Now that we're connected, set the socket to be non-blocking */
+    socket_set_nonblock(sock);
+
     s->sock = sock;
     s->size = size;
     s->blocksize = blocksize;
-    err = 0;
 
-out:
-    qemu_free(file);
-    return err;
+    /* Response handler. This is called when there is data to read */
+    qemu_aio_set_fd_handler(sock, aio_read_response, NULL,
aio_flush_request,
+                            NULL, s);
+    logout("Established connection with NBD server\n");
+    return 0;
 }
 
-static int nbd_read(BlockDriverState *bs, int64_t sector_num,
-                    uint8_t *buf, int nb_sectors)
+static void nbd_teardown_connection(BlockDriverState *bs)
 {
+    /* Send the final packet to the NBD server and close the socket */
     BDRVNBDState *s = bs->opaque;
     struct nbd_request request;
-    struct nbd_reply reply;
 
-    request.type = NBD_CMD_READ;
+    request.type = NBD_CMD_DISC;
     request.handle = (uint64_t)(intptr_t)bs;
-    request.from = sector_num * 512;;
-    request.len = nb_sectors * 512;
+    request.from = 0;
+    request.len = 0;
+    nbd_send_request(s->sock, &request);
 
-    if (nbd_send_request(s->sock, &request) == -1)
-        return -errno;
+    qemu_aio_set_fd_handler(s->sock, NULL, NULL, NULL, NULL, NULL);
+    closesocket(s->sock);
+    logout("Connection to NBD server closed\n");
+    return;
+}
 
-    if (nbd_receive_reply(s->sock, &reply) == -1)
-        return -errno;
+static int nbd_open(BlockDriverState *bs, const char* filename, int
flags)
+{
+    BDRVNBDState *s = bs->opaque;
+    int result;
 
-    if (reply.error !=0)
-        return -reply.error;
+    /* Pop the config into our state object. Exit if invalid. */
+    result = nbd_config(s, filename, flags);
 
-    if (reply.handle != request.handle)
-        return -EIO;
+    if (result != 0) {
+        return result;
+    }
 
-    if (nbd_wr_sync(s->sock, buf, request.len, 1) != request.len)
-        return -EIO;
+    QLIST_INIT(&s->outstanding_aio_head);
+
+    /* establish TCP connection, return error if it fails
+     * TODO: Configurable retry-until-timeout behaviour.
+     */
+    result = nbd_establish_connection(bs);
+    if (result != 0) {
+        return result;
+    }
 
     return 0;
 }
 
-static int nbd_write(BlockDriverState *bs, int64_t sector_num,
-                     const uint8_t *buf, int nb_sectors)
+static void nbd_close(BlockDriverState *bs)
 {
+    nbd_teardown_connection(bs);
     BDRVNBDState *s = bs->opaque;
+
+    if (s->export_name != NULL) {
+        qemu_free(s->export_name);
+    }
+    if (s->host_spec != NULL) {
+        qemu_free(s->host_spec);
+    }
+
+    return;
+}
+
+static int add_aio_request(BDRVNBDState *s, AIOReq *aio_req,
QEMUIOVector *qiov,
+                           enum AIOCBState aiocb_type)
+{
     struct nbd_request request;
-    struct nbd_reply reply;
 
-    request.type = NBD_CMD_WRITE;
-    request.handle = (uint64_t)(intptr_t)bs;
-    request.from = sector_num * 512;;
-    request.len = nb_sectors * 512;
+    request.from = aio_req->offset;
+    request.len = aio_req->data_len;
+    request.handle = aio_req->handle;
+
+    if (aiocb_type == AIOCB_READ_UDATA) {
+        request.type = NBD_CMD_READ;
+    } else {
+        request.type = NBD_CMD_WRITE;
+    }
 
-    if (nbd_send_request(s->sock, &request) == -1)
+    /* Write the request to the socket. Header first. */
+    if (nbd_send_request(s->sock, &request) == -1) {
+        /* TODO: retry handling. This leads to -EIO and request
cancellation */
+        logout("writing request header to server failed\n");
         return -errno;
+    }
 
-    if (nbd_wr_sync(s->sock, (uint8_t*)buf, request.len, 0) !=
request.len)
-        return -EIO;
+    /* If this is a write, send the data too */
+    if (aiocb_type == AIOCB_WRITE_UDATA) {
+        int ret = 0;
+        off_t offset = 0;
+        size_t total = aio_req->data_len;
+
+        while (offset < total) {
+            ret = nbd_wr_aio(s->sock, qiov->iov, total - offset,
offset, false);
+            if (ret == -1) {
+                logout("Error writing request data to NBD server: %i (%
s)\n",
+                       errno, strerror(errno));
+                return -EIO;
+            }
 
-    if (nbd_receive_reply(s->sock, &reply) == -1)
-        return -errno;
+            offset += ret;
+        }
+    }
+
+    return 0;
+}
+
+static inline AIOReq *alloc_aio_req(BDRVNBDState *s, NBDAIOCB *acb,
+                                    unsigned int data_len,
+                                    uint64_t offset, uint8_t flags,
+                                    unsigned int iov_offset)
+{
+    AIOReq *aio_req;
+
+    aio_req = qemu_malloc(sizeof(*aio_req));
+    aio_req->aiocb = acb;
+    aio_req->iov_offset = iov_offset;
+    aio_req->offset = offset;
+    aio_req->data_len = data_len;
+    aio_req->flags = flags;
+    aio_req->handle = s->aioreq_seq_num++; /* FIXME: Trivially
guessable */
+
+    QLIST_INSERT_HEAD(&s->outstanding_aio_head, aio_req,
+                      outstanding_aio_siblings);
+    QLIST_INSERT_HEAD(&acb->aioreq_head, aio_req, aioreq_siblings);
+
+    return aio_req;
+}
+
+static void nbd_finish_aiocb(NBDAIOCB *acb)
+{
+    if (!acb->canceled) {
+        acb->common.cb(acb->common.opaque, acb->ret);
+    }
+    qemu_aio_release(acb);
+}
+
+
+static void nbd_aio_cancel(BlockDriverAIOCB *blockacb)
+{
+    NBDAIOCB *acb = (NBDAIOCB *)blockacb;
+
+    /*
+     * We cannot cancel the requests which are already sent to
+     * the servers, so we just complete the request with -EIO here.
+     */
+    acb->common.cb(acb->common.opaque, -EIO);
+    acb->canceled = 1;
+}
+
+static AIOPool nbd_aio_pool = {
+    .aiocb_size = sizeof(NBDAIOCB),
+    .cancel = nbd_aio_cancel,
+};
+
+static NBDAIOCB *nbd_aio_setup(BlockDriverState *bs, QEMUIOVector
*qiov,
+                                   int64_t sector_num, int nb_sectors,
+                                   BlockDriverCompletionFunc *cb, void
*opaque)
+{
+    NBDAIOCB *acb;
+
+    acb = qemu_aio_get(&nbd_aio_pool, bs, cb, opaque);
 
-    if (reply.error !=0)
-        return -reply.error;
+    acb->qiov = qiov;
+
+    acb->sector_num = sector_num;
+    acb->nb_sectors = nb_sectors;
+
+    acb->aio_done_func = NULL;
+    acb->canceled = 0;
+    acb->bh = NULL;
+    acb->ret = 0;
+    QLIST_INIT(&acb->aioreq_head);
+    return acb;
+}
 
-    if (reply.handle != request.handle)
+static int nbd_schedule_bh(QEMUBHFunc *cb, NBDAIOCB *acb)
+{
+    if (acb->bh) {
+        logout("bug: %d %d\n", acb->aiocb_type, acb->aiocb_type);
+        return -EIO;
+    }
+
+    acb->bh = qemu_bh_new(cb, acb);
+    if (!acb->bh) {
+        logout("oom: %d %d\n", acb->aiocb_type, acb->aiocb_type);
         return -EIO;
+    }
+
+    qemu_bh_schedule(acb->bh);
 
     return 0;
 }
 
-static void nbd_close(BlockDriverState *bs)
+/*
+ * Send I/O requests to the server.
+ *
+ * This function sends requests to the server, links the requests to
+ * the outstanding_list in BDRVNBDState, and exits without waiting for
+ * the response.  The responses are received in the `aio_read_response'
+ * function which is called from the main loop as a fd handler.
+ */
+static void nbd_readv_writev_bh_cb(void *p)
 {
-    BDRVNBDState *s = bs->opaque;
-    struct nbd_request request;
+    NBDAIOCB *acb = p;
+    int ret = 0;
+    unsigned long len, done = 0, total = acb->nb_sectors * SECTOR_SIZE;
+    unsigned long idx = acb->sector_num;
 
-    request.type = NBD_CMD_DISC;
-    request.handle = (uint64_t)(intptr_t)bs;
-    request.from = 0;
-    request.len = 0;
-    nbd_send_request(s->sock, &request);
+    uint64_t offset = acb->sector_num * SECTOR_SIZE;
+    BDRVNBDState *s = acb->common.bs->opaque;
+
+    AIOReq *aio_req;
 
-    close(s->sock);
+    qemu_bh_delete(acb->bh);
+    acb->bh = NULL;
+
+    while (done != total) {
+        uint8_t flags = 0;
+
+        len = total - done;
+
+        aio_req = alloc_aio_req(s, acb, len, offset, flags, done);
+
+        ret = add_aio_request(s, aio_req, acb->qiov, acb->aiocb_type);
+
+        if (ret < 0) {
+            free_aio_req(s, aio_req);
+            acb->ret = -EIO;
+            goto out;
+        }
+
+        offset = 0;
+        idx++;
+        done += len;
+    }
+out:
+    if (QLIST_EMPTY(&acb->aioreq_head)) {
+        nbd_finish_aiocb(acb);
+    }
 }
 
+static BlockDriverAIOCB *nbd_aio_readv(BlockDriverState *bs,
+        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
+        BlockDriverCompletionFunc *cb, void *opaque)
+{
+    NBDAIOCB *acb;
+    int i;
+
+    acb = nbd_aio_setup(bs, qiov, sector_num, nb_sectors, cb, opaque);
+    acb->aiocb_type = AIOCB_READ_UDATA;
+    acb->aio_done_func = nbd_finish_aiocb;
+
+    for (i = 0; i < qiov->niov; i++) {
+        memset(qiov->iov[i].iov_base, 0, qiov->iov[i].iov_len);
+    }
+
+    nbd_schedule_bh(nbd_readv_writev_bh_cb, acb);
+    return &acb->common;
+}
+
+static BlockDriverAIOCB *nbd_aio_writev(BlockDriverState *bs,
+        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
+        BlockDriverCompletionFunc *cb, void *opaque)
+{
+    NBDAIOCB *acb;
+
+    acb = nbd_aio_setup(bs, qiov, sector_num, nb_sectors, cb, opaque);
+    acb->aiocb_type = AIOCB_WRITE_UDATA;
+    acb->aio_done_func = nbd_finish_aiocb;
+
+    nbd_schedule_bh(nbd_readv_writev_bh_cb, acb);
+    return &acb->common;
+}
+
+
 static int64_t nbd_getlength(BlockDriverState *bs)
 {
     BDRVNBDState *s = bs->opaque;
-
     return s->size;
 }
 
 static BlockDriver bdrv_nbd = {
-    .format_name	= "nbd",
-    .instance_size	= sizeof(BDRVNBDState),
-    .bdrv_file_open	= nbd_open,
-    .bdrv_read		= nbd_read,
-    .bdrv_write		= nbd_write,
-    .bdrv_close		= nbd_close,
-    .bdrv_getlength	= nbd_getlength,
-    .protocol_name	= "nbd",
+    .format_name     = "nbd",
+    .instance_size   = sizeof(BDRVNBDState),
+    .bdrv_file_open  = nbd_open,
+    .bdrv_aio_readv  = nbd_aio_readv,
+    .bdrv_aio_writev = nbd_aio_writev,
+    .bdrv_close      = nbd_close,
+    .bdrv_getlength  = nbd_getlength,
+    .protocol_name   = "nbd"
 };
 
 static void bdrv_nbd_init(void)
-- 
1.7.0.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] block/nbd: Make the NBD block device use the AIO interface
  2011-02-17 16:34         ` [Qemu-devel] [PATCH 3/3] block/nbd: Make the NBD block device use the AIO interface Nicholas Thomas
@ 2011-02-17 19:28           ` Nicholas Thomas
  2011-02-18 12:16             ` [Qemu-devel] [PATCH 3/3 v2] " Nicholas Thomas
  0 siblings, 1 reply; 13+ messages in thread
From: Nicholas Thomas @ 2011-02-17 19:28 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, qemu-devel

Ho hum.

On Thu, 2011-02-17 at 16:34 +0000, Nicholas Thomas wrote:
> Signed-off-by: Nick Thomas <nick@bytemark.co.uk>
> ---
>  block/nbd.c |  549
> ++++++++++++++++++++++++++++++++++++++++++++++++++---------
>  1 files changed, 464 insertions(+), 85 deletions(-)

Additional testing has revealed that this code breaks the stock
nbd-server (the one on sourceforge) when large (well, 1.3MiB) write
requests are sent to it.

The server process receives a valid-looking NBD write request header
followed by the first ~200K of the write data (sent from
add_aio_request). It then exits (errcode 1).

add_aio_request returns the error, and subsequently, read_aio_response
picks up the aioreq and loops forever, trying to read a response from a
closed socket.

Reads and small writes seem to work fine, however.

I'll debug the server tomorrow and try to see what's breaking it - no
good having a technically compliant client if it breaks the most common
server out there ;)

/Nick

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3 v2] block/nbd: Make the NBD block device use the AIO interface
  2011-02-17 19:28           ` Nicholas Thomas
@ 2011-02-18 12:16             ` Nicholas Thomas
  2011-02-18 12:23               ` Kevin Wolf
  0 siblings, 1 reply; 13+ messages in thread
From: Nicholas Thomas @ 2011-02-18 12:16 UTC (permalink / raw)
  To: kwolf; +Cc: stefanha, qemu-devel

On Thu, 2011-02-17 at 19:28 +0000, Nicholas Thomas wrote:

> Additional testing has revealed that this code breaks the stock
> nbd-server (the one on sourceforge) when large (well, 1.3MiB) write
> requests are sent to it.

....NBD has a limit of 1MB on the size of write requests.
NBD_BUFFER_SIZE in qemu-nbd.c - and I'm sure that's what's knocking out
the standard NBD server too.

I didn't see any option to tell QEMU to split up writes to a certain
size before handing them off to the block driver, so I split the writes
up into multiple acbs. Reworked patch:


This preserves the previous behaviour where the NBD server is
unavailable or goes away during guest execution, but switches the
NBD backend to present the AIO interface instead of the sync IO
interface.

We also split write requests into 1 MiB blocks (minus request size).
This is a hard limit in the NBD servers (including qemu-nbd), but
never seemed to come up with the previous backend code.
---
 block/nbd.c |  555
++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 470 insertions(+), 85 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index c8dc763..59de69d 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -1,11 +1,12 @@
 /*
- * QEMU Block driver for  NBD
+ * QEMU Block driver for  NBD - asynchronous IO
  *
  * Copyright (C) 2008 Bull S.A.S.
  *     Author: Laurent Vivier <Laurent.Vivier@bull.net>
  *
  * Some parts:
  *    Copyright (C) 2007 Anthony Liguori <anthony@codemonkey.ws>
+ *    Copyright (C) 2011 Nick Thomas <nick@bytemark.co.uk>
  *
  * Permission is hereby granted, free of charge, to any person
obtaining a copy
  * of this software and associated documentation files (the
"Software"), to deal
@@ -27,66 +28,135 @@
  */
 
 #include "qemu-common.h"
+#include "qemu_socket.h"
 #include "nbd.h"
 #include "module.h"
 
 #include <sys/types.h>
 #include <unistd.h>
 
-#define EN_OPTSTR ":exportname="
+#define EN_OPTSTR      ":exportname="
+#define SECTOR_SIZE    512
+
+/* 1MiB minus request header size */
+#define MAX_NBD_WRITE      ((1024*1024) - (4 + 4 + 8 + 8 + 4))
+
+#define DEBUG_NBD
+
+#if defined(DEBUG_NBD)
+#define logout(fmt, ...) \
+                fprintf(stderr, "nbd\t%-24s" fmt, __func__,
##__VA_ARGS__)
+#else
+#define logout(fmt, ...) ((void)0)
+#endif
+
+
+typedef struct NBDAIOCB NBDAIOCB;
+
+typedef struct AIOReq {
+    NBDAIOCB *aiocb;
+    off_t iov_offset; /* Where on the iov does this req start? */
+    off_t offset;     /* Starting point of the read */
+
+    size_t data_len;
+    uint8_t flags;
+    uint64_t handle;
+
+    QLIST_ENTRY(AIOReq) outstanding_aio_siblings;
+    QLIST_ENTRY(AIOReq) aioreq_siblings;
+} AIOReq;
+
 
 typedef struct BDRVNBDState {
     int sock;
     off_t size;
     size_t blocksize;
+
+    /* Filled in by nbd_config. Store host_spec because DNS may change
*/
+    bool tcp_conn;      /* True, we use TCP. False, UNIX domain sockets
*/
+    char *export_name;  /* An NBD server may export several devices
*/
+    char *host_spec;    /* Path to socket (UNIX) or hostname/IP (TCP)
*/
+    uint16_t tcp_port;
+
+    /* We use these for asynchronous I/O */
+    uint64_t aioreq_seq_num;
+    QLIST_HEAD(outstanding_aio_head, AIOReq) outstanding_aio_head;
 } BDRVNBDState;
 
-static int nbd_open(BlockDriverState *bs, const char* filename, int
flags)
+enum AIOCBState {
+    AIOCB_WRITE_UDATA,
+    AIOCB_READ_UDATA,
+};
+
+struct NBDAIOCB {
+    BlockDriverAIOCB common;
+
+    QEMUIOVector *qiov;
+
+    int64_t sector_num;
+    int nb_sectors;
+
+    int ret;
+    enum AIOCBState aiocb_type;
+
+    QEMUBH *bh;
+    void (*aio_done_func)(NBDAIOCB *);
+
+    int canceled;
+
+    QLIST_HEAD(aioreq_head, AIOReq) aioreq_head;
+};
+
+static inline int free_aio_req(BDRVNBDState *s, AIOReq *aio_req)
 {
-    BDRVNBDState *s = bs->opaque;
-    uint32_t nbdflags;
+    NBDAIOCB *acb = aio_req->aiocb;
+    QLIST_REMOVE(aio_req, outstanding_aio_siblings);
+    QLIST_REMOVE(aio_req, aioreq_siblings);
+    qemu_free(aio_req);
 
+    return !QLIST_EMPTY(&acb->aioreq_head);
+}
+
+static int nbd_config(BDRVNBDState *s, const char* filename, int flags)
+{
     char *file;
-    char *name;
-    const char *host;
+    char *export_name;
+    const char *host_spec;
     const char *unixpath;
-    int sock;
-    off_t size;
-    size_t blocksize;
-    int ret;
     int err = -EINVAL;
 
     file = qemu_strdup(filename);
 
-    name = strstr(file, EN_OPTSTR);
-    if (name) {
-        if (name[strlen(EN_OPTSTR)] == 0) {
+    export_name = strstr(file, EN_OPTSTR);
+    if (export_name) {
+        if (export_name[strlen(EN_OPTSTR)] == 0) {
             goto out;
         }
-        name[0] = 0;
-        name += strlen(EN_OPTSTR);
+        export_name[0] = 0; /* truncate 'file' */
+        export_name += strlen(EN_OPTSTR);
+        s->export_name = qemu_strdup(export_name);
     }
 
-    if (!strstart(file, "nbd:", &host)) {
+    /* extract the host_spec - fail if it's not nbd:* */
+    if (!strstart(file, "nbd:", &host_spec)) {
         goto out;
     }
 
-    if (strstart(host, "unix:", &unixpath)) {
-
-        if (unixpath[0] != '/') {
+    /* are we a UNIX or TCP socket? */
+    if (strstart(host_spec, "unix:", &unixpath)) {
+        if (unixpath[0] != '/') { /* We demand  an absolute path*/
             goto out;
         }
-
-        sock = unix_socket_outgoing(unixpath);
-
+        s->tcp_conn = false;
+        s->host_spec = qemu_strdup(unixpath);
     } else {
+        /* We should have an <IPv4 address>:<port> string to split up
*/
         uint16_t port = NBD_DEFAULT_PORT;
         char *p, *r;
         char hostname[128];
 
-        pstrcpy(hostname, 128, host);
-
-        p = strchr(hostname, ':');
+        pstrcpy(hostname, 128, host_spec);
+        p = strchr(hostname, ':'); /* FIXME: IPv6 */
         if (p != NULL) {
             *p = '\0';
             p++;
@@ -96,121 +166,436 @@ static int nbd_open(BlockDriverState *bs, const
char* filename, int flags)
                 goto out;
             }
         }
+        s->tcp_conn = true;
+        s->host_spec = qemu_strdup(hostname);
+        s->tcp_port = port;
+    }
+
+    err = 0;
+
+out:
+    qemu_free(file);
+    if (err != 0) {
+        if (s->export_name != NULL) {
+            qemu_free(s->export_name);
+        }
+        if (s->host_spec != NULL) {
+            qemu_free(s->host_spec);
+        }
+    }
+    return err;
+}
+
+static void aio_read_response(void *opaque)
+{
+    BDRVNBDState *s = opaque;
+    struct nbd_reply reply;
+
+    AIOReq *aio_req = NULL;
+    NBDAIOCB *acb;
+    int rest;
+
+    if (QLIST_EMPTY(&s->outstanding_aio_head)) {
+        return;
+    }
+
+    /* read the header */
+    if (nbd_receive_reply(s->sock, &reply) == -1) {
+        logout("Failed to read response from socket\n");
+        /* Having failed to read the reply header, we can't know which
+         * aio_req this corresponds to - so we can't signal a failure.
+         */
+        return;
+    }
+
+    /* find the right aio_req from the outstanding_aio list */
+    QLIST_FOREACH(aio_req, &s->outstanding_aio_head,
outstanding_aio_siblings) {
+        if (aio_req->handle == reply.handle) {
+            break;
+        }
+    }
+
+    if (!aio_req) {
+        logout("cannot find aio_req for handle %lu\n", reply.handle);
+        return;
+    }
+
+    acb = aio_req->aiocb;
+
+    if (acb->aiocb_type == AIOCB_READ_UDATA) {
+        off_t offset = 0;
+        int ret = 0;
+        size_t total = aio_req->data_len;
+
+        while (offset < total) {
+            ret = nbd_wr_aio(s->sock, acb->qiov->iov, total - offset,
offset,
+                             true);
+
+            if (ret == -1) {
+                logout("Error reading from NBD server: %i (%s)\n",
+                       errno, strerror(errno));
+                return;
+            }
+
+            offset += ret;
+        }
+    }
 
-        sock = tcp_socket_outgoing(hostname, port);
+    if (reply.error != 0) {
+        acb->ret = -EIO;
+        logout("NBD request resulted in error %i\n", reply.error);
     }
 
+    rest = free_aio_req(s, aio_req);
+    if (!rest) {
+        acb->aio_done_func(acb);
+    }
+
+    return;
+}
+
+static int aio_flush_request(void *opaque)
+{
+    BDRVNBDState *s = opaque;
+
+    return !QLIST_EMPTY(&s->outstanding_aio_head);
+}
+
+/*
+ * Connect to the NBD server specified in the state object
+ */
+static int nbd_establish_connection(BlockDriverState *bs)
+{
+    BDRVNBDState *s = bs->opaque;
+    int sock;
+    int ret;
+    off_t size;
+    size_t blocksize;
+    uint32_t nbdflags;
+
+    if (s->tcp_conn == true) {
+        sock = tcp_socket_outgoing(s->host_spec, s->tcp_port);
+    } else {
+        sock = unix_socket_outgoing(s->host_spec);
+    }
+
+    /* Failed to establish connection */
     if (sock == -1) {
-        err = -errno;
-        goto out;
+        logout("Failed to establish connection to NBD server\n");
+        return -errno;
     }
 
-    ret = nbd_receive_negotiate(sock, name, &nbdflags, &size,
&blocksize);
+    /* NBD handshake */
+    ret = nbd_receive_negotiate(sock, s->export_name, &nbdflags, &size,
+                                &blocksize);
     if (ret == -1) {
-        err = -errno;
-        goto out;
+        logout("Failed to negotiate with the NBD server\n");
+        closesocket(sock);
+        return -errno;
     }
 
+    /* Now that we're connected, set the socket to be non-blocking */
+    socket_set_nonblock(sock);
+
     s->sock = sock;
     s->size = size;
     s->blocksize = blocksize;
-    err = 0;
 
-out:
-    qemu_free(file);
-    return err;
+    /* Response handler. This is called when there is data to read */
+    qemu_aio_set_fd_handler(sock, aio_read_response, NULL,
aio_flush_request,
+                            NULL, s);
+    logout("Established connection with NBD server\n");
+    return 0;
 }
 
-static int nbd_read(BlockDriverState *bs, int64_t sector_num,
-                    uint8_t *buf, int nb_sectors)
+static void nbd_teardown_connection(BlockDriverState *bs)
 {
+    /* Send the final packet to the NBD server and close the socket */
     BDRVNBDState *s = bs->opaque;
     struct nbd_request request;
-    struct nbd_reply reply;
 
-    request.type = NBD_CMD_READ;
+    request.type = NBD_CMD_DISC;
     request.handle = (uint64_t)(intptr_t)bs;
-    request.from = sector_num * 512;;
-    request.len = nb_sectors * 512;
+    request.from = 0;
+    request.len = 0;
+    nbd_send_request(s->sock, &request);
 
-    if (nbd_send_request(s->sock, &request) == -1)
-        return -errno;
+    qemu_aio_set_fd_handler(s->sock, NULL, NULL, NULL, NULL, NULL);
+    closesocket(s->sock);
+    logout("Connection to NBD server closed\n");
+    return;
+}
 
-    if (nbd_receive_reply(s->sock, &reply) == -1)
-        return -errno;
+static int nbd_open(BlockDriverState *bs, const char* filename, int
flags)
+{
+    BDRVNBDState *s = bs->opaque;
+    int result;
 
-    if (reply.error !=0)
-        return -reply.error;
+    /* Pop the config into our state object. Exit if invalid. */
+    result = nbd_config(s, filename, flags);
 
-    if (reply.handle != request.handle)
-        return -EIO;
+    if (result != 0) {
+        return result;
+    }
 
-    if (nbd_wr_sync(s->sock, buf, request.len, 1) != request.len)
-        return -EIO;
+    QLIST_INIT(&s->outstanding_aio_head);
+
+    /* establish TCP connection, return error if it fails
+     * TODO: Configurable retry-until-timeout behaviour.
+     */
+    result = nbd_establish_connection(bs);
+    if (result != 0) {
+        return result;
+    }
 
     return 0;
 }
 
-static int nbd_write(BlockDriverState *bs, int64_t sector_num,
-                     const uint8_t *buf, int nb_sectors)
+static void nbd_close(BlockDriverState *bs)
 {
+    nbd_teardown_connection(bs);
     BDRVNBDState *s = bs->opaque;
+
+    if (s->export_name != NULL) {
+        qemu_free(s->export_name);
+    }
+    if (s->host_spec != NULL) {
+        qemu_free(s->host_spec);
+    }
+
+    return;
+}
+
+static int add_aio_request(BDRVNBDState *s, AIOReq *aio_req,
QEMUIOVector *qiov,
+                           enum AIOCBState aiocb_type)
+{
     struct nbd_request request;
-    struct nbd_reply reply;
 
-    request.type = NBD_CMD_WRITE;
-    request.handle = (uint64_t)(intptr_t)bs;
-    request.from = sector_num * 512;;
-    request.len = nb_sectors * 512;
+    request.from = aio_req->offset;
+    request.len = aio_req->data_len;
+    request.handle = aio_req->handle;
+
+    if (aiocb_type == AIOCB_READ_UDATA) {
+        request.type = NBD_CMD_READ;
+    } else {
+        request.type = NBD_CMD_WRITE;
+    }
 
-    if (nbd_send_request(s->sock, &request) == -1)
+    /* Write the request to the socket. Header first. */
+    if (nbd_send_request(s->sock, &request) == -1) {
+        /* TODO: retry handling. This leads to -EIO and request
cancellation */
+        logout("writing request header to server failed\n");
         return -errno;
+    }
 
-    if (nbd_wr_sync(s->sock, (uint8_t*)buf, request.len, 0) !=
request.len)
-        return -EIO;
+    /* If this is a write, send the data too */
+    if (aiocb_type == AIOCB_WRITE_UDATA) {
+        int ret = 0;
+        off_t offset = 0;
+        size_t total = aio_req->data_len;
+
+        while (offset < total) {
+            ret = nbd_wr_aio(s->sock, qiov->iov, total - offset,
+                             offset + aio_req->iov_offset, false);
+            if (ret == -1) {
+                logout("Error writing request data to NBD server: %i (%
s)\n",
+                       errno, strerror(errno));
+                return -EIO;
+            }
 
-    if (nbd_receive_reply(s->sock, &reply) == -1)
-        return -errno;
+            offset += ret;
+        }
+    }
 
-    if (reply.error !=0)
-        return -reply.error;
+    return 0;
+}
+
+static inline AIOReq *alloc_aio_req(BDRVNBDState *s, NBDAIOCB *acb,
+                                    size_t data_len,
+                                    off_t offset,
+                                    off_t iov_offset)
+{
+    AIOReq *aio_req;
 
-    if (reply.handle != request.handle)
+    aio_req = qemu_malloc(sizeof(*aio_req));
+    aio_req->aiocb = acb;
+    aio_req->iov_offset = iov_offset;
+    aio_req->offset = offset;
+    aio_req->data_len = data_len;
+    aio_req->handle = s->aioreq_seq_num++; /* FIXME: Trivially
guessable */
+
+    QLIST_INSERT_HEAD(&s->outstanding_aio_head, aio_req,
+                      outstanding_aio_siblings);
+    QLIST_INSERT_HEAD(&acb->aioreq_head, aio_req, aioreq_siblings);
+
+    return aio_req;
+}
+
+static void nbd_finish_aiocb(NBDAIOCB *acb)
+{
+    if (!acb->canceled) {
+        acb->common.cb(acb->common.opaque, acb->ret);
+    }
+    qemu_aio_release(acb);
+}
+
+
+static void nbd_aio_cancel(BlockDriverAIOCB *blockacb)
+{
+    NBDAIOCB *acb = (NBDAIOCB *)blockacb;
+
+    /*
+     * We cannot cancel the requests which are already sent to
+     * the servers, so we just complete the request with -EIO here.
+     */
+    acb->common.cb(acb->common.opaque, -EIO);
+    acb->canceled = 1;
+}
+
+static AIOPool nbd_aio_pool = {
+    .aiocb_size = sizeof(NBDAIOCB),
+    .cancel = nbd_aio_cancel,
+};
+
+static NBDAIOCB *nbd_aio_setup(BlockDriverState *bs, QEMUIOVector
*qiov,
+                                   int64_t sector_num, int nb_sectors,
+                                   BlockDriverCompletionFunc *cb, void
*opaque)
+{
+    NBDAIOCB *acb;
+
+    acb = qemu_aio_get(&nbd_aio_pool, bs, cb, opaque);
+
+    acb->qiov = qiov;
+
+    acb->sector_num = sector_num;
+    acb->nb_sectors = nb_sectors;
+
+    acb->aio_done_func = NULL;
+    acb->canceled = 0;
+    acb->bh = NULL;
+    acb->ret = 0;
+    QLIST_INIT(&acb->aioreq_head);
+    return acb;
+}
+
+static int nbd_schedule_bh(QEMUBHFunc *cb, NBDAIOCB *acb)
+{
+    if (acb->bh) {
+        logout("bug: %d %d\n", acb->aiocb_type, acb->aiocb_type);
         return -EIO;
+    }
+
+    acb->bh = qemu_bh_new(cb, acb);
+    if (!acb->bh) {
+        logout("oom: %d %d\n", acb->aiocb_type, acb->aiocb_type);
+        return -EIO;
+    }
+
+    qemu_bh_schedule(acb->bh);
 
     return 0;
 }
 
-static void nbd_close(BlockDriverState *bs)
+/*
+ * Send I/O requests to the server.
+ *
+ * This function sends requests to the server, links the requests to
+ * the outstanding_list in BDRVNBDState, and exits without waiting for
+ * the response.  The responses are received in the `aio_read_response'
+ * function which is called from the main loop as a fd handler.
+ * If this is a write request and it's >1MB, split it into multiple
AIOReqs
+ */
+static void nbd_readv_writev_bh_cb(void *p)
 {
-    BDRVNBDState *s = bs->opaque;
-    struct nbd_request request;
+    NBDAIOCB *acb = p;
+    int ret = 0;
 
-    request.type = NBD_CMD_DISC;
-    request.handle = (uint64_t)(intptr_t)bs;
-    request.from = 0;
-    request.len = 0;
-    nbd_send_request(s->sock, &request);
+    size_t len, done = 0;
+    size_t total = acb->nb_sectors * SECTOR_SIZE;
+
+    /* Where the read/write starts from */
+    size_t offset = acb->sector_num * SECTOR_SIZE;
+    BDRVNBDState *s = acb->common.bs->opaque;
+
+    AIOReq *aio_req;
 
-    close(s->sock);
+    qemu_bh_delete(acb->bh);
+    acb->bh = NULL;
+
+    while (done != total) {
+        len = (total - done);
+
+        /* Split write requests into 1MiB segments */
+        if(acb->aiocb_type == AIOCB_WRITE_UDATA && len > MAX_NBD_WRITE)
{
+          len = MAX_NBD_WRITE;
+        }
+
+        aio_req = alloc_aio_req(s, acb, len, offset + done, done);
+        ret = add_aio_request(s, aio_req, acb->qiov, acb->aiocb_type);
+
+        if (ret < 0) {
+            free_aio_req(s, aio_req);
+            acb->ret = -EIO;
+            goto out;
+        }
+
+        done += len;
+    }
+out:
+    if (QLIST_EMPTY(&acb->aioreq_head)) {
+        nbd_finish_aiocb(acb);
+    }
 }
 
+static BlockDriverAIOCB *nbd_aio_readv(BlockDriverState *bs,
+        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
+        BlockDriverCompletionFunc *cb, void *opaque)
+{
+    NBDAIOCB *acb;
+    int i;
+
+    acb = nbd_aio_setup(bs, qiov, sector_num, nb_sectors, cb, opaque);
+    acb->aiocb_type = AIOCB_READ_UDATA;
+    acb->aio_done_func = nbd_finish_aiocb;
+
+    for (i = 0; i < qiov->niov; i++) {
+        memset(qiov->iov[i].iov_base, 0, qiov->iov[i].iov_len);
+    }
+
+    nbd_schedule_bh(nbd_readv_writev_bh_cb, acb);
+    return &acb->common;
+}
+
+static BlockDriverAIOCB *nbd_aio_writev(BlockDriverState *bs,
+        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
+        BlockDriverCompletionFunc *cb, void *opaque)
+{
+    NBDAIOCB *acb;
+
+    acb = nbd_aio_setup(bs, qiov, sector_num, nb_sectors, cb, opaque);
+    acb->aiocb_type = AIOCB_WRITE_UDATA;
+    acb->aio_done_func = nbd_finish_aiocb;
+
+    nbd_schedule_bh(nbd_readv_writev_bh_cb, acb);
+    return &acb->common;
+}
+
+
 static int64_t nbd_getlength(BlockDriverState *bs)
 {
     BDRVNBDState *s = bs->opaque;
-
     return s->size;
 }
 
 static BlockDriver bdrv_nbd = {
-    .format_name	= "nbd",
-    .instance_size	= sizeof(BDRVNBDState),
-    .bdrv_file_open	= nbd_open,
-    .bdrv_read		= nbd_read,
-    .bdrv_write		= nbd_write,
-    .bdrv_close		= nbd_close,
-    .bdrv_getlength	= nbd_getlength,
-    .protocol_name	= "nbd",
+    .format_name     = "nbd",
+    .instance_size   = sizeof(BDRVNBDState),
+    .bdrv_file_open  = nbd_open,
+    .bdrv_aio_readv  = nbd_aio_readv,
+    .bdrv_aio_writev = nbd_aio_writev,
+    .bdrv_close      = nbd_close,
+    .bdrv_getlength  = nbd_getlength,
+    .protocol_name   = "nbd"
 };
 
 static void bdrv_nbd_init(void)
-- 
1.7.0.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3 v2] block/nbd: Make the NBD block device use the AIO interface
  2011-02-18 12:16             ` [Qemu-devel] [PATCH 3/3 v2] " Nicholas Thomas
@ 2011-02-18 12:23               ` Kevin Wolf
  2011-02-18 12:55                 ` Nicholas Thomas
  0 siblings, 1 reply; 13+ messages in thread
From: Kevin Wolf @ 2011-02-18 12:23 UTC (permalink / raw)
  To: Nicholas Thomas; +Cc: stefanha, qemu-devel

Am 18.02.2011 13:16, schrieb Nicholas Thomas:
> On Thu, 2011-02-17 at 19:28 +0000, Nicholas Thomas wrote:
> 
>> Additional testing has revealed that this code breaks the stock
>> nbd-server (the one on sourceforge) when large (well, 1.3MiB) write
>> requests are sent to it.
> 
> ....NBD has a limit of 1MB on the size of write requests.
> NBD_BUFFER_SIZE in qemu-nbd.c - and I'm sure that's what's knocking out
> the standard NBD server too.
> 
> I didn't see any option to tell QEMU to split up writes to a certain
> size before handing them off to the block driver, so I split the writes
> up into multiple acbs. Reworked patch:

Splitting it up manually sounds right.

I haven't had a close look at your patches yet, but one thing that I
noticed is that your patches are corrupted by line wraps. Please
consider using git-send-email to avoid this kind of trouble or configure
your mailer so that it stops doing this.

Kevin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3 v2] block/nbd: Make the NBD block device use the AIO interface
  2011-02-18 12:23               ` Kevin Wolf
@ 2011-02-18 12:55                 ` Nicholas Thomas
  0 siblings, 0 replies; 13+ messages in thread
From: Nicholas Thomas @ 2011-02-18 12:55 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: stefanha, qemu-devel

On Fri, 2011-02-18 at 13:23 +0100, Kevin Wolf wrote:
> I haven't had a close look at your patches yet, but one thing that I
> noticed is that your patches are corrupted by line wraps. Please
> consider using git-send-email to avoid this kind of trouble or configure
> your mailer so that it stops doing this.

Argh. Right. Resending them now...

/Nick

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-02-18 12:55 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-14 19:40 [Qemu-devel] NBD block device backend - 'improvements' Nicholas Thomas
2011-02-14 20:32 ` Stefan Hajnoczi
2011-02-15 11:09   ` Kevin Wolf
2011-02-15 21:26     ` Nicholas Thomas
2011-02-16 12:00       ` Kevin Wolf
2011-02-17 16:27         ` [Qemu-devel] " Nicholas Thomas
2011-02-17 16:34         ` [Qemu-devel] [PATCH 1/3] NBD library: whitespace changes Nicholas Thomas
2011-02-17 16:34         ` [Qemu-devel] [PATCH 2/3] NBD library: add aio-compatible read/write function Nicholas Thomas
2011-02-17 16:34         ` [Qemu-devel] [PATCH 3/3] block/nbd: Make the NBD block device use the AIO interface Nicholas Thomas
2011-02-17 19:28           ` Nicholas Thomas
2011-02-18 12:16             ` [Qemu-devel] [PATCH 3/3 v2] " Nicholas Thomas
2011-02-18 12:23               ` Kevin Wolf
2011-02-18 12:55                 ` Nicholas Thomas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.