[Qemu-devel] [RFC PATCH 0/2] GlusterFS support in QEMU

* [Qemu-devel] [RFC PATCH 0/2] GlusterFS support in QEMU - v2
@ 2012-07-21  8:29 Bharata B Rao
  2012-07-21  8:30 ` [Qemu-devel] [RFC PATCH 1/2] qemu: Add a config option for GlusterFS as block backend Bharata B Rao
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Bharata B Rao @ 2012-07-21  8:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Anand Avati, Amar Tumballi, Vijay Bellur

Hi,

Here is the v2 patchset for supporting GlusterFS protocol from QEMU.

This set of patches enables QEMU to boot VM images from gluster volumes.
This is achieved by adding gluster as a new block backend driver in QEMU.
Its already possible to boot from VM images on gluster volumes, but this
patchset provides the ability to boot VM images from gluster volumes by
by-passing the FUSE layer in gluster. In case the image is present on the
local system, it is possible to even bypass client and server translator and
hence the RPC overhead.

The major change in this version is to not implement libglusterfs based
gluster backend within QEMU but instead use libgfapi. libgfapi library
from GlusterFS project provides APIs to access gluster volumes directly.
With the use of libgfapi, the specification of gluster backend from QEMU
matches more closely with the GlusterFS's way of specifying volumes. We now
specify the gluster backed image like this:

-drive file=gluster:server@port:volname:image

- Here 'gluster' is the protocol.
- 'server@port' specifies the server where the volume file specification for
  the given volume resides. 'port' is the port number on which gluster
  management daemon (glusterd) is listening. This is optional and if not
  specified, QEMU will send 0 which will make libgfapi to use the default
  port.
- 'volname' is the name of the gluster volume which contains the VM image.
- 'image' is the path to the actual VM image in the gluster volume.

Note that we are no longer using volfiles directly and use volume names
instead. For this to work, gluster management daemon (glusterd) needs to
be running on the QEMU node. This limits the QEMU user to access the volumes by
the default volfiles that are generated by gluster CLI. This should be
fine as long as gluster CLI provides the capability to generate or regenerate
volume files for a given volume with the xlator set that QEMU user is
interested in. GlusterFS developers tell me that this can be provided with
some enhancements to Gluster CLI/glusterd. Note that the custom volume files
is typically needed when GlusterFS server is co-located with QEMU in
which case it would  be beneficial to get rid of client-server overhead and
RPC communication overhead.

Using the patches
=================
- GlusterFS backend is enabled by specifying --enable-glusterfs with the
  configure script.
- You need to have installed GlusterFS from latest gluster git which provides
  the required libgfapi library. (git://git.gluster.com/glusterfs.git)
- As mentioned above, the VM image on gluster volume can be specified like
  this:
	-drive file=gluster:localhost:testvol:/F17,format=gluster

  Note that format=gluster is not needed ideally and its a work around I have
  until libgfapi provides a working connection cleanup routine (glfs_fini()).
  When the format isn't specified, QEMU figures out the format by doing
  find_image_format that results in one open and close before opening the
  image file long term for standard read and write. Gluster connection
  initialization is done from open and connection termination is done from
  close. But since glfs_fini() isn't working yet, I am bypassing
  find_image_format by specifying format=gluster directly which results in
  just one open and hence I am not limited by glfs_fini().

Changes for v2
==============
- Removed the libglusterfs based gluster backend implementation within
  QEMU and instead using the APIs from libgfapi.
- Change in the specification from file=gluster:volfile:image to
  file=gluster:server@port:volname:image. format=gluster is no longer
  specified.
- Passing iovectors obtained from generic block layer of QEMU directly to
  libgfapi instead of converting them to linear buffers.
- Processing the aio call back directly from the read thread of event handler
  pipe instead of scheduling a BH and delegating the processing to that BH.
- Added async flush (fsync) support.
- Refusal to handle partial read/writes in gluster block driver.
- Other minor cleanups based on review comments from v1 post. There is one
  comment that is yet to be addressed (using a local argument vs using a
  field of bs - Stefan), I will address them in the next iteration.

v1
==
lists.nongnu.org/archive/html/qemu-devel/2012-06/msg01745.html

fio benchmark numbers
=====================
Environment
-----------
Dual core x86_64 laptop
QEMU (c0958559b1a58)
GlusterFS (35810fb2a7a12)
Guest: Fedora 16 (kernel 3.1.0-7.fc16.x86_64)
Host: Fedora 16 (kernel 3.4)
fio-HEAD-47ea504

fio jobfile
-----------
# cat aio-read-direct-seq 
; Read 4 files with aio at different depths
[global]
ioengine=libaio
direct=1
rw=read
bs=128k
size=512m
directory=/data1

[file1]
iodepth=4

[file2]
iodepth=32

[file3]
iodepth=8

[file4]
iodepth=16

Scenarios
---------
Base: QEMU boots from directly from image on gluster brick.
Fuse mount: QEMU boots from VM image on gluster FUSE mount.
Fuse bypass: QEMU uses gluster protocol and uses standard client volfile.
Fuse bypass custom: QEMU uses gluster protocol and uses a minimal client
	volfile that just has client xlator.
RPC bypass: QEMU uses just posix xlator and doesn't depend on gluster server.

Numbers (aggrb, minb and maxb in kB/s. mint and maxt in msec)
-------
			aggrb	minb	maxb	mint	maxt
Base			63076	15769	17488	29979	33248
Fuse mount		29392	7348	9266	56581	71350
Fuse bypass		53609	13402	14909	35164	39119
Fuse bypass custom	62968	15742	17962	29188	33305
RPC bypass		63505	15876	18534	28287	33023

All the scenarios used if=virtio and cache=none options.

^ permalink raw reply	[flat|nested] 20+ messages in thread