All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gerd Hoffmann <kraxel@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anthony Liguori <anthony@codemonkey.ws>,
	kvm-devel <kvm@vger.kernel.org>,
	qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool
Date: Thu, 11 Dec 2008 17:11:08 +0100	[thread overview]
Message-ID: <49413B9C.3030703@redhat.com> (raw)
In-Reply-To: <20081211155335.GE14908@random.random>

[-- Attachment #1: Type: text/plain, Size: 1260 bytes --]

Andrea Arcangeli wrote:
>>   * It can't handle block allocation.  Kernel handles that by doing
>>     such writes synchronously via VFS layer (instead of the separate
>>     aio code paths).  Leads to horrible performance and bug reports
>>     such as "installs on sparse files are very slow".
> 
> I think here you mean O_DIRECT regardless of aio/sync API,

Yes.  But kernel aio requires O_DIRECT, so aio users are affected
nevertheless.

> So in kernels that don't support IOCB_CMD_READV/WRITEV, we've simply
> to an array of iocb through io_submit (i.e. to conver the iov into a
> vector of iocb, instead of a single iocb pointing to the
> iov). Internally to io_submit a single dma command should be generated
> and the same sg list should be built the same as if we used
> READV/WRITEV. In theory READV/WRITEV should be just a cpu saving
> feature, it shouldn't influence disk bandwidth. If it does, it means
> the bio layer is broken and needs fixing.

Havn't tested that.  Could be it isn't a big problem, extra code size
for the two modes aside.

>   > > ahem: http://www.daemon-systems.org/man/preadv.2.html > >
> 
> Too bad nobody implemented it yet...

Kernel side looks easy, attached patch + syscall table windup in all
archs ...

cheers,
  Gerd

[-- Attachment #2: preadv.diff --]
[-- Type: text/plain, Size: 1390 bytes --]

diff --git a/fs/read_write.c b/fs/read_write.c
index 969a6d9..d1ea2fd 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -701,6 +701,54 @@ sys_writev(unsigned long fd, const struct iovec __user *vec, unsigned long vlen)
 	return ret;
 }
 
+asmlinkage ssize_t sys_preadv(unsigned int fd, const struct iovec __user *vec,
+                              unsigned long vlen, loff_t pos)
+{
+	struct file *file;
+	ssize_t ret = -EBADF;
+	int fput_needed;
+
+	if (pos < 0)
+		return -EINVAL;
+
+	file = fget_light(fd, &fput_needed);
+	if (file) {
+		ret = -ESPIPE;
+		if (file->f_mode & FMODE_PREAD)
+			ret = vfs_readv(file, vec, vlen, &pos);
+		fput_light(file, fput_needed);
+	}
+
+	if (ret > 0)
+		add_rchar(current, ret);
+	inc_syscr(current);
+	return ret;
+}
+
+asmlinkage ssize_t sys_pwritev(unsigned int fd, const struct iovec __user *vec,
+                              unsigned long vlen, loff_t pos)
+{
+	struct file *file;
+	ssize_t ret = -EBADF;
+	int fput_needed;
+
+	if (pos < 0)
+		return -EINVAL;
+
+	file = fget_light(fd, &fput_needed);
+	if (file) {
+		ret = -ESPIPE;
+		if (file->f_mode & FMODE_PWRITE)
+			ret = vfs_writev(file, vec, vlen, &pos);
+		fput_light(file, fput_needed);
+	}
+
+	if (ret > 0)
+		add_wchar(current, ret);
+	inc_syscw(current);
+	return ret;
+}
+
 static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 			   size_t count, loff_t max)
 {

WARNING: multiple messages have this Message-ID (diff)
From: Gerd Hoffmann <kraxel@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: kvm-devel <kvm@vger.kernel.org>, qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool
Date: Thu, 11 Dec 2008 17:11:08 +0100	[thread overview]
Message-ID: <49413B9C.3030703@redhat.com> (raw)
In-Reply-To: <20081211155335.GE14908@random.random>

[-- Attachment #1: Type: text/plain, Size: 1260 bytes --]

Andrea Arcangeli wrote:
>>   * It can't handle block allocation.  Kernel handles that by doing
>>     such writes synchronously via VFS layer (instead of the separate
>>     aio code paths).  Leads to horrible performance and bug reports
>>     such as "installs on sparse files are very slow".
> 
> I think here you mean O_DIRECT regardless of aio/sync API,

Yes.  But kernel aio requires O_DIRECT, so aio users are affected
nevertheless.

> So in kernels that don't support IOCB_CMD_READV/WRITEV, we've simply
> to an array of iocb through io_submit (i.e. to conver the iov into a
> vector of iocb, instead of a single iocb pointing to the
> iov). Internally to io_submit a single dma command should be generated
> and the same sg list should be built the same as if we used
> READV/WRITEV. In theory READV/WRITEV should be just a cpu saving
> feature, it shouldn't influence disk bandwidth. If it does, it means
> the bio layer is broken and needs fixing.

Havn't tested that.  Could be it isn't a big problem, extra code size
for the two modes aside.

>   > > ahem: http://www.daemon-systems.org/man/preadv.2.html > >
> 
> Too bad nobody implemented it yet...

Kernel side looks easy, attached patch + syscall table windup in all
archs ...

cheers,
  Gerd

[-- Attachment #2: preadv.diff --]
[-- Type: text/plain, Size: 1390 bytes --]

diff --git a/fs/read_write.c b/fs/read_write.c
index 969a6d9..d1ea2fd 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -701,6 +701,54 @@ sys_writev(unsigned long fd, const struct iovec __user *vec, unsigned long vlen)
 	return ret;
 }
 
+asmlinkage ssize_t sys_preadv(unsigned int fd, const struct iovec __user *vec,
+                              unsigned long vlen, loff_t pos)
+{
+	struct file *file;
+	ssize_t ret = -EBADF;
+	int fput_needed;
+
+	if (pos < 0)
+		return -EINVAL;
+
+	file = fget_light(fd, &fput_needed);
+	if (file) {
+		ret = -ESPIPE;
+		if (file->f_mode & FMODE_PREAD)
+			ret = vfs_readv(file, vec, vlen, &pos);
+		fput_light(file, fput_needed);
+	}
+
+	if (ret > 0)
+		add_rchar(current, ret);
+	inc_syscr(current);
+	return ret;
+}
+
+asmlinkage ssize_t sys_pwritev(unsigned int fd, const struct iovec __user *vec,
+                              unsigned long vlen, loff_t pos)
+{
+	struct file *file;
+	ssize_t ret = -EBADF;
+	int fput_needed;
+
+	if (pos < 0)
+		return -EINVAL;
+
+	file = fget_light(fd, &fput_needed);
+	if (file) {
+		ret = -ESPIPE;
+		if (file->f_mode & FMODE_PWRITE)
+			ret = vfs_writev(file, vec, vlen, &pos);
+		fput_light(file, fput_needed);
+	}
+
+	if (ret > 0)
+		add_wchar(current, ret);
+	inc_syscw(current);
+	return ret;
+}
+
 static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 			   size_t count, loff_t max)
 {

  reply	other threads:[~2008-12-11 16:11 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-05 21:21 [RFC] Replace posix-aio with custom thread pool Anthony Liguori
2008-12-05 21:21 ` [Qemu-devel] " Anthony Liguori
2008-12-06  9:03 ` Blue Swirl
2008-12-06 18:26   ` Jamie Lokier
2008-12-08 18:23   ` Anthony Liguori
2008-12-08 18:23     ` Anthony Liguori
2008-12-09 15:51 ` Gerd Hoffmann
2008-12-09 16:01   ` Anthony Liguori
2008-12-10 16:44     ` Andrea Arcangeli
2008-12-10 17:21       ` Anthony Liguori
2008-12-10 17:21         ` Anthony Liguori
2008-12-10 17:29         ` Gerd Hoffmann
2008-12-10 18:50           ` Anthony Liguori
2008-12-10 19:08             ` Andrea Arcangeli
2008-12-10 19:08               ` Andrea Arcangeli
2008-12-11 13:12               ` Andrea Arcangeli
2008-12-11 15:24                 ` Gerd Hoffmann
2008-12-11 15:24                   ` Gerd Hoffmann
2008-12-11 15:53                   ` Andrea Arcangeli
2008-12-11 15:53                     ` Andrea Arcangeli
2008-12-11 16:11                     ` Gerd Hoffmann [this message]
2008-12-11 16:11                       ` Gerd Hoffmann
2008-12-11 16:49                       ` Andrea Arcangeli
2008-12-11 16:49                         ` Andrea Arcangeli
2008-12-11 17:20                         ` Gerd Hoffmann
2008-12-11 17:20                           ` Gerd Hoffmann
2008-12-11 18:11                           ` Andrea Arcangeli
2008-12-11 18:11                             ` Andrea Arcangeli
2008-12-11 20:38                             ` Gerd Hoffmann
2008-12-11 20:38                               ` Gerd Hoffmann
2008-12-11 20:40                             ` Anthony Liguori
2008-12-12  8:23                             ` Jens Axboe
2008-12-12  8:23                               ` Jens Axboe
2008-12-12 11:51                               ` Andrea Arcangeli
2008-12-12 11:51                                 ` Andrea Arcangeli
2008-12-12 11:54                                 ` Jens Axboe
2008-12-12 11:54                                   ` Jens Axboe
2008-12-12 14:13                                   ` Andrea Arcangeli
2008-12-12 14:13                                     ` Andrea Arcangeli
2008-12-12 14:24                                     ` Anthony Liguori
2008-12-12 14:24                                       ` Anthony Liguori
2008-12-12 16:33                                       ` Chris Wright
2008-12-12 16:33                                         ` Chris Wright
2008-12-12 16:51                                         ` Anthony Liguori
2008-12-12 16:51                                           ` Anthony Liguori
2008-12-12 16:52                                           ` Chris Wright
2008-12-12 16:52                                             ` Chris Wright
2008-12-11 21:32                         ` Christoph Hellwig
2008-12-12  0:27                           ` Andrea Arcangeli
2008-12-12  0:27                             ` Andrea Arcangeli
2008-12-11 21:30                     ` Christoph Hellwig
2008-12-11 16:41                   ` Anthony Liguori
2008-12-11 16:41                     ` Anthony Liguori
2008-12-12 14:24               ` Andrea Arcangeli
2008-12-12 14:24                 ` Andrea Arcangeli
2008-12-12 14:35                 ` Anthony Liguori
2008-12-12 14:35                   ` Anthony Liguori
2008-12-12 15:44                   ` Andrea Arcangeli
2008-12-12 15:44                     ` Andrea Arcangeli
2008-12-12 16:49                     ` Anthony Liguori
2008-12-12 16:49                       ` Anthony Liguori
2008-12-12 17:09                       ` Andrea Arcangeli
2008-12-12 17:09                         ` Andrea Arcangeli
2008-12-12 17:25                         ` Anthony Liguori
2008-12-12 17:25                           ` Anthony Liguori
2008-12-12 17:52                           ` Andrea Arcangeli
2008-12-12 17:52                             ` Andrea Arcangeli
2008-12-12 18:17                             ` Anthony Liguori
2008-12-12 18:17                               ` Anthony Liguori
2008-12-12 18:26                               ` Andrea Arcangeli
2008-12-12 20:12                                 ` Gerd Hoffmann
2008-12-12 20:17                                   ` Anthony Liguori
2008-12-12 20:35                                     ` Gerd Hoffmann
2008-12-09 17:16   ` Avi Kivity
2008-12-17 14:44 ` Ian Jackson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49413B9C.3030703@redhat.com \
    --to=kraxel@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=anthony@codemonkey.ws \
    --cc=kvm@vger.kernel.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.