From: Gerd Hoffmann <kraxel@redhat.com> To: Andrea Arcangeli <aarcange@redhat.com> Cc: Anthony Liguori <anthony@codemonkey.ws>, kvm-devel <kvm@vger.kernel.org>, qemu-devel@nongnu.org Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool Date: Thu, 11 Dec 2008 17:11:08 +0100 [thread overview] Message-ID: <49413B9C.3030703@redhat.com> (raw) In-Reply-To: <20081211155335.GE14908@random.random> [-- Attachment #1: Type: text/plain, Size: 1260 bytes --] Andrea Arcangeli wrote: >> * It can't handle block allocation. Kernel handles that by doing >> such writes synchronously via VFS layer (instead of the separate >> aio code paths). Leads to horrible performance and bug reports >> such as "installs on sparse files are very slow". > > I think here you mean O_DIRECT regardless of aio/sync API, Yes. But kernel aio requires O_DIRECT, so aio users are affected nevertheless. > So in kernels that don't support IOCB_CMD_READV/WRITEV, we've simply > to an array of iocb through io_submit (i.e. to conver the iov into a > vector of iocb, instead of a single iocb pointing to the > iov). Internally to io_submit a single dma command should be generated > and the same sg list should be built the same as if we used > READV/WRITEV. In theory READV/WRITEV should be just a cpu saving > feature, it shouldn't influence disk bandwidth. If it does, it means > the bio layer is broken and needs fixing. Havn't tested that. Could be it isn't a big problem, extra code size for the two modes aside. > > > ahem: http://www.daemon-systems.org/man/preadv.2.html > > > > Too bad nobody implemented it yet... Kernel side looks easy, attached patch + syscall table windup in all archs ... cheers, Gerd [-- Attachment #2: preadv.diff --] [-- Type: text/plain, Size: 1390 bytes --] diff --git a/fs/read_write.c b/fs/read_write.c index 969a6d9..d1ea2fd 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -701,6 +701,54 @@ sys_writev(unsigned long fd, const struct iovec __user *vec, unsigned long vlen) return ret; } +asmlinkage ssize_t sys_preadv(unsigned int fd, const struct iovec __user *vec, + unsigned long vlen, loff_t pos) +{ + struct file *file; + ssize_t ret = -EBADF; + int fput_needed; + + if (pos < 0) + return -EINVAL; + + file = fget_light(fd, &fput_needed); + if (file) { + ret = -ESPIPE; + if (file->f_mode & FMODE_PREAD) + ret = vfs_readv(file, vec, vlen, &pos); + fput_light(file, fput_needed); + } + + if (ret > 0) + add_rchar(current, ret); + inc_syscr(current); + return ret; +} + +asmlinkage ssize_t sys_pwritev(unsigned int fd, const struct iovec __user *vec, + unsigned long vlen, loff_t pos) +{ + struct file *file; + ssize_t ret = -EBADF; + int fput_needed; + + if (pos < 0) + return -EINVAL; + + file = fget_light(fd, &fput_needed); + if (file) { + ret = -ESPIPE; + if (file->f_mode & FMODE_PWRITE) + ret = vfs_writev(file, vec, vlen, &pos); + fput_light(file, fput_needed); + } + + if (ret > 0) + add_wchar(current, ret); + inc_syscw(current); + return ret; +} + static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, size_t count, loff_t max) {
WARNING: multiple messages have this Message-ID (diff)
From: Gerd Hoffmann <kraxel@redhat.com> To: Andrea Arcangeli <aarcange@redhat.com> Cc: kvm-devel <kvm@vger.kernel.org>, qemu-devel@nongnu.org Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool Date: Thu, 11 Dec 2008 17:11:08 +0100 [thread overview] Message-ID: <49413B9C.3030703@redhat.com> (raw) In-Reply-To: <20081211155335.GE14908@random.random> [-- Attachment #1: Type: text/plain, Size: 1260 bytes --] Andrea Arcangeli wrote: >> * It can't handle block allocation. Kernel handles that by doing >> such writes synchronously via VFS layer (instead of the separate >> aio code paths). Leads to horrible performance and bug reports >> such as "installs on sparse files are very slow". > > I think here you mean O_DIRECT regardless of aio/sync API, Yes. But kernel aio requires O_DIRECT, so aio users are affected nevertheless. > So in kernels that don't support IOCB_CMD_READV/WRITEV, we've simply > to an array of iocb through io_submit (i.e. to conver the iov into a > vector of iocb, instead of a single iocb pointing to the > iov). Internally to io_submit a single dma command should be generated > and the same sg list should be built the same as if we used > READV/WRITEV. In theory READV/WRITEV should be just a cpu saving > feature, it shouldn't influence disk bandwidth. If it does, it means > the bio layer is broken and needs fixing. Havn't tested that. Could be it isn't a big problem, extra code size for the two modes aside. > > > ahem: http://www.daemon-systems.org/man/preadv.2.html > > > > Too bad nobody implemented it yet... Kernel side looks easy, attached patch + syscall table windup in all archs ... cheers, Gerd [-- Attachment #2: preadv.diff --] [-- Type: text/plain, Size: 1390 bytes --] diff --git a/fs/read_write.c b/fs/read_write.c index 969a6d9..d1ea2fd 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -701,6 +701,54 @@ sys_writev(unsigned long fd, const struct iovec __user *vec, unsigned long vlen) return ret; } +asmlinkage ssize_t sys_preadv(unsigned int fd, const struct iovec __user *vec, + unsigned long vlen, loff_t pos) +{ + struct file *file; + ssize_t ret = -EBADF; + int fput_needed; + + if (pos < 0) + return -EINVAL; + + file = fget_light(fd, &fput_needed); + if (file) { + ret = -ESPIPE; + if (file->f_mode & FMODE_PREAD) + ret = vfs_readv(file, vec, vlen, &pos); + fput_light(file, fput_needed); + } + + if (ret > 0) + add_rchar(current, ret); + inc_syscr(current); + return ret; +} + +asmlinkage ssize_t sys_pwritev(unsigned int fd, const struct iovec __user *vec, + unsigned long vlen, loff_t pos) +{ + struct file *file; + ssize_t ret = -EBADF; + int fput_needed; + + if (pos < 0) + return -EINVAL; + + file = fget_light(fd, &fput_needed); + if (file) { + ret = -ESPIPE; + if (file->f_mode & FMODE_PWRITE) + ret = vfs_writev(file, vec, vlen, &pos); + fput_light(file, fput_needed); + } + + if (ret > 0) + add_wchar(current, ret); + inc_syscw(current); + return ret; +} + static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, size_t count, loff_t max) {
next prev parent reply other threads:[~2008-12-11 16:11 UTC|newest] Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top 2008-12-05 21:21 [RFC] Replace posix-aio with custom thread pool Anthony Liguori 2008-12-05 21:21 ` [Qemu-devel] " Anthony Liguori 2008-12-06 9:03 ` Blue Swirl 2008-12-06 18:26 ` Jamie Lokier 2008-12-08 18:23 ` Anthony Liguori 2008-12-08 18:23 ` Anthony Liguori 2008-12-09 15:51 ` Gerd Hoffmann 2008-12-09 16:01 ` Anthony Liguori 2008-12-10 16:44 ` Andrea Arcangeli 2008-12-10 17:21 ` Anthony Liguori 2008-12-10 17:21 ` Anthony Liguori 2008-12-10 17:29 ` Gerd Hoffmann 2008-12-10 18:50 ` Anthony Liguori 2008-12-10 19:08 ` Andrea Arcangeli 2008-12-10 19:08 ` Andrea Arcangeli 2008-12-11 13:12 ` Andrea Arcangeli 2008-12-11 15:24 ` Gerd Hoffmann 2008-12-11 15:24 ` Gerd Hoffmann 2008-12-11 15:53 ` Andrea Arcangeli 2008-12-11 15:53 ` Andrea Arcangeli 2008-12-11 16:11 ` Gerd Hoffmann [this message] 2008-12-11 16:11 ` Gerd Hoffmann 2008-12-11 16:49 ` Andrea Arcangeli 2008-12-11 16:49 ` Andrea Arcangeli 2008-12-11 17:20 ` Gerd Hoffmann 2008-12-11 17:20 ` Gerd Hoffmann 2008-12-11 18:11 ` Andrea Arcangeli 2008-12-11 18:11 ` Andrea Arcangeli 2008-12-11 20:38 ` Gerd Hoffmann 2008-12-11 20:38 ` Gerd Hoffmann 2008-12-11 20:40 ` Anthony Liguori 2008-12-12 8:23 ` Jens Axboe 2008-12-12 8:23 ` Jens Axboe 2008-12-12 11:51 ` Andrea Arcangeli 2008-12-12 11:51 ` Andrea Arcangeli 2008-12-12 11:54 ` Jens Axboe 2008-12-12 11:54 ` Jens Axboe 2008-12-12 14:13 ` Andrea Arcangeli 2008-12-12 14:13 ` Andrea Arcangeli 2008-12-12 14:24 ` Anthony Liguori 2008-12-12 14:24 ` Anthony Liguori 2008-12-12 16:33 ` Chris Wright 2008-12-12 16:33 ` Chris Wright 2008-12-12 16:51 ` Anthony Liguori 2008-12-12 16:51 ` Anthony Liguori 2008-12-12 16:52 ` Chris Wright 2008-12-12 16:52 ` Chris Wright 2008-12-11 21:32 ` Christoph Hellwig 2008-12-12 0:27 ` Andrea Arcangeli 2008-12-12 0:27 ` Andrea Arcangeli 2008-12-11 21:30 ` Christoph Hellwig 2008-12-11 16:41 ` Anthony Liguori 2008-12-11 16:41 ` Anthony Liguori 2008-12-12 14:24 ` Andrea Arcangeli 2008-12-12 14:24 ` Andrea Arcangeli 2008-12-12 14:35 ` Anthony Liguori 2008-12-12 14:35 ` Anthony Liguori 2008-12-12 15:44 ` Andrea Arcangeli 2008-12-12 15:44 ` Andrea Arcangeli 2008-12-12 16:49 ` Anthony Liguori 2008-12-12 16:49 ` Anthony Liguori 2008-12-12 17:09 ` Andrea Arcangeli 2008-12-12 17:09 ` Andrea Arcangeli 2008-12-12 17:25 ` Anthony Liguori 2008-12-12 17:25 ` Anthony Liguori 2008-12-12 17:52 ` Andrea Arcangeli 2008-12-12 17:52 ` Andrea Arcangeli 2008-12-12 18:17 ` Anthony Liguori 2008-12-12 18:17 ` Anthony Liguori 2008-12-12 18:26 ` Andrea Arcangeli 2008-12-12 20:12 ` Gerd Hoffmann 2008-12-12 20:17 ` Anthony Liguori 2008-12-12 20:35 ` Gerd Hoffmann 2008-12-09 17:16 ` Avi Kivity 2008-12-17 14:44 ` Ian Jackson
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=49413B9C.3030703@redhat.com \ --to=kraxel@redhat.com \ --cc=aarcange@redhat.com \ --cc=anthony@codemonkey.ws \ --cc=kvm@vger.kernel.org \ --cc=qemu-devel@nongnu.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.