All of lore.kernel.org
 help / color / mirror / Atom feed
From: Anthony Liguori <anthony@codemonkey.ws>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>,
	qemu-devel@nongnu.org, kvm-devel <kvm@vger.kernel.org>
Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool
Date: Fri, 12 Dec 2008 11:25:55 -0600	[thread overview]
Message-ID: <49429EA3.8070008@codemonkey.ws> (raw)
In-Reply-To: <20081212170916.GO6809@random.random>

Andrea Arcangeli wrote:
> On Fri, Dec 12, 2008 at 10:49:45AM -0600, Anthony Liguori wrote:
>   
>> I meant, if you wanted to pass a file descriptor as a raw device.  So:
>>
>> qemu -hda raw:fd=4
>>
>> Or something like that.  We don't support this today.
>>     
>
> ah ok.
>
>   
>> I think bouncing the iov and just using pread/pwrite may be our best bet.  
>> It means memory allocation but we can cap it.  Since we're using threads, 
>>     
>
> It's already capped. However currently it generates an iovec, but
> we've simply to check the iovcnt to be 1, if it's 1 we pread from
> iov.iov_base, iov.iov_len. The dma api will take care to enforce
> iovcnt to be 1 for the iovec if preadv/pwritev isn't detected at
> compile time.
>   

Hrm, that's more complex than I was expecting.  I was thinking the bdrv 
aio infrastructure would always take an iovec.  Any details about the 
underlying host's ability to handle the iovec would be insulated.

>> we just can force a thread to sleep until memory becomes available so it's 
>> actually pretty straight forward.
>>     
>
> There's no way to detect that and wait for memory,

If we artificially cap at say 50MB, then you do something like:

while (buffer == NULL) {
   buffer = try_to_bounce(offset, iov, iovcnt, &size);
   if (buffer == NULL && errno == ENOMEM) {
      pthread_wait_cond(more memory);
   }
}

try_to_bounce allocs with malloc() but if you exceed 50MB, then you fail 
with an error of ENOMEM.  In your bounce_free() function, you do a 
pthread_cond_broadcast() to wake up any threads potentially waiting to 
allocate memory.

This lets us expose a preadv/pwritev function that actually works.  The 
expectation is that bouncing will outperform just doing pread/pwrite of 
each vector.  Of course, you could get smart and if try_to_bounce fail, 
fall back to pread/pwrite each vector.  Likewise, you can fast-path the 
case of a single iovec to avoid bouncing entirely.

Regards,

Anthony Liguori

>  it'd sigkill before
> you can check... at least with the default overcommit. The way the dma
> api works, is that it doesn't send a mega large writev, but send it in
> pieces capped by the max buffer size, with many iovecs with iovcnt = 1.
>
>   
>> We can use libaio on older Linux's to simulate preadv/pwritev.  Use the 
>> proper syscalls on newer kernels, on BSDs, and bounce everything else.
>>     
>
> Given READV/WRITEV aren't available in not very recent kernels and
> given that without O_DIRECT each iocb will become synchronous, we
> can't use the libaio. Also once they fix linux-aio, if we do that, the
> iocb logic would need to be largely refactored. So I'm not sure if it
> worth it as it can't handle 2.6.16-18 when O_DIRECT is disabled (when
> O_DIRECT is enabled we could just build an array of linear iocb).
>   


WARNING: multiple messages have this Message-ID (diff)
From: Anthony Liguori <anthony@codemonkey.ws>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>,
	kvm-devel <kvm@vger.kernel.org>,
	qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool
Date: Fri, 12 Dec 2008 11:25:55 -0600	[thread overview]
Message-ID: <49429EA3.8070008@codemonkey.ws> (raw)
In-Reply-To: <20081212170916.GO6809@random.random>

Andrea Arcangeli wrote:
> On Fri, Dec 12, 2008 at 10:49:45AM -0600, Anthony Liguori wrote:
>   
>> I meant, if you wanted to pass a file descriptor as a raw device.  So:
>>
>> qemu -hda raw:fd=4
>>
>> Or something like that.  We don't support this today.
>>     
>
> ah ok.
>
>   
>> I think bouncing the iov and just using pread/pwrite may be our best bet.  
>> It means memory allocation but we can cap it.  Since we're using threads, 
>>     
>
> It's already capped. However currently it generates an iovec, but
> we've simply to check the iovcnt to be 1, if it's 1 we pread from
> iov.iov_base, iov.iov_len. The dma api will take care to enforce
> iovcnt to be 1 for the iovec if preadv/pwritev isn't detected at
> compile time.
>   

Hrm, that's more complex than I was expecting.  I was thinking the bdrv 
aio infrastructure would always take an iovec.  Any details about the 
underlying host's ability to handle the iovec would be insulated.

>> we just can force a thread to sleep until memory becomes available so it's 
>> actually pretty straight forward.
>>     
>
> There's no way to detect that and wait for memory,

If we artificially cap at say 50MB, then you do something like:

while (buffer == NULL) {
   buffer = try_to_bounce(offset, iov, iovcnt, &size);
   if (buffer == NULL && errno == ENOMEM) {
      pthread_wait_cond(more memory);
   }
}

try_to_bounce allocs with malloc() but if you exceed 50MB, then you fail 
with an error of ENOMEM.  In your bounce_free() function, you do a 
pthread_cond_broadcast() to wake up any threads potentially waiting to 
allocate memory.

This lets us expose a preadv/pwritev function that actually works.  The 
expectation is that bouncing will outperform just doing pread/pwrite of 
each vector.  Of course, you could get smart and if try_to_bounce fail, 
fall back to pread/pwrite each vector.  Likewise, you can fast-path the 
case of a single iovec to avoid bouncing entirely.

Regards,

Anthony Liguori

>  it'd sigkill before
> you can check... at least with the default overcommit. The way the dma
> api works, is that it doesn't send a mega large writev, but send it in
> pieces capped by the max buffer size, with many iovecs with iovcnt = 1.
>
>   
>> We can use libaio on older Linux's to simulate preadv/pwritev.  Use the 
>> proper syscalls on newer kernels, on BSDs, and bounce everything else.
>>     
>
> Given READV/WRITEV aren't available in not very recent kernels and
> given that without O_DIRECT each iocb will become synchronous, we
> can't use the libaio. Also once they fix linux-aio, if we do that, the
> iocb logic would need to be largely refactored. So I'm not sure if it
> worth it as it can't handle 2.6.16-18 when O_DIRECT is disabled (when
> O_DIRECT is enabled we could just build an array of linear iocb).
>   

  reply	other threads:[~2008-12-12 17:26 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-05 21:21 [RFC] Replace posix-aio with custom thread pool Anthony Liguori
2008-12-05 21:21 ` [Qemu-devel] " Anthony Liguori
2008-12-06  9:03 ` Blue Swirl
2008-12-06 18:26   ` Jamie Lokier
2008-12-08 18:23   ` Anthony Liguori
2008-12-08 18:23     ` Anthony Liguori
2008-12-09 15:51 ` Gerd Hoffmann
2008-12-09 16:01   ` Anthony Liguori
2008-12-10 16:44     ` Andrea Arcangeli
2008-12-10 17:21       ` Anthony Liguori
2008-12-10 17:21         ` Anthony Liguori
2008-12-10 17:29         ` Gerd Hoffmann
2008-12-10 18:50           ` Anthony Liguori
2008-12-10 19:08             ` Andrea Arcangeli
2008-12-10 19:08               ` Andrea Arcangeli
2008-12-11 13:12               ` Andrea Arcangeli
2008-12-11 15:24                 ` Gerd Hoffmann
2008-12-11 15:24                   ` Gerd Hoffmann
2008-12-11 15:53                   ` Andrea Arcangeli
2008-12-11 15:53                     ` Andrea Arcangeli
2008-12-11 16:11                     ` Gerd Hoffmann
2008-12-11 16:11                       ` Gerd Hoffmann
2008-12-11 16:49                       ` Andrea Arcangeli
2008-12-11 16:49                         ` Andrea Arcangeli
2008-12-11 17:20                         ` Gerd Hoffmann
2008-12-11 17:20                           ` Gerd Hoffmann
2008-12-11 18:11                           ` Andrea Arcangeli
2008-12-11 18:11                             ` Andrea Arcangeli
2008-12-11 20:38                             ` Gerd Hoffmann
2008-12-11 20:38                               ` Gerd Hoffmann
2008-12-11 20:40                             ` Anthony Liguori
2008-12-12  8:23                             ` Jens Axboe
2008-12-12  8:23                               ` Jens Axboe
2008-12-12 11:51                               ` Andrea Arcangeli
2008-12-12 11:51                                 ` Andrea Arcangeli
2008-12-12 11:54                                 ` Jens Axboe
2008-12-12 11:54                                   ` Jens Axboe
2008-12-12 14:13                                   ` Andrea Arcangeli
2008-12-12 14:13                                     ` Andrea Arcangeli
2008-12-12 14:24                                     ` Anthony Liguori
2008-12-12 14:24                                       ` Anthony Liguori
2008-12-12 16:33                                       ` Chris Wright
2008-12-12 16:33                                         ` Chris Wright
2008-12-12 16:51                                         ` Anthony Liguori
2008-12-12 16:51                                           ` Anthony Liguori
2008-12-12 16:52                                           ` Chris Wright
2008-12-12 16:52                                             ` Chris Wright
2008-12-11 21:32                         ` Christoph Hellwig
2008-12-12  0:27                           ` Andrea Arcangeli
2008-12-12  0:27                             ` Andrea Arcangeli
2008-12-11 21:30                     ` Christoph Hellwig
2008-12-11 16:41                   ` Anthony Liguori
2008-12-11 16:41                     ` Anthony Liguori
2008-12-12 14:24               ` Andrea Arcangeli
2008-12-12 14:24                 ` Andrea Arcangeli
2008-12-12 14:35                 ` Anthony Liguori
2008-12-12 14:35                   ` Anthony Liguori
2008-12-12 15:44                   ` Andrea Arcangeli
2008-12-12 15:44                     ` Andrea Arcangeli
2008-12-12 16:49                     ` Anthony Liguori
2008-12-12 16:49                       ` Anthony Liguori
2008-12-12 17:09                       ` Andrea Arcangeli
2008-12-12 17:09                         ` Andrea Arcangeli
2008-12-12 17:25                         ` Anthony Liguori [this message]
2008-12-12 17:25                           ` Anthony Liguori
2008-12-12 17:52                           ` Andrea Arcangeli
2008-12-12 17:52                             ` Andrea Arcangeli
2008-12-12 18:17                             ` Anthony Liguori
2008-12-12 18:17                               ` Anthony Liguori
2008-12-12 18:26                               ` Andrea Arcangeli
2008-12-12 20:12                                 ` Gerd Hoffmann
2008-12-12 20:17                                   ` Anthony Liguori
2008-12-12 20:35                                     ` Gerd Hoffmann
2008-12-09 17:16   ` Avi Kivity
2008-12-17 14:44 ` Ian Jackson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49429EA3.8070008@codemonkey.ws \
    --to=anthony@codemonkey.ws \
    --cc=aarcange@redhat.com \
    --cc=kraxel@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.