All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
@ 2009-03-25 13:45 Stefano Stabellini
  2009-03-25 15:22 ` Avi Kivity
  0 siblings, 1 reply; 29+ messages in thread
From: Stefano Stabellini @ 2009-03-25 13:45 UTC (permalink / raw)
  To: qemu-devel

Hi all,
after the recent introduction of dma_buf_prepare we stopped honoring
IDE_DMA_BUF_SECTORS (the guest can issue dma requests with a greater
total length than IDE_DMA_BUF_SECTORS).
This patch adds the IDE_DMA_BUF_SECTORS limit back in place.

Comments are welcome.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

---

diff --git a/hw/ide.c b/hw/ide.c
index 96bc176..ef1356d 100644
--- a/hw/ide.c
+++ b/hw/ide.c
@@ -207,6 +207,7 @@
 #define MAX_MULT_SECTORS 16
 
 #define IDE_DMA_BUF_SECTORS 256
+#define IDE_DMA_BUF_BYTES (IDE_DMA_BUF_SECTORS * 512)
 
 #if (IDE_DMA_BUF_SECTORS < MAX_MULT_SECTORS)
 #error "IDE_DMA_BUF_SECTORS must be bigger or equal to MAX_MULT_SECTORS"
@@ -877,9 +878,10 @@ static int dma_buf_prepare(BMDMAState *bm, int is_write)
         uint32_t addr;
         uint32_t size;
     } prd;
-    int l, len;
+    int l, len, n;
 
-    qemu_sglist_init(&s->sg, s->nsector / (TARGET_PAGE_SIZE/512) + 1);
+    n = s->nsector <= IDE_DMA_BUF_SECTORS ? s->nsector : IDE_DMA_BUF_SECTORS;
+    qemu_sglist_init(&s->sg, n / (TARGET_PAGE_SIZE/512) + 1);
     s->io_buffer_size = 0;
     for(;;) {
         if (bm->cur_prd_len == 0) {
@@ -900,6 +902,13 @@ static int dma_buf_prepare(BMDMAState *bm, int is_write)
         }
         l = bm->cur_prd_len;
         if (l > 0) {
+            if (l > IDE_DMA_BUF_BYTES)
+                l = IDE_DMA_BUF_BYTES;
+            if (s->io_buffer_size + l > IDE_DMA_BUF_BYTES) {
+                l = IDE_DMA_BUF_BYTES - s->io_buffer_size;
+                if (!l)
+                    return s->io_buffer_size != 0;
+            }
             qemu_sglist_add(&s->sg, bm->cur_prd_addr, l);
             bm->cur_prd_addr += l;
             bm->cur_prd_len -= l;

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-25 13:45 [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS Stefano Stabellini
@ 2009-03-25 15:22 ` Avi Kivity
  2009-03-25 16:19   ` Stefano Stabellini
  0 siblings, 1 reply; 29+ messages in thread
From: Avi Kivity @ 2009-03-25 15:22 UTC (permalink / raw)
  To: qemu-devel

Stefano Stabellini wrote:
> Hi all,
> after the recent introduction of dma_buf_prepare we stopped honoring
> IDE_DMA_BUF_SECTORS (the guest can issue dma requests with a greater
> total length than IDE_DMA_BUF_SECTORS).
> This patch adds the IDE_DMA_BUF_SECTORS limit back in place.
>
> Comments are welcome.
>   

Aren't 64K sector requests legal?

Looking at ide_cmd_lba48_transform().

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-25 15:22 ` Avi Kivity
@ 2009-03-25 16:19   ` Stefano Stabellini
  2009-03-25 16:45     ` Avi Kivity
  2009-03-25 16:46     ` Samuel Thibault
  0 siblings, 2 replies; 29+ messages in thread
From: Stefano Stabellini @ 2009-03-25 16:19 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:

> Stefano Stabellini wrote:
>> Hi all,
>> after the recent introduction of dma_buf_prepare we stopped honoring
>> IDE_DMA_BUF_SECTORS (the guest can issue dma requests with a greater
>> total length than IDE_DMA_BUF_SECTORS).
>> This patch adds the IDE_DMA_BUF_SECTORS limit back in place.
>>
>> Comments are welcome.
>>   
> 
> Aren't 64K sector requests legal?
> 
> Looking at ide_cmd_lba48_transform().
> 


I am not sure about that, but I would think not since we define MAX_MULT_SECTORS 16
If I change MAX_MULT_SECTORS, am I supposed to see smaller dma requests?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-25 16:19   ` Stefano Stabellini
@ 2009-03-25 16:45     ` Avi Kivity
  2009-03-25 16:50       ` Stefano Stabellini
  2009-03-25 16:46     ` Samuel Thibault
  1 sibling, 1 reply; 29+ messages in thread
From: Avi Kivity @ 2009-03-25 16:45 UTC (permalink / raw)
  To: qemu-devel

Stefano Stabellini wrote:
> Avi Kivity wrote:
>
>   
>> Stefano Stabellini wrote:
>>     
>>> Hi all,
>>> after the recent introduction of dma_buf_prepare we stopped honoring
>>> IDE_DMA_BUF_SECTORS (the guest can issue dma requests with a greater
>>> total length than IDE_DMA_BUF_SECTORS).
>>> This patch adds the IDE_DMA_BUF_SECTORS limit back in place.
>>>
>>> Comments are welcome.
>>>   
>>>       
>> Aren't 64K sector requests legal?
>>
>> Looking at ide_cmd_lba48_transform().
>>
>>     
>
>
> I am not sure about that, but I would think not since we define MAX_MULT_SECTORS 16
> If I change MAX_MULT_SECTORS, am I supposed to see smaller dma requests?
>   

I think that's for pio, or maybe older DMA?  Certainly you wouldn't 
expect to be limited to 16 sectors DMA.

Anyway I think IDE_DMA_BUF_SECTORS was a qemu limitation and did not 
model real hardware behaviour.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-25 16:19   ` Stefano Stabellini
  2009-03-25 16:45     ` Avi Kivity
@ 2009-03-25 16:46     ` Samuel Thibault
  1 sibling, 0 replies; 29+ messages in thread
From: Samuel Thibault @ 2009-03-25 16:46 UTC (permalink / raw)
  To: qemu-devel

Stefano Stabellini, le Wed 25 Mar 2009 16:19:58 +0000, a écrit :
> Avi Kivity wrote:
> 
> > Stefano Stabellini wrote:
> >> Hi all,
> >> after the recent introduction of dma_buf_prepare we stopped honoring
> >> IDE_DMA_BUF_SECTORS (the guest can issue dma requests with a greater
> >> total length than IDE_DMA_BUF_SECTORS).
> >> This patch adds the IDE_DMA_BUF_SECTORS limit back in place.
> >>
> >> Comments are welcome.
> >>   
> > 
> > Aren't 64K sector requests legal?
> > 
> > Looking at ide_cmd_lba48_transform().
> > 
> 
> 
> I am not sure about that, but I would think not since we define MAX_MULT_SECTORS 16
> If I change MAX_MULT_SECTORS, am I supposed to see smaller dma requests?

Do not confuse things. MAX_MULT_SECTORS is for multi-sectors PIO, not
for DMA.

Samuel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-25 16:45     ` Avi Kivity
@ 2009-03-25 16:50       ` Stefano Stabellini
  2009-03-25 17:47         ` Stefano Stabellini
  0 siblings, 1 reply; 29+ messages in thread
From: Stefano Stabellini @ 2009-03-25 16:50 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:

> I think that's for pio, or maybe older DMA?  Certainly you wouldn't 
> expect to be limited to 16 sectors DMA.


Thanks for clarifying.

 
> Anyway I think IDE_DMA_BUF_SECTORS was a qemu limitation and did not 
> model real hardware behaviour.
> 


Therefore I must assume that we expect qemu to be able to handle any
kind of dma request coming from the guest, which at the moment is
limited to 64K sectors, am I right?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-25 16:50       ` Stefano Stabellini
@ 2009-03-25 17:47         ` Stefano Stabellini
  2009-03-26 10:23           ` Avi Kivity
  0 siblings, 1 reply; 29+ messages in thread
From: Stefano Stabellini @ 2009-03-25 17:47 UTC (permalink / raw)
  To: qemu-devel

Stefano Stabellini wrote:

>> Anyway I think IDE_DMA_BUF_SECTORS was a qemu limitation and did not 
>> model real hardware behaviour.
>>
> 
> 
> Therefore I must assume that we expect qemu to be able to handle any
> kind of dma request coming from the guest, which at the moment is
> limited to 64K sectors, am I right?

I checked the ide driver in the kernel and it assumes that the max
sectors is either 256 or 64K depending on lba support, exactly as qemu does.


So now my question is: if I want to reduce the maximum dma request size
inside qemu, given that I must fill correctly the guest provided sg
list, is it OK to use IDE_DMA_BUF_SECTORS in dma_buf_prepare as I have
done in my patch?

I don't see any other possible solution, but if you have any other
suggestion you are welcome to let me know.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-25 17:47         ` Stefano Stabellini
@ 2009-03-26 10:23           ` Avi Kivity
  2009-03-26 10:31             ` Stefano Stabellini
  0 siblings, 1 reply; 29+ messages in thread
From: Avi Kivity @ 2009-03-26 10:23 UTC (permalink / raw)
  To: qemu-devel

Stefano Stabellini wrote:
> I checked the ide driver in the kernel and it assumes that the max
> sectors is either 256 or 64K depending on lba support, exactly as qemu does.
>
>
> So now my question is: if I want to reduce the maximum dma request size
> inside qemu, given that I must fill correctly the guest provided sg
> list, is it OK to use IDE_DMA_BUF_SECTORS in dma_buf_prepare as I have
> done in my patch?
>
> I don't see any other possible solution, but if you have any other
> suggestion you are welcome to let me know.
>   

Look at the DMA API (dma-helpers.c) which already knows how to split 
large dma requests.  Splitting is controlled by 
cpu_physical_memory_map() (which I'm guessing is your real limitation), 
so you might want to look at that.

The advantage of this approach is that it will apply to scsi and virtio 
once they are ported to use the DMA API.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-26 10:23           ` Avi Kivity
@ 2009-03-26 10:31             ` Stefano Stabellini
  2009-03-26 10:57               ` Avi Kivity
  0 siblings, 1 reply; 29+ messages in thread
From: Stefano Stabellini @ 2009-03-26 10:31 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:

> Stefano Stabellini wrote:
>> I checked the ide driver in the kernel and it assumes that the max
>> sectors is either 256 or 64K depending on lba support, exactly as qemu does.
>>
>>
>> So now my question is: if I want to reduce the maximum dma request size
>> inside qemu, given that I must fill correctly the guest provided sg
>> list, is it OK to use IDE_DMA_BUF_SECTORS in dma_buf_prepare as I have
>> done in my patch?
>>
>> I don't see any other possible solution, but if you have any other
>> suggestion you are welcome to let me know.
>>   
> 
> Look at the DMA API (dma-helpers.c) which already knows how to split 
> large dma requests.  Splitting is controlled by 
> cpu_physical_memory_map() (which I'm guessing is your real limitation), 
> so you might want to look at that.
> 
> The advantage of this approach is that it will apply to scsi and virtio 
> once they are ported to use the DMA API.
> 
> 



Unfortunately that is not really helpful: after the split done by
cpu_physical_memory_map the iovector is converted in a buffer in
bdrv_aio_rw_vector and then the full length of the buffer is passed on
to the bdrv_aio_write\read for the dma operation.

I need a way to set a maximum limit for the total number of sectors in
the dma operation, much like blk_queue_max_phys_segments in the kernel.

This could also be useful to make sure that we don't allocate bounce
buffers bigger than a predetermined limit.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-26 10:31             ` Stefano Stabellini
@ 2009-03-26 10:57               ` Avi Kivity
  2009-03-26 11:45                 ` Stefano Stabellini
  0 siblings, 1 reply; 29+ messages in thread
From: Avi Kivity @ 2009-03-26 10:57 UTC (permalink / raw)
  To: qemu-devel

Stefano Stabellini wrote:
> Unfortunately that is not really helpful: after the split done by
> cpu_physical_memory_map the iovector is converted in a buffer in
> bdrv_aio_rw_vector and then the full length of the buffer is passed on
> to the bdrv_aio_write\read for the dma operation.
>
> I need a way to set a maximum limit for the total number of sectors in
> the dma operation, much like blk_queue_max_phys_segments in the kernel.
>
> This could also be useful to make sure that we don't allocate bounce
> buffers bigger than a predetermined limit.
>   

If cpu_physical_memory_map() returns NULL, then dma-helpers.c will stop 
collecting sg entries and submit the I/O.  Tuning that will control how 
vectored requests are submitted.

If you problem is specifically with the bdrv_aio_rw_vector bounce 
buffer, then note that this is a temporary measure until vectored aio is 
in place, through preadv/pwritev and/or linux-aio IO_CMD_PREADV.  You 
should either convert to that when it is merged, or implement request 
splitting in bdrv_aio_rw_vector.

Can you explain your problem in more detail?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-26 10:57               ` Avi Kivity
@ 2009-03-26 11:45                 ` Stefano Stabellini
  2009-03-26 12:10                   ` Avi Kivity
  2009-03-26 22:42                   ` Christoph Hellwig
  0 siblings, 2 replies; 29+ messages in thread
From: Stefano Stabellini @ 2009-03-26 11:45 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:

> If cpu_physical_memory_map() returns NULL, then dma-helpers.c will stop 
> collecting sg entries and submit the I/O.  Tuning that will control how 
> vectored requests are submitted.
> 


 
I understand your suggestion now, something like:

---

diff --git a/dma-helpers.c b/dma-helpers.c
index 96a120c..6c43b97 100644
--- a/dma-helpers.c
+++ b/dma-helpers.c
@@ -96,6 +96,11 @@ static void dma_bdrv_cb(void *opaque, int ret)
     while (dbs->sg_cur_index < dbs->sg->nsg) {
         cur_addr = dbs->sg->sg[dbs->sg_cur_index].base + dbs->sg_cur_byte;
         cur_len = dbs->sg->sg[dbs->sg_cur_index].len - dbs->sg_cur_byte;
+        if (dbs->iov.size + cur_len > DMA_LIMIT) {
+            cur_len = DMA_LIMIT - dbs->iov.size;
+            if (cur_len <= 0)
+                break;
+        }
         mem = cpu_physical_memory_map(cur_addr, &cur_len, !dbs->is_write);
         if (!mem)
             break;

---

would work for me.

However it is difficult to put that code inside cpu_physical_memory_map
since I don't have any reference to link together all the mapping
requests related to the same dma transfer.


> If you problem is specifically with the bdrv_aio_rw_vector bounce 
> buffer, then note that this is a temporary measure until vectored aio is 
> in place, through preadv/pwritev and/or linux-aio IO_CMD_PREADV.  You 
> should either convert to that when it is merged, or implement request 
> splitting in bdrv_aio_rw_vector.
> 
> Can you explain your problem in more detail?



My problem is that my block driver has a size limit for read and write
operations.
When preadv/pwritev are in place I could limit the transfer size
directly in raw_aio_preadv\pwritev but I would also have to update the
iovector size field to reflect that and I think is a little bit ugly.

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-26 11:45                 ` Stefano Stabellini
@ 2009-03-26 12:10                   ` Avi Kivity
  2009-03-26 12:28                     ` Stefano Stabellini
  2009-03-26 12:47                     ` Samuel Thibault
  2009-03-26 22:42                   ` Christoph Hellwig
  1 sibling, 2 replies; 29+ messages in thread
From: Avi Kivity @ 2009-03-26 12:10 UTC (permalink / raw)
  To: qemu-devel

Stefano Stabellini wrote:
> Avi Kivity wrote:
>
>   
>> If cpu_physical_memory_map() returns NULL, then dma-helpers.c will stop 
>> collecting sg entries and submit the I/O.  Tuning that will control how 
>> vectored requests are submitted.
>>
>>     
>
>
>  
> I understand your suggestion now, something like:
>
> ---
>
> diff --git a/dma-helpers.c b/dma-helpers.c
> index 96a120c..6c43b97 100644
> --- a/dma-helpers.c
> +++ b/dma-helpers.c
> @@ -96,6 +96,11 @@ static void dma_bdrv_cb(void *opaque, int ret)
>      while (dbs->sg_cur_index < dbs->sg->nsg) {
>          cur_addr = dbs->sg->sg[dbs->sg_cur_index].base + dbs->sg_cur_byte;
>          cur_len = dbs->sg->sg[dbs->sg_cur_index].len - dbs->sg_cur_byte;
> +        if (dbs->iov.size + cur_len > DMA_LIMIT) {
> +            cur_len = DMA_LIMIT - dbs->iov.size;
> +            if (cur_len <= 0)
> +                break;
> +        }
>          mem = cpu_physical_memory_map(cur_addr, &cur_len, !dbs->is_write);
>          if (!mem)
>              break;
>
> ---
>
> would work for me.
>
> However it is difficult to put that code inside cpu_physical_memory_map
> since I don't have any reference to link together all the mapping
> requests related to the same dma transfer.
>   

It would be fine here, but see below.

>> If you problem is specifically with the bdrv_aio_rw_vector bounce 
>> buffer, then note that this is a temporary measure until vectored aio is 
>> in place, through preadv/pwritev and/or linux-aio IO_CMD_PREADV.  You 
>> should either convert to that when it is merged, or implement request 
>> splitting in bdrv_aio_rw_vector.
>>
>> Can you explain your problem in more detail?
>>     
>
>
>
> My problem is that my block driver has a size limit for read and write
> operations.
>   

Then I think the place to split the requests is in your block format 
driver, not the generic code.  If you run with one device using the 
limited block format driver, and the other device using another, 
unlimited block format driver, then the second device would be limited 
by the first device's limitation.

I realize your use case will probably not trigger this, but it does 
indicate you're limiting at the wrong layer.  It places the burden on 
all callers of block format drivers instead of centralizing it.

> When preadv/pwritev are in place I could limit the transfer size
> directly in raw_aio_preadv\pwritev but I would also have to update the
> iovector size field to reflect that and I think is a little bit ugly.
>   

Just copy the iovec for each sub-request.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-26 12:10                   ` Avi Kivity
@ 2009-03-26 12:28                     ` Stefano Stabellini
  2009-03-26 12:47                     ` Samuel Thibault
  1 sibling, 0 replies; 29+ messages in thread
From: Stefano Stabellini @ 2009-03-26 12:28 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:

>> When preadv/pwritev are in place I could limit the transfer size
>> directly in raw_aio_preadv\pwritev but I would also have to update the
>> iovector size field to reflect that and I think is a little bit ugly.
>>   
> 
> Just copy the iovec for each sub-request.
> 


OK, I'll see about that when the preadv and pwritev are actually
implemented.
Meanwhile I'll keep the patch I sent before in our tree only (unless
someone else is interested in a way to limit dma transfer size for any
reason).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-26 12:10                   ` Avi Kivity
  2009-03-26 12:28                     ` Stefano Stabellini
@ 2009-03-26 12:47                     ` Samuel Thibault
  2009-03-26 12:58                       ` Avi Kivity
  1 sibling, 1 reply; 29+ messages in thread
From: Samuel Thibault @ 2009-03-26 12:47 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity, le Thu 26 Mar 2009 14:10:20 +0200, a écrit :
> I realize your use case will probably not trigger this, but it does 
> indicate you're limiting at the wrong layer.  It places the burden on 
> all callers of block format drivers instead of centralizing it.

Then it should be centralized in the block layer instead of placing the
burden on all block format drivers ;)

One thing for instance that still have been overlooked although patches
have been sent is block-raw-posix' read/write_pread_aligned() that
consider partial read/writes as an error.  That's a bug.

Samuel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-26 12:47                     ` Samuel Thibault
@ 2009-03-26 12:58                       ` Avi Kivity
  2009-03-26 15:30                         ` Samuel Thibault
  0 siblings, 1 reply; 29+ messages in thread
From: Avi Kivity @ 2009-03-26 12:58 UTC (permalink / raw)
  To: qemu-devel

Samuel Thibault wrote:
> Avi Kivity, le Thu 26 Mar 2009 14:10:20 +0200, a écrit :
>   
>> I realize your use case will probably not trigger this, but it does 
>> indicate you're limiting at the wrong layer.  It places the burden on 
>> all callers of block format drivers instead of centralizing it.
>>     
>
> Then it should be centralized in the block layer instead of placing the
> burden on all block format drivers ;)
>   

If other drivers need to do that, certainly.  However I expect that most 
drivers are interested in increasing dma sizes rather than decreasing them.

> One thing for instance that still have been overlooked although patches
> have been sent is block-raw-posix' read/write_pread_aligned() that
> consider partial read/writes as an error.  That's a bug.
>   

Right.  Unrelated topic though?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-26 12:58                       ` Avi Kivity
@ 2009-03-26 15:30                         ` Samuel Thibault
  2009-03-26 18:32                           ` Avi Kivity
  0 siblings, 1 reply; 29+ messages in thread
From: Samuel Thibault @ 2009-03-26 15:30 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity, le Thu 26 Mar 2009 14:58:55 +0200, a écrit :
> Samuel Thibault wrote:
> >Avi Kivity, le Thu 26 Mar 2009 14:10:20 +0200, a écrit :
> >  
> >>I realize your use case will probably not trigger this, but it does 
> >>indicate you're limiting at the wrong layer.  It places the burden on 
> >>all callers of block format drivers instead of centralizing it.
> >>    
> >
> >Then it should be centralized in the block layer instead of placing the
> >burden on all block format drivers ;)
> >  
> 
> If other drivers need to do that, certainly.

In our case the other driver is specific to Xen.

> >One thing for instance that still have been overlooked although patches
> >have been sent is block-raw-posix' read/write_pread_aligned() that
> >consider partial read/writes as an error.  That's a bug.
> >  
> 
> Right.  Unrelated topic though?

Nope.  It's exactly the issue: read/write() may not be able to perform
the whole operation in just one go, and qemu should continue in that
case.

Samuel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-26 15:30                         ` Samuel Thibault
@ 2009-03-26 18:32                           ` Avi Kivity
  2009-03-26 18:48                             ` Samuel Thibault
  0 siblings, 1 reply; 29+ messages in thread
From: Avi Kivity @ 2009-03-26 18:32 UTC (permalink / raw)
  To: qemu-devel

Samuel Thibault wrote:
> Avi Kivity, le Thu 26 Mar 2009 14:58:55 +0200, a écrit :
>   
>> Samuel Thibault wrote:
>>     
>>> Avi Kivity, le Thu 26 Mar 2009 14:10:20 +0200, a écrit :
>>>  
>>>       
>>>> I realize your use case will probably not trigger this, but it does 
>>>> indicate you're limiting at the wrong layer.  It places the burden on 
>>>> all callers of block format drivers instead of centralizing it.
>>>>    
>>>>         
>>> Then it should be centralized in the block layer instead of placing the
>>> burden on all block format drivers ;)
>>>  
>>>       
>> If other drivers need to do that, certainly.
>>     
>
> In our case the other driver is specific to Xen.
>   

I'm confused.  I can only count one driver which has limited dma size.

>   
>>> One thing for instance that still have been overlooked although patches
>>> have been sent is block-raw-posix' read/write_pread_aligned() that
>>> consider partial read/writes as an error.  That's a bug.
>>>  
>>>       
>> Right.  Unrelated topic though?
>>     
>
> Nope.  It's exactly the issue: read/write() may not be able to perform
> the whole operation in just one go, and qemu should continue in that
> case.
>   

Oh, you're overloading block-raw-posix?  Isn't it more natural for you 
to implement block-raw-xen-pv-block-frontend?  You'd be able to use 
asynchronous requests instead of a thread pool (much like 
block-raw-linux-aio).


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-26 18:32                           ` Avi Kivity
@ 2009-03-26 18:48                             ` Samuel Thibault
  2009-03-26 19:40                               ` Avi Kivity
  0 siblings, 1 reply; 29+ messages in thread
From: Samuel Thibault @ 2009-03-26 18:48 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity, le Thu 26 Mar 2009 20:32:28 +0200, a écrit :
> Samuel Thibault wrote:
> >Avi Kivity, le Thu 26 Mar 2009 14:58:55 +0200, a écrit :
> >>Samuel Thibault wrote:
> >>>it should be centralized in the block layer instead of placing the
> >>>burden on all block format drivers ;)
> >>>      
> >>If other drivers need to do that, certainly.
> >
> >In our case the other driver is specific to Xen.
> 
> I'm confused.  I can only count one driver which has limited dma size.

Then you are not looking at the same place as I am. The Xen tree is at
http://xenbits.xensource.com/git-http/qemu-xen-unstable.git

> >>>One thing for instance that still have been overlooked although patches
> >>>have been sent is block-raw-posix' read/write_pread_aligned() that
> >>>consider partial read/writes as an error.  That's a bug.
> >>>      
> >>Right.  Unrelated topic though?
> >
> >Nope.  It's exactly the issue: read/write() may not be able to perform
> >the whole operation in just one go, and qemu should continue in that
> >case.
> 
> Oh, you're overloading block-raw-posix?

I'm not.  Actually, I am _also_ implementing the read/write() functions,
but that's another matter.  In the xen tree, there is an addition
block-vbd.c driver.  I'm here just pointing out that the problem is not
_only_ in the xen-specific driver, but also in the posix driver, on any
OS that doesn't necessarily do all the work the caller asked for (which
is _allowed_ by POSIX).

> Isn't it more natural for you to implement
> block-raw-xen-pv-block-frontend?

That's it.

> You'd be able to use asynchronous requests instead of a thread pool
> (much like block-raw-linux-aio).

That's it.  But the xen ring protocols limits the amount of data
transferred at a time, thus the limitation.

Samuel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-26 18:48                             ` Samuel Thibault
@ 2009-03-26 19:40                               ` Avi Kivity
  2009-03-26 23:18                                 ` Samuel Thibault
  0 siblings, 1 reply; 29+ messages in thread
From: Avi Kivity @ 2009-03-26 19:40 UTC (permalink / raw)
  To: qemu-devel

Samuel Thibault wrote:
>>>>> it should be centralized in the block layer instead of placing the
>>>>> burden on all block format drivers ;)
>>>>>      
>>>>>           
>>>> If other drivers need to do that, certainly.
>>>>         
>>> In our case the other driver is specific to Xen.
>>>       
>> I'm confused.  I can only count one driver which has limited dma size.
>>     
>
> Then you are not looking at the same place as I am. The Xen tree is at
> http://xenbits.xensource.com/git-http/qemu-xen-unstable.git
>   

I wasn't looking at any tree.  I count one driver with limited DMA sizes 
- block-vbd.  What's the other one?  Forgive me for not cloning and 
rummaging in qemu-xen.

>>>>> One thing for instance that still have been overlooked although patches
>>>>> have been sent is block-raw-posix' read/write_pread_aligned() that
>>>>> consider partial read/writes as an error.  That's a bug.
>>>>>      
>>>>>           
>>>> Right.  Unrelated topic though?
>>>>         
>>> Nope.  It's exactly the issue: read/write() may not be able to perform
>>> the whole operation in just one go, and qemu should continue in that
>>> case.
>>>       
>> Oh, you're overloading block-raw-posix?
>>     
>
> I'm not.  Actually, I am _also_ implementing the read/write() functions,
> but that's another matter.  In the xen tree, there is an addition
> block-vbd.c driver.  I'm here just pointing out that the problem is not
> _only_ in the xen-specific driver, but also in the posix driver, on any
> OS that doesn't necessarily do all the work the caller asked for (which
> is _allowed_ by POSIX).
>   

But that's not limited DMA (or at least, not limited up-front).  And 
it's easily corrected, place a while loop around preadv/pwritev, no need 
to split a request a priori somewhere up the stack.  IDE_MAX_DMA_BUFFER 
(or however it's called) wouldn't help here.

And it wouldn't be right for block-vbd - you should split your requests 
as late as possible, IMO.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-26 11:45                 ` Stefano Stabellini
  2009-03-26 12:10                   ` Avi Kivity
@ 2009-03-26 22:42                   ` Christoph Hellwig
  2009-03-26 23:22                     ` Samuel Thibault
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2009-03-26 22:42 UTC (permalink / raw)
  To: qemu-devel

On Thu, Mar 26, 2009 at 11:45:59AM +0000, Stefano Stabellini wrote:
> My problem is that my block driver has a size limit for read and write
> operations.

So fix your driver.  I assume that driver in question is a qemu
BlockDriver and not a kernel driver behind block-raw-posix.c?  At least
for Linux I couldn't think of a way to introduce such an arbitrary limit
anyway.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-26 19:40                               ` Avi Kivity
@ 2009-03-26 23:18                                 ` Samuel Thibault
  2009-03-27  9:52                                   ` Avi Kivity
  0 siblings, 1 reply; 29+ messages in thread
From: Samuel Thibault @ 2009-03-26 23:18 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity, le Thu 26 Mar 2009 21:40:18 +0200, a écrit :
> Samuel Thibault wrote:
> >>>>>it should be centralized in the block layer instead of placing the
> >>>>>burden on all block format drivers ;)
> >>>>>     
> >>>>>          
> >>>>If other drivers need to do that, certainly.
> >>>>        
> >>>In our case the other driver is specific to Xen.
> >>>      
> >>I'm confused.  I can only count one driver which has limited dma size.
> >>    
> >
> >Then you are not looking at the same place as I am. The Xen tree is at
> >http://xenbits.xensource.com/git-http/qemu-xen-unstable.git
> >  
> 
> I wasn't looking at any tree.  I count one driver with limited DMA sizes 
> - block-vbd.  What's the other one?

Ah, I thought you understood that the posix driver has the same kind of
limitation (and qemu is actually _bugged_ in that regard).

> >I'm here just pointing out that the problem is not
> >_only_ in the xen-specific driver, but also in the posix driver, on any
> >OS that doesn't necessarily do all the work the caller asked for (which
> >is _allowed_ by POSIX).
> >  
> 
> But that's not limited DMA (or at least, not limited up-front).  And 
> it's easily corrected, place a while loop around preadv/pwritev, no need 
> to split a request a priori somewhere up the stack.

Sure, and I could do the same in the block-vbd driver, thus then my
original remark "it should be centralized in the block layer instead of
placing the burden on all block format drivers".  Just to make sure: I'm
_not_ saying that should be done in the DMA code.  I said it should be
done in the block layer, shared by all block drivers.

> And it wouldn't be right for block-vbd - you should split your requests 
> as late as possible, IMO.

Why making it "late"?  Exposing the lower limits to let upper layers
decide how they should manage fragmentation usually gets better
performance.  (Note that in my case there is no system involved, so it's
really _not_ costly to do the fragmentation on the qemu side).

Samuel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-26 22:42                   ` Christoph Hellwig
@ 2009-03-26 23:22                     ` Samuel Thibault
  2009-03-27 10:02                       ` Avi Kivity
  0 siblings, 1 reply; 29+ messages in thread
From: Samuel Thibault @ 2009-03-26 23:22 UTC (permalink / raw)
  To: qemu-devel

Christoph Hellwig, le Thu 26 Mar 2009 23:42:29 +0100, a écrit :
> On Thu, Mar 26, 2009 at 11:45:59AM +0000, Stefano Stabellini wrote:
> > My problem is that my block driver has a size limit for read and write
> > operations.
> 
> So fix your driver.

And introduce code that could be shared with posix' fix for partial
reads/writes?

> I assume that driver in question is a qemu BlockDriver and not a
> kernel driver behind block-raw-posix.c?

Yes.

> At least for Linux I couldn't think of a way to introduce such an
> arbitrary limit anyway.

Nothing in Linux makes filesystems and devices systematically read/write
the amount of data that was requested.

Samuel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-26 23:18                                 ` Samuel Thibault
@ 2009-03-27  9:52                                   ` Avi Kivity
  2009-03-27 10:32                                     ` Samuel Thibault
  0 siblings, 1 reply; 29+ messages in thread
From: Avi Kivity @ 2009-03-27  9:52 UTC (permalink / raw)
  To: qemu-devel

Samuel Thibault wrote:
> Ah, I thought you understood that the posix driver has the same kind of
> limitation 

It's not the same limitation.  The posix driver has no limits on DMA 
size, it will happily transfer a gigabyte of data if you ask it to.


> (and qemu is actually _bugged_ in that regard).
>   

It has a bug in that it does not correctly interpret the return value of 
pread()/pwrite().  It's a minor bug since no system supported by qemu 
will actually return a short read or write (I think) and in that  we 
hope disk errors are rare.  Nevertheless it should be fixed (it's an 
easy fix too).  However implementing DMA limits like you propose 
(IDE_DMA_BUF_SECTORS) will not fix the bug, only reduce performance.

>   
>>> I'm here just pointing out that the problem is not
>>> _only_ in the xen-specific driver, but also in the posix driver, on any
>>> OS that doesn't necessarily do all the work the caller asked for (which
>>> is _allowed_ by POSIX).
>>>  
>>>       
>> But that's not limited DMA (or at least, not limited up-front).  And 
>> it's easily corrected, place a while loop around preadv/pwritev, no need 
>> to split a request a priori somewhere up the stack.
>>     
>
> Sure, and I could do the same in the block-vbd driver, thus then my
> original remark "it should be centralized in the block layer instead of
> placing the burden on all block format drivers".  Just to make sure: I'm
> _not_ saying that should be done in the DMA code.  I said it should be
> done in the block layer, shared by all block drivers.
>   

A generic fix will have to issue a new aio request.  block-raw-posix 
need not do that, just a while loop.


>> And it wouldn't be right for block-vbd - you should split your requests 
>> as late as possible, IMO.
>>     
>
> Why making it "late"?  Exposing the lower limits to let upper layers
> decide how they should manage fragmentation usually gets better
> performance.  (Note that in my case there is no system involved, so it's
> really _not_ costly to do the fragmentation on the qemu side).
>   

If ring entries can be more than a page (if the request is contiguous), 
then the limit can be expanded.  In other words, it's a worst-case 
limit, not a hard limit.  Exposing the worst case limit will lead to 
pessimistic choices.

That's how virtio-blk works, don't know about xen vbd (might not work 
due to the need to transfer grants?)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-26 23:22                     ` Samuel Thibault
@ 2009-03-27 10:02                       ` Avi Kivity
  2009-03-27 10:36                         ` Samuel Thibault
  0 siblings, 1 reply; 29+ messages in thread
From: Avi Kivity @ 2009-03-27 10:02 UTC (permalink / raw)
  To: qemu-devel

Samuel Thibault wrote:
> Nothing in Linux makes filesystems and devices systematically read/write
> the amount of data that was requested.
>
>   

We're talking about the qemu block layer, not Linux.  You're proposing 
to take the posix API rules and apply them to the qemu block layer.  But 
posix read/write does not fit how DMA works.  A request should either 
complete fully, or fail, leaving the destination (disk or memory) in an 
undefined state.  Partial completions are impossible to implement 
efficiently, and are a needless complication.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-27  9:52                                   ` Avi Kivity
@ 2009-03-27 10:32                                     ` Samuel Thibault
  2009-03-27 10:53                                       ` Avi Kivity
  0 siblings, 1 reply; 29+ messages in thread
From: Samuel Thibault @ 2009-03-27 10:32 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity, le Fri 27 Mar 2009 12:52:15 +0300, a écrit :
> However implementing DMA limits like you propose 
> (IDE_DMA_BUF_SECTORS) will not fix the bug, only reduce performance.

Have you read what I wrote? v

> >Just to make sure: I'm _not_ saying that should be done in the DMA
> >code.  I said it should be done in the block layer, shared by all
> >block drivers.
> 
> A generic fix will have to issue a new aio request.

Ah, right, that's different from the case below.

> block-raw-posix need not do that, just a while loop.

For the synchronous interface, yes. For the asynchronous interface,
short reads could happen too... (unfortunately we are aware of that too
late for queuing other requests and getting efficiency).

Samuel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-27 10:02                       ` Avi Kivity
@ 2009-03-27 10:36                         ` Samuel Thibault
  2009-03-27 10:58                           ` Avi Kivity
  0 siblings, 1 reply; 29+ messages in thread
From: Samuel Thibault @ 2009-03-27 10:36 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity, le Fri 27 Mar 2009 13:02:49 +0300, a écrit :
> Samuel Thibault wrote:
> >Nothing in Linux makes filesystems and devices systematically read/write
> >the amount of data that was requested.
> 
> We're talking about the qemu block layer, not Linux.

Well, in the part I quoted before my sentence:

> At least for Linux I couldn't think of a way to introduce such an
> arbitrary limit anyway.

I believed you were talking about Linux never making short reads.

> You're proposing to take the posix API rules and apply them to the
> qemu block layer.

No.  I'm proposing to take into account that because of the posix API
rules, we should fix our block layer.

> But posix read/write does not fit how DMA works.  A request should
> either complete fully, or fail, leaving the destination (disk or
> memory) in an undefined state.

Sure.  Again, please read what _I_ wrote: I'm not proposing to expose
that into DMA, just make it genericly handled in the block layer.

Samuel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-27 10:32                                     ` Samuel Thibault
@ 2009-03-27 10:53                                       ` Avi Kivity
  2009-03-27 13:45                                         ` Samuel Thibault
  0 siblings, 1 reply; 29+ messages in thread
From: Avi Kivity @ 2009-03-27 10:53 UTC (permalink / raw)
  To: qemu-devel

Samuel Thibault wrote:
> Avi Kivity, le Fri 27 Mar 2009 12:52:15 +0300, a écrit :
>   
>> However implementing DMA limits like you propose 
>> (IDE_DMA_BUF_SECTORS) will not fix the bug, only reduce performance.
>>     
>
> Have you read what I wrote? v
>   

You can't have a good flamewar if you read what people actually write.  
Maybe I'm confused by $subject.

>> block-raw-posix need not do that, just a while loop.
>>     
>
> For the synchronous interface, yes. For the asynchronous interface,
> short reads could happen too... (unfortunately we are aware of that too
> late for queuing other requests and getting efficiency).
>   

Currently block-raw-posix uses synchronous I/O even for asynchronous 
requests (using threads) so the fix is the same in both cases.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-27 10:36                         ` Samuel Thibault
@ 2009-03-27 10:58                           ` Avi Kivity
  0 siblings, 0 replies; 29+ messages in thread
From: Avi Kivity @ 2009-03-27 10:58 UTC (permalink / raw)
  To: qemu-devel

Samuel Thibault wrote:
>> But posix read/write does not fit how DMA works.  A request should
>> either complete fully, or fail, leaving the destination (disk or
>> memory) in an undefined state.
>>     
>
> Sure.  Again, please read what _I_ wrote: I'm not proposing to expose
> that into DMA, just make it genericly handled in the block layer.

Okay.  I oppose making short reads part of the API.  Callers should 
expect the return code to be either zero, or a failure indication.

If you want generic block layer support, it could come in the form of 
helpers, that is a block format driver could call 
bdrv_continue_partial_request(lots, of, parameters) to have the helper 
advance the iovec and reissue the partially completed request.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS
  2009-03-27 10:53                                       ` Avi Kivity
@ 2009-03-27 13:45                                         ` Samuel Thibault
  0 siblings, 0 replies; 29+ messages in thread
From: Samuel Thibault @ 2009-03-27 13:45 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity, le Fri 27 Mar 2009 13:53:31 +0300, a écrit :
> Maybe I'm confused by $subject.

Ah, ok.

> >>block-raw-posix need not do that, just a while loop.
> >
> >For the synchronous interface, yes. For the asynchronous interface,
> >short reads could happen too... (unfortunately we are aware of that too
> >late for queuing other requests and getting efficiency).
> 
> Currently block-raw-posix uses synchronous I/O even for asynchronous 
> requests (using threads) so the fix is the same in both cases.

Ow, right, we are not using any native AIO interface actually.

(not only related to qemu: I'm depressed when seeing how AIO is
standardized/supported/used in the Unix world...)

Samuel

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2009-03-27 13:46 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-25 13:45 [Qemu-devel] [PATCH] honor IDE_DMA_BUF_SECTORS Stefano Stabellini
2009-03-25 15:22 ` Avi Kivity
2009-03-25 16:19   ` Stefano Stabellini
2009-03-25 16:45     ` Avi Kivity
2009-03-25 16:50       ` Stefano Stabellini
2009-03-25 17:47         ` Stefano Stabellini
2009-03-26 10:23           ` Avi Kivity
2009-03-26 10:31             ` Stefano Stabellini
2009-03-26 10:57               ` Avi Kivity
2009-03-26 11:45                 ` Stefano Stabellini
2009-03-26 12:10                   ` Avi Kivity
2009-03-26 12:28                     ` Stefano Stabellini
2009-03-26 12:47                     ` Samuel Thibault
2009-03-26 12:58                       ` Avi Kivity
2009-03-26 15:30                         ` Samuel Thibault
2009-03-26 18:32                           ` Avi Kivity
2009-03-26 18:48                             ` Samuel Thibault
2009-03-26 19:40                               ` Avi Kivity
2009-03-26 23:18                                 ` Samuel Thibault
2009-03-27  9:52                                   ` Avi Kivity
2009-03-27 10:32                                     ` Samuel Thibault
2009-03-27 10:53                                       ` Avi Kivity
2009-03-27 13:45                                         ` Samuel Thibault
2009-03-26 22:42                   ` Christoph Hellwig
2009-03-26 23:22                     ` Samuel Thibault
2009-03-27 10:02                       ` Avi Kivity
2009-03-27 10:36                         ` Samuel Thibault
2009-03-27 10:58                           ` Avi Kivity
2009-03-25 16:46     ` Samuel Thibault

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.