linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [Lse-tech] Re: Preliminary results of using multiblock raw I/O
@ 2001-10-23 14:12 Shailabh Nagar
  2001-10-23 18:10 ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Shailabh Nagar @ 2001-10-23 14:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Martin Frey, 'Reto Baettig', lse-tech, linux-kernel




>On Tue, Oct 23 2001, Martin Frey wrote:
>> >I haven't seen the SGI rawio patch, but I'm assuming it used kiobufs to
>> >pass a single unit of 1 meg down at the time. Yes currently we do incur
>> >significant overhead compared to that approach.
>> >
>> Yes, it used kiobufs to get a gatherlist, setup a gather DMA out
>> of that list and submitted it to the SCSI layer. Depending on
>> the controller 1 MB could be transfered with 0 memcopies, 1 DMA,
>> 1 interrupt. 200 MB/s with 10% CPU load was really impressive.
>
>Let me repeat that the only difference between the kiobuf and the
>current approach is the overhead incurred on multiple __make_request
>calls. Given the current short queues, this isn't as bad as it used to
>be. Of course it isn't free, though.

The patch below attempts to address exactly that - reducing the number of
submit_bh/__make_request() calls made for raw I/O. The basic idea is to do
a major
part of the I/O in page sized blocks.

Comments on the idea ?


diff -Naur linux-2.4.10-v/drivers/char/raw.c
linux-2.4.10-rawvar/drivers/char/raw.c
--- linux-2.4.10-v/drivers/char/raw.c    Sat Sep 22 23:35:43 2001
+++ linux-2.4.10-rawvar/drivers/char/raw.c    Wed Oct 17 16:31:43 2001
@@ -283,6 +283,9 @@

     int       sector_size, sector_bits, sector_mask;
     int       max_sectors;
+
+    int             cursector_size, cursector_bits;
+    loff_t          startpg,endpg ;

     /*
      * First, a few checks on device size limits
@@ -304,8 +307,8 @@
     }

     dev = to_kdev_t(raw_devices[minor].binding->bd_dev);
-    sector_size = raw_devices[minor].sector_size;
-    sector_bits = raw_devices[minor].sector_bits;
+    sector_size = cursector_size = raw_devices[minor].sector_size;
+    sector_bits = cursector_bits = raw_devices[minor].sector_bits;
     sector_mask = sector_size- 1;
     max_sectors = KIO_MAX_SECTORS >> (sector_bits - 9);

@@ -325,6 +328,23 @@
     if ((*offp >> sector_bits) >= limit)
          goto out_free;

+    /* Using multiple I/O granularities
+       Divide <size> into <initial> <pagealigned> <final>
+       <initial> and <final> are done at sector_size granularity
+       <pagealigned> is done at PAGE_SIZE granularity
+       startpg, endpg define the boundaries of <pagealigned>.
+       They also serve as flags on whether PAGE_SIZE I/O is
+       done at all (its unnecessary if <size> is sufficiently small)
+    */
+
+    startpg = (*offp + (loff_t)(PAGE_SIZE - 1)) & (loff_t)PAGE_MASK ;
+    endpg = (*offp + (loff_t) size) & (loff_t)PAGE_MASK ;
+
+    if ((startpg == endpg) || (sector_size == PAGE_SIZE))
+         /* PAGE_SIZE I/O either unnecessary or being done anyway */
+         /* impossible values make startpg,endpg act as flags     */
+         startpg = endpg = ~(loff_t)0 ;
+
     /*
      * Split the IO into KIO_MAX_SECTORS chunks, mapping and
      * unmapping the single kiobuf as we go to perform each chunk of
@@ -332,9 +352,23 @@
      */

     transferred = 0;
-    blocknr = *offp >> sector_bits;
     while (size > 0) {
-         blocks = size >> sector_bits;
+
+         if (*offp  == startpg) {
+              cursector_size = PAGE_SIZE ;
+              cursector_bits = PAGE_SHIFT ;
+         }
+         else if (*offp == endpg) {
+              cursector_size = sector_size ;
+              cursector_bits = sector_bits ;
+         }
+
+         blocknr = *offp >> cursector_bits ;
+         max_sectors = KIO_MAX_SECTORS << (cursector_bits - 9) ;
+         if (limit != INT_MAX)
+              limit = (((loff_t) blk_size[MAJOR(dev)][MINOR(dev)]) <<
BLOCK_SIZE_BITS) >> cursector_bits ;
+
+         blocks = size >> cursector_bits;
          if (blocks > max_sectors)
               blocks = max_sectors;
          if (blocks > limit - blocknr)
@@ -342,7 +376,7 @@
          if (!blocks)
               break;

-         iosize = blocks << sector_bits;
+         iosize = blocks << cursector_bits;

          err = map_user_kiobuf(rw, iobuf, (unsigned long) buf, iosize);
          if (err)
@@ -351,7 +385,7 @@
          for (i=0; i < blocks; i++)
               iobuf->blocks[i] = blocknr++;

-         err = brw_kiovec(rw, 1, &iobuf, dev, iobuf->blocks, sector_size);
+         err = brw_kiovec(rw, 1, &iobuf, dev, iobuf->blocks,
cursector_size);

          if (rw == READ && err > 0)
               mark_dirty_kiobuf(iobuf, err);
@@ -360,6 +394,7 @@
               transferred += err;
               size -= err;
               buf += err;
+              *offp += err ;
          }

          unmap_kiobuf(iobuf);
@@ -369,7 +404,6 @@
     }

     if (transferred) {
-         *offp += transferred;
          err = transferred;
     }




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Lse-tech] Re: Preliminary results of using multiblock raw I/O
  2001-10-23 14:12 [Lse-tech] Re: Preliminary results of using multiblock raw I/O Shailabh Nagar
@ 2001-10-23 18:10 ` Jens Axboe
  0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2001-10-23 18:10 UTC (permalink / raw)
  To: Shailabh Nagar
  Cc: Martin Frey, 'Reto Baettig', lse-tech, linux-kernel

On Tue, Oct 23 2001, Shailabh Nagar wrote:
> >On Tue, Oct 23 2001, Martin Frey wrote:
> >> >I haven't seen the SGI rawio patch, but I'm assuming it used kiobufs to
> >> >pass a single unit of 1 meg down at the time. Yes currently we do incur
> >> >significant overhead compared to that approach.
> >> >
> >> Yes, it used kiobufs to get a gatherlist, setup a gather DMA out
> >> of that list and submitted it to the SCSI layer. Depending on
> >> the controller 1 MB could be transfered with 0 memcopies, 1 DMA,
> >> 1 interrupt. 200 MB/s with 10% CPU load was really impressive.
> >
> >Let me repeat that the only difference between the kiobuf and the
> >current approach is the overhead incurred on multiple __make_request
> >calls. Given the current short queues, this isn't as bad as it used to
> >be. Of course it isn't free, though.
> 
> The patch below attempts to address exactly that - reducing the number of
> submit_bh/__make_request() calls made for raw I/O. The basic idea is to do
> a major
> part of the I/O in page sized blocks.
> 
> Comments on the idea ?

Looks fine to me.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Lse-tech] Re: Preliminary results of using multiblock raw I/O
  2001-10-23 17:49     ` Jens Axboe
@ 2001-10-23 18:04       ` Alan Cox
  0 siblings, 0 replies; 10+ messages in thread
From: Alan Cox @ 2001-10-23 18:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Alan Cox, Shailabh Nagar, Reto Baettig, lse-tech, linux-kernel

> Fine with me, the major reason for doing 255 sectors and not 256 was IDE
> of course... So feel free to change the default host max_sectors to 256.

The -ac tree uses 128 for IDE currently I believe. I will double check
before I tweak anything

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Lse-tech] Re: Preliminary results of using multiblock raw I/O
  2001-10-23 16:23   ` Alan Cox
@ 2001-10-23 17:49     ` Jens Axboe
  2001-10-23 18:04       ` Alan Cox
  0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2001-10-23 17:49 UTC (permalink / raw)
  To: Alan Cox; +Cc: Shailabh Nagar, Reto Baettig, lse-tech, linux-kernel

On Tue, Oct 23 2001, Alan Cox wrote:
> > request the lower level driver can handle. This is typically 127kB, for
> > SCSI it can be as much as 512kB currently and depending on the SCSI
> 
> We really btw should make scsi default to 128K - otherwise all the raid
> stuff tends to go 127K, 1K, 127K, 1K and have to handle partial stripe
> read/writes

Fine with me, the major reason for doing 255 sectors and not 256 was IDE
of course... So feel free to change the default host max_sectors to 256.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Lse-tech] Re: Preliminary results of using multiblock raw I/O
  2001-10-23  6:42 ` Jens Axboe
  2001-10-23  9:59   ` Martin Frey
@ 2001-10-23 16:23   ` Alan Cox
  2001-10-23 17:49     ` Jens Axboe
  1 sibling, 1 reply; 10+ messages in thread
From: Alan Cox @ 2001-10-23 16:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Shailabh Nagar, Reto Baettig, lse-tech, linux-kernel

> request the lower level driver can handle. This is typically 127kB, for
> SCSI it can be as much as 512kB currently and depending on the SCSI

We really btw should make scsi default to 128K - otherwise all the raid
stuff tends to go 127K, 1K, 127K, 1K and have to handle partial stripe
read/writes


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Lse-tech] Re: Preliminary results of using multiblock raw I/O
@ 2001-10-23 14:05 Shailabh Nagar
  0 siblings, 0 replies; 10+ messages in thread
From: Shailabh Nagar @ 2001-10-23 14:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Reto Baettig, lse-tech, linux-kernel



>On Mon, Oct 22 2001, Shailabh Nagar wrote:
>>
>>
>> Unlike the SGI patch, the multiple block size patch continues to use
buffer
>> heads. So the biggest atomic transfer request that can be seen by a
device
>> driver with the multiblocksize patch is still 1 page.
>
>Not so. Given a 1MB contigious request broken into 256 pages, even if
>submitted in these chunks it will be merged into the biggest possible
>request the lower level driver can handle. This is typically 127kB, for
>SCSI it can be as much as 512kB currently and depending on the SCSI
>driver even more maybe.

My mistake - by device driver I wasn't referring to the lowest level
drivers but also
including the merging functionality.

>
>I haven't seen the SGI rawio patch, but I'm assuming it used kiobufs to
>pass a single unit of 1 meg down at the time. Yes currently we do incur
>significant overhead compared to that approach.
>
> Getting bigger transfers would require a single buffer head to be able to
> point to a multipage buffer or not use buffer heads at all.
> The former would obviously be a major change and suitable only for 2.5
> (perhaps as part of the much-awaited rewrite of the block I/O
>
> Ongoing effort.
>
> subsystem).The use of multipage transfers using a single buffer head
would
> also help non-raw I/O transfers. I don't know if anyone is working along
> those lines.

>It is being worked on.


Could you give some idea as to what are some of the ideas being
discussed/proposed ?
It would be nice to know some of the details as they are being worked on.

Thanks,
Shailabh Nagar





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Lse-tech] Re: Preliminary results of using multiblock raw I/O
  2001-10-23  9:59   ` Martin Frey
@ 2001-10-23 10:02     ` Jens Axboe
  0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2001-10-23 10:02 UTC (permalink / raw)
  To: Martin Frey
  Cc: 'Shailabh Nagar', 'Reto Baettig', lse-tech, linux-kernel

On Tue, Oct 23 2001, Martin Frey wrote:
> >I haven't seen the SGI rawio patch, but I'm assuming it used kiobufs to
> >pass a single unit of 1 meg down at the time. Yes currently we do incur
> >significant overhead compared to that approach.
> >
> Yes, it used kiobufs to get a gatherlist, setup a gather DMA out
> of that list and submitted it to the SCSI layer. Depending on
> the controller 1 MB could be transfered with 0 memcopies, 1 DMA,
> 1 interrupt. 200 MB/s with 10% CPU load was really impressive.

Let me repeat that the only difference between the kiobuf and the
current approach is the overhead incurred on multiple __make_request
calls. Given the current short queues, this isn't as bad as it used to
be. Of course it isn't free, though.

It's still 0 mem copies, and can be completed with 1 interrupts and DMA
operation.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Lse-tech] Re: Preliminary results of using multiblock raw I/O
  2001-10-23  6:42 ` Jens Axboe
@ 2001-10-23  9:59   ` Martin Frey
  2001-10-23 10:02     ` Jens Axboe
  2001-10-23 16:23   ` Alan Cox
  1 sibling, 1 reply; 10+ messages in thread
From: Martin Frey @ 2001-10-23  9:59 UTC (permalink / raw)
  To: 'Jens Axboe', 'Shailabh Nagar'
  Cc: 'Reto Baettig', lse-tech, linux-kernel

>I haven't seen the SGI rawio patch, but I'm assuming it used kiobufs to
>pass a single unit of 1 meg down at the time. Yes currently we do incur
>significant overhead compared to that approach.
>
Yes, it used kiobufs to get a gatherlist, setup a gather DMA out
of that list and submitted it to the SCSI layer. Depending on
the controller 1 MB could be transfered with 0 memcopies, 1 DMA,
1 interrupt. 200 MB/s with 10% CPU load was really impressive.

Regards, Martin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Lse-tech] Re: Preliminary results of using multiblock raw I/O
  2001-10-22 19:08 Shailabh Nagar
@ 2001-10-23  6:42 ` Jens Axboe
  2001-10-23  9:59   ` Martin Frey
  2001-10-23 16:23   ` Alan Cox
  0 siblings, 2 replies; 10+ messages in thread
From: Jens Axboe @ 2001-10-23  6:42 UTC (permalink / raw)
  To: Shailabh Nagar; +Cc: Reto Baettig, lse-tech, linux-kernel

On Mon, Oct 22 2001, Shailabh Nagar wrote:
> 
> 
> Unlike the SGI patch, the multiple block size patch continues to use buffer
> heads. So the biggest atomic transfer request that can be seen by a device
> driver with the multiblocksize patch is still 1 page.

Not so. Given a 1MB contigious request broken into 256 pages, even if
submitted in these chunks it will be merged into the biggest possible
request the lower level driver can handle. This is typically 127kB, for
SCSI it can be as much as 512kB currently and depending on the SCSI
driver even more maybe.

I haven't seen the SGI rawio patch, but I'm assuming it used kiobufs to
pass a single unit of 1 meg down at the time. Yes currently we do incur
significant overhead compared to that approach.

> Getting bigger transfers would require a single buffer head to be able to
> point to a multipage buffer or not use buffer heads at all.
> The former would obviously be a major change and suitable only for 2.5
> (perhaps as part of the much-awaited rewrite of the block I/O

Ongoing effort.

> subsystem).The use of multipage transfers using a single buffer head would
> also help non-raw I/O transfers. I don't know if anyone is working along
> those lines.

It is being worked on.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Lse-tech] Re: Preliminary results of using multiblock raw I/O
@ 2001-10-22 19:08 Shailabh Nagar
  2001-10-23  6:42 ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Shailabh Nagar @ 2001-10-22 19:08 UTC (permalink / raw)
  To: Reto Baettig; +Cc: lse-tech, linux-kernel



Unlike the SGI patch, the multiple block size patch continues to use buffer
heads. So the biggest atomic transfer request that can be seen by a device
driver with the multiblocksize patch is still 1 page.

Getting bigger transfers would require a single buffer head to be able to
point to a multipage buffer or not use buffer heads at all.
The former would obviously be a major change and suitable only for 2.5
(perhaps as part of the much-awaited rewrite of the block I/O
subsystem).The use of multipage transfers using a single buffer head would
also help non-raw I/O transfers. I don't know if anyone is working along
those lines.

Incidentally, the multiple block size patch doesn't check whether the
device driver can handle large requests - thats on the todo list of
changes.


Shailabh Nagar
Enterprise Linux Group, IBM TJ Watson Research Center
(914) 945 2851, T/L 862 2851


Reto Baettig <baettig@scs.ch>@lists.sourceforge.net on 10/22/2001 03:50:16
AM
Hi!

We had 200MB/s on 2.2.18 with the SGI raw patch and CPU-Load
approximately 10%.
On 2.4.3-12, we get 100MB/s with 100% CPU-Load. Is there a way of
getting even bigger transfers than one page for the aligned part? With
the SGI patch, there was much less waiting for I/O completion  because
we could transfer 1MB in one chunk. I'm sorry but I don't have time at
the moment to test the patch but I will send you our numbers as soon as
we have some time.

Good to see somebody working on it! Thanks!

Reto

Shailabh Nagar wrote:
>
> Following up on the previous mail with patches for doing multiblock raw
I/O
> :
>
> Experiments on a 2-way, 850MHz PIII, 256K cache, 256M memory
> Running bonnie (modified to allow specification of O_DIRECT option,
> target file etc.)
> Only the block tests (rewrite,read,write) have been run. All tests
> are single threaded.
>
> BW  = bandwidth in kB/s
> cpu = %CPU use
> abs = size of each I/O request
>       (NOT blocksize used by underlying raw I/O mechanism !)
>
> pre2 = using kernel 2.4.13-pre2aa1
> multi = 2.4.13-pre2aa1 kernel with multiblock raw I/O patches applied
>         (both /dev/raw and O_DIRECT)
>
>                   /dev/raw (uses 512 byte blocks)
>                ===============================
>
>          rewrite              write                   read
> ------------------------------------------------------------------
>      pre2      multi       pre2     multi         pre2     multi
> ------------------------------------------------------------------
> abs BW  cpu   BW  cpu     BW  cpu   BW  cpu      BW  cpu   BW  cpu
> ------------------------------------------------------------------
>  4k 884 0.5   882 0.1    1609 0.3  1609 0.2     9841 1.5  9841 0.9
>  6k 884 0.5   882 0.2    1609 0.5  1609 0.1     9841 1.8  9841 1.2
> 16k 884 0.6   882 0.2    1609 0.3  1609 0.0     9841 2.7  9841 1.4
> 18k 884 0.4   882 0.2    1609 0.4  1607 0.1     9841 2.4  9829 1.2
> 64k 883 0.5   882 0.1    1609 0.4  1609 0.3     9841 2.0  9841 0.6
> 66k 883 0.5   882 0.2    1609 0.5  1609 0.2     9829 3.4  9829 1.0
>
>                O_DIRECT : on filesystem with 1K blocksize
>             ===========================================
>
>          rewrite              write                   read
> ------------------------------------------------------------------
>      pre2      multi       pre2     multi         pre2     multi
> ------------------------------------------------------------------
> abs BW  cpu   BW  cpu     BW  cpu   BW  cpu      BW  cpu   BW  cpu
> ------------------------------------------------------------------
>  4k 854 0.8   880 0.4    1527 0.5  1607 0.1     9731 2.5  9780 1.3
>  6k 856 0.4   882 0.3    1527 0.4  1607 0.1   9732 1.6  9780 0.7
> 16k 857 0.4   881 0.1     1527 0.3  1608 0.0  9732 2.2  9780 1.2
> 18k 857 0.3   882 0.2     1527 0.4  1607 0.1  9731 1.9  9780 1.0
> 64k 857 0.3   881 0.1     1526 0.4  1607 0.2  9732 1.6  9780 1.6
> 66k 856 0.4   882 0.2     1527 0.4  1607 0.2  9731 2.7  9780 1.2
>



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2001-10-23 18:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-10-23 14:12 [Lse-tech] Re: Preliminary results of using multiblock raw I/O Shailabh Nagar
2001-10-23 18:10 ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2001-10-23 14:05 Shailabh Nagar
2001-10-22 19:08 Shailabh Nagar
2001-10-23  6:42 ` Jens Axboe
2001-10-23  9:59   ` Martin Frey
2001-10-23 10:02     ` Jens Axboe
2001-10-23 16:23   ` Alan Cox
2001-10-23 17:49     ` Jens Axboe
2001-10-23 18:04       ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).