linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Warning - running *really* short on DMA buffers while doing file transfers
@ 2002-09-26  3:27 Mathieu Chouquet-Stringer
  2002-09-26  6:14 ` Jens Axboe
  0 siblings, 1 reply; 60+ messages in thread
From: Mathieu Chouquet-Stringer @ 2002-09-26  3:27 UTC (permalink / raw)
  To: linux-scsi, linux-kernel

	  Hello!

I've upgraded a while to 2.4.19 and my box has been happy for the last 52
days (it's a dual PIII). Tonight while going through my logs, I've found
these:

Sep 25 22:18:41 bigip kernel: Warning - running *really* short on DMA buffers
Sep 25 22:18:47 bigip last message repeated 55 times
Sep 25 22:19:41 bigip last message repeated 71 times

I know where it's coming from (drivers/scsi/scsi_merge.c):
        /* scsi_malloc can only allocate in chunks of 512 bytes so
         * round it up.
         */
        SCpnt->sglist_len = (SCpnt->sglist_len + 511) & ~511;

        sgpnt = (struct scatterlist *) scsi_malloc(SCpnt->sglist_len);

	/*
         * Now fill the scatter-gather table.
         */
        if (!sgpnt) {
                /*
                 * If we cannot allocate the scatter-gather table, then
                 * simply write the first buffer all by itself.
                 */
                printk("Warning - running *really* short on DMA
		buffers\n");
                this_count = SCpnt->request.current_nr_sectors;
                goto single_segment;
        }

So I know that scsi_malloc failed, the reason, I don't know. Well I guess
either this test if (len % SECTOR_SIZE != 0 || len > PAGE_SIZE) or this one
if ((dma_malloc_freelist[i] & (mask << j)) == 0) failed (both come from
scsi_dma.c).

It is easily reproducible though: I just have to start a huge file transfer
from a crappy ide drive (ie low throughput) to a scsi one and the result is
almost garanteed.

I'm running a 2.4.19 + freeswan but I can easily get rid of freeswan if you
want me to (the kernel was compiled with gcc 2.96 20000731). The kernel has
nothing wierd enabled (it runs netfilter and nfs). 
I've got both ide and scsi drives with ext3 as the only fs, the scsi card
is an Adaptec AHA-2940U2/U2W (oem version) and has only 3 devices
connected:
Attached devices: 
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: QUANTUM  Model: ATLAS 10K 18SCA  Rev: UCIE
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 05 Lun: 00
  Vendor: TEAC     Model: CD-R55S          Rev: 1.0R
  Type:   CD-ROM                           ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 06 Lun: 00
  Vendor: HP       Model: C1537A           Rev: L706
  Type:   Sequential-Access                ANSI SCSI revision: 02

The ide chip being the regular Intel Corp. 82801AA IDE (rev 02).

Here the output of /proc/slabinfo few minutes after the last line in the
logs:

slabinfo - version: 1.1 (SMP)
kmem_cache            80     80    244    5    5    1 :  252  126
fib6_nodes             9    226     32    2    2    1 :  252  126
ip6_dst_cache         16     40    192    2    2    1 :  252  126
ndisc_cache            1     30    128    1    1    1 :  252  126
ip_conntrack          84    132    352   12   12    1 :  124   62
tcp_tw_bucket          1     30    128    1    1    1 :  252  126
tcp_bind_bucket       22    226     32    2    2    1 :  252  126
tcp_open_request       0      0     96    0    0    1 :  252  126
inet_peer_cache        3     59     64    1    1    1 :  252  126
ip_fib_hash           23    226     32    2    2    1 :  252  126
ip_dst_cache         150    216    160    9    9    1 :  252  126
arp_cache              6     60    128    2    2    1 :  252  126
blkdev_requests      896    920     96   23   23    1 :  252  126
journal_head         293    390     48    5    5    1 :  252  126
revoke_table           8    253     12    1    1    1 :  252  126
revoke_record          0      0     32    0    0    1 :  252  126
dnotify cache          0      0     20    0    0    1 :  252  126
file lock cache      168    168     92    4    4    1 :  252  126
fasync cache           0      0     16    0    0    1 :  252  126
uid_cache             10    226     32    2    2    1 :  252  126
skbuff_head_cache    392    720    160   30   30    1 :  252  126
sock                  97    148    928   37   37    1 :  124   62
sigqueue              29     29    132    1    1    1 :  252  126
cdev_cache            17    295     64    5    5    1 :  252  126
bdev_cache            12    118     64    2    2    1 :  252  126
mnt_cache             22    118     64    2    2    1 :  252  126
inode_cache         1390   2176    480  272  272    1 :  124   62
dentry_cache        1440   4620    128  154  154    1 :  252  126
filp                1455   1560    128   52   52    1 :  252  126
names_cache            2      2   4096    2    2    1 :   60   30
buffer_head        76691  77160     96 1929 1929    1 :  252  126
mm_struct            194    264    160   11   11    1 :  252  126
vm_area_struct      4303   4840     96  121  121    1 :  252  126
fs_cache             194    236     64    4    4    1 :  252  126
files_cache          130    153    416   17   17    1 :  124   62
signal_act           114    114   1312   38   38    1 :   60   30
size-131072(DMA)       0      0 131072    0    0   32 :    0    0
size-131072            0      0 131072    0    0   32 :    0    0
size-65536(DMA)        0      0  65536    0    0   16 :    0    0
size-65536             1      1  65536    1    1   16 :    0    0
size-32768(DMA)        0      0  32768    0    0    8 :    0    0
size-32768             3      3  32768    3    3    8 :    0    0
size-16384(DMA)        0      0  16384    0    0    4 :    0    0
size-16384            11     12  16384   11   12    4 :    0    0
size-8192(DMA)         0      0   8192    0    0    2 :    0    0
size-8192              9     10   8192    9   10    2 :    0    0
size-4096(DMA)         0      0   4096    0    0    1 :   60   30
size-4096             63     63   4096   63   63    1 :   60   30
size-2048(DMA)         0      0   2048    0    0    1 :   60   30
size-2048            252    282   2048  139  141    1 :   60   30
size-1024(DMA)         0      0   1024    0    0    1 :  124   62
size-1024            175    176   1024   44   44    1 :  124   62
size-512(DMA)          0      0    512    0    0    1 :  124   62
size-512             448    448    512   56   56    1 :  124   62
size-256(DMA)          0      0    256    0    0    1 :  252  126
size-256             263    270    256   18   18    1 :  252  126
size-128(DMA)          0      0    128    0    0    1 :  252  126
size-128            2128   2670    128   89   89    1 :  252  126
size-64(DMA)           0      0     64    0    0    1 :  252  126
size-64              718   2537     64   43   43    1 :  252  126
size-32(DMA)           0      0     32    0    0    1 :  252  126
size-32             1495  11413     32  101  101    1 :  252  126

And /proc/meminfo
        total:    used:    free:  shared: buffers:  cached:
Mem:  394948608 389390336  5558272        0 15466496 310026240
Swap: 806068224 33927168 772141056
MemTotal:       385692 kB
MemFree:          5428 kB
MemShared:           0 kB
Buffers:         15104 kB
Cached:         298916 kB
SwapCached:       3844 kB
Active:          22192 kB
Inactive:       335208 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       385692 kB
LowFree:          5428 kB
SwapTotal:      787176 kB
SwapFree:       754044 kB


So my question: know issue (like memory fragmentation) or bug (in this case
I would be glad to test any patches you would want me to or to give you
anything missing in this email)?

Regards, Mathieu.

Oh BTW, just one thing, I wanted to give the throughput of the ide drived
but it failed:
Sep 25 23:18:32 bigip kernel: hdb: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Sep 25 23:18:32 bigip kernel: hdb: dma_intr: error=0x40 { UncorrectableError }, LBAsect=102882, sector=102784

I read my logs every day so I know for sure these messages are new (damn it
doesn't look good)...

-- 
Mathieu Chouquet-Stringer              E-Mail : mathieu@newview.com
    It is exactly because a man cannot do a thing that he is a
                      proper judge of it.
                      -- Oscar Wilde

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-26  3:27 Warning - running *really* short on DMA buffers while doing file transfers Mathieu Chouquet-Stringer
@ 2002-09-26  6:14 ` Jens Axboe
  2002-09-26  7:04   ` Pedro M. Rodrigues
  0 siblings, 1 reply; 60+ messages in thread
From: Jens Axboe @ 2002-09-26  6:14 UTC (permalink / raw)
  To: Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

On Wed, Sep 25 2002, Mathieu Chouquet-Stringer wrote:
> 	  Hello!
> 
> I've upgraded a while to 2.4.19 and my box has been happy for the last 52
> days (it's a dual PIII). Tonight while going through my logs, I've found
> these:
> 
> Sep 25 22:18:41 bigip kernel: Warning - running *really* short on DMA buffers
> Sep 25 22:18:47 bigip last message repeated 55 times
> Sep 25 22:19:41 bigip last message repeated 71 times

This is fixed in 2.4.20-pre

> Oh BTW, just one thing, I wanted to give the throughput of the ide drived
> but it failed:
> Sep 25 23:18:32 bigip kernel: hdb: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> Sep 25 23:18:32 bigip kernel: hdb: dma_intr: error=0x40 { UncorrectableError }, LBAsect=102882, sector=102784

Yep, looks like the end of the road for that drive.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-26  6:14 ` Jens Axboe
@ 2002-09-26  7:04   ` Pedro M. Rodrigues
  2002-09-26 15:31     ` Justin T. Gibbs
  0 siblings, 1 reply; 60+ messages in thread
From: Pedro M. Rodrigues @ 2002-09-26  7:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

Jens Axboe wrote:
> On Wed, Sep 25 2002, Mathieu Chouquet-Stringer wrote:
> 
>>	  Hello!
>>
>>I've upgraded a while to 2.4.19 and my box has been happy for the last 52
>>days (it's a dual PIII). Tonight while going through my logs, I've found
>>these:
>>
>>Sep 25 22:18:41 bigip kernel: Warning - running *really* short on DMA buffers
>>Sep 25 22:18:47 bigip last message repeated 55 times
>>Sep 25 22:19:41 bigip last message repeated 71 times
> 
> 
> This is fixed in 2.4.20-pre
> 
> 

    I reported this same problem some weeks ago - 
http://marc.theaimsgroup.com/?l=linux-kernel&m=103069116227685&w=2 . 
2.4.20pre kernels solved the error messages flooding the console, and 
improved things a bit, but system load still got very high and disk read 
and write performance was lousy. Adding more memory and using a 
completely different machine didn't help. What did? Changing the Adaptec 
scsi driver to aic7xxx_old . The performance was up 50% for writes and 
90% for reads, and the system load was acceptable. And i didn't even had 
to change the RedHat kernel (2.4.18-10) for a custom one. The storage 
was two external Arena raid boxes, btw.


Regards,
Pedro



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-26  7:04   ` Pedro M. Rodrigues
@ 2002-09-26 15:31     ` Justin T. Gibbs
  2002-09-27  6:13       ` Jens Axboe
  2002-09-27 12:28       ` Pedro M. Rodrigues
  0 siblings, 2 replies; 60+ messages in thread
From: Justin T. Gibbs @ 2002-09-26 15:31 UTC (permalink / raw)
  To: Pedro M. Rodrigues, Jens Axboe
  Cc: Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

>     I reported this same problem some weeks ago -
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103069116227685&w=2 .
> 2.4.20pre kernels solved the error messages flooding the console, and
> improved things a bit, but system load still got very high and disk read
> and write performance was lousy. Adding more memory and using a
> completely different machine didn't help. What did? Changing the Adaptec
> scsi driver to aic7xxx_old . The performance was up 50% for writes and
> 90% for reads, and the system load was acceptable. And i didn't even had
> to change the RedHat kernel (2.4.18-10) for a custom one. The storage was
> two external Arena raid boxes, btw.

I would be interested in knowing if reducing the maximum tag depth for
the driver improves things for you.  There is a large difference in the
defaults between the two drivers.  It has only reacently come to my
attention that the SCSI layer per-transaction overhead is so high that
you can completely starve the kernel of resources if this setting is too
high.  For example, a 4GB system installing RedHat 7.3 could not even
complete an install on a 20 drive system with the default of 253 commands.
The latest version of the aic7xxx driver already sent to Marcelo drops the
default to 32.

--
Justin


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-26 15:31     ` Justin T. Gibbs
@ 2002-09-27  6:13       ` Jens Axboe
  2002-09-27  6:33         ` Matthew Jacob
  2002-09-27 12:28       ` Pedro M. Rodrigues
  1 sibling, 1 reply; 60+ messages in thread
From: Jens Axboe @ 2002-09-27  6:13 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: Pedro M. Rodrigues, Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

On Thu, Sep 26 2002, Justin T. Gibbs wrote:
> >     I reported this same problem some weeks ago -
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=103069116227685&w=2 .
> > 2.4.20pre kernels solved the error messages flooding the console, and
> > improved things a bit, but system load still got very high and disk read
> > and write performance was lousy. Adding more memory and using a
> > completely different machine didn't help. What did? Changing the Adaptec
> > scsi driver to aic7xxx_old . The performance was up 50% for writes and
> > 90% for reads, and the system load was acceptable. And i didn't even had
> > to change the RedHat kernel (2.4.18-10) for a custom one. The storage was
> > two external Arena raid boxes, btw.
> 
> I would be interested in knowing if reducing the maximum tag depth for
> the driver improves things for you.  There is a large difference in the
> defaults between the two drivers.  It has only reacently come to my
> attention that the SCSI layer per-transaction overhead is so high that
> you can completely starve the kernel of resources if this setting is too
> high.  For example, a 4GB system installing RedHat 7.3 could not even
> complete an install on a 20 drive system with the default of 253 commands.
> The latest version of the aic7xxx driver already sent to Marcelo drops the
> default to 32.

2.4 layer is most horrible there, 2.5 at least gets rid of the old
scsi_dma crap. That said, 253 default depth is a bit over the top, no?

-- 
Jens Axboe, who always uses 4


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27  6:13       ` Jens Axboe
@ 2002-09-27  6:33         ` Matthew Jacob
  2002-09-27  6:36           ` Jens Axboe
  0 siblings, 1 reply; 60+ messages in thread
From: Matthew Jacob @ 2002-09-27  6:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Justin T. Gibbs, Pedro M. Rodrigues, Mathieu Chouquet-Stringer,
	linux-scsi, linux-kernel


> 2.4 layer is most horrible there, 2.5 at least gets rid of the old
> scsi_dma crap. That said, 253 default depth is a bit over the top, no?

Why? Something like a large Hitachi 9*** storage system can take ~1000
tags w/o wincing.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27  6:33         ` Matthew Jacob
@ 2002-09-27  6:36           ` Jens Axboe
  2002-09-27  6:50             ` Matthew Jacob
  0 siblings, 1 reply; 60+ messages in thread
From: Jens Axboe @ 2002-09-27  6:36 UTC (permalink / raw)
  To: Matthew Jacob
  Cc: Justin T. Gibbs, Pedro M. Rodrigues, Mathieu Chouquet-Stringer,
	linux-scsi, linux-kernel

On Thu, Sep 26 2002, Matthew Jacob wrote:
> 
> > 2.4 layer is most horrible there, 2.5 at least gets rid of the old
> > scsi_dma crap. That said, 253 default depth is a bit over the top, no?
> 
> Why? Something like a large Hitachi 9*** storage system can take ~1000
> tags w/o wincing.

Yeah, I bet that most of the devices attached to aic7xxx controllers are
exactly such beasts.

I didn't say that 253 is a silly default for _everything_, I think it's
a silly default for most users.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27  6:36           ` Jens Axboe
@ 2002-09-27  6:50             ` Matthew Jacob
  2002-09-27  6:56               ` Jens Axboe
  0 siblings, 1 reply; 60+ messages in thread
From: Matthew Jacob @ 2002-09-27  6:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Justin T. Gibbs, Pedro M. Rodrigues, Mathieu Chouquet-Stringer,
	linux-scsi, linux-kernel


> > > scsi_dma crap. That said, 253 default depth is a bit over the top, no?
> > 
> > Why? Something like a large Hitachi 9*** storage system can take ~1000
> > tags w/o wincing.
> 
> Yeah, I bet that most of the devices attached to aic7xxx controllers are
> exactly such beasts.
> 
> I didn't say that 253 is a silly default for _everything_, I think it's
> a silly default for most users.
> 

Well, no, I'm not sure I agree. In the expected life time of this
particular set of software that gets shipped out, the next generation of
100GB or better disk drives will be attached, and they'll likely eat all
of that many tags too, and usefully, considering the speed and bit
density of drives. For example, the current U160 Fujitsu drives will
take ~130 tags before sending back a QFULL.

On the other hand, we can also find a large class of existing devices
and situations where anything over 4 tags is overload too.

With some perspective on this, I'd have to say that in the last 25 years
I've seen more errors on the side of 'too conservative' for limits
rather than the opposite.

That said, the only problem with allowing such generous limits is the
impact on the system, which allows you to saturate as it does. Getting
that fixed is more important than saying a driver writer's choice for
limits is 'over the top'.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27  6:50             ` Matthew Jacob
@ 2002-09-27  6:56               ` Jens Axboe
  2002-09-27  7:18                 ` Matthew Jacob
  0 siblings, 1 reply; 60+ messages in thread
From: Jens Axboe @ 2002-09-27  6:56 UTC (permalink / raw)
  To: Matthew Jacob
  Cc: Justin T. Gibbs, Pedro M. Rodrigues, Mathieu Chouquet-Stringer,
	linux-scsi, linux-kernel

On Thu, Sep 26 2002, Matthew Jacob wrote:
> 
> > > > scsi_dma crap. That said, 253 default depth is a bit over the top, no?
> > > 
> > > Why? Something like a large Hitachi 9*** storage system can take ~1000
> > > tags w/o wincing.
> > 
> > Yeah, I bet that most of the devices attached to aic7xxx controllers are
> > exactly such beasts.
> > 
> > I didn't say that 253 is a silly default for _everything_, I think it's
> > a silly default for most users.
> > 
> 
> Well, no, I'm not sure I agree. In the expected life time of this
> particular set of software that gets shipped out, the next generation of
> 100GB or better disk drives will be attached, and they'll likely eat all
> of that many tags too, and usefully, considering the speed and bit
> density of drives. For example, the current U160 Fujitsu drives will
> take ~130 tags before sending back a QFULL.

Just because a device can eat XXX number of tags does definitely _not_
make it a good idea. At least not if you care the slightest bit about
latency.

> On the other hand, we can also find a large class of existing devices
> and situations where anything over 4 tags is overload too.
> 
> With some perspective on this, I'd have to say that in the last 25 years
> I've seen more errors on the side of 'too conservative' for limits
> rather than the opposite.

At least for this tagging discussion, I'm of the exact opposite
oppinion. What's the worst that can happen with a tag setting that is
too low? Theoretical loss of disk bandwidth. I say theoretical, because
it's not even given that tags are that much faster then the Linux io
scheduler. More tags might even give you _worse_ throughput, because you
end up leaving the io scheduler with much less to work on (if you have a
253 depth to you device, you have 3 requests left for the queue...).

So I think the 'more tags the better!' belief is very much bogus, at
least for the common case.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27  6:56               ` Jens Axboe
@ 2002-09-27  7:18                 ` Matthew Jacob
  2002-09-27  7:24                   ` Jens Axboe
  0 siblings, 1 reply; 60+ messages in thread
From: Matthew Jacob @ 2002-09-27  7:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Justin T. Gibbs, Pedro M. Rodrigues, Mathieu Chouquet-Stringer,
	linux-scsi, linux-kernel

> 
> So I think the 'more tags the better!' belief is very much bogus, at
> least for the common case.

Well, that's one theory.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27  7:18                 ` Matthew Jacob
@ 2002-09-27  7:24                   ` Jens Axboe
  2002-09-27  7:29                     ` Matthew Jacob
  0 siblings, 1 reply; 60+ messages in thread
From: Jens Axboe @ 2002-09-27  7:24 UTC (permalink / raw)
  To: Matthew Jacob
  Cc: Justin T. Gibbs, Pedro M. Rodrigues, Mathieu Chouquet-Stringer,
	linux-scsi, linux-kernel

On Fri, Sep 27 2002, Matthew Jacob wrote:
> > 
> > So I think the 'more tags the better!' belief is very much bogus, at
> > least for the common case.
> 
> Well, that's one theory.

Numbers talk, theory spinning walks

Both Andrew and I did latency numbers for even small depths of tagging,
and the result was not pretty. Sure this is just your regular plaino
SCSI drives, however that's also what I care most about. People with
big-ass hardware tend to find a way to tweak them as well, I'd like the
typical systems to run fine out of the box though.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27  7:24                   ` Jens Axboe
@ 2002-09-27  7:29                     ` Matthew Jacob
  2002-09-27  7:34                       ` Matthew Jacob
  0 siblings, 1 reply; 60+ messages in thread
From: Matthew Jacob @ 2002-09-27  7:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Justin T. Gibbs, Pedro M. Rodrigues, Mathieu Chouquet-Stringer,
	linux-scsi, linux-kernel



On Fri, 27 Sep 2002, Jens Axboe wrote:

> On Fri, Sep 27 2002, Matthew Jacob wrote:
> > > 
> > > So I think the 'more tags the better!' belief is very much bogus, at
> > > least for the common case.
> > 
> > Well, that's one theory.
> 
> Numbers talk, theory spinning walks
> 
> Both Andrew and I did latency numbers for even small depths of tagging,
> and the result was not pretty. Sure this is just your regular plaino
> SCSI drives, however that's also what I care most about. People with
> big-ass hardware tend to find a way to tweak them as well, I'd like the
> typical systems to run fine out of the box though.
> 

Fair enough. 





^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27  7:29                     ` Matthew Jacob
@ 2002-09-27  7:34                       ` Matthew Jacob
  2002-09-27  7:45                         ` Jens Axboe
  0 siblings, 1 reply; 60+ messages in thread
From: Matthew Jacob @ 2002-09-27  7:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Justin T. Gibbs, Pedro M. Rodrigues, Mathieu Chouquet-Stringer,
	linux-scsi, linux-kernel


The issue here is not whether it's appropriate to oversaturate the
'standard' SCSI drive- it isn't- I never suggested it was.

I'd just suggest that it's asinine to criticise an HBA for running up to
reasonable limits when it's the non-toy OS that will do sensible I/O
scheduling. So point your gums elsewhere.



On Fri, 27 Sep 2002, Matthew Jacob wrote:

> 
> 
> On Fri, 27 Sep 2002, Jens Axboe wrote:
> 
> > On Fri, Sep 27 2002, Matthew Jacob wrote:
> > > > 
> > > > So I think the 'more tags the better!' belief is very much bogus, at
> > > > least for the common case.
> > > 
> > > Well, that's one theory.
> > 
> > Numbers talk, theory spinning walks
> > 
> > Both Andrew and I did latency numbers for even small depths of tagging,
> > and the result was not pretty. Sure this is just your regular plaino
> > SCSI drives, however that's also what I care most about. People with
> > big-ass hardware tend to find a way to tweak them as well, I'd like the
> > typical systems to run fine out of the box though.
> > 
> 
> Fair enough. 
> 
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27  7:34                       ` Matthew Jacob
@ 2002-09-27  7:45                         ` Jens Axboe
  2002-09-27  8:37                           ` Matthew Jacob
  0 siblings, 1 reply; 60+ messages in thread
From: Jens Axboe @ 2002-09-27  7:45 UTC (permalink / raw)
  To: Matthew Jacob
  Cc: Justin T. Gibbs, Pedro M. Rodrigues, Mathieu Chouquet-Stringer,
	linux-scsi, linux-kernel

On Fri, Sep 27 2002, Matthew Jacob wrote:
> 
> The issue here is not whether it's appropriate to oversaturate the
> 'standard' SCSI drive- it isn't- I never suggested it was.

Ok so we agree. I think our oversaturate thresholds are different,
though.

> I'd just suggest that it's asinine to criticise an HBA for running up to
> reasonable limits when it's the non-toy OS that will do sensible I/O
> scheduling. So point your gums elsewhere.

Well I don't think 253 is a reasonable limit, that was the whole point.
How can sane io scheduling ever prevent starvation in that case? I can't
point my gums elsewhere, this is where I'm seeing starvation.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27  7:45                         ` Jens Axboe
@ 2002-09-27  8:37                           ` Matthew Jacob
  2002-09-27 10:25                             ` Jens Axboe
  0 siblings, 1 reply; 60+ messages in thread
From: Matthew Jacob @ 2002-09-27  8:37 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Justin T. Gibbs, Pedro M. Rodrigues, Mathieu Chouquet-Stringer,
	linux-scsi, linux-kernel



> On Fri, Sep 27 2002, Matthew Jacob wrote:
> > 
> > The issue here is not whether it's appropriate to oversaturate the
> > 'standard' SCSI drive- it isn't- I never suggested it was.
> 
> Ok so we agree. I think our oversaturate thresholds are different,
> though.

I think we simply disagree as to where to put them. See below.

> 
> > I'd just suggest that it's asinine to criticise an HBA for running up to
> > reasonable limits when it's the non-toy OS that will do sensible I/O
> > scheduling. So point your gums elsewhere.
> 
> Well I don't think 253 is a reasonable limit, that was the whole point.
> How can sane io scheduling ever prevent starvation in that case? I can't
> point my gums elsewhere, this is where I'm seeing starvation.

You're in starvation because the I/O midlayer and buffer cache are
allowing you to build enough transactions on one bus to impact system
response times. This is an old problem with Linux that comes and goes
(as it has come and gone for most systems). There are a number of
possible solutions to this problem- but because this is in 2.4 perhaps
the most sensible one is to limit how much you *ask* from an HBA,
perhaps based upon even as vague a set of parameters as CPU speed and
available memory divided by the number of <n-scsibus/total spindles).

It's the job of the HBA driver to manage resources on the the HBA and on
the bus the HBA interfaces to. If the HBA and its driver can efficiently
manage 1000 concurrent commands per lun and 16384 luns per target and
500 'targets' in a fabric, let it.

Let oversaturation of a *spindle* be informed by device quirks and the
rate of QFULLs received, or even, if you will, by finding the knee in
the per-command latency curve (if you can and you think that it's
meaningful). Let oversaturation of the system be done elsewhere- let the
buffer cache manager and system policy knobs decide whether the fact
that the AIC driver is so busy moving I/O that the user can't get window
focus onto the window in N-P complete time to kill the runaway tar.

Sorry- an overlong response. It *is* easier to just say "well, 'fix' the
HBA driver so it doesn't allow the system to get too busy or
overloaded". But it seems to me that this is even best solved in the
midlayer which should, in fact, know best (better than a single HBA).

-matt



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27  8:37                           ` Matthew Jacob
@ 2002-09-27 10:25                             ` Jens Axboe
  2002-09-27 12:18                               ` Matthew Jacob
  2002-09-27 13:30                               ` Justin T. Gibbs
  0 siblings, 2 replies; 60+ messages in thread
From: Jens Axboe @ 2002-09-27 10:25 UTC (permalink / raw)
  To: Matthew Jacob
  Cc: Justin T. Gibbs, Pedro M. Rodrigues, Mathieu Chouquet-Stringer,
	linux-scsi, linux-kernel

On Fri, Sep 27 2002, Matthew Jacob wrote:
> > > I'd just suggest that it's asinine to criticise an HBA for running up to
> > > reasonable limits when it's the non-toy OS that will do sensible I/O
> > > scheduling. So point your gums elsewhere.
> > 
> > Well I don't think 253 is a reasonable limit, that was the whole point.
> > How can sane io scheduling ever prevent starvation in that case? I can't
> > point my gums elsewhere, this is where I'm seeing starvation.
> 
> You're in starvation because the I/O midlayer and buffer cache are
> allowing you to build enough transactions on one bus to impact system
> response times. This is an old problem with Linux that comes and goes

That's one type of starvation, but that's one that can be easily
prevented by the os. Not a problem.

The starvation I'm talking about is the drive starving requests. Just
keep it moderately busy (10-30 outstanding tags), and a read can take a
long time to complete. The hba can try to prevent this from happening by
issuing ever Xth request as on ordered tag, however the fact that we
even need to consider doing this suggests to me that something is
broken. That a drive can starve a single request for that long is _bad_.
Issuing every 1024th request as ordered helps absolutely zip for good
interactive behaviour. It might help on a single request basis, but good
latency feel typically requires more than that.

We _can_ try and prevent this type of starvation. If we encounter a
read, don't queue any more writes to the drive before it completes.
That's some simple logic that will probably help a lot. This is the type
of thing the deadline io scheduler tries to do. This is working around
broken drive firmware in my oppinion, the drive shouldn't be starving
requests like that.

However, it's stupid to try and work around a problem if we can simply
prevent the problem in the first place. What is the problem? It's lots
of tags causing the drive to starve requests. Do we need lots of tags?
In my experience 4 tags is more than plenty for a typical drive, there's
simply _no_ benefit from going beyond that. It doesn't buy you extra
throughput, it doesn't buy you better io scheduling (au contraire). All
it gets you is lots of extra latency. So why would I want lots of tags
on a typical scsi drive? I don't.

> (as it has come and gone for most systems). There are a number of
> possible solutions to this problem- but because this is in 2.4 perhaps
> the most sensible one is to limit how much you *ask* from an HBA,
> perhaps based upon even as vague a set of parameters as CPU speed and
> available memory divided by the number of <n-scsibus/total spindles).

This doesn't make much sense to me. Why would the CPU speed and
available memory impact this at all? We don't want to deplete system
resources (the case Justin mentioned), of course, but beyond that I
don't think it makes much sense.

> It's the job of the HBA driver to manage resources on the the HBA and on
> the bus the HBA interfaces to. If the HBA and its driver can efficiently
> manage 1000 concurrent commands per lun and 16384 luns per target and
> 500 'targets' in a fabric, let it.

Yes, if the hba and its drive _and_ the target can _efficiently_ handle
it, I'm all for it. Again you seem to be comparing the typical scsi hard
drive to more esoteric devices. I'll note again that I'm not talking
about such devices.

> Let oversaturation of a *spindle* be informed by device quirks and the
> rate of QFULLs received, or even, if you will, by finding the knee in

If device quirks are that 90% (pulling this number out of my ass) of
scsi drives use pure internal sptf scheduling and thus heavily starve
requests, then why bother? Queue full contains no latency information.

> the per-command latency curve (if you can and you think that it's

I can try to get a decent default. 253 clearly isn't it, far from it.

> meaningful). Let oversaturation of the system be done elsewhere- let the
> buffer cache manager and system policy knobs decide whether the fact
> that the AIC driver is so busy moving I/O that the user can't get window
> focus onto the window in N-P complete time to kill the runaway tar.

This is not a problem with the vm flooding a spindle. We want it to be
flooded, the more we can shove into the io scheduler to work with, the
better chance it has of doing a good job.

> Sorry- an overlong response. It *is* easier to just say "well, 'fix' the
> HBA driver so it doesn't allow the system to get too busy or
> overloaded". But it seems to me that this is even best solved in the
> midlayer which should, in fact, know best (better than a single HBA).

Agrh. Who's saying 'fix' the hba driver? Either I'm not expressing
myself very clearly, or you are simply not reading what I write.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27 10:25                             ` Jens Axboe
@ 2002-09-27 12:18                               ` Matthew Jacob
  2002-09-27 12:54                                 ` Jens Axboe
  2002-09-27 13:30                               ` Justin T. Gibbs
  1 sibling, 1 reply; 60+ messages in thread
From: Matthew Jacob @ 2002-09-27 12:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Justin T. Gibbs, Pedro M. Rodrigues, Mathieu Chouquet-Stringer,
	linux-scsi, linux-kernel


[ .. all sorts of nice discussion, but not on our argument point ]
> 
> Agrh. Who's saying 'fix' the hba driver? Either I'm not expressing
> myself very clearly, or you are simply not reading what I write.

I (foolishly) leapt in when you said "253 is 'over the top'". You seemed
to imply that the aic7xxx driver was at fault and should be limiting the
amount it is sending out. My (mostly) only beef with what you've written
is with that implication- mainly as "don't send so many damned commands
if you think they're too many". If the finger pointing at aic7xx is not
what you're implying, then this has been a waste of email bandwidth-
sorry.

-matt



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-26 15:31     ` Justin T. Gibbs
  2002-09-27  6:13       ` Jens Axboe
@ 2002-09-27 12:28       ` Pedro M. Rodrigues
  1 sibling, 0 replies; 60+ messages in thread
From: Pedro M. Rodrigues @ 2002-09-27 12:28 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: Jens Axboe, Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

Justin T. Gibbs wrote:
>>    I reported this same problem some weeks ago -
>>http://marc.theaimsgroup.com/?l=linux-kernel&m=103069116227685&w=2 .
>>2.4.20pre kernels solved the error messages flooding the console, and
>>improved things a bit, but system load still got very high and disk read
>>and write performance was lousy. Adding more memory and using a
>>completely different machine didn't help. What did? Changing the Adaptec
>>scsi driver to aic7xxx_old . The performance was up 50% for writes and
>>90% for reads, and the system load was acceptable. And i didn't even had
>>to change the RedHat kernel (2.4.18-10) for a custom one. The storage was
>>two external Arena raid boxes, btw.
> 
> 
> I would be interested in knowing if reducing the maximum tag depth for
> the driver improves things for you.  There is a large difference in the
> defaults between the two drivers.  It has only reacently come to my
> attention that the SCSI layer per-transaction overhead is so high that
> you can completely starve the kernel of resources if this setting is too
> high.  For example, a 4GB system installing RedHat 7.3 could not even
> complete an install on a 20 drive system with the default of 253 commands.
> The latest version of the aic7xxx driver already sent to Marcelo drops the
> default to 32.
> 
> --
> Justin
> 
> 
> 

    I have a server available to test it, but the storage in question is 
already deployed. Yet, by luck (irony apart) i have a maintenance window 
this weekend for tuning and other matters, i can decrease the maximum 
number of TCQ commands per device in the proper aic7xxx driver to 32 and 
  report on the results. While trying to solve this problem i browsed 
RedHat's bugzilla, and there were several people burned with this 
problem. Hope this sorts it out for them.


/Pedro


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27 12:18                               ` Matthew Jacob
@ 2002-09-27 12:54                                 ` Jens Axboe
  0 siblings, 0 replies; 60+ messages in thread
From: Jens Axboe @ 2002-09-27 12:54 UTC (permalink / raw)
  To: Matthew Jacob
  Cc: Justin T. Gibbs, Pedro M. Rodrigues, Mathieu Chouquet-Stringer,
	linux-scsi, linux-kernel

On Fri, Sep 27 2002, Matthew Jacob wrote:
> 
> [ .. all sorts of nice discussion, but not on our argument point ]
> > 
> > Agrh. Who's saying 'fix' the hba driver? Either I'm not expressing
> > myself very clearly, or you are simply not reading what I write.
> 
> I (foolishly) leapt in when you said "253 is 'over the top'". You seemed
> to imply that the aic7xxx driver was at fault and should be limiting the
> amount it is sending out. My (mostly) only beef with what you've written
> is with that implication- mainly as "don't send so many damned commands
> if you think they're too many". If the finger pointing at aic7xx is not
> what you're implying, then this has been a waste of email bandwidth-
> sorry.

It's not aimed at any specific hba driver, it could be any. 253 would be
over the top for any of them, it just so happens that aic7xxx has this
as the default :-)

So while it is definitely not the aic7xxx driver doing the starvation
(it's the device), the aic7xxx driver is (_in my oppinion) somewhat at
fault for setting it so high _as a default_.

Hopefully that's the end of this thread :)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27 10:25                             ` Jens Axboe
  2002-09-27 12:18                               ` Matthew Jacob
@ 2002-09-27 13:30                               ` Justin T. Gibbs
  2002-09-27 14:26                                 ` James Bottomley
                                                   ` (3 more replies)
  1 sibling, 4 replies; 60+ messages in thread
From: Justin T. Gibbs @ 2002-09-27 13:30 UTC (permalink / raw)
  To: Jens Axboe, Matthew Jacob
  Cc: Pedro M. Rodrigues, Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

> The starvation I'm talking about is the drive starving requests. Just
> keep it moderately busy (10-30 outstanding tags), and a read can take a
> long time to complete.

As I tried to explain to Andrew just the other day, this is neither a
drive nor HBA problem.  You've essentially constructed a benchmark where
a single process can get so far ahead of the I/O subsystem in terms of
buffered writes that there is no choice but for there to be a long delay
for the device to handle your read.  Consider that because you are queuing,
the drive will completely fill its cache with write data that is pending
to hit the platters.  The number of transactions in the cache is marginally
dependant on the number of tags in use since that will affect the ability
of the controller to saturate the drive cache with write data.  Depending
on your drive, mode page settings, etc, the drive may allow your read to
pass the write, but many do not.  So you have to wait for the cache to
at least have space to handle your read and perhaps have even additional
write data flush before your read can even be started.  If you don't like
this behavior, which actually maximizes the throughput of the device, have
the I/O scheduler hold back a single processes from creating such a large
backlog.  You can also read the SCSI spec and tune your disk to behave
differently.

Now consider the read case.  I maintain that any reasonable drive will
*always* outperform the OS's transaction reordering/elevator algorithms
for seek reduction.  This is the whole point of having high tag depths.
In all I/O studies that have been performed todate, reads far outnumber
writes *unless* you are creating an ISO image on your disk.  In my opinion
it is much more important to optimize for the more common, concurrent
read case, than it is for the sequential write case with intermittent
reads.  Of course, you can fix the latter case too without any change to
the driver's queue depth as outlined above.  Why not have your cake and
eat it too?

--
Justin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file  transfers
  2002-09-27 13:30                               ` Justin T. Gibbs
@ 2002-09-27 14:26                                 ` James Bottomley
  2002-09-27 14:33                                   ` Jens Axboe
  2002-09-27 16:26                                   ` Justin T. Gibbs
  2002-09-27 14:30                                 ` Warning - running *really* short on DMA buffers while doing file transfers Jens Axboe
                                                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 60+ messages in thread
From: James Bottomley @ 2002-09-27 14:26 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: Jens Axboe, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

gibbs@scsiguy.com said:
> Now consider the read case.  I maintain that any reasonable drive will
> *always* outperform the OS's transaction reordering/elevator
> algorithms for seek reduction.  This is the whole point of having high
> tag depths. In all I/O studies that have been performed todate, reads
> far outnumber writes *unless* you are creating an ISO image on your
> disk.  In my opinion it is much more important to optimize for the
> more common, concurrent read case, than it is for the sequential write
> case with intermittent reads.  Of course, you can fix the latter case
> too without any change to the driver's queue depth as outlined above.
> Why not have your cake and eat it too? 

But it's not just the drive's elevator that we depend on.  You have to 
transfer the data to the drive as well.  The worst case is SCSI-2 where all 
phases of the transfer except data are narrow and asynchronous.  We get 
abysmal performance in SCSI-2 if the OS gives us 16 contiguous 4k data chunks 
instead of one 64k one because of the high command setup overhead.

Even the protocols which can transfer the header at the same speed, like FC, 
benefit from having large data to header ratios in their frames.

Therefore, it is in SCSI's interest to have the OS merge requests if it can 
purely from the transport efficiency point of view.  Once we accept the 
necessity of having the OS do some elevator work it becomes detrimental to 
have this work repeated in the drive firmware.

I guess, however, that this issue will evaporate substantially once the 
aic7xxx driver uses ordered tags to represent the transaction integrity since 
the barriers will force the drive seek algorithm to follow the tag 
transmission order much more closely.

James



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27 13:30                               ` Justin T. Gibbs
  2002-09-27 14:26                                 ` James Bottomley
@ 2002-09-27 14:30                                 ` Jens Axboe
  2002-09-27 17:19                                   ` Justin T. Gibbs
  2002-09-27 14:56                                 ` Rik van Riel
  2002-09-27 15:34                                 ` Matthew Jacob
  3 siblings, 1 reply; 60+ messages in thread
From: Jens Axboe @ 2002-09-27 14:30 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: Matthew Jacob, Pedro M. Rodrigues, Mathieu Chouquet-Stringer,
	linux-scsi, linux-kernel

On Fri, Sep 27 2002, Justin T. Gibbs wrote:
> > The starvation I'm talking about is the drive starving requests. Just
> > keep it moderately busy (10-30 outstanding tags), and a read can take a
> > long time to complete.
> 
> As I tried to explain to Andrew just the other day, this is neither a
> drive nor HBA problem.  You've essentially constructed a benchmark where
> a single process can get so far ahead of the I/O subsystem in terms of
> buffered writes that there is no choice but for there to be a long delay
> for the device to handle your read.  Consider that because you are queuing,
> the drive will completely fill its cache with write data that is pending
> to hit the platters.  The number of transactions in the cache is marginally
> dependant on the number of tags in use since that will affect the ability
> of the controller to saturate the drive cache with write data.  Depending
> on your drive, mode page settings, etc, the drive may allow your read to
> pass the write, but many do not.  So you have to wait for the cache to
> at least have space to handle your read and perhaps have even additional
> write data flush before your read can even be started.  If you don't like
> this behavior, which actually maximizes the throughput of the device, have
> the I/O scheduler hold back a single processes from creating such a large
> backlog.  You can also read the SCSI spec and tune your disk to behave
> differently.

If the above is what has been observed in the real world, then there
would be no problem. Lets say I have 32 tags pending, all writes. Now I
issue a read. Then I go ahead and through my writes at the drive,
basically keeping it at 32 tags all the time. When will this read
complete? The answer is, well it might not within any reasonable time,
because the drive happily starves the read to get the best write
throughput.

The size of the dirty cache back log, or whatever you want to call it,
does not matter _at all_. I don't know why both you and Matt keep
bringing that point up. The 'back log' is just that, it will be
processed in due time. If a read comes in, the io scheduler will decide
it's the most important thing on earth. So I may have 1 gig of dirty
cache waiting to be flushed to disk, that _does not_ mean that the read
that now comes in has to wait for the 1 gig to be flushed first.

> Now consider the read case.  I maintain that any reasonable drive will
> *always* outperform the OS's transaction reordering/elevator algorithms
> for seek reduction.  This is the whole point of having high tag depths.

Well given that the drive has intimate knowledge of itself, then yes of
course it is the only one that can order any number of pending requests
most optimally. So the drive might provide the best layout of requests
when it comes to total number of seek time spent, and throughput. But
often at the cost of increased (some times much, see the trivial
examples given) latency.

However, I maintain that going beyond any reasonable number of tags for
a standard drive is *stupid*. The Linux io scheduler gets very good
performance without any queueing at all. Going from 4 to 64 tags gets
you very very little increase in performance, if any at all.

> In all I/O studies that have been performed todate, reads far outnumber
> writes *unless* you are creating an ISO image on your disk.  In my opinion

Well it's my experience that it's pretty balanced, at least for my own
workload. atime updates and compiles etc put a nice load on writes.

> it is much more important to optimize for the more common, concurrent
> read case, than it is for the sequential write case with intermittent
> reads.  Of course, you can fix the latter case too without any change to
> the driver's queue depth as outlined above.  Why not have your cake and
> eat it too?

If you care to show me this cake, I'd be happy to devour it. I see
nothing even resembling a solution to this problem in your email, except
from you above saying I should ignore it and optimize for 'the common'
concurrent read case.

It's pointless to argue that tagging is oh so great and always
outperforms the os io scheduler, and that we should just use 253 tags
because the drive knwos best, when several examples have shown that this
is _not the case_.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27 14:26                                 ` James Bottomley
@ 2002-09-27 14:33                                   ` Jens Axboe
  2002-09-27 16:26                                   ` Justin T. Gibbs
  1 sibling, 0 replies; 60+ messages in thread
From: Jens Axboe @ 2002-09-27 14:33 UTC (permalink / raw)
  To: James Bottomley
  Cc: Justin T. Gibbs, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

On Fri, Sep 27 2002, James Bottomley wrote:
> Therefore, it is in SCSI's interest to have the OS merge requests if it can 
> purely from the transport efficiency point of view.  Once we accept the 
> necessity of having the OS do some elevator work it becomes detrimental to 
> have this work repeated in the drive firmware.

Hear, hear. And given that the os io scheduler (I prefer to call it
that, elevator is pretty far from the truth :-) gets so close to drives
optimal performance in most cases, a small tag depth makes sense and
protects us from the latency concerns.

> I guess, however, that this issue will evaporate substantially once
> the aic7xxx driver uses ordered tags to represent the transaction
> integrity since the barriers will force the drive seek algorithm to
> follow the tag transmission order much more closely.

Depends on how often you issue these ordered tags, but yes I hope so
too.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27 13:30                               ` Justin T. Gibbs
  2002-09-27 14:26                                 ` James Bottomley
  2002-09-27 14:30                                 ` Warning - running *really* short on DMA buffers while doing file transfers Jens Axboe
@ 2002-09-27 14:56                                 ` Rik van Riel
  2002-09-27 15:34                                 ` Matthew Jacob
  3 siblings, 0 replies; 60+ messages in thread
From: Rik van Riel @ 2002-09-27 14:56 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: Jens Axboe, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

On Fri, 27 Sep 2002, Justin T. Gibbs wrote:

> writes *unless* you are creating an ISO image on your disk.  In my opinion
> it is much more important to optimize for the more common, concurrent
> read case, than it is for the sequential write case with intermittent
> reads.

You're missing the point.  The only reason the reads are
intermittent is that the application can't proceed until
the read is done and the read is being starved by writes.

If the read was serviced immediately, the next read could
get scheduled quickly and they wouldn't be intermittant.

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27 13:30                               ` Justin T. Gibbs
                                                   ` (2 preceding siblings ...)
  2002-09-27 14:56                                 ` Rik van Riel
@ 2002-09-27 15:34                                 ` Matthew Jacob
  2002-09-27 15:37                                   ` Jens Axboe
  3 siblings, 1 reply; 60+ messages in thread
From: Matthew Jacob @ 2002-09-27 15:34 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: Jens Axboe, Pedro M. Rodrigues, Mathieu Chouquet-Stringer,
	linux-scsi, linux-kernel


> If you don't like this behavior, which actually maximizes the
> throughput of the device, have the I/O scheduler hold back a single
> processes from creating such a large backlog.


Justin and I are (for once) in 100% agreement.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27 15:34                                 ` Matthew Jacob
@ 2002-09-27 15:37                                   ` Jens Axboe
  2002-09-27 17:20                                     ` Justin T. Gibbs
  0 siblings, 1 reply; 60+ messages in thread
From: Jens Axboe @ 2002-09-27 15:37 UTC (permalink / raw)
  To: Matthew Jacob
  Cc: Justin T. Gibbs, Pedro M. Rodrigues, Mathieu Chouquet-Stringer,
	linux-scsi, linux-kernel

On Fri, Sep 27 2002, Matthew Jacob wrote:
> 
> > If you don't like this behavior, which actually maximizes the
> > throughput of the device, have the I/O scheduler hold back a single
> > processes from creating such a large backlog.
> 
> 
> Justin and I are (for once) in 100% agreement.

Well Justin and you are both, it seems, missing the point.

I'm now saying for the 3rd time, that there's zero problem in having a
huge dirty cache backlog. This is not the problem, please disregard any
reference to that. Count only the time spent for servicing a read
request, _from when it enters the drive_ and until it completes. IO
scheduler is _not_ involved.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file  transfers
  2002-09-27 14:26                                 ` James Bottomley
  2002-09-27 14:33                                   ` Jens Axboe
@ 2002-09-27 16:26                                   ` Justin T. Gibbs
  2002-09-27 17:21                                     ` James Bottomley
  2002-09-27 18:59                                     ` Warning - running *really* short on DMA buffers while doingfile transfers Andrew Morton
  1 sibling, 2 replies; 60+ messages in thread
From: Justin T. Gibbs @ 2002-09-27 16:26 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

> But it's not just the drive's elevator that we depend on.  You have to 
> transfer the data to the drive as well.  The worst case is SCSI-2 where
> all  phases of the transfer except data are narrow and asynchronous.  We
> get  abysmal performance in SCSI-2 if the OS gives us 16 contiguous 4k
> data chunks  instead of one 64k one because of the high command setup
> overhead.

Which part of the OS are you talking about?  In the case of writes,
the VM/Buffer cache should be deferring the retiring of dirty buffers
in the hopes that the writes become irrelevant.  That typically gives
ample time for writes to be combined.  I also do not believe that the
command overhead is as significant as you suggest.  I've personally seen
a non-packetized SCSI bus perform over 15K transactions per-second.  The
number moves to ~40-50k when you start using packetized transfers.  The
drives do this combining for you too, so other than command overhead
and perhaps having a cheap drive with a really slow IOP on it, this
really isn't an issue.

For reads, the OS is supposed to be doing read-ahead and the application
or the kernel should be performing async reads where appropriate.
Most applications have output that depends on input, but not input
decisions that rely on previous input so async I/O or I/O hints (madvise)
can be easily used.  Because of read-ahead, the OS should never send
16 4k contiguous reads to the I/O layer for the same application.
 
> Even the protocols which can transfer the header at the same speed, like
> FC,  benefit from having large data to header ratios in their frames.

Yes, small transactions require more processing overhead, but you can
only combine transactions that are contiguous.  See above on how the
OS should be optimizing the contiguous case anyway.

> Therefore, it is in SCSI's interest to have the OS merge requests if it
> can  purely from the transport efficiency point of view.  Once we accept
> the  necessity of having the OS do some elevator work it becomes
> detrimental to  have this work repeated in the drive firmware.

The OS elevator will never know all of the device characteristics that
the device knows.  This is why the device's elevator will always out
perform the OSes assuming the OS isn't stupid about overcommitting writes.
That's what the argument is here.  Linux is agressively committing writes
when it shouldn't.
 
> I guess, however, that this issue will evaporate substantially once the 
> aic7xxx driver uses ordered tags to represent the transaction integrity
> since  the barriers will force the drive seek algorithm to follow the tag 
> transmission order much more closely.

Hooks for sending ordered tags have been in the aic7xxx driver, at least
in FreeBSD's version, since '97.  As soon as the Linux cmd blocks have
such information it will be trivial to have the aic7xxx driver issue
the appropriate tag types.  But this misses the point.  Andrew's original
speculation was that writes were "passing reads" once the read was
submitted to the drive.  I would like to understand the evidence behind
that assertion since all drive's I've worked with automatically give
a higher priority to read traffic than writes since writes can be buffered
but reads cannot.  Ordered tags only help if the driver is already not
doing what you want or if your writes must have a specific order for
data integrity.

--
Justin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27 14:30                                 ` Warning - running *really* short on DMA buffers while doing file transfers Jens Axboe
@ 2002-09-27 17:19                                   ` Justin T. Gibbs
  2002-09-27 18:29                                     ` Rik van Riel
  0 siblings, 1 reply; 60+ messages in thread
From: Justin T. Gibbs @ 2002-09-27 17:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Matthew Jacob, Pedro M. Rodrigues, Mathieu Chouquet-Stringer,
	linux-scsi, linux-kernel

> If the above is what has been observed in the real world, then there
> would be no problem. Lets say I have 32 tags pending, all writes. Now I
> issue a read. Then I go ahead and through my writes at the drive,
> basically keeping it at 32 tags all the time. When will this read
> complete? The answer is, well it might not within any reasonable time,
> because the drive happily starves the read to get the best write
> throughput.

Just because you use 32 or 4 or 8 or whatever tags you cannot know the
number of commands still in the drive's cache.  Have you disabled
turned off the WCE bit on your drive and retested your latency numbers?

> The size of the dirty cache back log, or whatever you want to call it,
> does not matter _at all_. I don't know why both you and Matt keep
> bringing that point up. The 'back log' is just that, it will be
> processed in due time. If a read comes in, the io scheduler will decide
> it's the most important thing on earth. So I may have 1 gig of dirty
> cache waiting to be flushed to disk, that _does not_ mean that the read
> that now comes in has to wait for the 1 gig to be flushed first.

But it does matter.  If single process can fill the drive's or array's
cache with silly write data as well as have all outstanding tags busy
on its writes, you will incur a significant delay.  No single process
should be allowed to do that.  It doesn't matter that the read becomes
the most important thing on earth to the OS, you can't take back what
you've already issued to the device.  Sorry.  It doesn't work that
way.

> However, I maintain that going beyond any reasonable number of tags for
> a standard drive is *stupid*. The Linux io scheduler gets very good
> performance without any queueing at all. Going from 4 to 64 tags gets
> you very very little increase in performance, if any at all.

Under what benchmarks?  http load?  Squid, News, or mail simulations?
All I've seen are examples crafted to prove your point that I don't
think mirror real world workloads.

>> In all I/O studies that have been performed todate, reads far outnumber
>> writes *unless* you are creating an ISO image on your disk.  In my
>> opinion
> 
> Well it's my experience that it's pretty balanced, at least for my own
> workload. atime updates and compiles etc put a nice load on writes.

These are very differnet than the "benchmark" I've seen used in this
dicussion:

dd if=/dev/zero of=somefile bs=1M &
cat somefile.

Have you actually timed some of your common activities (say a full
build of the Linux kernel w/modules) at different tag depths, with
or without write caching enabled, etc?
 
> If you care to show me this cake, I'd be happy to devour it. I see
> nothing even resembling a solution to this problem in your email, except
> from you above saying I should ignore it and optimize for 'the common'
> concurrent read case.

Take a look inside True64 (I believe there are a few papers about this)
to see how to use command response times to modulate device workload.
FreeBSD has several algorithms in its VM to prevent a single process
from holding onto too many dirty buffers.  FreeBSD, Solaris, True64,
even WindowsNT have effective algorithms for sanely retiring dirty
buffers without saturating the system.  All of these algorithms have
been discussed at length in conference papers.  You just need to go
do a google search.  None of these issues are new and the solutions
are not novel.
 
> It's pointless to argue that tagging is oh so great and always
> outperforms the os io scheduler, and that we should just use 253 tags
> because the drive knwos best, when several examples have shown that this
> is _not the case_.

You are trying to solve these problems at the wrong level.

--
Justin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27 15:37                                   ` Jens Axboe
@ 2002-09-27 17:20                                     ` Justin T. Gibbs
  0 siblings, 0 replies; 60+ messages in thread
From: Justin T. Gibbs @ 2002-09-27 17:20 UTC (permalink / raw)
  To: Jens Axboe, Matthew Jacob
  Cc: Pedro M. Rodrigues, Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

> I'm now saying for the 3rd time, that there's zero problem in having a
> huge dirty cache backlog. This is not the problem, please disregard any
> reference to that. Count only the time spent for servicing a read
> request, _from when it enters the drive_ and until it completes. IO
> scheduler is _not_ involved.

On the drive?  That's all I've been saying.

--
Justin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file  transfers
  2002-09-27 16:26                                   ` Justin T. Gibbs
@ 2002-09-27 17:21                                     ` James Bottomley
  2002-09-27 18:56                                       ` Justin T. Gibbs
  2002-09-27 20:58                                       ` Warning - running *really* short on DMA buffers while doing file transfers Justin T. Gibbs
  2002-09-27 18:59                                     ` Warning - running *really* short on DMA buffers while doingfile transfers Andrew Morton
  1 sibling, 2 replies; 60+ messages in thread
From: James Bottomley @ 2002-09-27 17:21 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: James Bottomley, Jens Axboe, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

> Which part of the OS are you talking about?

I'm not, I'm talking about the pure physical characteristics of the transport 
bus.

> I also do not believe that the command overhead is as significant as
> you suggest.  I've personally seen a non-packetized SCSI bus perform
> over 15K transactions per-second.

Well, lets assume the simplest setup possible: select + tag msg + 10 byte 
command + disconnect + reselect + status; that's 17 bytes async.  The maximum 
bus speed async narrow is about 4Mb/s, so those 17 bytes take around 4us to 
transmit.  On a wide Ultra2 bus, the data rate is about 80Mb/s so it takes 
50us to transmit 4k or 800us to transmit 64k.  However, the major killer in 
this model is going to be disconnection delay at around 200us (dwarfing 
arbitration delay, bus settle time etc).  For 4k packets you spend about 3 
times longer arbitrating for the bus than you do transmitting data.  For 64k 
packets it's 25% of your data transmit time in arbitration.  Your theoretical 
throughput for 4k packets is thus 20Mb/s.  In my book that's a significant 
loss on an 80Mb/s bus.

On Fabric busses, you move to the network model and collision probabilities 
which increase as the packet size goes down.

gibbs@scsiguy.com said:
> Because of read-ahead, the OS should never send 16 4k contiguous reads
> to the I/O layer for the same application.

read ahead is basically a very simplistic form of I/O scheduling.  

> Hooks for sending ordered tags have been in the aic7xxx driver, at
> least in FreeBSD's version, since '97.  As soon as the Linux cmd
> blocks have such information it will be trivial to have the aic7xxx
> driver issue the appropriate tag types.

They already do in 2.5, see scsi_populate_tag_msg() in scsi.h.  This assumes 
you're using the generic tag queueing, which the aic7xxx doesn't, but you 
could easily key the tag type off REQ_BARRIER.

> But this misses the point.  Andrew's original speculation was that
> writes were "passing reads" once the read was submitted to the drive.

The speculation is based on the observation that for transactions consisting 
of multiple writes and small reads, the reads take a long time to complete.  
That translates to starving a read in favour of a bunch of contiguous writes.  
I'm sure we've all seen SCSI drives indulge in this type of unfair behaviour 
(it does make sense to keep servicing writes if they're direct follow on's 
from the previously serviced ones).

> I would like to understand the evidence behind that assertion since
> all drive's I've worked with automatically give a higher priority to
> read traffic than writes since writes can be buffered but reads
> cannot.

The evidence is here:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103302456113997&w=1

James



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file transfers
  2002-09-27 17:19                                   ` Justin T. Gibbs
@ 2002-09-27 18:29                                     ` Rik van Riel
  0 siblings, 0 replies; 60+ messages in thread
From: Rik van Riel @ 2002-09-27 18:29 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: Jens Axboe, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

On Fri, 27 Sep 2002, Justin T. Gibbs wrote:

> FreeBSD has several algorithms in its VM to prevent a single process
> from holding onto too many dirty buffers.  FreeBSD, Solaris, True64,
> even WindowsNT have effective algorithms for sanely retiring dirty
> buffers without saturating the system.

I guess those must be bad for dbench, bonnie or other critical
server applications ;)

*runs like hell*

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

Spamtraps of the month:  september@surriel.com trac@trac.org


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file  transfers
  2002-09-27 17:21                                     ` James Bottomley
@ 2002-09-27 18:56                                       ` Justin T. Gibbs
  2002-09-27 19:07                                         ` Warning - running *really* short on DMA buffers while doingfile transfers Andrew Morton
  2002-09-27 20:58                                       ` Warning - running *really* short on DMA buffers while doing file transfers Justin T. Gibbs
  1 sibling, 1 reply; 60+ messages in thread
From: Justin T. Gibbs @ 2002-09-27 18:56 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

> Well, lets assume the simplest setup possible: select + tag msg + 10 byte 
> command + disconnect + reselect + status; that's 17 bytes async.  The
> maximum  bus speed async narrow is about 4Mb/s, so those 17 bytes take
> around 4us to  transmit.  On a wide Ultra2 bus, the data rate is about
> 80Mb/s so it takes  50us to transmit 4k or 800us to transmit 64k.
> However, the major killer in  this model is going to be disconnection
> delay at around 200us (dwarfing  arbitration delay, bus settle time etc).
> For 4k packets you spend about 3  times longer arbitrating for the bus
> than you do transmitting data.  For 64k  packets it's 25% of your data
> transmit time in arbitration.  Your theoretical  throughput for 4k
> packets is thus 20Mb/s.  In my book that's a significant  loss on an
> 80Mb/s bus.

This only matters if your benchmark is dependent on round-trip latency
(no read-ahead or write behind and no command overlap) or if you have
saturated the bus.  None of these are the case with the single drive
I/O benchmarks that have been talked about in this thread.  I suppose
I should have been more specific in saying, "the command overhead is
not a factor in the issues raised by this thread".  Now if you want
to use command overhead as a reason to use tagged queuing to mitigate
that overhead, by all means, go right ahead.

>> Hooks for sending ordered tags have been in the aic7xxx driver, at
>> least in FreeBSD's version, since '97.  As soon as the Linux cmd
>> blocks have such information it will be trivial to have the aic7xxx
>> driver issue the appropriate tag types.
> 
> They already do in 2.5, see scsi_populate_tag_msg() in scsi.h.  This
> assumes  you're using the generic tag queueing, which the aic7xxx
> doesn't, but you  could easily key the tag type off REQ_BARRIER.

Okay.

>> But this misses the point.  Andrew's original speculation was that
>> writes were "passing reads" once the read was submitted to the drive.
> 
> The speculation is based on the observation that for transactions
> consisting  of multiple writes and small reads, the reads take a long
> time to complete.

I've seen evidence that a series of reads takes a long time to complete,
but nothing that indicates that every read is starved beyond what you
would expect to see if a huge number of writes were issued between each
read.

> That translates to starving a read in favour of a
> bunch of contiguous writes.   I'm sure we've all seen SCSI drives indulge
> in this type of unfair behaviour  (it does make sense to keep servicing
> writes if they're direct follow on's  from the previously serviced ones).

Actually I haven't.  The closest I can come to this is a single read way
off on the far side of the disk starved by a continuous stream or reads
on the other side of the platter.  This behavior was fixed by all major
drive manufacturers that I know of back in 97 or 98.

>> I would like to understand the evidence behind that assertion since
>> all drive's I've worked with automatically give a higher priority to
>> read traffic than writes since writes can be buffered but reads
>> cannot.
> 
> The evidence is here:
> 
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103302456113997&w=1

Which unfortunately characterizes only a single symptom without breaking
it down on a transaction by transaction basis.  We need to understand
how many writes were queued by the OS to the drive between each read to
know if the drive is actually allowing writes to pass reads or not.

--
Justin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doingfile   transfers
  2002-09-27 16:26                                   ` Justin T. Gibbs
  2002-09-27 17:21                                     ` James Bottomley
@ 2002-09-27 18:59                                     ` Andrew Morton
  1 sibling, 0 replies; 60+ messages in thread
From: Andrew Morton @ 2002-09-27 18:59 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: James Bottomley, Jens Axboe, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

"Justin T. Gibbs" wrote:
> 
> ...
> The OS elevator will never know all of the device characteristics that
> the device knows.  This is why the device's elevator will always out
> perform the OSes assuming the OS isn't stupid about overcommitting writes.
> That's what the argument is here.  Linux is agressively committing writes
> when it shouldn't.

The VM really doesn't want to strangle itself because it might be
talking to a braindead SCSI drive.

I have a Fujitsu disk which allows newly submitted writes to
bypass already-submitted reads.  A read went in, and did not
come back for three seconds.  Ninety megabytes of writes went
into the disk during those three seconds.

>From a whole-system performance viewpoint that is completely
broken behaviour.  It's unmanageable from the VM point of view.
At least, I don't want to have to manage it at that level.

We may be able to work around it by adding kludges to the IO
scheduler but it's easier to just set the tag depth to zero and
mutter rude words about clueless firmware developers.

> > I guess, however, that this issue will evaporate substantially once the
> > aic7xxx driver uses ordered tags to represent the transaction integrity
> > since  the barriers will force the drive seek algorithm to follow the tag
> > transmission order much more closely.
> 
> Hooks for sending ordered tags have been in the aic7xxx driver, at least
> in FreeBSD's version, since '97.  As soon as the Linux cmd blocks have
> such information it will be trivial to have the aic7xxx driver issue
> the appropriate tag types.  But this misses the point.  Andrew's original
> speculation was that writes were "passing reads" once the read was
> submitted to the drive.  I would like to understand the evidence behind
> that assertion since all drive's I've worked with automatically give
> a higher priority to read traffic than writes since writes can be buffered
> but reads cannot.

Could be that the Fujitsu is especially broken.  I observed the three
second read latency with 253 tags (OK, that's 128 megabytes worth).
But with the driver limited to four tags, latency was two seconds.
Hence my speculation.

>  Ordered tags only help if the driver is already not
> doing what you want or if your writes must have a specific order for
> data integrity.

Is it possible to add a tag to a read which says "may not be bypassed
by writes"?  That would be OK, as long as the driver is only set up
to use a tag depth of four or so.

To use larger tag depths, we'd need to be able to tag newly incoming
reads with a "do this before servicing already-submitted writes"
attribute.  Is anything like that available?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doingfile   transfers
  2002-09-27 18:56                                       ` Justin T. Gibbs
@ 2002-09-27 19:07                                         ` Andrew Morton
  2002-09-27 19:16                                           ` Justin T. Gibbs
  0 siblings, 1 reply; 60+ messages in thread
From: Andrew Morton @ 2002-09-27 19:07 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: James Bottomley, Jens Axboe, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

"Justin T. Gibbs" wrote:
> 
> ...
> > The evidence is here:
> >
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=103302456113997&w=1
> 
> Which unfortunately characterizes only a single symptom without breaking
> it down on a transaction by transaction basis.  We need to understand
> how many writes were queued by the OS to the drive between each read to
> know if the drive is actually allowing writes to pass reads or not.
> 

Given that I measured a two-second read latency with four tags,
that would be about 60 megabytes of write traffic after the
read was submitted.  Say, 120 requests.  That's with a tag
depth of four.

Not sure how old the disk is.  It's a 36G Fujitsu SCA-2.  Manufactured
in 2000, perhaps??

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doingfile transfers
  2002-09-27 19:07                                         ` Warning - running *really* short on DMA buffers while doingfile transfers Andrew Morton
@ 2002-09-27 19:16                                           ` Justin T. Gibbs
  2002-09-27 19:36                                             ` Warning - running *really* short on DMA buffers while doingfiletransfers Andrew Morton
  0 siblings, 1 reply; 60+ messages in thread
From: Justin T. Gibbs @ 2002-09-27 19:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Jens Axboe, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

>> Which unfortunately characterizes only a single symptom without breaking
>> it down on a transaction by transaction basis.  We need to understand
>> how many writes were queued by the OS to the drive between each read to
>> know if the drive is actually allowing writes to pass reads or not.
>> 
> 
> Given that I measured a two-second read latency with four tags,
> that would be about 60 megabytes of write traffic after the
> read was submitted.  Say, 120 requests.  That's with a tag
> depth of four.

I still don't follow your reasoning.  Your benchmark indicates the
latency for several reads (cat kernel/*.c), not the per-read latency.
The two are quite different and unless you know the per-read latency and
whether it was affected by filling the drive's entire cache with
pent up writes (again these are writes that are above and beyond
those still assigned tags) you are still speculating that writes
are passing reads.

If you can tell me exactly how you ran your benchmark, I'll find the
information I want by using a SCSI bus analyzer to sniff the traffic
on the bus.

--
Justin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while  doingfiletransfers
  2002-09-27 19:16                                           ` Justin T. Gibbs
@ 2002-09-27 19:36                                             ` Andrew Morton
  2002-09-27 19:52                                               ` Justin T. Gibbs
  2002-09-27 19:58                                               ` Andrew Morton
  0 siblings, 2 replies; 60+ messages in thread
From: Andrew Morton @ 2002-09-27 19:36 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: James Bottomley, Jens Axboe, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

"Justin T. Gibbs" wrote:
> 
> >> Which unfortunately characterizes only a single symptom without breaking
> >> it down on a transaction by transaction basis.  We need to understand
> >> how many writes were queued by the OS to the drive between each read to
> >> know if the drive is actually allowing writes to pass reads or not.
> >>
> >
> > Given that I measured a two-second read latency with four tags,
> > that would be about 60 megabytes of write traffic after the
> > read was submitted.  Say, 120 requests.  That's with a tag
> > depth of four.
> 
> I still don't follow your reasoning.  Your benchmark indicates the
> latency for several reads (cat kernel/*.c), not the per-read latency.
> The two are quite different and unless you know the per-read latency and
> whether it was affected by filling the drive's entire cache with
> pent up writes (again these are writes that are above and beyond
> those still assigned tags) you are still speculating that writes
> are passing reads.
> 
> If you can tell me exactly how you ran your benchmark, I'll find the
> information I want by using a SCSI bus analyzer to sniff the traffic
> on the bus.

I did it by tracing.  4 meg printk buffer, teach printk to timestamp
its output, add tracing printk's, then stare at the output.

The patches are at

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.38/2.5.38-mm1/experimental/printk-big-buf.patch
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.38/2.5.38-mm1/experimental/elevator-debug.patch

One sample trace is at
http://www.zip.com.au/~akpm/linux/patches/trace.txt

Watch the read of sector 528598.  It was inserted into the
elevator at 24989.185 seconds, was taken off the elevator by
the driver at 24989.186 seconds and was completed in bio_endio()
at 24992.273 seconds.  That trace was taken with 253 tags.  I
don't have a 4-tag trace handy but it was much the same, with
a two-second lag.

I am assuming that the driver submits the request to the disk
shortly after calling elv_next_request().  If I'm wrong, and
the driver itself is hanging onto the request for a significant
amount of time then the disk is not the source of the delay.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while  doingfiletransfers
  2002-09-27 19:36                                             ` Warning - running *really* short on DMA buffers while doingfiletransfers Andrew Morton
@ 2002-09-27 19:52                                               ` Justin T. Gibbs
  2002-09-27 21:13                                                 ` James Bottomley
  2002-09-27 19:58                                               ` Andrew Morton
  1 sibling, 1 reply; 60+ messages in thread
From: Justin T. Gibbs @ 2002-09-27 19:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Jens Axboe, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

> Watch the read of sector 528598.  It was inserted into the
> elevator at 24989.185 seconds, was taken off the elevator by
> the driver at 24989.186 seconds and was completed in bio_endio()
> at 24992.273 seconds.  That trace was taken with 253 tags.  I
> don't have a 4-tag trace handy but it was much the same, with
> a two-second lag.
> 
> I am assuming that the driver submits the request to the disk
> shortly after calling elv_next_request().  If I'm wrong, and
> the driver itself is hanging onto the request for a significant
> amount of time then the disk is not the source of the delay.

Since your drive cannot handle 253 tags, when saturated with commands,
a new command is never submitted to the drive directly.  Instead the
command waits in the aic7xxx driver's queue until space is available
on the device.  In FreeBSD, this never happens as tag depth is known
to, and adjusted by, the mid-layer.  In Linux I must report the
queue depth without having sufficient load or history with the device
to know anything about its capabilities so I have no choice but to
throttle internally should the device support fewer tags than initially
reported to the OS.  You can determine the actual device queue
depth from "cat /proc/scsi/aic7xxx/#".  Run a bunch of I/O first so
that the tag depth gets locked first.

--
Justin



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while  doingfiletransfers
  2002-09-27 19:36                                             ` Warning - running *really* short on DMA buffers while doingfiletransfers Andrew Morton
  2002-09-27 19:52                                               ` Justin T. Gibbs
@ 2002-09-27 19:58                                               ` Andrew Morton
  1 sibling, 0 replies; 60+ messages in thread
From: Andrew Morton @ 2002-09-27 19:58 UTC (permalink / raw)
  To: Justin T. Gibbs, James Bottomley, Jens Axboe, Matthew Jacob,
	Pedro M. Rodrigues, Mathieu Chouquet-Stringer, linux-scsi,
	linux-kernel

Andrew Morton wrote:
> 
> ...
> One sample trace is at
> http://www.zip.com.au/~akpm/linux/patches/trace.txt
> 

Another thing to note from that trace is that many writes
went through the entire submit_bio/elv_next_request/bio_endio
cycle between the submission and completion of the read.

So there was:

	submit_bio(the read)
	elv_next_request(the read)
	...
	submit_bio(a write)
	elv_next_request(that write)
	bio_endio(that write)
	...
	bio_endio(the read)

For many writes.  I'm fairly (but not 100%) sure that the same
behaviour was seen with four tags.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file  transfers
  2002-09-27 17:21                                     ` James Bottomley
  2002-09-27 18:56                                       ` Justin T. Gibbs
@ 2002-09-27 20:58                                       ` Justin T. Gibbs
  2002-09-27 21:38                                         ` Patrick Mansfield
  1 sibling, 1 reply; 60+ messages in thread
From: Justin T. Gibbs @ 2002-09-27 20:58 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

>> Hooks for sending ordered tags have been in the aic7xxx driver, at
>> least in FreeBSD's version, since '97.  As soon as the Linux cmd
>> blocks have such information it will be trivial to have the aic7xxx
>> driver issue the appropriate tag types.
> 
> They already do in 2.5, see scsi_populate_tag_msg() in scsi.h.  This
> assumes  you're using the generic tag queueing, which the aic7xxx
> doesn't, but you  could easily key the tag type off REQ_BARRIER.

If anyone wants to play with the updated aic7xxx and aic79xx drivers
(new port to 2.5, plus it honors the otag stuff), you can pick it up
from here:



--On Friday, September 27, 2002 13:21:29 -0400 James Bottomley
<James.Bottomley@steeleye.com> wrote:

>> Which part of the OS are you talking about?
> 
> I'm not, I'm talking about the pure physical characteristics of the
> transport  bus.
> 
>> I also do not believe that the command overhead is as significant as
>> you suggest.  I've personally seen a non-packetized SCSI bus perform
>> over 15K transactions per-second.
> 
> Well, lets assume the simplest setup possible: select + tag msg + 10 byte 
> command + disconnect + reselect + status; that's 17 bytes async.  The
> maximum  bus speed async narrow is about 4Mb/s, so those 17 bytes take
> around 4us to  transmit.  On a wide Ultra2 bus, the data rate is about
> 80Mb/s so it takes  50us to transmit 4k or 800us to transmit 64k.
> However, the major killer in  this model is going to be disconnection
> delay at around 200us (dwarfing  arbitration delay, bus settle time etc).
> For 4k packets you spend about 3  times longer arbitrating for the bus
> than you do transmitting data.  For 64k  packets it's 25% of your data
> transmit time in arbitration.  Your theoretical  throughput for 4k
> packets is thus 20Mb/s.  In my book that's a significant  loss on an
> 80Mb/s bus.
> 
> On Fabric busses, you move to the network model and collision
> probabilities  which increase as the packet size goes down.
> 
> gibbs@scsiguy.com said:
>> Because of read-ahead, the OS should never send 16 4k contiguous reads
>> to the I/O layer for the same application.
> 
> read ahead is basically a very simplistic form of I/O scheduling.  
> 


--On Friday, September 27, 2002 13:21:29 -0400 James Bottomley
<James.Bottomley@steeleye.com> wrote:

>> Which part of the OS are you talking about?
> 
> I'm not, I'm talking about the pure physical characteristics of the
> transport  bus.
> 
>> I also do not believe that the command overhead is as significant as
>> you suggest.  I've personally seen a non-packetized SCSI bus perform
>> over 15K transactions per-second.
> 
> Well, lets assume the simplest setup possible: select + tag msg + 10 byte 
> command + disconnect + reselect + status; that's 17 bytes async.  The
> maximum  bus speed async narrow is about 4Mb/s, so those 17 bytes take
> around 4us to  transmit.  On a wide Ultra2 bus, the data rate is about
> 80Mb/s so it takes  50us to transmit 4k or 800us to transmit 64k.
> However, the major killer in  this model is going to be disconnection
> delay at around 200us (dwarfing  arbitration delay, bus settle time etc).
> For 4k packets you spend about 3  times longer arbitrating for the bus
> than you do transmitting data.  For 64k  packets it's 25% of your data
> transmit time in arbitration.  Your theoretical  throughput for 4k
> packets is thus 20Mb/s.  In my book that's a significant  loss on an
> 80Mb/s bus.
> 
> On Fabric busses, you move to the network model and collision
> probabilities  which increase as the packet size goes down.
> 
> gibbs@scsiguy.com said:
>> Because of read-ahead, the OS should never send 16 4k contiguous reads
>> to the I/O layer for the same application.
> 
> read ahead is basically a very simplistic form of I/O scheduling.  
> 

http://people.FreeBSD.org/~gibbs/linux/linux-2.5-aic79xxx.tar.gz

--
Justin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while  doingfiletransfers
  2002-09-27 19:52                                               ` Justin T. Gibbs
@ 2002-09-27 21:13                                                 ` James Bottomley
  2002-09-27 21:18                                                   ` Matthew Jacob
  2002-09-27 21:28                                                   ` Justin T. Gibbs
  0 siblings, 2 replies; 60+ messages in thread
From: James Bottomley @ 2002-09-27 21:13 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: Andrew Morton, James Bottomley, Jens Axboe, Matthew Jacob,
	Pedro M. Rodrigues, Mathieu Chouquet-Stringer, linux-scsi,
	linux-kernel

> Since your drive cannot handle 253 tags, when saturated with commands,
> a new command is never submitted to the drive directly.  Instead the
> command waits in the aic7xxx driver's queue until space is available
> on the device.  In FreeBSD, this never happens as tag depth is known
> to, and adjusted by, the mid-layer.  In Linux I must report the queue
> depth without having sufficient load or history with the device to
> know anything about its capabilities so I have no choice but to
> throttle internally should the device support fewer tags than
> initially reported to the OS.  You can determine the actual device
> queue depth from "cat /proc/scsi/aic7xxx/#".  Run a bunch of I/O first
> so that the tag depth gets locked first. 

Linux is perfectly happy just to have you return 1 in queuecommand if the 
device won't accept the tag.  The can_queue parameter represents the maximum 
number of outstanding commands the mid-layer will ever send.  The mid-layer is 
happy to re-queue I/O below this limit if it cannot be accepted by the drive.  
In fact, that's more or less what queue plugging is about.

The only problem occurs if you return 1 from queuecommand with no other 
outstanding I/O for the device.

There should be no reason in 2.5 for a driver to have to implement an internal 
queue.

James



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while  doingfiletransfers
  2002-09-27 21:13                                                 ` James Bottomley
@ 2002-09-27 21:18                                                   ` Matthew Jacob
  2002-09-27 21:23                                                     ` James Bottomley
  2002-09-27 21:28                                                   ` Justin T. Gibbs
  1 sibling, 1 reply; 60+ messages in thread
From: Matthew Jacob @ 2002-09-27 21:18 UTC (permalink / raw)
  To: James Bottomley
  Cc: Justin T. Gibbs, Andrew Morton, Jens Axboe, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

> 
> Linux is perfectly happy just to have you return 1 in queuecommand if the 
> device won't accept the tag.  The can_queue parameter represents the maximum 
> number of outstanding commands the mid-layer will ever send.  The mid-layer is 
> happy to re-queue I/O below this limit if it cannot be accepted by the drive.  
> In fact, that's more or less what queue plugging is about.
> 
> The only problem occurs if you return 1 from queuecommand with no other 
> outstanding I/O for the device.

Duh. There had been race conditions in the past which caused all of us
HBA writers to in fact start swalloing things like QFULL and maintaining
internal queues.

> 
> There should be no reason in 2.5 for a driver to have to implement an internal 
> queue.

That'd be swell.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while  doingfiletransfers
  2002-09-27 21:18                                                   ` Matthew Jacob
@ 2002-09-27 21:23                                                     ` James Bottomley
  2002-09-27 21:29                                                       ` Justin T. Gibbs
                                                                         ` (3 more replies)
  0 siblings, 4 replies; 60+ messages in thread
From: James Bottomley @ 2002-09-27 21:23 UTC (permalink / raw)
  To: mjacob
  Cc: James Bottomley, Justin T. Gibbs, Andrew Morton, Jens Axboe,
	Pedro M. Rodrigues, Mathieu Chouquet-Stringer, linux-scsi,
	linux-kernel

mjacob@feral.com said:
> Duh. There had been race conditions in the past which caused all of us
> HBA writers to in fact start swalloing things like QFULL and
> maintaining internal queues. 

That was true of 2.2, 2.3 (and I think early 2.4) but it isn't true of late 
2.4 and 2.5

James



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while  doingfiletransfers
  2002-09-27 21:13                                                 ` James Bottomley
  2002-09-27 21:18                                                   ` Matthew Jacob
@ 2002-09-27 21:28                                                   ` Justin T. Gibbs
  2002-09-28 15:52                                                     ` James Bottomley
  2002-09-30 23:54                                                     ` Doug Ledford
  1 sibling, 2 replies; 60+ messages in thread
From: Justin T. Gibbs @ 2002-09-27 21:28 UTC (permalink / raw)
  To: James Bottomley
  Cc: Andrew Morton, Jens Axboe, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

> Linux is perfectly happy just to have you return 1 in queuecommand if the 
> device won't accept the tag.  The can_queue parameter represents the
> maximum  number of outstanding commands the mid-layer will ever send.
> The mid-layer is  happy to re-queue I/O below this limit if it cannot be
> accepted by the drive.   In fact, that's more or less what queue plugging
> is about.
> 
> The only problem occurs if you return 1 from queuecommand with no other 
> outstanding I/O for the device.
> 
> There should be no reason in 2.5 for a driver to have to implement an
> internal  queue.

Did this really get fixed in 2.5?  The internal queuing was completely
broken in 2.4.  Some of the known breakages were:

1) Device returns queue full with no outstanding commands from us
   (usually occurs in multi-initiator environments).

2) No delay after busy status so devices that will continually
   report BUSY if you hammer them with commands never come ready.

3) Queue is restarted as soon as any command completes even if
   you really need to throttle down the number of tags supported
   by the device.

4) No tag throttling.  If tag throttling is in 2.5, does it ever
   increment the tag depth to handle devices that report temporary
   resource shortages (Atlas II and III do this all the time, other
   devices usually do this only in multi-initiator environments).

5) Proper transaction ordering across a queue full.  The aic7xxx
   driver "requeues" all transactions that have not yet been sent
   to the device replacing the transaction that experienced the queue
   full back at the head so that ordering is maintained.

No thought was put into any of these issues in 2.4, so I decided not
to even think about trusting the mid-layer for this functionality.

--
Justin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while  doingfiletransfers
  2002-09-27 21:23                                                     ` James Bottomley
@ 2002-09-27 21:29                                                       ` Justin T. Gibbs
  2002-09-27 21:32                                                       ` Matthew Jacob
                                                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 60+ messages in thread
From: Justin T. Gibbs @ 2002-09-27 21:29 UTC (permalink / raw)
  To: James Bottomley, mjacob
  Cc: Andrew Morton, Jens Axboe, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

> That was true of 2.2, 2.3 (and I think early 2.4) but it isn't true of
> late  2.4 and 2.5

I have see 0 changes in 2.4 that indicate that it is safe to have the
mid-layer do queuing.

--
Justin


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while  doingfiletransfers
  2002-09-27 21:23                                                     ` James Bottomley
  2002-09-27 21:29                                                       ` Justin T. Gibbs
@ 2002-09-27 21:32                                                       ` Matthew Jacob
  2002-09-27 22:08                                                       ` Mike Anderson
  2002-09-30 23:49                                                       ` Doug Ledford
  3 siblings, 0 replies; 60+ messages in thread
From: Matthew Jacob @ 2002-09-27 21:32 UTC (permalink / raw)
  To: James Bottomley
  Cc: Justin T. Gibbs, Andrew Morton, Jens Axboe, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel


> mjacob@feral.com said:
> > Duh. There had been race conditions in the past which caused all of us
> > HBA writers to in fact start swalloing things like QFULL and
> > maintaining internal queues. 
> 
> That was true of 2.2, 2.3 (and I think early 2.4) but it isn't true of late 
> 2.4 and 2.5

Probably. But I'll like leave the 2.4 driver alone. I'm about to fork my
bk repository into 2.2, 2.4 and 2.5 streams and put the 2.4 version into
maintenance mode. 

It turns out that there are other reasons why I maintain an internal
queue that have to do more with hiding fibre channel issues. For
example, if I get a LIP or an RSCN, I have to go out and re-evaluate the
loop/fabric and make sure I've tracked any changes in the identity of
the devices. The cleanest way to handle this right now for linux is to
accept comamnds, disable the scsi timer on them, and restart them once I
get things sorted out again. Maybe this will change for 2.5.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file  transfers
  2002-09-27 20:58                                       ` Warning - running *really* short on DMA buffers while doing file transfers Justin T. Gibbs
@ 2002-09-27 21:38                                         ` Patrick Mansfield
  2002-09-27 22:08                                           ` Justin T. Gibbs
  0 siblings, 1 reply; 60+ messages in thread
From: Patrick Mansfield @ 2002-09-27 21:38 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: James Bottomley, Jens Axboe, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

On Fri, Sep 27, 2002 at 02:58:15PM -0600, Justin T. Gibbs wrote:
> >> Hooks for sending ordered tags have been in the aic7xxx driver, at
> >> least in FreeBSD's version, since '97.  As soon as the Linux cmd
> >> blocks have such information it will be trivial to have the aic7xxx
> >> driver issue the appropriate tag types.
> > 
> > They already do in 2.5, see scsi_populate_tag_msg() in scsi.h.  This
> > assumes  you're using the generic tag queueing, which the aic7xxx
> > doesn't, but you  could easily key the tag type off REQ_BARRIER.
> 
> If anyone wants to play with the updated aic7xxx and aic79xx drivers
> (new port to 2.5, plus it honors the otag stuff), you can pick it up
> from here:
> 
> 
> http://people.FreeBSD.org/~gibbs/linux/linux-2.5-aic79xxx.tar.gz
> 
> --
> Justin

Any 2.5 patch for the above? Or aic7xxx/Config.in and
aic7xxx/Makefile for 2.5?

Thanks.

-- Patrick Mansfield

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file  transfers
  2002-09-27 21:38                                         ` Patrick Mansfield
@ 2002-09-27 22:08                                           ` Justin T. Gibbs
  2002-09-27 22:28                                             ` Patrick Mansfield
  0 siblings, 1 reply; 60+ messages in thread
From: Justin T. Gibbs @ 2002-09-27 22:08 UTC (permalink / raw)
  To: Patrick Mansfield
  Cc: James Bottomley, Jens Axboe, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

>> http://people.FreeBSD.org/~gibbs/linux/linux-2.5-aic79xxx.tar.gz
>> 
>> --
>> Justin
> 
> Any 2.5 patch for the above? Or aic7xxx/Config.in and
> aic7xxx/Makefile for 2.5?

Try it now.

--
Justin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doingfiletransfers
  2002-09-27 21:23                                                     ` James Bottomley
  2002-09-27 21:29                                                       ` Justin T. Gibbs
  2002-09-27 21:32                                                       ` Matthew Jacob
@ 2002-09-27 22:08                                                       ` Mike Anderson
  2002-09-30 23:49                                                       ` Doug Ledford
  3 siblings, 0 replies; 60+ messages in thread
From: Mike Anderson @ 2002-09-27 22:08 UTC (permalink / raw)
  To: James Bottomley
  Cc: mjacob, Justin T. Gibbs, Andrew Morton, Jens Axboe,
	Pedro M. Rodrigues, Mathieu Chouquet-Stringer, linux-scsi,
	linux-kernel


James Bottomley [James.Bottomley@SteelEye.com] wrote:
> mjacob@feral.com said:
> > Duh. There had been race conditions in the past which caused all of us
> > HBA writers to in fact start swalloing things like QFULL and
> > maintaining internal queues. 
> 
> That was true of 2.2, 2.3 (and I think early 2.4) but it isn't true of late 
> 2.4 and 2.5
> 

The current model appears to not be ideal. We go through the process of
starting a cmd only to find out the adapter really knew we could not
start this command. We then put this request back on the head of the
queue while it is holding resources (a request that possibly could have
more merging and mem from the scsi_sg_pools).

I thought there was discussion previously on mid-layer queue
adjustments during the (? attach patch ?) but I am having trouble
finding it.

-andmike
--
Michael Anderson
andmike@us.ibm.com


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file  transfers
  2002-09-27 22:08                                           ` Justin T. Gibbs
@ 2002-09-27 22:28                                             ` Patrick Mansfield
  2002-09-27 22:48                                               ` Justin T. Gibbs
  0 siblings, 1 reply; 60+ messages in thread
From: Patrick Mansfield @ 2002-09-27 22:28 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: James Bottomley, Jens Axboe, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

On Fri, Sep 27, 2002 at 04:08:22PM -0600, Justin T. Gibbs wrote:
> >> http://people.FreeBSD.org/~gibbs/linux/linux-2.5-aic79xxx.tar.gz
> >> 
> >> --
> >> Justin
> > 
> > Any 2.5 patch for the above? Or aic7xxx/Config.in and
> > aic7xxx/Makefile for 2.5?
> 
> Try it now.
> 

Great! It boots up fine on my IBM netfinity system with 2.5.37.

I see:

[ boot up stuff deleted ] 

scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.10
        <Adaptec aic7896/97 Ultra2 SCSI adapter>
        aic7896/97: Ultra2 Wide Channel A, SCSI Id=7, 32/253 SCBs

scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.10
        <Adaptec aic7896/97 Ultra2 SCSI adapter>
        aic7896/97: Ultra2 Wide Channel B, SCSI Id=7, 32/253 SCBs

I turned on the debug flags, there were a bunch of odd messages
in there, but otherwise it seems to be working fine. My .config
has the following AIC config options:

CONFIG_SCSI_AIC7XXX=y
CONFIG_AIC7XXX_CMDS_PER_DEVICE=253
CONFIG_AIC7XXX_RESET_DELAY_MS=15000
CONFIG_AIC7XXX_ALLOW_MEMIO=y
# CONFIG_AIC7XXX_PROBE_EISA_VL is not set
# CONFIG_AIC7XXX_BUILD_FIRMWARE is not set
CONFIG_AIC7XXX_DEBUG_ENABLE=y
CONFIG_AIC7XXX_DEBUG_MASK=0
CONFIG_AIC7XXX_REG_PRETTY_PRINT=y
# CONFIG_SCSI_AIC79XX is not set

Weird boot time messages:

INITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUT<5>  Vendor: IBM-PSG   Model: ST318203LC    !#  Rev: B222
  Type:   Direct-Access                      ANSI SCSI revision: 02
INITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUT(scsi0:A:0): 80.000MB/s transfers (40.000MHz, offset 15, 16bit)
INITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUT<5>  Vendor: IBM-PSG   Model: ST318203LC    !#  Rev: B222
  Type:   Direct-Access                      ANSI SCSI revision: 02
INITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUT(scsi0:A:1): 80.000MB/s transfers (40.000MHz, offset 15, 16bit)
  Vendor: IBM       Model: LN V1.2Rack       Rev: B004
  Type:   Processor                          ANSI SCSI revision: 02
scsi0:A:0:0: Tagged Queuing enabled.  Depth 253
scsi0:A:1:0: Tagged Queuing enabled.  Depth 253
st: Version 20020822, fixed bufsize 32768, wrt 30720, s/g segs 256
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
Attached scsi disk sdb at scsi0, channel 0, id 1, lun 0
INITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUT<5>SCSI device sda: 35548320 512-byte hdwr sectors (18201 MB)
 sda: sda1 sda2
INITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUTINITIATOR_MSG_OUT<5>SCSI device sdb: 35548320 512-byte hdwr sectors (18201 MB)
 sdb: sdb1 sdb2
Attached scsi generic sg2 at scsi0, channel 0, id 15, lun 0,  type 3
mice: PS/2 mouse device common for all mice
input: PS/2 Generic Mouse on isa0060/serio1
serio: i8042 AUX port at 0x60,0x64 irq 12
input: AT Set 2 keyboard on isa0060/serio0
serio: i8042 KBD port at 0x60,0x64 irq 1
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP, IGMP
IP: routing cache hash table of 2048 buckets, 32Kbytes
TCP: Hash tables configured (established 16384 bind 21845)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) readonly.
Freeing unused kernel memory: 96k freed
INIT: version 2.78 booting

[ more boot up stuff ]

-- Patrick Mansfield

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doing file  transfers
  2002-09-27 22:28                                             ` Patrick Mansfield
@ 2002-09-27 22:48                                               ` Justin T. Gibbs
  0 siblings, 0 replies; 60+ messages in thread
From: Justin T. Gibbs @ 2002-09-27 22:48 UTC (permalink / raw)
  To: Patrick Mansfield
  Cc: James Bottomley, Jens Axboe, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 382 bytes --]

> I turned on the debug flags, there were a bunch of odd messages
> in there, but otherwise it seems to be working fine. My .config
> has the following AIC config options:

<sigh>
I always run with debugging turned on with the message flags enabled,
so I missed this in my testing.  I just updated the tarfile.  The
following patch is all you need to shut the driver up.

--
Justin

[-- Attachment #2: diff --]
[-- Type: text/plain, Size: 1375 bytes --]

Change 1419 by gibbs@bitkeeper-linux-2.5 on 2002/09/27 16:44:04

	Add a missing pair of curly braces to a conditional debug
	statement.  This ensures that debug code doesn't trigger if
	it isn't enabled. <blush>

Affected files ...

... //depot/aic7xxx/aic7xxx/aic7xxx.c#80 edit

Differences ...

==== //depot/aic7xxx/aic7xxx/aic7xxx.c#80 (ktext) ====

***************
*** 37,43 ****
   * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
   * POSSIBILITY OF SUCH DAMAGES.
   *
!  * $Id: //depot/aic7xxx/aic7xxx/aic7xxx.c#79 $
   *
   * $FreeBSD$
   */
--- 37,43 ----
   * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
   * POSSIBILITY OF SUCH DAMAGES.
   *
!  * $Id: //depot/aic7xxx/aic7xxx/aic7xxx.c#80 $
   *
   * $FreeBSD$
   */
***************
*** 2475,2483 ****
  			panic("HOST_MSG_LOOP interrupt with no active message");
  
  #ifdef AHC_DEBUG
! 		if ((ahc_debug & AHC_SHOW_MESSAGES) != 0)
  			ahc_print_devinfo(ahc, &devinfo);
  			printf("INITIATOR_MSG_OUT");
  #endif
  		phasemis = bus_phase != P_MESGOUT;
  		if (phasemis) {
--- 2475,2484 ----
  			panic("HOST_MSG_LOOP interrupt with no active message");
  
  #ifdef AHC_DEBUG
! 		if ((ahc_debug & AHC_SHOW_MESSAGES) != 0) {
  			ahc_print_devinfo(ahc, &devinfo);
  			printf("INITIATOR_MSG_OUT");
+ 		}
  #endif
  		phasemis = bus_phase != P_MESGOUT;
  		if (phasemis) {

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while  doingfiletransfers
  2002-09-27 21:28                                                   ` Justin T. Gibbs
@ 2002-09-28 15:52                                                     ` James Bottomley
  2002-09-28 23:25                                                       ` Luben Tuikov
  2002-09-29  4:00                                                       ` Justin T. Gibbs
  2002-09-30 23:54                                                     ` Doug Ledford
  1 sibling, 2 replies; 60+ messages in thread
From: James Bottomley @ 2002-09-28 15:52 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: James Bottomley, Andrew Morton, Jens Axboe, Matthew Jacob,
	Pedro M. Rodrigues, Mathieu Chouquet-Stringer, linux-scsi,
	linux-kernel

gibbs@scsiguy.com said:
> 1) Device returns queue full with no outstanding commands from us
>    (usually occurs in multi-initiator environments). 

That's another manifestation of the problem I referred to.

> 2) No delay after busy status so devices that will continually
>    report BUSY if you hammer them with commands never come ready. 

I think Eric did that because the spec makes BUSY look less severe than QUEUE 
FULL.  We can easily treat busy as QUEUE FULL.  That will cause a short delay 
as the cmd goes back into the block queue and gets reissued.

> 3) Queue is restarted as soon as any command completes even if
>    you really need to throttle down the number of tags supported
>    by the device. 

That's a valid flow control response.  Given the variability of queue depths, 
particularly in multi-initiator/FC environments, it's not clear that 
attempting to implement high/low water marks would buy anything.

> 4) No tag throttling.  If tag throttling is in 2.5, does it ever
>    increment the tag depth to handle devices that report temporary
>    resource shortages (Atlas II and III do this all the time, other
>    devices usually do this only in multi-initiator environments). 

That depends on the tag philosophy, which is partly what this thread is all 
about.  If you regard tags as simply a transport engine to the device and tend 
to keep the number of tags much less than the number the device could accept, 
then this isn't necessary.

Since this feature is one you particularly want for the aic, send us some code 
and it can probably go in the mid-layer. (or actually, if you want to talk to 
Jens about it, the block layer).

> 5) Proper transaction ordering across a queue full.  The aic7xxx
>    driver "requeues" all transactions that have not yet been sent
>    to the device replacing the transaction that experienced the queue
>    full back at the head so that ordering is maintained. 

I'm lost here.  We currently implement TCQ with simple tags which have no 
guarantee of execution order in the drive I/O scheduler.  Why would we want to 
bother preserving the order of what will become essentially an unordered queue?

This will become an issue when (or more likely if) journalled fs rely on the 
barrier being implemented down to the medium, and the mid layer does do 
reqeueing in the correct order in that case, except for the tiny race where 
the command following the queue full could be accepted by the device before 
the queue is blocked.

> No thought was put into any of these issues in 2.4, so I decided not
> to even think about trusting the mid-layer for this functionality. 

Apart from the TCQ pieces, these are all edge cases which are rarely (if ever) 
seen.  They afflict all drivers and the only one that causes any problems is 
the mid-layer assumption that all devices can accept at least one command.

By not using any of the mid-layer queueing, you've got into a catch-22 
situation where we don't have any bug reports for these problems and you don't 
see them because you don't use the generic infrastructure.

How about I look at fixing the above and you look at using the generic 
infrastructure?

I might even think about how to do dynamic tags in the blk code...

James




^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while  doingfiletransfers
  2002-09-28 15:52                                                     ` James Bottomley
@ 2002-09-28 23:25                                                       ` Luben Tuikov
  2002-09-29  2:48                                                         ` James Bottomley
  2002-09-30  8:34                                                         ` Jens Axboe
  2002-09-29  4:00                                                       ` Justin T. Gibbs
  1 sibling, 2 replies; 60+ messages in thread
From: Luben Tuikov @ 2002-09-28 23:25 UTC (permalink / raw)
  To: James Bottomley
  Cc: Justin T. Gibbs, Andrew Morton, Jens Axboe, Matthew Jacob,
	Pedro M. Rodrigues, Mathieu Chouquet-Stringer, linux-scsi,
	linux-kernel

James Bottomley wrote:
> 
> > 5) Proper transaction ordering across a queue full.  The aic7xxx
> >    driver "requeues" all transactions that have not yet been sent
> >    to the device replacing the transaction that experienced the queue
> >    full back at the head so that ordering is maintained.
> 
> I'm lost here.  We currently implement TCQ with simple tags which have no
> guarantee of execution order in the drive I/O scheduler.  Why would we want to
> bother preserving the order of what will become essentially an unordered queue?
> 
> This will become an issue when (or more likely if) journalled fs rely on the
> barrier being implemented down to the medium, and the mid layer does do
> reqeueing in the correct order in that case, except for the tiny race where
> the command following the queue full could be accepted by the device before
> the queue is blocked.

Justin has the right idea.

TCQ goes hand in hand with Task Attributes. I.e. a tagged task
is an I_T_L_Q nexus and has a task attribute (Simple, Ordered, Head
of Queue, ACA; cf. SAM-3, 4.9.1).

Maybe the generator of tags (block layer, user process through sg, etc)
should also set the tag attribute of the task, as per SAM-3.
Most often (as currently implicitly) this would be a Simple task attribute.
Why not the block layer borrow the idea from SAM-3, I see IDE only
coming closer to SCSI...

This way there'd be no need for explicit barriers. They can be implemented
through Ordered and Head of Queue Tasks, everything else is Simple
attribute task (IO scheduler can play with those as it wishes).

This would provide for a more general basis the whole game (IO scheduling,
TCQ, IO barriers, etc).

If the device is not SCSI or it doesn't provide for those (the applicable
bits in the INQ data and mode pages), then those can be succesfully
simulated in the kernel, but at lowest level as I mentioned before.

-- 
Luben

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while  doingfiletransfers
  2002-09-28 23:25                                                       ` Luben Tuikov
@ 2002-09-29  2:48                                                         ` James Bottomley
  2002-09-30  8:34                                                         ` Jens Axboe
  1 sibling, 0 replies; 60+ messages in thread
From: James Bottomley @ 2002-09-29  2:48 UTC (permalink / raw)
  To: Luben Tuikov; +Cc: James Bottomley, Justin T. Gibbs, linux-scsi, linux-kernel

luben@splentec.com said:
> TCQ goes hand in hand with Task Attributes. I.e. a tagged task is an
> I_T_L_Q nexus and has a task attribute (Simple, Ordered, Head of
> Queue, ACA; cf. SAM-3, 4.9.1). 

I believe the point I was making is that our only current expectation is 
simple tag, which is unordered.

> aybe the generator of tags (block layer, user process through sg, etc)
> should also set the tag attribute of the task, as per SAM-3. Most
> often (as currently implicitly) this would be a Simple task attribute.
> Why not the block layer borrow the idea from SAM-3, I see IDE only
> coming closer to SCSI...

> This way there'd be no need for explicit barriers. They can be
> implemented through Ordered and Head of Queue Tasks, everything else
> is Simple attribute task (IO scheduler can play with those as it
> wishes).

> This would provide for a more general basis the whole game (IO
> scheduling, TCQ, IO barriers, etc). 

That would be rather the wrong approach.  As the layers move up from the 
physical hardware, the level of abstraction becomes greater, so the current 
proposal is (descending the abstractions):

journal transaction->REQ_BARRIER->cache synchronise (ide) or ordered tag (scsi)

Most of the implementation is in ll_rw_blk.c if you want to take a look

James



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while  doingfiletransfers
  2002-09-28 15:52                                                     ` James Bottomley
  2002-09-28 23:25                                                       ` Luben Tuikov
@ 2002-09-29  4:00                                                       ` Justin T. Gibbs
  2002-09-29 15:45                                                         ` James Bottomley
  1 sibling, 1 reply; 60+ messages in thread
From: Justin T. Gibbs @ 2002-09-29  4:00 UTC (permalink / raw)
  To: James Bottomley
  Cc: Andrew Morton, Jens Axboe, Matthew Jacob, Pedro M. Rodrigues,
	Mathieu Chouquet-Stringer, linux-scsi, linux-kernel

> gibbs@scsiguy.com said:
>> 1) Device returns queue full with no outstanding commands from us
>>    (usually occurs in multi-initiator environments). 
> 
> That's another manifestation of the problem I referred to.
> 
>> 2) No delay after busy status so devices that will continually
>>    report BUSY if you hammer them with commands never come ready. 
> 
> I think Eric did that because the spec makes BUSY look less severe than
> QUEUE  FULL.  We can easily treat busy as QUEUE FULL.  That will cause a
> short delay  as the cmd goes back into the block queue and gets reissued.

The delay should be on the order of 500ms.  The turn around time for
re-issuing the command is not a sufficient delay.

>> 3) Queue is restarted as soon as any command completes even if
>>    you really need to throttle down the number of tags supported
>>    by the device. 
> 
> That's a valid flow control response.  Given the variability of queue
> depths,  particularly in multi-initiator/FC environments, it's not clear
> that  attempting to implement high/low water marks would buy anything.

It is exactly because of the variability of the queue depth that you must
implement a policy that will not only lower the depth but also raise it.
Soft/transient limits (e.g. Atlas II write cache fills and the tag depth
drops to 1 - or bursty behavior by another initiator) don't prevent you
from maximizing concurency on the device.  There can be no low-water mark,
only a high water mark based on repeated queue fulls at a certain level so
you optimize the common case where the device has a hard limit and you are
operating in a single initiator environment.  Queue fulls are not free.
This is even more the case if actually handle the retransmission ordering
case correctly (may require QErr or ECA/ACA recovery).
 
>> 4) No tag throttling.  If tag throttling is in 2.5, does it ever
>>    increment the tag depth to handle devices that report temporary
>>    resource shortages (Atlas II and III do this all the time, other
>>    devices usually do this only in multi-initiator environments). 
> 
> That depends on the tag philosophy, which is partly what this thread is
> all  about.  If you regard tags as simply a transport engine to the
> device and tend  to keep the number of tags much less than the number the
> device could accept,  then this isn't necessary.

There are complaints about read latency, and speculations about the cause.
You can't really argue "tag philosophy" without more information on why one
philosophy would perform differently in the given situation.

> Since this feature is one you particularly want for the aic, send us some
> code  and it can probably go in the mid-layer.

I think you misunderstand me.  If the aic drivers behave as best as they
can in Linux, then I'm mostly happy.  I've already written one OpenSource
SCSI mid-layer, given presentations on how to fix the Linux mid-layer, and
try to discuss these issues with Linux developers.  I just don't have the
energy to go implement a real solution for Linux only to have it thrown
away.  Life's too short.  8-)

> (or actually, if you want
> to talk to  Jens about it, the block layer).

I don't believe that much of the stuff that has recently been put into the
block layer has any reason to be there, but I'm not going to press my
"philosophical differences" in that area. 8-)

>> 5) Proper transaction ordering across a queue full.  The aic7xxx
>>    driver "requeues" all transactions that have not yet been sent
>>    to the device replacing the transaction that experienced the queue
>>    full back at the head so that ordering is maintained. 
> 
> I'm lost here.  We currently implement TCQ with simple tags which have no 
> guarantee of execution order in the drive I/O scheduler.

Do you run all of your devices with a queue algorithm modifier of 0?  If
not, then there certainly are guarantees on "effective ordering" even
in the simple queue task case.  For example, writes ands reads
to the same location must never occur out of order from the viewpoint of
the initiator - a sync cache command will only flush the commands that
have occurred before it, etc, etc.

As you note, this is even more important in the case of implementing
barriers.  Since you basically told me I should implement this support
in my drivers, I figured that ordering must be important. 8-)

>> No thought was put into any of these issues in 2.4, so I decided not
>> to even think about trusting the mid-layer for this functionality. 
> 
> Apart from the TCQ pieces, these are all edge cases which are rarely (if
> ever)  seen.

Handling the edge cases is what makes an OS enterprise worthy.  The
edge cases do happen, they are tested for (by Adaptec and the OEMs
that it supports) and the expectation is that they will be handled
gracefully.

> They afflict all drivers and the only one that causes any
> problems is  the mid-layer assumption that all devices can accept at
> least one command.

Actually, they affect very few "important" drivers because they implement
their own queuing (sym, aic7*, isp*, AdvanSys).

> By not using any of the mid-layer queueing, you've got into a catch-22 
> situation where we don't have any bug reports for these problems and you
> don't  see them because you don't use the generic infrastructure.

Oh come on.  These bugs have been known and talked about for at least three
years now.  The concensus has always been to continue to hack around this
stuff or just brush it off as "edge cases".  That is why the situation has
never improved.

> How about I look at fixing the above and you look at using the generic 
> infrastructure?

Once the generic infrastructure handles these cases and does proper
throttling (code for this is pretty easy to steal out of the aic7xxx
driver if you're interested) I'd be more than happy to rip out this extra
logic in my driver.  Its really sad to have to constantly lie to the
mid-layer in order to get reasonable results, but right now there is no
other option.

--
Justin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while  doingfiletransfers
  2002-09-29  4:00                                                       ` Justin T. Gibbs
@ 2002-09-29 15:45                                                         ` James Bottomley
  2002-09-29 16:49                                                           ` [ getting OT ] " Matthew Jacob
  2002-09-30 19:06                                                           ` Luben Tuikov
  0 siblings, 2 replies; 60+ messages in thread
From: James Bottomley @ 2002-09-29 15:45 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: James Bottomley, linux-scsi, linux-kernel

gibbs@scsiguy.com said:
> The delay should be on the order of 500ms.  The turn around time for
> re-issuing the command is not a sufficient delay. 

That's not what the spec says.  It just says "reissue at a later time".  
SCSI-2 also implies BUSY is fairly interchangeable with QUEUE FULL.  SAM-3 
clarifies that BUSY should only be returned if the target doesn't have any 
pending tasks for the initiator, otherwise TASK SET BUSY (renamed QUEUE FULL) 
should be returned.

Half a second's delay on BUSY would kill performance for any SCSI-2 device 
using this return instead of QUEUE FULL.

It sounds more like an individual device problem which could be handled in an 
exception table.  What device is this and why does it require 0.5s backoff?

> Do you run all of your devices with a queue algorithm modifier of 0?
> If not, then there certainly are guarantees on "effective ordering"
> even in the simple queue task case.  For example, writes ands reads to
> the same location must never occur out of order from the viewpoint of
> the initiator - a sync cache command will only flush the commands that
> have occurred before it, etc, etc.

I run with the defaults (which are algorithm 0, Qerr 0).  However, what the 
drive thinks it's doing is not relevant to this discussion.  The question is 
"does the OS have any ordering expectations?".  The answer for Linux currently 
is "no".  In future, it may be "yes" and this whole area will have to be 
revisited, but for now it is "no" and no benefit is gained from being careful 
to preserve the ordering.

> I've already written one OpenSource SCSI mid-layer, given
> presentations on how to fix the Linux mid-layer, and try to discuss
> these issues with Linux developers.  I just don't have the energy to
> go implement a real solution for Linux only to have it thrown away.
> Life's too short.  8-)

What can I say? I've always found the life of an open source developer to be a 
pretty thankless, filled with bug reports, irate complaints about feature 
breakage and tossed code.  The worst I think is "This code looks fine now why 
don't you <insert feature requiring a complete re-write of proposed code>".

I can ceratinly sympathise with anyone not wanting to work in this 
environment.  I just don't see it changing soon.

James



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [ getting OT ] Re: Warning - running *really* short on DMA buffers while  doingfiletransfers
  2002-09-29 15:45                                                         ` James Bottomley
@ 2002-09-29 16:49                                                           ` Matthew Jacob
  2002-09-30 19:06                                                           ` Luben Tuikov
  1 sibling, 0 replies; 60+ messages in thread
From: Matthew Jacob @ 2002-09-29 16:49 UTC (permalink / raw)
  To: James Bottomley; +Cc: Justin T. Gibbs, linux-scsi, linux-kernel

> > I've already written one OpenSource SCSI mid-layer, given
> > presentations on how to fix the Linux mid-layer, and try to discuss
> > these issues with Linux developers.  I just don't have the energy to
> > go implement a real solution for Linux only to have it thrown away.
> > Life's too short.  8-)
> 
> What can I say? I've always found the life of an open source developer to be a 
> pretty thankless, filled with bug reports, irate complaints about feature 
> breakage and tossed code.  The worst I think is "This code looks fine now why 
> don't you <insert feature requiring a complete re-write of proposed code>".
> 
> I can ceratinly sympathise with anyone not wanting to work in this 
> environment.  I just don't see it changing soon.

Justin, and all of us, are quite content to work in an Open Source
environment I believe. It is the true inheritor of the original Unix
philosophies.

But it's difficult to commit to an effort that one often feels is a
waste of time from the git go. This is one of the bootstrapping problems
of the Linux environment: pretty much everyone expects you to produce a
working prototype of a problem solution *before* people will accept it-
how else can they evaluate it, hmm?

But major amounts of work would have to be expended before you would do
something like present 'CAM in linux' for review. That makes for a
natural tendency to try and assess *beforehand* whether there's even a
point in trying. I think that the subtext of Justin's comment, to put
words in his mouth which can later deny if he likes, is that there's a
sense that some of the solutions he'd propose/do would never be
accepted, so why spend an effort sure to be wasted?


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doingfiletransfers
  2002-09-28 23:25                                                       ` Luben Tuikov
  2002-09-29  2:48                                                         ` James Bottomley
@ 2002-09-30  8:34                                                         ` Jens Axboe
  1 sibling, 0 replies; 60+ messages in thread
From: Jens Axboe @ 2002-09-30  8:34 UTC (permalink / raw)
  To: Luben Tuikov
  Cc: James Bottomley, Justin T. Gibbs, Andrew Morton, Matthew Jacob,
	Pedro M. Rodrigues, Mathieu Chouquet-Stringer, linux-scsi,
	linux-kernel

On Sat, Sep 28 2002, Luben Tuikov wrote:
> James Bottomley wrote:
> > 
> > > 5) Proper transaction ordering across a queue full.  The aic7xxx
> > >    driver "requeues" all transactions that have not yet been sent
> > >    to the device replacing the transaction that experienced the queue
> > >    full back at the head so that ordering is maintained.
> > 
> > I'm lost here.  We currently implement TCQ with simple tags which have no
> > guarantee of execution order in the drive I/O scheduler.  Why would we want to
> > bother preserving the order of what will become essentially an unordered queue?
> > 
> > This will become an issue when (or more likely if) journalled fs rely on the
> > barrier being implemented down to the medium, and the mid layer does do
> > reqeueing in the correct order in that case, except for the tiny race where
> > the command following the queue full could be accepted by the device before
> > the queue is blocked.
> 
> Justin has the right idea.
> 
> TCQ goes hand in hand with Task Attributes. I.e. a tagged task
> is an I_T_L_Q nexus and has a task attribute (Simple, Ordered, Head
> of Queue, ACA; cf. SAM-3, 4.9.1).
> 
> Maybe the generator of tags (block layer, user process through sg, etc)
> should also set the tag attribute of the task, as per SAM-3.
> Most often (as currently implicitly) this would be a Simple task attribute.
> Why not the block layer borrow the idea from SAM-3, I see IDE only
> coming closer to SCSI...

Block layer being the tag generator (and manager), yes I 100% agree that
if we extend the support of block level tagging slightly it is very
possible to simply generate the type of tag required (as already replied
to Patrick, REQ_BARRIER would be an ordered task attribut, etc).

Currently IDE tcq has no notion of tag options, unfortunately.

> This way there'd be no need for explicit barriers. They can be implemented
> through Ordered and Head of Queue Tasks, everything else is Simple
> attribute task (IO scheduler can play with those as it wishes).
> 
> This would provide for a more general basis the whole game (IO scheduling,
> TCQ, IO barriers, etc).

Agree

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while  doingfiletransfers
  2002-09-29 15:45                                                         ` James Bottomley
  2002-09-29 16:49                                                           ` [ getting OT ] " Matthew Jacob
@ 2002-09-30 19:06                                                           ` Luben Tuikov
  1 sibling, 0 replies; 60+ messages in thread
From: Luben Tuikov @ 2002-09-30 19:06 UTC (permalink / raw)
  To: James Bottomley; +Cc: Justin T. Gibbs, linux-scsi, linux-kernel

James Bottomley wrote:
> 
> I run with the defaults (which are algorithm 0, Qerr 0).  However, what the
> drive thinks it's doing is not relevant to this discussion.  The question is
> "does the OS have any ordering expectations?".  The answer for Linux currently
> is "no".  In future, it may be "yes" and this whole area will have to be
> revisited, but for now it is "no" and no benefit is gained from being careful
> to preserve the ordering.
> 

Of course it will be ``yes'', it's just the next natural step for
an ever maturing (great!) OS. ``Ordering'' is implicitly preserved
in the buffer cache (read/write (2)), but DB/Journalled FS would
want a finer access to the device (imagination use req'd here).

And there's plenty of (old) research there on queuing models
which would provide that kind of flexibility for that future time
when there'd be ``oui'' :-).

> > I've already written one OpenSource SCSI mid-layer, given
> > presentations on how to fix the Linux mid-layer, and try to discuss
> > these issues with Linux developers.  I just don't have the energy to
> > go implement a real solution for Linux only to have it thrown away.
> > Life's too short.  8-)
> 
> What can I say? I've always found the life of an open source developer to be a
> pretty thankless, filled with bug reports, irate complaints about feature
> breakage and tossed code.  The worst I think is "This code looks fine now why
> don't you <insert feature requiring a complete re-write of proposed code>".
> 
> I can ceratinly sympathise with anyone not wanting to work in this
> environment.  I just don't see it changing soon.

Oh, c'mon James. You know we all appreciate you very much, no need for this.

The reason some of us have not started meddling with SCSI core is that
same appreciation of the other ppl working on it, the trust and good rapport.

Linux SCSI is one of the great environments, cf. <you know where'd I refer>.
I'm happy and proud to be part of this environment.

Probably what Justin meant is some past events in the history of Linux,
and the best course of action would be a LaTeX/PS/ASCII presentation/layout/blurb
of what Justin had in mind, and we can start from there.

-- 
Luben

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doingfiletransfers
  2002-09-27 21:23                                                     ` James Bottomley
                                                                         ` (2 preceding siblings ...)
  2002-09-27 22:08                                                       ` Mike Anderson
@ 2002-09-30 23:49                                                       ` Doug Ledford
  3 siblings, 0 replies; 60+ messages in thread
From: Doug Ledford @ 2002-09-30 23:49 UTC (permalink / raw)
  To: James Bottomley
  Cc: mjacob, Justin T. Gibbs, Andrew Morton, Jens Axboe,
	Pedro M. Rodrigues, Mathieu Chouquet-Stringer, linux-scsi,
	linux-kernel

On Fri, Sep 27, 2002 at 05:23:10PM -0400, James Bottomley wrote:
> mjacob@feral.com said:
> > Duh. There had been race conditions in the past which caused all of us
> > HBA writers to in fact start swalloing things like QFULL and
> > maintaining internal queues. 
> 
> That was true of 2.2, 2.3 (and I think early 2.4) but it isn't true of late 
> 2.4 and 2.5

Oh, it's true of current 2.4 (as of 2.4.19).  It's broken for new and old 
eh drivers both in 2.4.  Hell, it's still broken for new eh drivers in 2.5 
as well.

-- 
  Doug Ledford <dledford@redhat.com>     919-754-3700 x44233
         Red Hat, Inc. 
         1801 Varsity Dr.
         Raleigh, NC 27606
  

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Warning - running *really* short on DMA buffers while doingfiletransfers
  2002-09-27 21:28                                                   ` Justin T. Gibbs
  2002-09-28 15:52                                                     ` James Bottomley
@ 2002-09-30 23:54                                                     ` Doug Ledford
  1 sibling, 0 replies; 60+ messages in thread
From: Doug Ledford @ 2002-09-30 23:54 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: James Bottomley, Andrew Morton, Jens Axboe, Matthew Jacob,
	Pedro M. Rodrigues, Mathieu Chouquet-Stringer, linux-scsi,
	linux-kernel

On Fri, Sep 27, 2002 at 03:28:47PM -0600, Justin T. Gibbs wrote:
> > Linux is perfectly happy just to have you return 1 in queuecommand if the 
> > device won't accept the tag.  The can_queue parameter represents the
> > maximum  number of outstanding commands the mid-layer will ever send.
> > The mid-layer is  happy to re-queue I/O below this limit if it cannot be
> > accepted by the drive.   In fact, that's more or less what queue plugging
> > is about.
> > 
> > The only problem occurs if you return 1 from queuecommand with no other 
> > outstanding I/O for the device.
> > 
> > There should be no reason in 2.5 for a driver to have to implement an
> > internal  queue.
> 
> Did this really get fixed in 2.5?  The internal queuing was completely
> broken in 2.4.  Some of the known breakages were:
> 
> 1) Device returns queue full with no outstanding commands from us
>    (usually occurs in multi-initiator environments).

This may be fixed.

> 2) No delay after busy status so devices that will continually
>    report BUSY if you hammer them with commands never come ready.

This is still broken.  Plus, it has a limited number of retries before it 
simply returns an I/O error, so it basically hammers the device (so it 
can't get unbusy) until a set number of retries have completed then it 
returns an I/O error, giving all sorts of false I/O errors on devices that 
use BUSY status.

> 3) Queue is restarted as soon as any command completes even if
>    you really need to throttle down the number of tags supported
>    by the device.
> 
> 4) No tag throttling.  If tag throttling is in 2.5, does it ever
>    increment the tag depth to handle devices that report temporary
>    resource shortages (Atlas II and III do this all the time, other
>    devices usually do this only in multi-initiator environments).

The current 2.5 mid layer is still tag stupid.  It has no concept of tag 
depth adjustment (although I have a patch here that implements this bit, 
it really needs updating to the current kernels).

> 5) Proper transaction ordering across a queue full.  The aic7xxx
>    driver "requeues" all transactions that have not yet been sent
>    to the device replacing the transaction that experienced the queue
>    full back at the head so that ordering is maintained.
> 
> No thought was put into any of these issues in 2.4, so I decided not
> to even think about trusting the mid-layer for this functionality.

No, you still can't yet, but we hope to have that fixed before the next 
stable kernel series.

-- 
  Doug Ledford <dledford@redhat.com>     919-754-3700 x44233
         Red Hat, Inc. 
         1801 Varsity Dr.
         Raleigh, NC 27606
  

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2002-09-30 23:48 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-26  3:27 Warning - running *really* short on DMA buffers while doing file transfers Mathieu Chouquet-Stringer
2002-09-26  6:14 ` Jens Axboe
2002-09-26  7:04   ` Pedro M. Rodrigues
2002-09-26 15:31     ` Justin T. Gibbs
2002-09-27  6:13       ` Jens Axboe
2002-09-27  6:33         ` Matthew Jacob
2002-09-27  6:36           ` Jens Axboe
2002-09-27  6:50             ` Matthew Jacob
2002-09-27  6:56               ` Jens Axboe
2002-09-27  7:18                 ` Matthew Jacob
2002-09-27  7:24                   ` Jens Axboe
2002-09-27  7:29                     ` Matthew Jacob
2002-09-27  7:34                       ` Matthew Jacob
2002-09-27  7:45                         ` Jens Axboe
2002-09-27  8:37                           ` Matthew Jacob
2002-09-27 10:25                             ` Jens Axboe
2002-09-27 12:18                               ` Matthew Jacob
2002-09-27 12:54                                 ` Jens Axboe
2002-09-27 13:30                               ` Justin T. Gibbs
2002-09-27 14:26                                 ` James Bottomley
2002-09-27 14:33                                   ` Jens Axboe
2002-09-27 16:26                                   ` Justin T. Gibbs
2002-09-27 17:21                                     ` James Bottomley
2002-09-27 18:56                                       ` Justin T. Gibbs
2002-09-27 19:07                                         ` Warning - running *really* short on DMA buffers while doingfile transfers Andrew Morton
2002-09-27 19:16                                           ` Justin T. Gibbs
2002-09-27 19:36                                             ` Warning - running *really* short on DMA buffers while doingfiletransfers Andrew Morton
2002-09-27 19:52                                               ` Justin T. Gibbs
2002-09-27 21:13                                                 ` James Bottomley
2002-09-27 21:18                                                   ` Matthew Jacob
2002-09-27 21:23                                                     ` James Bottomley
2002-09-27 21:29                                                       ` Justin T. Gibbs
2002-09-27 21:32                                                       ` Matthew Jacob
2002-09-27 22:08                                                       ` Mike Anderson
2002-09-30 23:49                                                       ` Doug Ledford
2002-09-27 21:28                                                   ` Justin T. Gibbs
2002-09-28 15:52                                                     ` James Bottomley
2002-09-28 23:25                                                       ` Luben Tuikov
2002-09-29  2:48                                                         ` James Bottomley
2002-09-30  8:34                                                         ` Jens Axboe
2002-09-29  4:00                                                       ` Justin T. Gibbs
2002-09-29 15:45                                                         ` James Bottomley
2002-09-29 16:49                                                           ` [ getting OT ] " Matthew Jacob
2002-09-30 19:06                                                           ` Luben Tuikov
2002-09-30 23:54                                                     ` Doug Ledford
2002-09-27 19:58                                               ` Andrew Morton
2002-09-27 20:58                                       ` Warning - running *really* short on DMA buffers while doing file transfers Justin T. Gibbs
2002-09-27 21:38                                         ` Patrick Mansfield
2002-09-27 22:08                                           ` Justin T. Gibbs
2002-09-27 22:28                                             ` Patrick Mansfield
2002-09-27 22:48                                               ` Justin T. Gibbs
2002-09-27 18:59                                     ` Warning - running *really* short on DMA buffers while doingfile transfers Andrew Morton
2002-09-27 14:30                                 ` Warning - running *really* short on DMA buffers while doing file transfers Jens Axboe
2002-09-27 17:19                                   ` Justin T. Gibbs
2002-09-27 18:29                                     ` Rik van Riel
2002-09-27 14:56                                 ` Rik van Riel
2002-09-27 15:34                                 ` Matthew Jacob
2002-09-27 15:37                                   ` Jens Axboe
2002-09-27 17:20                                     ` Justin T. Gibbs
2002-09-27 12:28       ` Pedro M. Rodrigues

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).