linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Linux 2.4.21-rc6
@ 2003-05-29  0:55 Marcelo Tosatti
  2003-05-29  1:22 ` Con Kolivas
                   ` (3 more replies)
  0 siblings, 4 replies; 109+ messages in thread
From: Marcelo Tosatti @ 2003-05-29  0:55 UTC (permalink / raw)
  To: lkml


Hi,

Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's fix
for the IO stalls/deadlocks.

Please test it.


Summary of changes from v2.4.21-rc5 to v2.4.21-rc6
============================================

<c-d.hailfinger.kernel.2003@gmx.net>:
  o IDE config.in correctness

Andi Kleen <ak@muc.de>:
  o x86-64 fix for the ioport problem

Andrew Morton <akpm@digeo.com>:
  o Fix IO stalls and deadlocks

Marcelo Tosatti <marcelo@freak.distro.conectiva>:
  o Add missing via82xxx PCI ID
  o Backout erroneous fsync on last opener at close()
  o Changed EXTRAVERSION to -rc6


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-05-29  0:55 Linux 2.4.21-rc6 Marcelo Tosatti
@ 2003-05-29  1:22 ` Con Kolivas
  2003-05-29  5:24   ` Marc Wilson
  2003-05-29 10:02 ` Con Kolivas
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 109+ messages in thread
From: Con Kolivas @ 2003-05-29  1:22 UTC (permalink / raw)
  To: lkml

On Thu, 29 May 2003 10:55, Marcelo Tosatti wrote:
> Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's fix
> for the IO stalls/deadlocks.

Good for you. Well done Marcelo!

> Please test it.

Yes everyone who gets these stalls please test it also!

> Andrew Morton <akpm@digeo.com>:
>   o Fix IO stalls and deadlocks

For those interested these are patches 1 and 2 from akpm's proposed fixes in 
the looong thread discussing this problem.

Con

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-05-29  1:22 ` Con Kolivas
@ 2003-05-29  5:24   ` Marc Wilson
  2003-05-29  5:34     ` Riley Williams
  0 siblings, 1 reply; 109+ messages in thread
From: Marc Wilson @ 2003-05-29  5:24 UTC (permalink / raw)
  To: lkml

On Thu, May 29, 2003 at 11:22:20AM +1000, Con Kolivas wrote:
> On Thu, 29 May 2003 10:55, Marcelo Tosatti wrote:
> > Andrew Morton <akpm@digeo.com>:
> >   o Fix IO stalls and deadlocks
> 
> For those interested these are patches 1 and 2 from akpm's proposed fixes in 
> the looong thread discussing this problem.

Are you sure?  I'm no C programmer, but it looks to me like all three
patches are in 21-rc6.

And I still see the stalls, although it's much reduced. :(  I just had mutt
freeze cold on me though for ~15 sec when it tried to open my debian-devel
mbox (rather lage file) while brag was beating on the drive.

<whimper>

-- 
 Marc Wilson |     You have had a long-term stimulation relative to
 msw@cox.net |     business.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-05-29  5:24   ` Marc Wilson
@ 2003-05-29  5:34     ` Riley Williams
  2003-05-29  5:57       ` Marc Wilson
  0 siblings, 1 reply; 109+ messages in thread
From: Riley Williams @ 2003-05-29  5:34 UTC (permalink / raw)
  To: Marc Wilson, lkml

Hi Marc.

 > I just had mutt > freeze cold on me though for ~15 sec when
 > it tried to open my debian-devel mbox (rather large file)
 > while brag was beating on the drive.
 >
 > <whimper>

I used to get the same effect when I asked pine to open the
Linux-Kernel mailbox on my system. I long since cured that by
having procmail split Linux-Kernel mail into multiple mailboxes,
one for each calendar week.

The basic problem there is that any mail client needs to know
just how many messages are in a particular folder to handle that
folder, and the only way to do this is to count them all. That's
what takes the time when one opens a large folder.

Best wishes from Riley.
---
 * Nothing as pretty as a smile, nothing as ugly as a frown.

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.484 / Virus Database: 282 - Release Date: 27-May-2003


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-05-29  5:34     ` Riley Williams
@ 2003-05-29  5:57       ` Marc Wilson
  2003-05-29  7:15         ` Riley Williams
                           ` (2 more replies)
  0 siblings, 3 replies; 109+ messages in thread
From: Marc Wilson @ 2003-05-29  5:57 UTC (permalink / raw)
  To: lkml

On Thu, May 29, 2003 at 06:34:48AM +0100, Riley Williams wrote:
> The basic problem there is that any mail client needs to know
> just how many messages are in a particular folder to handle that
> folder, and the only way to do this is to count them all. That's
> what takes the time when one opens a large folder.

No, the basic problem there is that the kernel is deadlocking.  Read the
VERY long thread for the details.

I think I have enough on the ball to be able to tell the difference between
mutt opening a folder and counting messages, with a counter and percentage
indicator advancing, and mutt sitting there deadlocked with the HD activity
light stuck on and all the rest of X stuck tight.

And it just happened again, so -rc6 is no sure fix.  What did y'all that
reported the problem had gone away do, patch -rc4 with the akpm patches?
^_^

-- 
 Marc Wilson |     Fortune favors the lucky.
 msw@cox.net |

^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: Linux 2.4.21-rc6
  2003-05-29  5:57       ` Marc Wilson
@ 2003-05-29  7:15         ` Riley Williams
  2003-05-29  8:38         ` Willy Tarreau
  2003-06-03 16:02         ` Marcelo Tosatti
  2 siblings, 0 replies; 109+ messages in thread
From: Riley Williams @ 2003-05-29  7:15 UTC (permalink / raw)
  To: Marc Wilson; +Cc: Linux Kernel List

Hi Marc.

 >> The basic problem there is that any mail client needs to know
 >> just how many messages are in a particular folder to handle that
 >> folder, and the only way to do this is to count them all. That's
 >> what takes the time when one opens a large folder.

 > No, the basic problem there is that the kernel is deadlocking.
 > Read the VERY long thread for the details.
 >
 > I think I have enough on the ball to be able to tell the difference
 > between mutt opening a folder and counting messages, with a counter
 > and percentage indicator advancing, and mutt sitting there
 > deadlocked with the HD activity light stuck on and all the rest of
 > X stuck tight.

I thought I was on the ball when a similar situation happened to me.
What I observed was that the counters and percentage indicators were
NOT advancing for about 30 seconds, and both would then jump up by
about 70 messages and the relevant percent rather than counting
smoothly through. It was only when I noticed those jumps that I went
back to basics and analysed the folder rather than the kernel.

However, I apologise profusely for assuming that my experience in what
to me appear to be similar circumstances to yours could have any sort
of bearing on the problem you are seeing.

 > And it just happened again, so -rc6 is no sure fix. What did y'all
 > that reported the problem had gone away do, patch -rc4 with the
 > akpm patches?

In my case, I fixed the problem by splitting the relevant folder up,
as stated in my previous message. However, such a solution apparently
doesn't work for you, so I'm unable to help any further.

Best wishes from Riley.
---
 * Nothing as pretty as a smile, nothing as ugly as a frown.

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.484 / Virus Database: 282 - Release Date: 27-May-2003


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-05-29  5:57       ` Marc Wilson
  2003-05-29  7:15         ` Riley Williams
@ 2003-05-29  8:38         ` Willy Tarreau
  2003-05-29  8:40           ` Willy Tarreau
  2003-06-03 16:02         ` Marcelo Tosatti
  2 siblings, 1 reply; 109+ messages in thread
From: Willy Tarreau @ 2003-05-29  8:38 UTC (permalink / raw)
  To: lkml

Hi !

On Wed, May 28, 2003 at 10:57:35PM -0700, Marc Wilson wrote:
> No, the basic problem there is that the kernel is deadlocking.  Read the
> VERY long thread for the details.

I didn't follow this thread, what's its subject, please ?
 
> I think I have enough on the ball to be able to tell the difference between
> mutt opening a folder and counting messages, with a counter and percentage
> indicator advancing, and mutt sitting there deadlocked with the HD activity
> light stuck on and all the rest of X stuck tight.

even on -rc3, I don't observe this behaviour. I tried from a cold cache, and
mutt took a little less than 3 seconds to open LKML's May folder (35 MB), and
progressed very smoothly. Since it's on my Alpha file server, I can't test
with X. But the I/O bandwidth and scheduler frequency (1024 HZ) may have an
impact.

Cheers,
Willy


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-05-29  8:38         ` Willy Tarreau
@ 2003-05-29  8:40           ` Willy Tarreau
  0 siblings, 0 replies; 109+ messages in thread
From: Willy Tarreau @ 2003-05-29  8:40 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: lkml

On Thu, May 29, 2003 at 10:38:04AM +0200, Willy Tarreau wrote:
> Hi !
> 
> On Wed, May 28, 2003 at 10:57:35PM -0700, Marc Wilson wrote:
> > No, the basic problem there is that the kernel is deadlocking.  Read the
> > VERY long thread for the details.
> 
> I didn't follow this thread, what's its subject, please ?

Hmmm never mind, I easily found it (yes, VERY long) !

Cheers,
Willy


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-05-29  0:55 Linux 2.4.21-rc6 Marcelo Tosatti
  2003-05-29  1:22 ` Con Kolivas
@ 2003-05-29 10:02 ` Con Kolivas
  2003-05-29 18:00 ` Georg Nikodym
  2003-06-03 19:45 ` Config issue (CONFIG_X86_TSC) " Paul
  3 siblings, 0 replies; 109+ messages in thread
From: Con Kolivas @ 2003-05-29 10:02 UTC (permalink / raw)
  To: Marcelo Tosatti, lkml

On Thu, 29 May 2003 10:55, Marcelo Tosatti wrote:
> Hi,
>
> Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's fix
> for the IO stalls/deadlocks.
>
> Please test it.
>
>

> Andrew Morton <akpm@digeo.com>:
>   o Fix IO stalls and deadlocks

As this is only patches 1 and 2 from akpm's suggested changes I was wondering 
if my report got lost in the huge thread so I've included it here:

Ok patch combination final score for me is as follows in the presence of a 
large continuous write:
1 No change
2 No change
3 improvement++; minor hangs with reads
1+2 improvement+++; minor pauses with switching applications
1+2+3 improvement++++; no pauses

Applications may start up slowly that's fine. The mouse cursor keeps spinning 
and responding at all times though with 1+2+3 which it hasn't done in 2.4 for 
a year or so.

Is there a reason the 3rd patch was omitted?

Con

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-05-29  0:55 Linux 2.4.21-rc6 Marcelo Tosatti
  2003-05-29  1:22 ` Con Kolivas
  2003-05-29 10:02 ` Con Kolivas
@ 2003-05-29 18:00 ` Georg Nikodym
  2003-05-29 19:11   ` -rc7 " Marcelo Tosatti
  2003-06-03 19:45 ` Config issue (CONFIG_X86_TSC) " Paul
  3 siblings, 1 reply; 109+ messages in thread
From: Georg Nikodym @ 2003-05-29 18:00 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: lkml

[-- Attachment #1: Type: text/plain, Size: 555 bytes --]

On Wed, 28 May 2003 21:55:39 -0300 (BRT)
Marcelo Tosatti <marcelo@conectiva.com.br> wrote:

> Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's
> fix for the IO stalls/deadlocks.

While others may be dubious about the efficacy of this patch, I've been
running -rc6 on my laptop now since sometime last night and have seen
nothing odd.

In case anybody cares, I'm using both ide and a ieee1394 (for a large
external drive [which implies scsi]) and I do a _lot_ of big work with
BK so I was seeing the problem within hours previously.

-g

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 109+ messages in thread

* -rc7   Re: Linux 2.4.21-rc6
  2003-05-29 18:00 ` Georg Nikodym
@ 2003-05-29 19:11   ` Marcelo Tosatti
  2003-05-29 19:56     ` Krzysiek Taraszka
  2003-06-04 10:22     ` Andrea Arcangeli
  0 siblings, 2 replies; 109+ messages in thread
From: Marcelo Tosatti @ 2003-05-29 19:11 UTC (permalink / raw)
  To: Georg Nikodym; +Cc: lkml



On Thu, 29 May 2003, Georg Nikodym wrote:

> On Wed, 28 May 2003 21:55:39 -0300 (BRT)
> Marcelo Tosatti <marcelo@conectiva.com.br> wrote:
>
> > Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's
> > fix for the IO stalls/deadlocks.
>
> While others may be dubious about the efficacy of this patch, I've been
> running -rc6 on my laptop now since sometime last night and have seen
> nothing odd.
>
> In case anybody cares, I'm using both ide and a ieee1394 (for a large
> external drive [which implies scsi]) and I do a _lot_ of big work with
> BK so I was seeing the problem within hours previously.

Great!

-rc7 will have to be released due to some problems :(

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-05-29 19:11   ` -rc7 " Marcelo Tosatti
@ 2003-05-29 19:56     ` Krzysiek Taraszka
  2003-05-29 20:18       ` Krzysiek Taraszka
  2003-06-04 10:22     ` Andrea Arcangeli
  1 sibling, 1 reply; 109+ messages in thread
From: Krzysiek Taraszka @ 2003-05-29 19:56 UTC (permalink / raw)
  To: Marcelo Tosatti, Georg Nikodym; +Cc: lkml

[-- Attachment #1: Type: text/plain, Size: 4242 bytes --]

Dnia czw 29. maja 2003 21:11, Marcelo Tosatti napisał:
> On Thu, 29 May 2003, Georg Nikodym wrote:
> > On Wed, 28 May 2003 21:55:39 -0300 (BRT)
> >
> > Marcelo Tosatti <marcelo@conectiva.com.br> wrote:
> > > Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's
> > > fix for the IO stalls/deadlocks.
> >
> > While others may be dubious about the efficacy of this patch, I've been
> > running -rc6 on my laptop now since sometime last night and have seen
> > nothing odd.
> >
> > In case anybody cares, I'm using both ide and a ieee1394 (for a large
> > external drive [which implies scsi]) and I do a _lot_ of big work with
> > BK so I was seeing the problem within hours previously.
>
> Great!
>
> -rc7 will have to be released due to some problems :(


hmm, seems to ide modules and others are broken. Im looking for reason why ..
here are depmod errors and my .config file:

make[1]: Nie nic do roboty w `modules_install'.
make[1]: Opuszczam katalog 
`/home/users/dzimi/rpm/BUILD/linux-2.4.20/arch/i386/l
ib'
cd /lib/modules/2.4.21-rc6; \
mkdir -p pcmcia; \
find kernel -path '*/pcmcia/*' -name '*.o' | xargs -i -r ln -sf ../{} pcmcia
if [ -r System.map ]; then /sbin/depmod -ae -F System.map  2.4.21-rc6; fi
depmod: *** Unresolved symbols in 
/lib/modules/2.4.21-rc6/kernel/drivers/ide/ide
-disk.o
depmod:         proc_ide_read_geometry
depmod:         ide_remove_proc_entries
depmod: *** Unresolved symbols in 
/lib/modules/2.4.21-rc6/kernel/drivers/ide/ide
-floppy.o
depmod:         proc_ide_read_geometry
depmod:         ide_remove_proc_entries
depmod: *** Unresolved symbols in 
/lib/modules/2.4.21-rc6/kernel/drivers/ide/ide
-probe.o
depmod:         do_ide_request
depmod:         ide_add_generic_settings
depmod:         create_proc_ide_interfaces
depmod: *** Unresolved symbols in 
/lib/modules/2.4.21-rc6/kernel/drivers/ide/ide
-tape.o
depmod:         ide_remove_proc_entries
depmod: *** Unresolved symbols in 
/lib/modules/2.4.21-rc6/kernel/drivers/ide/ide
.o
depmod:         ide_release_dma
depmod:         ide_add_proc_entries
depmod:         cmd640_vlb
depmod:         ide_probe_for_cmd640x
depmod:         ide_scan_pcibus
depmod:         proc_ide_read_capacity
depmod:         proc_ide_create
depmod:         ide_remove_proc_entries
depmod:         destroy_proc_ide_drives
depmod:         proc_ide_destroy
depmod:         create_proc_ide_interfaces
depmod: *** Unresolved symbols in 
/lib/modules/2.4.21-rc6/kernel/drivers/net/wan
/comx.o
depmod:         proc_get_inode
depmod: *** Unresolved symbols in 
/lib/modules/2.4.21-rc6/kernel/net/atm/common.
o
depmod:         free_atm_vcc_sk
depmod:         atm_init_aal34
depmod:         alloc_atm_vcc_sk
depmod:         atm_init_aal0
depmod:         atm_devs
depmod: *** Unresolved symbols in /lib/modules/2.4.21-rc6/kernel/net/atm/pvc.o
depmod:         atm_getsockopt
depmod:         atm_recvmsg
depmod:         atm_release
depmod:         atm_ioctl
depmod:         atm_create
depmod:         atm_sendmsg
depmod:         atm_poll
depmod:         atm_connect
depmod:         atm_proc_init
depmod:         atm_setsockopt
depmod: *** Unresolved symbols in 
/lib/modules/2.4.21-rc6/kernel/net/atm/resourc                                            
es.o
depmod:         atm_proc_dev_deregister
depmod:         atm_proc_dev_register
depmod: *** Unresolved symbols in 
/lib/modules/2.4.21-rc6/kernel/net/atm/signali                                            
ng.o
depmod:         nodev_vccs
depmod:         atm_devs
depmod: *** Unresolved symbols in /lib/modules/2.4.21-rc6/kernel/net/atm/svc.o
depmod:         atm_getsockopt
depmod:         atm_recvmsg
depmod:         free_atm_vcc_sk
depmod:         atm_ioctl
depmod:         atm_create
depmod:         atm_sendmsg
depmod:         atm_poll
depmod:         atm_connect
depmod:         atm_release_vcc_sk
depmod:         atm_setsockopt

my .config (it's distro config, im PLD kernel packager) is include.
Ok im going to fix those trivial (?) problems, when i worked around 2.2.x i 
made some hacks on ksyms.c, was it corect ?

-- 
Krzysiek Taraszka			(dzimi@pld.org.pl)
http://cyborg.kernel.pl/~dzimi/

[-- Attachment #2: .config --]
[-- Type: text/plain, Size: 39697 bytes --]

#
# Automatically generated by make menuconfig: don't edit
#
CONFIG_X86=y
# CONFIG_SBUS is not set
CONFIG_UID16=y

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y

#
# Loadable module support
#
CONFIG_MODULES=y
# CONFIG_MODVERSIONS is not set
CONFIG_KMOD=y

#
# Processor type and features
#
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
CONFIG_MK7=y
# CONFIG_MK8 is not set
# CONFIG_MELAN is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_HAS_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_USE_3DNOW=y
CONFIG_X86_PGE=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_F00F_WORKS_OK=y
CONFIG_X86_MCE=y
CONFIG_TOSHIBA=m
CONFIG_I8K=m
CONFIG_MICROCODE=m
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=m
# CONFIG_NOHIGHMEM is not set
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
CONFIG_HIGHMEM=y
CONFIG_HIGHIO=y
# CONFIG_MATH_EMULATION is not set
CONFIG_MTRR=y
# CONFIG_SMP is not set
CONFIG_X86_UP_APIC=y
CONFIG_X86_UP_IOAPIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_TSC_DISABLE is not set
CONFIG_X86_TSC=y

#
# General setup
#
CONFIG_NET=y
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GODIRECT is not set
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_ISA=y
CONFIG_PCI_NAMES=y
CONFIG_EISA=y
CONFIG_MCA=y
CONFIG_HOTPLUG=y

#
# PCMCIA/CardBus support
#
CONFIG_PCMCIA=m
CONFIG_CARDBUS=y
CONFIG_TCIC=y
CONFIG_I82092=y
CONFIG_I82365=y

#
# PCI Hotplug Support
#
CONFIG_HOTPLUG_PCI=m
CONFIG_HOTPLUG_PCI_COMPAQ=m
# CONFIG_HOTPLUG_PCI_COMPAQ_NVRAM is not set
CONFIG_HOTPLUG_PCI_IBM=m
CONFIG_HOTPLUG_PCI_ACPI=m
CONFIG_SYSVIPC=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_SYSCTL=y
CONFIG_KCORE_ELF=y
# CONFIG_KCORE_AOUT is not set
CONFIG_BINFMT_AOUT=m
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_MISC=m
CONFIG_PM=y
CONFIG_ACPI=y
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_BUSMGR=m
CONFIG_ACPI_SYS=m
CONFIG_ACPI_CPU=m
CONFIG_ACPI_BUTTON=m
CONFIG_ACPI_AC=m
CONFIG_ACPI_EC=m
CONFIG_ACPI_CMBATT=m
CONFIG_ACPI_THERMAL=m
CONFIG_APM=m
# CONFIG_APM_IGNORE_USER_SUSPEND is not set
# CONFIG_APM_DO_ENABLE is not set
# CONFIG_APM_CPU_IDLE is not set
# CONFIG_APM_DISPLAY_BLANK is not set
CONFIG_APM_RTC_IS_GMT=y
# CONFIG_APM_ALLOW_INTS is not set
CONFIG_APM_REAL_MODE_POWER_OFF=y

#
# Memory Technology Devices (MTD)
#
CONFIG_MTD=m
# CONFIG_MTD_DEBUG is not set
CONFIG_MTD_PARTITIONS=m
CONFIG_MTD_CONCAT=m
CONFIG_MTD_REDBOOT_PARTS=m
# CONFIG_MTD_CMDLINE_PARTS is not set
CONFIG_MTD_CHAR=m
CONFIG_MTD_BLOCK=m
CONFIG_MTD_BLOCK_RO=m
CONFIG_FTL=m
CONFIG_NFTL=m
CONFIG_NFTL_RW=y

#
# RAM/ROM/Flash chip drivers
#
CONFIG_MTD_CFI=m
CONFIG_MTD_JEDECPROBE=m
CONFIG_MTD_GEN_PROBE=m
CONFIG_MTD_CFI_ADV_OPTIONS=y
CONFIG_MTD_CFI_NOSWAP=y
# CONFIG_MTD_CFI_BE_BYTE_SWAP is not set
# CONFIG_MTD_CFI_LE_BYTE_SWAP is not set
# CONFIG_MTD_CFI_GEOMETRY is not set
CONFIG_MTD_CFI_INTELEXT=m
CONFIG_MTD_CFI_AMDSTD=m
# CONFIG_MTD_CFI_STAA is not set
CONFIG_MTD_RAM=m
CONFIG_MTD_ROM=m
CONFIG_MTD_ABSENT=m
# CONFIG_MTD_OBSOLETE_CHIPS is not set
# CONFIG_MTD_AMDSTD is not set
# CONFIG_MTD_SHARP is not set
# CONFIG_MTD_JEDEC is not set

#
# Mapping drivers for chip access
#
CONFIG_MTD_PHYSMAP=m
CONFIG_MTD_PHYSMAP_START=8000000
CONFIG_MTD_PHYSMAP_LEN=4000000
CONFIG_MTD_PHYSMAP_BUSWIDTH=2
CONFIG_MTD_PNC2000=m
CONFIG_MTD_SC520CDP=m
CONFIG_MTD_NETSC520=m
CONFIG_MTD_SBC_GXX=m
CONFIG_MTD_ELAN_104NC=m
CONFIG_MTD_DILNETPC=m
CONFIG_MTD_DILNETPC_BOOTSIZE=80000
# CONFIG_MTD_MIXMEM is not set
# CONFIG_MTD_OCTAGON is not set
# CONFIG_MTD_VMAX is not set
# CONFIG_MTD_SCx200_DOCFLASH is not set
CONFIG_MTD_L440GX=m
# CONFIG_MTD_AMD76XROM is not set
CONFIG_MTD_ICH2ROM=m
# CONFIG_MTD_NETtel is not set
# CONFIG_MTD_SCB2_FLASH is not set
CONFIG_MTD_PCI=m
# CONFIG_MTD_PCMCIA is not set

#
# Self-contained MTD device drivers
#
CONFIG_MTD_PMC551=m
CONFIG_MTD_PMC551_BUGFIX=y
# CONFIG_MTD_PMC551_DEBUG is not set
CONFIG_MTD_SLRAM=m
CONFIG_MTD_MTDRAM=m
CONFIG_MTDRAM_TOTAL_SIZE=4096
CONFIG_MTDRAM_ERASE_SIZE=128
CONFIG_MTD_BLKMTD=m
CONFIG_MTD_DOC1000=m
CONFIG_MTD_DOC2000=m
CONFIG_MTD_DOC2001=m
CONFIG_MTD_DOCPROBE=m
CONFIG_MTD_DOCPROBE_ADVANCED=y
CONFIG_MTD_DOCPROBE_ADDRESS=0000
CONFIG_MTD_DOCPROBE_HIGH=y
CONFIG_MTD_DOCPROBE_55AA=y

#
# NAND Flash Device Drivers
#
CONFIG_MTD_NAND=m
CONFIG_MTD_NAND_VERIFY_WRITE=y
CONFIG_MTD_NAND_IDS=m

#
# Parallel port support
#
CONFIG_PARPORT=m
CONFIG_PARPORT_PC=m
CONFIG_PARPORT_PC_CML1=m
CONFIG_PARPORT_SERIAL=m
CONFIG_PARPORT_PC_FIFO=y
CONFIG_PARPORT_PC_SUPERIO=y
CONFIG_PARPORT_PC_PCMCIA=m
# CONFIG_PARPORT_AMIGA is not set
# CONFIG_PARPORT_MFC3 is not set
# CONFIG_PARPORT_ATARI is not set
# CONFIG_PARPORT_GSC is not set
# CONFIG_PARPORT_SUNBPP is not set
# CONFIG_PARPORT_OTHER is not set
CONFIG_PARPORT_1284=y

#
# Plug and Play configuration
#
CONFIG_PNP=m
CONFIG_ISAPNP=m

#
# Block devices
#
CONFIG_BLK_DEV_FD=m
CONFIG_BLK_DEV_PS2=m
CONFIG_BLK_DEV_XD=m
CONFIG_PARIDE=m
CONFIG_PARIDE_PARPORT=m
CONFIG_PARIDE_PD=m
CONFIG_PARIDE_PCD=m
CONFIG_PARIDE_PF=m
CONFIG_PARIDE_PT=m
CONFIG_PARIDE_PG=m
CONFIG_PARIDE_ATEN=m
CONFIG_PARIDE_BPCK=m
CONFIG_PARIDE_BPCK6=m
CONFIG_PARIDE_COMM=m
CONFIG_PARIDE_DSTR=m
CONFIG_PARIDE_FIT2=m
CONFIG_PARIDE_FIT3=m
CONFIG_PARIDE_EPAT=m
CONFIG_PARIDE_EPATC8=y
CONFIG_PARIDE_EPIA=m
CONFIG_PARIDE_FRIQ=m
CONFIG_PARIDE_FRPW=m
CONFIG_PARIDE_KBIC=m
CONFIG_PARIDE_KTTI=m
CONFIG_PARIDE_ON20=m
CONFIG_PARIDE_ON26=m
CONFIG_BLK_CPQ_DA=m
CONFIG_BLK_CPQ_CISS_DA=m
CONFIG_CISS_SCSI_TAPE=y
CONFIG_BLK_DEV_DAC960=m
CONFIG_BLK_DEV_UMEM=m
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_NBD=m
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_SIZE=4096
CONFIG_BLK_DEV_INITRD=y
CONFIG_BLK_STATS=y

#
# Multi-device support (RAID and LVM)
#
CONFIG_MD=y
CONFIG_BLK_DEV_MD=m
CONFIG_MD_LINEAR=m
CONFIG_MD_RAID0=m
CONFIG_MD_RAID1=m
CONFIG_MD_RAID5=m
CONFIG_MD_MULTIPATH=m
CONFIG_BLK_DEV_LVM=m

#
# Networking options
#
CONFIG_PACKET=m
CONFIG_PACKET_MMAP=y
CONFIG_NETLINK_DEV=y
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
CONFIG_FILTER=y
CONFIG_UNIX=m
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_FWMARK=y
CONFIG_IP_ROUTE_NAT=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_TOS=y
CONFIG_IP_ROUTE_VERBOSE=y
CONFIG_IP_ROUTE_LARGE_TABLES=y
# CONFIG_IP_PNP is not set
CONFIG_NET_IPIP=m
CONFIG_NET_IPGRE=m
CONFIG_NET_IPGRE_BROADCAST=y
CONFIG_IP_MROUTE=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
# CONFIG_ARPD is not set
# CONFIG_INET_ECN is not set
CONFIG_SYN_COOKIES=y

#
#   IP: Netfilter Configuration
#
CONFIG_IP_NF_CONNTRACK=m
CONFIG_IP_NF_FTP=m
# CONFIG_IP_NF_AMANDA is not set
# CONFIG_IP_NF_TFTP is not set
CONFIG_IP_NF_IRC=m
CONFIG_IP_NF_QUEUE=m
CONFIG_IP_NF_IPTABLES=m
CONFIG_IP_NF_MATCH_LIMIT=m
CONFIG_IP_NF_MATCH_MAC=m
CONFIG_IP_NF_MATCH_PKTTYPE=m
CONFIG_IP_NF_MATCH_MARK=m
CONFIG_IP_NF_MATCH_MULTIPORT=m
CONFIG_IP_NF_MATCH_TOS=m
CONFIG_IP_NF_MATCH_ECN=m
CONFIG_IP_NF_MATCH_DSCP=m
CONFIG_IP_NF_MATCH_AH_ESP=m
CONFIG_IP_NF_MATCH_LENGTH=m
CONFIG_IP_NF_MATCH_TTL=m
CONFIG_IP_NF_MATCH_TCPMSS=m
CONFIG_IP_NF_MATCH_HELPER=m
CONFIG_IP_NF_MATCH_STATE=m
CONFIG_IP_NF_MATCH_CONNTRACK=m
CONFIG_IP_NF_MATCH_UNCLEAN=m
CONFIG_IP_NF_MATCH_OWNER=m
CONFIG_IP_NF_FILTER=m
CONFIG_IP_NF_TARGET_REJECT=m
CONFIG_IP_NF_TARGET_MIRROR=m
CONFIG_IP_NF_NAT=m
CONFIG_IP_NF_NAT_NEEDED=y
CONFIG_IP_NF_TARGET_MASQUERADE=m
CONFIG_IP_NF_TARGET_REDIRECT=m
CONFIG_IP_NF_NAT_LOCAL=y
CONFIG_IP_NF_NAT_SNMP_BASIC=m
CONFIG_IP_NF_NAT_IRC=m
CONFIG_IP_NF_NAT_FTP=m
CONFIG_IP_NF_MANGLE=m
CONFIG_IP_NF_TARGET_TOS=m
CONFIG_IP_NF_TARGET_ECN=m
CONFIG_IP_NF_TARGET_DSCP=m
CONFIG_IP_NF_TARGET_MARK=m
CONFIG_IP_NF_TARGET_LOG=m
CONFIG_IP_NF_TARGET_ULOG=m
CONFIG_IP_NF_TARGET_TCPMSS=m
CONFIG_IP_NF_ARPTABLES=m
CONFIG_IP_NF_ARPFILTER=m
CONFIG_IP_NF_COMPAT_IPCHAINS=m
CONFIG_IP_NF_NAT_NEEDED=y
CONFIG_IP_NF_COMPAT_IPFWADM=m
CONFIG_IP_NF_NAT_NEEDED=y
CONFIG_IPV6=m

#
#   IPv6: Netfilter Configuration
#
CONFIG_IP6_NF_QUEUE=m
CONFIG_IP6_NF_IPTABLES=m
CONFIG_IP6_NF_MATCH_LIMIT=m
CONFIG_IP6_NF_MATCH_MAC=m
# CONFIG_IP6_NF_MATCH_RT is not set
# CONFIG_IP6_NF_MATCH_OPTS is not set
# CONFIG_IP6_NF_MATCH_FRAG is not set
# CONFIG_IP6_NF_MATCH_HL is not set
CONFIG_IP6_NF_MATCH_MULTIPORT=m
CONFIG_IP6_NF_MATCH_OWNER=m
CONFIG_IP6_NF_MATCH_MARK=m
# CONFIG_IP6_NF_MATCH_IPV6HEADER is not set
# CONFIG_IP6_NF_MATCH_AHESP is not set
CONFIG_IP6_NF_MATCH_LENGTH=m
CONFIG_IP6_NF_MATCH_EUI64=m
CONFIG_IP6_NF_FILTER=m
CONFIG_IP6_NF_TARGET_LOG=m
CONFIG_IP6_NF_MANGLE=m
CONFIG_IP6_NF_TARGET_MARK=m
CONFIG_KHTTPD=m
CONFIG_ATM=m
CONFIG_VLAN_8021Q=m
CONFIG_IPX=m
CONFIG_IPX_INTERN=y
CONFIG_ATALK=m

#
# Appletalk devices
#
CONFIG_DEV_APPLETALK=y
CONFIG_LTPC=m
CONFIG_COPS=m
CONFIG_COPS_DAYNA=y
CONFIG_COPS_TANGENT=y
CONFIG_IPDDP=m
CONFIG_IPDDP_ENCAP=y
CONFIG_IPDDP_DECAP=y
CONFIG_DECNET=m
CONFIG_DECNET_SIOCGIFCONF=y
CONFIG_DECNET_ROUTER=y
# CONFIG_DECNET_ROUTE_FWMARK is not set
CONFIG_BRIDGE=m
CONFIG_X25=m
CONFIG_LAPB=m
CONFIG_LLC=y
CONFIG_NET_DIVERT=y
CONFIG_ECONET=m
CONFIG_ECONET_AUNUDP=y
CONFIG_ECONET_NATIVE=y
CONFIG_WAN_ROUTER=m
# CONFIG_NET_FASTROUTE is not set
# CONFIG_NET_HW_FLOWCONTROL is not set

#
# QoS and/or fair queueing
#
CONFIG_NET_SCHED=y
CONFIG_NET_SCH_CBQ=m
CONFIG_NET_SCH_HTB=m
CONFIG_NET_SCH_CSZ=m
CONFIG_NET_SCH_PRIO=m
CONFIG_NET_SCH_RED=m
CONFIG_NET_SCH_SFQ=m
CONFIG_NET_SCH_TEQL=m
CONFIG_NET_SCH_TBF=m
CONFIG_NET_SCH_GRED=m
CONFIG_NET_SCH_DSMARK=m
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_QOS=y
CONFIG_NET_ESTIMATOR=y
CONFIG_NET_CLS=y
CONFIG_NET_CLS_TCINDEX=m
CONFIG_NET_CLS_ROUTE4=m
CONFIG_NET_CLS_ROUTE=y
CONFIG_NET_CLS_FW=m
CONFIG_NET_CLS_U32=m
CONFIG_NET_CLS_RSVP=m
CONFIG_NET_CLS_RSVP6=m
CONFIG_NET_CLS_POLICE=y

#
# Network testing
#
CONFIG_NET_PKTGEN=m

#
# Telephony Support
#
CONFIG_PHONE=m
CONFIG_PHONE_IXJ=m
CONFIG_PHONE_IXJ_PCMCIA=m

#
# ATA/IDE/MFM/RLL support
#
CONFIG_IDE=m

#
# IDE, ATA and ATAPI Block devices
#
CONFIG_BLK_DEV_IDE=m
# CONFIG_BLK_DEV_HD_IDE is not set
# CONFIG_BLK_DEV_HD is not set
CONFIG_BLK_DEV_IDEDISK=m
# CONFIG_IDEDISK_MULTI_MODE is not set
CONFIG_IDEDISK_STROKE=y
CONFIG_BLK_DEV_IDECS=m
CONFIG_BLK_DEV_IDECD=m
CONFIG_BLK_DEV_IDETAPE=m
CONFIG_BLK_DEV_IDEFLOPPY=m
CONFIG_BLK_DEV_IDESCSI=m
CONFIG_IDE_TASK_IOCTL=y
CONFIG_BLK_DEV_CMD640=y
# CONFIG_BLK_DEV_CMD640_ENHANCED is not set
# CONFIG_BLK_DEV_ISAPNP is not set
CONFIG_BLK_DEV_IDEPCI=y
# CONFIG_BLK_DEV_GENERIC is not set
CONFIG_IDEPCI_SHARE_IRQ=y
CONFIG_BLK_DEV_IDEDMA_PCI=y
CONFIG_BLK_DEV_OFFBOARD=y
# CONFIG_BLK_DEV_IDEDMA_FORCED is not set
# CONFIG_IDEDMA_PCI_AUTO is not set
# CONFIG_IDEDMA_ONLYDISK is not set
CONFIG_BLK_DEV_IDEDMA=y
# CONFIG_IDEDMA_PCI_WIP is not set
# CONFIG_BLK_DEV_ADMA100 is not set
CONFIG_BLK_DEV_AEC62XX=y
CONFIG_BLK_DEV_ALI15X3=y
# CONFIG_WDC_ALI15X3 is not set
CONFIG_BLK_DEV_AMD74XX=y
# CONFIG_AMD74XX_OVERRIDE is not set
CONFIG_BLK_DEV_CMD64X=y
# CONFIG_BLK_DEV_TRIFLEX is not set
CONFIG_BLK_DEV_CY82C693=y
CONFIG_BLK_DEV_CS5530=y
CONFIG_BLK_DEV_HPT34X=y
# CONFIG_HPT34X_AUTODMA is not set
CONFIG_BLK_DEV_HPT366=y
CONFIG_BLK_DEV_PIIX=y
CONFIG_BLK_DEV_NS87415=y
CONFIG_BLK_DEV_OPTI621=y
# CONFIG_BLK_DEV_PDC202XX_OLD is not set
# CONFIG_PDC202XX_BURST is not set
# CONFIG_BLK_DEV_PDC202XX_NEW is not set
CONFIG_BLK_DEV_RZ1000=y
# CONFIG_BLK_DEV_SC1200 is not set
CONFIG_BLK_DEV_SVWKS=y
# CONFIG_BLK_DEV_SIIMAGE is not set
CONFIG_BLK_DEV_SIS5513=y
CONFIG_BLK_DEV_SLC90E66=y
CONFIG_BLK_DEV_TRM290=y
CONFIG_BLK_DEV_VIA82CXXX=y
CONFIG_IDE_CHIPSETS=y
CONFIG_BLK_DEV_4DRIVES=y
CONFIG_BLK_DEV_ALI14XX=m
CONFIG_BLK_DEV_DTC2278=m
CONFIG_BLK_DEV_HT6560B=m
# CONFIG_BLK_DEV_PDC4030 is not set
CONFIG_BLK_DEV_QD65XX=m
CONFIG_BLK_DEV_UMC8672=m
# CONFIG_IDEDMA_AUTO is not set
# CONFIG_IDEDMA_IVB is not set
# CONFIG_DMA_NONPCI is not set
CONFIG_BLK_DEV_IDE_MODES=y
CONFIG_BLK_DEV_ATARAID=m
CONFIG_BLK_DEV_ATARAID_PDC=m
CONFIG_BLK_DEV_ATARAID_HPT=m
# CONFIG_BLK_DEV_ATARAID_SII is not set

#
# SCSI support
#
CONFIG_SCSI=m
CONFIG_BLK_DEV_SD=m
CONFIG_SD_EXTRA_DEVS=64
CONFIG_CHR_DEV_ST=m
CONFIG_CHR_DEV_OSST=m
CONFIG_BLK_DEV_SR=m
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_SR_EXTRA_DEVS=4
CONFIG_CHR_DEV_SG=m
# CONFIG_SCSI_DEBUG_QUEUES is not set
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y

#
# SCSI low-level drivers
#
CONFIG_BLK_DEV_3W_XXXX_RAID=m
CONFIG_SCSI_7000FASST=m
CONFIG_SCSI_ACARD=m
CONFIG_SCSI_AHA152X=m
CONFIG_SCSI_AHA1542=m
CONFIG_SCSI_AHA1740=m
CONFIG_SCSI_AACRAID=m
CONFIG_SCSI_AIC7XXX=m
CONFIG_AIC7XXX_CMDS_PER_DEVICE=253
CONFIG_AIC7XXX_RESET_DELAY_MS=15000
CONFIG_AIC7XXX_PROBE_EISA_VL=y
# CONFIG_AIC7XXX_BUILD_FIRMWARE is not set
# CONFIG_SCSI_AIC79XX is not set
CONFIG_SCSI_AIC7XXX_OLD=m
CONFIG_AIC7XXX_OLD_TCQ_ON_BY_DEFAULT=y
CONFIG_AIC7XXX_OLD_CMDS_PER_DEVICE=128
CONFIG_AIC7XXX_OLD_PROC_STATS=y
CONFIG_SCSI_DPT_I2O=m
CONFIG_SCSI_ADVANSYS=m
CONFIG_SCSI_IN2000=m
CONFIG_SCSI_AM53C974=m
CONFIG_SCSI_MEGARAID=m
CONFIG_SCSI_BUSLOGIC=m
# CONFIG_SCSI_OMIT_FLASHPOINT is not set
CONFIG_SCSI_CPQFCTS=m
CONFIG_SCSI_DMX3191D=m
CONFIG_SCSI_DTC3280=m
CONFIG_SCSI_EATA=m
CONFIG_SCSI_EATA_TAGGED_QUEUE=y
# CONFIG_SCSI_EATA_LINKED_COMMANDS is not set
CONFIG_SCSI_EATA_MAX_TAGS=16
CONFIG_SCSI_EATA_DMA=m
CONFIG_SCSI_EATA_PIO=m
CONFIG_SCSI_FUTURE_DOMAIN=m
CONFIG_SCSI_FD_MCS=m
CONFIG_SCSI_GDTH=m
CONFIG_SCSI_GENERIC_NCR5380=m
# CONFIG_SCSI_GENERIC_NCR53C400 is not set
CONFIG_SCSI_G_NCR5380_PORT=y
# CONFIG_SCSI_G_NCR5380_MEM is not set
CONFIG_SCSI_IBMMCA=m
CONFIG_IBMMCA_SCSI_ORDER_STANDARD=y
# CONFIG_IBMMCA_SCSI_DEV_RESET is not set
CONFIG_SCSI_IPS=m
CONFIG_SCSI_INITIO=m
CONFIG_SCSI_INIA100=m
CONFIG_SCSI_PPA=m
CONFIG_SCSI_IMM=m
# CONFIG_SCSI_IZIP_EPP16 is not set
# CONFIG_SCSI_IZIP_SLOW_CTR is not set
CONFIG_SCSI_NCR53C406A=m
CONFIG_SCSI_NCR_D700=m
CONFIG_53C700_IO_MAPPED=y
CONFIG_SCSI_NCR53C7xx=m
# CONFIG_SCSI_NCR53C7xx_sync is not set
CONFIG_SCSI_NCR53C7xx_FAST=y
CONFIG_SCSI_NCR53C7xx_DISCONNECT=y
CONFIG_SCSI_SYM53C8XX_2=m
CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=1
CONFIG_SCSI_SYM53C8XX_DEFAULT_TAGS=16
CONFIG_SCSI_SYM53C8XX_MAX_TAGS=64
CONFIG_SCSI_SYM53C8XX_IOMAPPED=y
CONFIG_SCSI_NCR53C8XX=m
CONFIG_SCSI_SYM53C8XX=m
CONFIG_SCSI_NCR53C8XX_DEFAULT_TAGS=8
CONFIG_SCSI_NCR53C8XX_MAX_TAGS=32
CONFIG_SCSI_NCR53C8XX_SYNC=20
# CONFIG_SCSI_NCR53C8XX_PROFILE is not set
# CONFIG_SCSI_NCR53C8XX_IOMAPPED is not set
CONFIG_SCSI_NCR53C8XX_PQS_PDS=y
# CONFIG_SCSI_NCR53C8XX_SYMBIOS_COMPAT is not set
CONFIG_SCSI_MCA_53C9X=m
CONFIG_SCSI_PAS16=m
CONFIG_SCSI_PCI2000=m
CONFIG_SCSI_PCI2220I=m
CONFIG_SCSI_PSI240I=m
CONFIG_SCSI_QLOGIC_FAS=m
CONFIG_SCSI_QLOGIC_ISP=m
CONFIG_SCSI_QLOGIC_FC=m
# CONFIG_SCSI_QLOGIC_FC_FIRMWARE is not set
CONFIG_SCSI_QLOGIC_1280=m
CONFIG_SCSI_SEAGATE=m
CONFIG_SCSI_SIM710=m
CONFIG_SCSI_SYM53C416=m
CONFIG_SCSI_DC390T=m
# CONFIG_SCSI_DC390T_NOGENSUPP is not set
CONFIG_SCSI_T128=m
CONFIG_SCSI_U14_34F=m
# CONFIG_SCSI_U14_34F_LINKED_COMMANDS is not set
CONFIG_SCSI_U14_34F_MAX_TAGS=8
CONFIG_SCSI_ULTRASTOR=m
# CONFIG_SCSI_NSP32 is not set
CONFIG_SCSI_DEBUG=m

#
# PCMCIA SCSI adapter support
#
CONFIG_SCSI_PCMCIA=y
CONFIG_PCMCIA_AHA152X=m
CONFIG_PCMCIA_FDOMAIN=m
CONFIG_PCMCIA_NINJA_SCSI=m
CONFIG_PCMCIA_QLOGIC=m

#
# Fusion MPT device support
#
CONFIG_FUSION=m
# CONFIG_FUSION_BOOT is not set
CONFIG_FUSION_MAX_SGE=40
CONFIG_FUSION_ISENSE=m
CONFIG_FUSION_CTL=m
CONFIG_FUSION_LAN=m
CONFIG_NET_FC=y

#
# IEEE 1394 (FireWire) support (EXPERIMENTAL)
#
CONFIG_IEEE1394=m
CONFIG_IEEE1394_PCILYNX=m
CONFIG_IEEE1394_OHCI1394=m
CONFIG_IEEE1394_VIDEO1394=m
CONFIG_IEEE1394_SBP2=m
CONFIG_IEEE1394_SBP2_PHYS_DMA=y
CONFIG_IEEE1394_ETH1394=m
CONFIG_IEEE1394_DV1394=m
CONFIG_IEEE1394_RAWIO=m
CONFIG_IEEE1394_CMP=m
CONFIG_IEEE1394_AMDTP=m
# CONFIG_IEEE1394_VERBOSEDEBUG is not set

#
# I2O device support
#
CONFIG_I2O=m
CONFIG_I2O_PCI=m
CONFIG_I2O_BLOCK=m
CONFIG_I2O_LAN=m
CONFIG_I2O_SCSI=m
CONFIG_I2O_PROC=m

#
# Network device support
#
CONFIG_NETDEVICES=y

#
# ARCnet devices
#
CONFIG_ARCNET=m
CONFIG_ARCNET_1201=m
CONFIG_ARCNET_1051=m
CONFIG_ARCNET_RAW=m
CONFIG_ARCNET_COM90xx=m
CONFIG_ARCNET_COM90xxIO=m
CONFIG_ARCNET_RIM_I=m
CONFIG_ARCNET_COM20020=m
CONFIG_ARCNET_COM20020_ISA=m
CONFIG_ARCNET_COM20020_PCI=m
CONFIG_DUMMY=m
CONFIG_BONDING=m
CONFIG_EQUALIZER=m
CONFIG_TUN=m
CONFIG_ETHERTAP=m
CONFIG_NET_SB1000=m

#
# Ethernet (10 or 100Mbit)
#
CONFIG_NET_ETHERNET=y
# CONFIG_SUNLANCE is not set
CONFIG_HAPPYMEAL=m
# CONFIG_SUNBMAC is not set
# CONFIG_SUNQE is not set
CONFIG_SUNGEM=m
CONFIG_NET_VENDOR_3COM=y
CONFIG_EL1=m
CONFIG_EL2=m
CONFIG_ELPLUS=m
CONFIG_EL16=m
CONFIG_EL3=m
CONFIG_3C515=m
CONFIG_ELMC=m
CONFIG_ELMC_II=m
CONFIG_VORTEX=m
# CONFIG_TYPHOON is not set
CONFIG_LANCE=m
CONFIG_NET_VENDOR_SMC=y
CONFIG_WD80x3=m
CONFIG_ULTRAMCA=m
CONFIG_ULTRA=m
CONFIG_ULTRA32=m
CONFIG_SMC9194=m
CONFIG_NET_VENDOR_RACAL=y
CONFIG_NI5010=m
CONFIG_NI52=m
CONFIG_NI65=m
CONFIG_AT1700=m
CONFIG_DEPCA=m
CONFIG_HP100=m
CONFIG_NET_ISA=y
CONFIG_E2100=m
CONFIG_EWRK3=m
CONFIG_EEXPRESS=m
CONFIG_EEXPRESS_PRO=m
CONFIG_HPLAN_PLUS=m
CONFIG_HPLAN=m
CONFIG_LP486E=m
CONFIG_ETH16I=m
CONFIG_NE2000=m
CONFIG_SKMC=m
CONFIG_NE2_MCA=m
CONFIG_IBMLANA=m
CONFIG_NET_PCI=y
CONFIG_PCNET32=m
# CONFIG_AMD8111_ETH is not set
CONFIG_ADAPTEC_STARFIRE=m
CONFIG_AC3200=m
CONFIG_APRICOT=m
CONFIG_CS89x0=m
CONFIG_TULIP=m
CONFIG_TULIP_MWI=y
CONFIG_TULIP_MMIO=y
CONFIG_DE4X5=m
CONFIG_DGRS=m
CONFIG_DM9102=m
CONFIG_EEPRO100=m
# CONFIG_EEPRO100_PIO is not set
CONFIG_E100=m
CONFIG_LNE390=m
CONFIG_FEALNX=m
CONFIG_NATSEMI=m
CONFIG_NE2K_PCI=m
CONFIG_NE3210=m
CONFIG_ES3210=m
CONFIG_8139CP=m
CONFIG_8139TOO=m
# CONFIG_8139TOO_PIO is not set
# CONFIG_8139TOO_TUNE_TWISTER is not set
CONFIG_8139TOO_8129=y
# CONFIG_8139_OLD_RX_RESET is not set
CONFIG_SIS900=m
CONFIG_EPIC100=m
CONFIG_SUNDANCE=m
# CONFIG_SUNDANCE_MMIO is not set
CONFIG_TLAN=m
CONFIG_TC35815=m
CONFIG_VIA_RHINE=m
# CONFIG_VIA_RHINE_MMIO is not set
CONFIG_WINBOND_840=m
CONFIG_NET_POCKET=y
CONFIG_ATP=m
CONFIG_DE600=m
CONFIG_DE620=m

#
# Ethernet (1000 Mbit)
#
CONFIG_ACENIC=m
# CONFIG_ACENIC_OMIT_TIGON_I is not set
CONFIG_DL2K=m
CONFIG_E1000=m
# CONFIG_MYRI_SBUS is not set
CONFIG_NS83820=m
CONFIG_HAMACHI=m
CONFIG_YELLOWFIN=m
# CONFIG_R8169 is not set
CONFIG_SK98LIN=m
CONFIG_TIGON3=m
CONFIG_FDDI=y
CONFIG_DEFXX=m
CONFIG_SKFP=m
CONFIG_HIPPI=y
CONFIG_ROADRUNNER=m
# CONFIG_ROADRUNNER_LARGE_RINGS is not set
CONFIG_PLIP=m
CONFIG_PPP=m
CONFIG_PPP_MULTILINK=y
CONFIG_PPP_FILTER=y
CONFIG_PPP_ASYNC=m
CONFIG_PPP_SYNC_TTY=m
CONFIG_PPP_DEFLATE=m
CONFIG_PPP_BSDCOMP=m
CONFIG_PPPOE=m
CONFIG_PPPOATM=m
CONFIG_SLIP=m
CONFIG_SLIP_COMPRESSED=y
CONFIG_SLIP_SMART=y
CONFIG_SLIP_MODE_SLIP6=y

#
# Wireless LAN (non-hamradio)
#
CONFIG_NET_RADIO=y
CONFIG_STRIP=m
CONFIG_WAVELAN=m
CONFIG_ARLAN=m
CONFIG_AIRONET4500=m
CONFIG_AIRONET4500_NONCS=m
CONFIG_AIRONET4500_PNP=y
CONFIG_AIRONET4500_PCI=y
# CONFIG_AIRONET4500_ISA is not set
# CONFIG_AIRONET4500_I365 is not set
CONFIG_AIRONET4500_PROC=m
CONFIG_AIRO=m
CONFIG_HERMES=m
CONFIG_PLX_HERMES=m
CONFIG_PCI_HERMES=m
CONFIG_PCMCIA_HERMES=m
CONFIG_AIRO_CS=m
CONFIG_NET_WIRELESS=y

#
# Token Ring devices
#
CONFIG_TR=y
CONFIG_IBMTR=m
CONFIG_IBMOL=m
CONFIG_IBMLS=m
CONFIG_3C359=m
CONFIG_TMS380TR=m
CONFIG_TMSPCI=m
CONFIG_TMSISA=m
CONFIG_ABYSS=m
CONFIG_MADGEMC=m
CONFIG_SMCTR=m
CONFIG_NET_FC=y
CONFIG_IPHASE5526=m
CONFIG_RCPCI=m
CONFIG_SHAPER=m

#
# Wan interfaces
#
CONFIG_WAN=y
CONFIG_HOSTESS_SV11=m
CONFIG_COSA=m
CONFIG_COMX=m
CONFIG_COMX_HW_COMX=m
CONFIG_COMX_HW_LOCOMX=m
CONFIG_COMX_HW_MIXCOM=m
CONFIG_COMX_HW_MUNICH=m
CONFIG_COMX_PROTO_PPP=m
CONFIG_COMX_PROTO_LAPB=m
CONFIG_COMX_PROTO_FR=m
CONFIG_DSCC4=m
CONFIG_LANMEDIA=m
CONFIG_ATI_XX20=m
CONFIG_SEALEVEL_4021=m
CONFIG_SYNCLINK_SYNCPPP=m
CONFIG_HDLC=m
# CONFIG_HDLC_RAW is not set
# CONFIG_HDLC_CISCO is not set
# CONFIG_HDLC_FR is not set
CONFIG_HDLC_PPP=y
CONFIG_HDLC_X25=y
CONFIG_N2=m
CONFIG_C101=m
CONFIG_FARSYNC=m
# CONFIG_HDLC_DEBUG_PKT is not set
# CONFIG_HDLC_DEBUG_HARD_HEADER is not set
# CONFIG_HDLC_DEBUG_ECN is not set
# CONFIG_HDLC_DEBUG_RINGS is not set
CONFIG_DLCI=m
CONFIG_DLCI_COUNT=24
CONFIG_DLCI_MAX=8
CONFIG_SDLA=m
CONFIG_WAN_ROUTER_DRIVERS=y
CONFIG_VENDOR_SANGOMA=m
CONFIG_WANPIPE_CHDLC=y
CONFIG_WANPIPE_FR=y
CONFIG_WANPIPE_X25=y
CONFIG_WANPIPE_PPP=y
CONFIG_WANPIPE_MULTPPP=y
CONFIG_CYCLADES_SYNC=m
CONFIG_CYCLOMX_X25=y
CONFIG_LAPBETHER=m
CONFIG_X25_ASY=m
CONFIG_SBNI=m
# CONFIG_SBNI_MULTILINE is not set

#
# PCMCIA network device support
#
CONFIG_NET_PCMCIA=y
CONFIG_PCMCIA_3C589=m
CONFIG_PCMCIA_3C574=m
CONFIG_PCMCIA_FMVJ18X=m
CONFIG_PCMCIA_PCNET=m
CONFIG_PCMCIA_AXNET=m
CONFIG_PCMCIA_NMCLAN=m
CONFIG_PCMCIA_SMC91C92=m
CONFIG_PCMCIA_XIRC2PS=m
CONFIG_ARCNET_COM20020_CS=m
CONFIG_PCMCIA_IBMTR=m
CONFIG_PCMCIA_XIRCOM=m
CONFIG_PCMCIA_XIRTULIP=m
CONFIG_NET_PCMCIA_RADIO=y
CONFIG_PCMCIA_RAYCS=m
CONFIG_PCMCIA_NETWAVE=m
CONFIG_PCMCIA_WAVELAN=m
CONFIG_AIRONET4500_CS=m

#
# Amateur Radio support
#
CONFIG_HAMRADIO=y
CONFIG_AX25=m
CONFIG_AX25_DAMA_SLAVE=y
CONFIG_NETROM=m
CONFIG_ROSE=m

#
# AX.25 network device drivers
#
CONFIG_MKISS=m
CONFIG_6PACK=m
CONFIG_BPQETHER=m
CONFIG_DMASCC=m
CONFIG_SCC=m
# CONFIG_SCC_DELAY is not set
# CONFIG_SCC_TRXECHO is not set
CONFIG_BAYCOM_SER_FDX=m
CONFIG_BAYCOM_SER_HDX=m
CONFIG_BAYCOM_PAR=m
CONFIG_BAYCOM_EPP=m
CONFIG_SOUNDMODEM=m
CONFIG_SOUNDMODEM_SBC=y
CONFIG_SOUNDMODEM_WSS=y
CONFIG_SOUNDMODEM_AFSK1200=y
CONFIG_SOUNDMODEM_AFSK2400_7=y
CONFIG_SOUNDMODEM_AFSK2400_8=y
CONFIG_SOUNDMODEM_AFSK2666=y
CONFIG_SOUNDMODEM_HAPN4800=y
CONFIG_SOUNDMODEM_PSK4800=y
CONFIG_SOUNDMODEM_FSK9600=y
CONFIG_YAM=m

#
# IrDA (infrared) support
#
CONFIG_IRDA=m
CONFIG_IRLAN=m
CONFIG_IRNET=m
CONFIG_IRCOMM=m
CONFIG_IRDA_ULTRA=y
CONFIG_IRDA_CACHE_LAST_LSAP=y
CONFIG_IRDA_FAST_RR=y
CONFIG_IRDA_DEBUG=y

#
# Infrared-port device drivers
#
CONFIG_IRTTY_SIR=m
CONFIG_IRPORT_SIR=m
CONFIG_DONGLE=y
CONFIG_ESI_DONGLE=m
CONFIG_ACTISYS_DONGLE=m
CONFIG_TEKRAM_DONGLE=m
CONFIG_GIRBIL_DONGLE=m
CONFIG_LITELINK_DONGLE=m
CONFIG_MCP2120_DONGLE=m
CONFIG_OLD_BELKIN_DONGLE=m
CONFIG_ACT200L_DONGLE=m
CONFIG_MA600_DONGLE=m
CONFIG_USB_IRDA=m
CONFIG_NSC_FIR=m
CONFIG_WINBOND_FIR=m
# CONFIG_TOSHIBA_OLD is not set
CONFIG_TOSHIBA_FIR=m
CONFIG_SMC_IRCC_FIR=m
CONFIG_ALI_FIR=m
CONFIG_VLSI_FIR=m

#
# ISDN subsystem
#
CONFIG_ISDN=m
CONFIG_ISDN_BOOL=y
CONFIG_ISDN_PPP=y
CONFIG_ISDN_PPP_VJ=y
CONFIG_ISDN_MPP=y
CONFIG_ISDN_PPP_BSDCOMP=m
CONFIG_ISDN_AUDIO=y
CONFIG_ISDN_TTY_FAX=y
CONFIG_ISDN_X25=y

#
# ISDN feature submodules
#
CONFIG_ISDN_DRV_LOOP=m
CONFIG_ISDN_DIVERSION=m

#
# Passive ISDN cards
#
CONFIG_ISDN_DRV_HISAX=m
CONFIG_ISDN_HISAX=y
CONFIG_HISAX_EURO=y
CONFIG_DE_AOC=y
# CONFIG_HISAX_NO_SENDCOMPLETE is not set
# CONFIG_HISAX_NO_LLC is not set
# CONFIG_HISAX_NO_KEYPAD is not set
CONFIG_HISAX_1TR6=y
CONFIG_HISAX_NI1=y
CONFIG_HISAX_MAX_CARDS=8
CONFIG_HISAX_16_0=y
CONFIG_HISAX_16_3=y
CONFIG_HISAX_AVM_A1=y
CONFIG_HISAX_IX1MICROR2=y
CONFIG_HISAX_ASUSCOM=y
CONFIG_HISAX_TELEINT=y
CONFIG_HISAX_HFCS=y
CONFIG_HISAX_SPORTSTER=y
CONFIG_HISAX_MIC=y
CONFIG_HISAX_ISURF=y
CONFIG_HISAX_HSTSAPHIR=y
CONFIG_HISAX_TELESPCI=y
CONFIG_HISAX_S0BOX=y
CONFIG_HISAX_FRITZPCI=y
CONFIG_HISAX_AVM_A1_PCMCIA=y
CONFIG_HISAX_ELSA=y
CONFIG_HISAX_DIEHLDIVA=y
CONFIG_HISAX_SEDLBAUER=y
CONFIG_HISAX_NETJET=y
CONFIG_HISAX_NETJET_U=y
CONFIG_HISAX_NICCY=y
CONFIG_HISAX_BKM_A4T=y
CONFIG_HISAX_SCT_QUADRO=y
CONFIG_HISAX_GAZEL=y
CONFIG_HISAX_HFC_PCI=y
CONFIG_HISAX_W6692=y
CONFIG_HISAX_HFC_SX=y
# CONFIG_HISAX_ENTERNOW_PCI is not set
# CONFIG_HISAX_DEBUG is not set
CONFIG_HISAX_SEDLBAUER_CS=m
CONFIG_HISAX_ELSA_CS=m
CONFIG_HISAX_AVM_A1_CS=m
CONFIG_HISAX_ST5481=m
CONFIG_HISAX_FRITZ_PCIPNP=m
# CONFIG_USB_AUERISDN is not set

#
# Active ISDN cards
#
CONFIG_ISDN_DRV_ICN=m
CONFIG_ISDN_DRV_PCBIT=m
CONFIG_ISDN_DRV_SC=m
CONFIG_ISDN_DRV_ACT2000=m
CONFIG_ISDN_DRV_EICON=y
CONFIG_ISDN_DRV_EICON_DIVAS=m
CONFIG_ISDN_DRV_EICON_OLD=m
CONFIG_ISDN_DRV_EICON_PCI=y
CONFIG_ISDN_DRV_EICON_ISA=y
CONFIG_ISDN_DRV_TPAM=m
CONFIG_ISDN_CAPI=m
CONFIG_ISDN_DRV_AVMB1_VERBOSE_REASON=y
CONFIG_ISDN_CAPI_MIDDLEWARE=y
CONFIG_ISDN_CAPI_CAPI20=m
CONFIG_ISDN_CAPI_CAPIFS_BOOL=y
CONFIG_ISDN_CAPI_CAPIFS=m
CONFIG_ISDN_CAPI_CAPIDRV=m
CONFIG_ISDN_DRV_AVMB1_B1ISA=m
CONFIG_ISDN_DRV_AVMB1_B1PCI=m
CONFIG_ISDN_DRV_AVMB1_B1PCIV4=y
CONFIG_ISDN_DRV_AVMB1_T1ISA=m
CONFIG_ISDN_DRV_AVMB1_B1PCMCIA=m
CONFIG_ISDN_DRV_AVMB1_AVM_CS=m
CONFIG_ISDN_DRV_AVMB1_T1PCI=m
CONFIG_ISDN_DRV_AVMB1_C4=m
CONFIG_HYSDN=m
CONFIG_HYSDN_CAPI=y

#
# Old CD-ROM drivers (not SCSI, not IDE)
#
CONFIG_CD_NO_IDESCSI=y
CONFIG_AZTCD=m
CONFIG_GSCD=m
CONFIG_SBPCD=m
CONFIG_MCD=m
CONFIG_MCD_IRQ=11
CONFIG_MCD_BASE=300
CONFIG_MCDX=m
CONFIG_OPTCD=m
CONFIG_CM206=m
CONFIG_SJCD=m
CONFIG_ISP16_CDI=m
CONFIG_CDU31A=m
CONFIG_CDU535=m

#
# Input core support
#
CONFIG_INPUT=m
CONFIG_INPUT_KEYBDEV=m
CONFIG_INPUT_MOUSEDEV=m
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
CONFIG_INPUT_JOYDEV=m
CONFIG_INPUT_EVDEV=m

#
# Character devices
#
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_SERIAL=y
CONFIG_SERIAL_CONSOLE=y
# CONFIG_SERIAL_EXTENDED is not set
CONFIG_SERIAL_NONSTANDARD=y
CONFIG_COMPUTONE=m
CONFIG_ROCKETPORT=m
CONFIG_CYCLADES=m
CONFIG_CYZ_INTR=y
CONFIG_DIGIEPCA=m
CONFIG_ESPSERIAL=m
CONFIG_MOXA_INTELLIO=m
CONFIG_MOXA_SMARTIO=m
CONFIG_ISI=m
CONFIG_SYNCLINK=m
CONFIG_SYNCLINKMP=m
CONFIG_N_HDLC=m
CONFIG_RISCOM8=m
CONFIG_SPECIALIX=m
# CONFIG_SPECIALIX_RTSCTS is not set
CONFIG_SX=m
CONFIG_RIO=m
# CONFIG_RIO_OLDPCI is not set
# CONFIG_STALDRV is not set
CONFIG_UNIX98_PTYS=y
CONFIG_UNIX98_PTY_COUNT=512
CONFIG_PRINTER=m
CONFIG_LP_CONSOLE=y
CONFIG_PPDEV=m
# CONFIG_TIPAR is not set

#
# I2C support
#
CONFIG_I2C=m
CONFIG_I2C_ALGOBIT=m
CONFIG_I2C_PHILIPSPAR=m
CONFIG_I2C_ELV=m
CONFIG_I2C_VELLEMAN=m
# CONFIG_SCx200_I2C is not set
# CONFIG_SCx200_ACB is not set
CONFIG_I2C_ALGOPCF=m
CONFIG_I2C_ELEKTOR=m
CONFIG_I2C_CHARDEV=m
CONFIG_I2C_PROC=m

#
# Mice
#
CONFIG_BUSMOUSE=m
CONFIG_ATIXL_BUSMOUSE=m
CONFIG_LOGIBUSMOUSE=m
CONFIG_MS_BUSMOUSE=m
CONFIG_MOUSE=m
CONFIG_PSMOUSE=y
CONFIG_82C710_MOUSE=m
CONFIG_PC110_PAD=m
CONFIG_MK712_MOUSE=m

#
# Joysticks
#
CONFIG_INPUT_GAMEPORT=m
CONFIG_INPUT_NS558=m
CONFIG_INPUT_LIGHTNING=m
CONFIG_INPUT_PCIGAME=m
CONFIG_INPUT_CS461X=m
CONFIG_INPUT_EMU10K1=m
CONFIG_INPUT_SERIO=m
CONFIG_INPUT_SERPORT=m
CONFIG_INPUT_ANALOG=m
CONFIG_INPUT_A3D=m
CONFIG_INPUT_ADI=m
CONFIG_INPUT_COBRA=m
CONFIG_INPUT_GF2K=m
CONFIG_INPUT_GRIP=m
CONFIG_INPUT_INTERACT=m
CONFIG_INPUT_TMDC=m
CONFIG_INPUT_SIDEWINDER=m
CONFIG_INPUT_IFORCE_USB=m
CONFIG_INPUT_IFORCE_232=m
CONFIG_INPUT_WARRIOR=m
CONFIG_INPUT_MAGELLAN=m
CONFIG_INPUT_SPACEORB=m
CONFIG_INPUT_SPACEBALL=m
CONFIG_INPUT_STINGER=m
CONFIG_INPUT_DB9=m
CONFIG_INPUT_GAMECON=m
CONFIG_INPUT_TURBOGRAFX=m
CONFIG_QIC02_TAPE=m
CONFIG_QIC02_DYNCONF=y
# CONFIG_IPMI_HANDLER is not set
# CONFIG_IPMI_PANIC_EVENT is not set
# CONFIG_IPMI_DEVICE_INTERFACE is not set
# CONFIG_IPMI_KCS is not set
# CONFIG_IPMI_WATCHDOG is not set

#
# Watchdog Cards
#
CONFIG_WATCHDOG=y
# CONFIG_WATCHDOG_NOWAYOUT is not set
CONFIG_ACQUIRE_WDT=m
CONFIG_ADVANTECH_WDT=m
# CONFIG_ALIM1535_WDT is not set
CONFIG_ALIM7101_WDT=m
CONFIG_SC520_WDT=m
CONFIG_PCWATCHDOG=m
CONFIG_EUROTECH_WDT=m
CONFIG_IB700_WDT=m
CONFIG_WAFER_WDT=m
CONFIG_I810_TCO=m
CONFIG_MIXCOMWD=m
CONFIG_60XX_WDT=m
CONFIG_SC1200_WDT=m
# CONFIG_SCx200_WDT is not set
CONFIG_SOFT_WATCHDOG=m
CONFIG_W83877F_WDT=m
CONFIG_WDT=m
CONFIG_WDTPCI=m
CONFIG_WDT_501=y
# CONFIG_WDT_501_FAN is not set
CONFIG_MACHZ_WDT=m
CONFIG_AMD7XX_TCO=m
# CONFIG_SCx200_GPIO is not set
CONFIG_AMD_RNG=m
CONFIG_INTEL_RNG=m
CONFIG_AMD_PM768=m
CONFIG_NVRAM=m
CONFIG_RTC=m
CONFIG_DTLK=m
CONFIG_R3964=m
CONFIG_APPLICOM=m
CONFIG_SONYPI=m

#
# Ftape, the floppy tape device driver
#
CONFIG_FTAPE=m
CONFIG_ZFTAPE=m
CONFIG_ZFT_DFLT_BLK_SZ=10240
CONFIG_ZFT_COMPRESSOR=m
CONFIG_FT_NR_BUFFERS=3
CONFIG_FT_PROC_FS=y
CONFIG_FT_NORMAL_DEBUG=y
# CONFIG_FT_FULL_DEBUG is not set
# CONFIG_FT_NO_TRACE is not set
# CONFIG_FT_NO_TRACE_AT_ALL is not set
CONFIG_FT_STD_FDC=y
# CONFIG_FT_MACH2 is not set
# CONFIG_FT_PROBE_FC10 is not set
# CONFIG_FT_ALT_FDC is not set
CONFIG_FT_FDC_THR=8
CONFIG_FT_FDC_MAX_RATE=2000
CONFIG_FT_ALPHA_CLOCK=0
CONFIG_AGP=m
CONFIG_AGP_INTEL=y
CONFIG_AGP_I810=y
CONFIG_AGP_VIA=y
CONFIG_AGP_AMD=y
CONFIG_AGP_AMD_8151=y
CONFIG_AGP_SIS=y
CONFIG_AGP_ALI=y
CONFIG_AGP_SWORKS=y
CONFIG_DRM=y
# CONFIG_DRM_OLD is not set
CONFIG_DRM_NEW=y
CONFIG_DRM_TDFX=m
CONFIG_DRM_R128=m
CONFIG_DRM_RADEON=m
CONFIG_DRM_I810=m
# CONFIG_DRM_I810_XFREE_41 is not set
CONFIG_DRM_I830=m
CONFIG_DRM_MGA=m
CONFIG_DRM_SIS=m

#
# PCMCIA character devices
#
CONFIG_PCMCIA_SERIAL_CS=m
CONFIG_SYNCLINK_CS=m
CONFIG_MWAVE=m

#
# Multimedia devices
#
CONFIG_VIDEO_DEV=m

#
# Video For Linux
#
CONFIG_VIDEO_PROC_FS=y
CONFIG_I2C_PARPORT=m
CONFIG_VIDEO_BT848=m
CONFIG_VIDEO_PMS=m
CONFIG_VIDEO_BWQCAM=m
CONFIG_VIDEO_CQCAM=m
CONFIG_VIDEO_W9966=m
CONFIG_VIDEO_CPIA=m
CONFIG_VIDEO_CPIA_PP=m
CONFIG_VIDEO_CPIA_USB=m
CONFIG_VIDEO_SAA5249=m
CONFIG_TUNER_3036=m
CONFIG_VIDEO_STRADIS=m
CONFIG_VIDEO_ZORAN=m
CONFIG_VIDEO_ZORAN_BUZ=m
CONFIG_VIDEO_ZORAN_DC10=m
CONFIG_VIDEO_ZORAN_LML33=m
CONFIG_VIDEO_ZR36120=m
CONFIG_VIDEO_MEYE=m

#
# Radio Adapters
#
CONFIG_RADIO_CADET=m
CONFIG_RADIO_RTRACK=m
CONFIG_RADIO_RTRACK2=m
CONFIG_RADIO_AZTECH=m
CONFIG_RADIO_GEMTEK=m
CONFIG_RADIO_GEMTEK_PCI=m
CONFIG_RADIO_MAXIRADIO=m
CONFIG_RADIO_MAESTRO=m
CONFIG_RADIO_MIROPCM20=m
CONFIG_RADIO_MIROPCM20_RDS=m
CONFIG_RADIO_SF16FMI=m
# CONFIG_RADIO_SF16FMR2 is not set
CONFIG_RADIO_TERRATEC=m
CONFIG_RADIO_TRUST=m
CONFIG_RADIO_TYPHOON=m
CONFIG_RADIO_TYPHOON_PROC_FS=y
CONFIG_RADIO_ZOLTRIX=m

#
# File systems
#
CONFIG_QUOTA=y
CONFIG_AUTOFS_FS=m
CONFIG_AUTOFS4_FS=m
CONFIG_REISERFS_FS=m
# CONFIG_REISERFS_CHECK is not set
CONFIG_REISERFS_PROC_INFO=y
CONFIG_ADFS_FS=m
CONFIG_ADFS_FS_RW=y
CONFIG_AFFS_FS=m
CONFIG_HFS_FS=m
CONFIG_BEFS_FS=m
# CONFIG_BEFS_DEBUG is not set
CONFIG_BFS_FS=m
CONFIG_EXT3_FS=m
CONFIG_JBD=m
# CONFIG_JBD_DEBUG is not set
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_UMSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_EFS_FS=m
CONFIG_JFFS_FS=m
CONFIG_JFFS_FS_VERBOSE=0
CONFIG_JFFS_PROC_FS=y
CONFIG_JFFS2_FS=m
CONFIG_JFFS2_FS_DEBUG=0
CONFIG_CRAMFS=m
CONFIG_TMPFS=y
CONFIG_RAMFS=y
CONFIG_ISO9660_FS=m
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_JFS_FS=m
# CONFIG_JFS_DEBUG is not set
CONFIG_JFS_STATISTICS=y
CONFIG_MINIX_FS=m
CONFIG_VXFS_FS=m
CONFIG_NTFS_FS=m
CONFIG_NTFS_RW=y
CONFIG_HPFS_FS=m
CONFIG_PROC_FS=y
CONFIG_DEVFS_FS=y
# CONFIG_DEVFS_MOUNT is not set
# CONFIG_DEVFS_DEBUG is not set
CONFIG_DEVPTS_FS=y
CONFIG_QNX4FS_FS=m
CONFIG_QNX4FS_RW=y
CONFIG_ROMFS_FS=y
CONFIG_EXT2_FS=m
CONFIG_SYSV_FS=m
CONFIG_UDF_FS=m
CONFIG_UDF_RW=y
CONFIG_UFS_FS=m
CONFIG_UFS_FS_WRITE=y

#
# Network File Systems
#
CONFIG_CODA_FS=m
CONFIG_INTERMEZZO_FS=m
CONFIG_NFS_FS=m
CONFIG_NFS_V3=y
# CONFIG_ROOT_NFS is not set
CONFIG_NFSD=m
CONFIG_NFSD_V3=y
CONFIG_NFSD_TCP=y
CONFIG_SUNRPC=m
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_SMB_FS=m
CONFIG_SMB_NLS_DEFAULT=y
CONFIG_SMB_NLS_REMOTE="cp437"
CONFIG_NCP_FS=m
CONFIG_NCPFS_PACKET_SIGNING=y
CONFIG_NCPFS_IOCTL_LOCKING=y
CONFIG_NCPFS_STRONG=y
CONFIG_NCPFS_NFS_NS=y
CONFIG_NCPFS_OS2_NS=y
CONFIG_NCPFS_SMALLDOS=y
CONFIG_NCPFS_NLS=y
CONFIG_NCPFS_EXTRAS=y
CONFIG_ZISOFS_FS=m

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
CONFIG_ACORN_PARTITION=y
CONFIG_ACORN_PARTITION_ICS=y
CONFIG_ACORN_PARTITION_ADFS=y
# CONFIG_ACORN_PARTITION_POWERTEC is not set
CONFIG_ACORN_PARTITION_RISCIX=y
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
CONFIG_ATARI_PARTITION=y
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
CONFIG_LDM_PARTITION=y
# CONFIG_LDM_DEBUG is not set
CONFIG_SGI_PARTITION=y
CONFIG_ULTRIX_PARTITION=y
CONFIG_SUN_PARTITION=y
CONFIG_EFI_PARTITION=y
CONFIG_SMB_NLS=y
CONFIG_NLS=y

#
# Native Language Support
#
CONFIG_NLS_DEFAULT="iso8859-1"
CONFIG_NLS_CODEPAGE_437=m
CONFIG_NLS_CODEPAGE_737=m
CONFIG_NLS_CODEPAGE_775=m
CONFIG_NLS_CODEPAGE_850=m
CONFIG_NLS_CODEPAGE_852=m
CONFIG_NLS_CODEPAGE_855=m
CONFIG_NLS_CODEPAGE_857=m
CONFIG_NLS_CODEPAGE_860=m
CONFIG_NLS_CODEPAGE_861=m
CONFIG_NLS_CODEPAGE_862=m
CONFIG_NLS_CODEPAGE_863=m
CONFIG_NLS_CODEPAGE_864=m
CONFIG_NLS_CODEPAGE_865=m
CONFIG_NLS_CODEPAGE_866=m
CONFIG_NLS_CODEPAGE_869=m
CONFIG_NLS_CODEPAGE_936=m
CONFIG_NLS_CODEPAGE_950=m
CONFIG_NLS_CODEPAGE_932=m
CONFIG_NLS_CODEPAGE_949=m
CONFIG_NLS_CODEPAGE_874=m
CONFIG_NLS_ISO8859_8=m
CONFIG_NLS_CODEPAGE_1250=m
CONFIG_NLS_CODEPAGE_1251=m
CONFIG_NLS_ISO8859_1=m
CONFIG_NLS_ISO8859_2=m
CONFIG_NLS_ISO8859_3=m
CONFIG_NLS_ISO8859_4=m
CONFIG_NLS_ISO8859_5=m
CONFIG_NLS_ISO8859_6=m
CONFIG_NLS_ISO8859_7=m
CONFIG_NLS_ISO8859_9=m
CONFIG_NLS_ISO8859_13=m
CONFIG_NLS_ISO8859_14=m
CONFIG_NLS_ISO8859_15=m
CONFIG_NLS_KOI8_R=m
CONFIG_NLS_KOI8_U=m
CONFIG_NLS_UTF8=m

#
# Console drivers
#
CONFIG_VGA_CONSOLE=y
CONFIG_VIDEO_SELECT=y
CONFIG_MDA_CONSOLE=m

#
# Frame-buffer support
#
CONFIG_FB=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_FB_RIVA=m
CONFIG_FB_CLGEN=m
CONFIG_FB_PM2=m
# CONFIG_FB_PM2_FIFO_DISCONNECT is not set
# CONFIG_FB_PM2_PCI is not set
CONFIG_FB_PM3=m
CONFIG_FB_CYBER2000=m
CONFIG_FB_VESA=y
CONFIG_FB_VGA16=m
CONFIG_FB_HGA=m
CONFIG_VIDEO_SELECT=y
CONFIG_FB_MATROX=m
CONFIG_FB_MATROX_MILLENIUM=y
CONFIG_FB_MATROX_MYSTIQUE=y
CONFIG_FB_MATROX_G450=m
CONFIG_FB_MATROX_I2C=m
CONFIG_FB_MATROX_MAVEN=m
# CONFIG_FB_MATROX_PROC is not set
CONFIG_FB_MATROX_MULTIHEAD=y
CONFIG_FB_ATY=m
CONFIG_FB_ATY_GX=y
CONFIG_FB_ATY_CT=y
CONFIG_FB_RADEON=m
CONFIG_FB_ATY128=m
# CONFIG_FB_INTEL is not set
CONFIG_FB_SIS=m
CONFIG_FB_SIS_300=y
CONFIG_FB_SIS_315=y
CONFIG_FB_NEOMAGIC=m
CONFIG_FB_3DFX=m
CONFIG_FB_VOODOO1=m
CONFIG_FB_TRIDENT=m
CONFIG_FB_VIRTUAL=m
CONFIG_FBCON_ADVANCED=y
CONFIG_FBCON_MFB=m
CONFIG_FBCON_CFB2=m
CONFIG_FBCON_CFB4=m
CONFIG_FBCON_CFB8=y
CONFIG_FBCON_CFB16=y
CONFIG_FBCON_CFB24=y
CONFIG_FBCON_CFB32=m
# CONFIG_FBCON_AFB is not set
# CONFIG_FBCON_ILBM is not set
# CONFIG_FBCON_IPLAN2P2 is not set
# CONFIG_FBCON_IPLAN2P4 is not set
# CONFIG_FBCON_IPLAN2P8 is not set
# CONFIG_FBCON_MAC is not set
CONFIG_FBCON_VGA_PLANES=m
CONFIG_FBCON_VGA=m
CONFIG_FBCON_HGA=m
# CONFIG_FBCON_FONTWIDTH8_ONLY is not set
CONFIG_FBCON_FONTS=y
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
# CONFIG_FONT_SUN8x16 is not set
# CONFIG_FONT_SUN12x22 is not set
# CONFIG_FONT_6x11 is not set
# CONFIG_FONT_PEARL_8x8 is not set
# CONFIG_FONT_ACORN_8x8 is not set

#
# Sound
#
CONFIG_SOUND=m
CONFIG_SOUND_ALI5455=m
CONFIG_SOUND_BT878=m
CONFIG_SOUND_CMPCI=m
CONFIG_SOUND_CMPCI_FM=y
CONFIG_SOUND_CMPCI_FMIO=388
CONFIG_SOUND_CMPCI_FMIO=388
CONFIG_SOUND_CMPCI_MIDI=y
CONFIG_SOUND_CMPCI_MPUIO=330
CONFIG_SOUND_CMPCI_JOYSTICK=y
CONFIG_SOUND_CMPCI_CM8738=y
CONFIG_SOUND_CMPCI_SPDIFINVERSE=y
# CONFIG_SOUND_CMPCI_SPDIFLOOP is not set
CONFIG_SOUND_CMPCI_SPEAKERS=2
CONFIG_SOUND_EMU10K1=m
CONFIG_MIDI_EMU10K1=y
CONFIG_SOUND_FUSION=m
CONFIG_SOUND_CS4281=m
CONFIG_SOUND_ES1370=m
CONFIG_SOUND_ES1371=m
CONFIG_SOUND_ESSSOLO1=m
CONFIG_SOUND_MAESTRO=m
CONFIG_SOUND_MAESTRO3=m
CONFIG_SOUND_FORTE=m
CONFIG_SOUND_ICH=m
CONFIG_SOUND_RME96XX=m
CONFIG_SOUND_SONICVIBES=m
CONFIG_SOUND_TRIDENT=m
CONFIG_SOUND_MSNDCLAS=m
# CONFIG_MSNDCLAS_HAVE_BOOT is not set
CONFIG_MSNDCLAS_INIT_FILE="/etc/sound/msndinit.bin"
CONFIG_MSNDCLAS_PERM_FILE="/etc/sound/msndperm.bin"
CONFIG_SOUND_MSNDPIN=m
# CONFIG_MSNDPIN_HAVE_BOOT is not set
CONFIG_MSNDPIN_INIT_FILE="/etc/sound/pndspini.bin"
CONFIG_MSNDPIN_PERM_FILE="/etc/sound/pndsperm.bin"
CONFIG_SOUND_VIA82CXXX=m
CONFIG_MIDI_VIA82CXXX=y
CONFIG_SOUND_OSS=m
# CONFIG_SOUND_TRACEINIT is not set
# CONFIG_SOUND_DMAP is not set
CONFIG_SOUND_AD1816=m
# CONFIG_SOUND_AD1889 is not set
CONFIG_SOUND_SGALAXY=m
CONFIG_SOUND_ADLIB=m
CONFIG_SOUND_ACI_MIXER=m
CONFIG_SOUND_CS4232=m
CONFIG_SOUND_SSCAPE=m
CONFIG_SOUND_GUS=m
# CONFIG_SOUND_GUS16 is not set
# CONFIG_SOUND_GUSMAX is not set
CONFIG_SOUND_VMIDI=m
CONFIG_SOUND_TRIX=m
CONFIG_SOUND_MSS=m
CONFIG_SOUND_MPU401=m
CONFIG_SOUND_NM256=m
CONFIG_SOUND_MAD16=m
CONFIG_MAD16_OLDCARD=y
CONFIG_SOUND_PAS=m
# CONFIG_PAS_JOYSTICK is not set
CONFIG_SOUND_PSS=m
# CONFIG_PSS_MIXER is not set
# CONFIG_PSS_HAVE_BOOT is not set
CONFIG_SOUND_SB=m
CONFIG_SOUND_AWE32_SYNTH=m
# CONFIG_SOUND_KAHLUA is not set
CONFIG_SOUND_WAVEFRONT=m
CONFIG_SOUND_MAUI=m
CONFIG_SOUND_YM3812=m
CONFIG_SOUND_OPL3SA1=m
CONFIG_SOUND_OPL3SA2=m
CONFIG_SOUND_YMFPCI=m
# CONFIG_SOUND_YMFPCI_LEGACY is not set
CONFIG_SOUND_UART6850=m
CONFIG_SOUND_AEDSP16=m
CONFIG_SC6600=y
# CONFIG_SC6600_JOY is not set
CONFIG_SC6600_CDROM=4
CONFIG_SC6600_CDROMBASE=0
CONFIG_AEDSP16_SBPRO=y
CONFIG_AEDSP16_MPU401=y
CONFIG_SOUND_TVMIXER=m

#
# USB support
#
CONFIG_USB=m
CONFIG_USB_DEBUG=y
CONFIG_USB_DEVICEFS=y
CONFIG_USB_BANDWIDTH=y
CONFIG_USB_EHCI_HCD=m
CONFIG_USB_UHCI=m
CONFIG_USB_UHCI_ALT=m
CONFIG_USB_OHCI=m
CONFIG_USB_AUDIO=m
CONFIG_USB_EMI26=m
CONFIG_USB_MIDI=m
CONFIG_USB_STORAGE=m
CONFIG_USB_STORAGE_DEBUG=y
CONFIG_USB_STORAGE_DATAFAB=y
CONFIG_USB_STORAGE_FREECOM=y
CONFIG_USB_STORAGE_ISD200=y
CONFIG_USB_STORAGE_DPCM=y
CONFIG_USB_STORAGE_HP8200e=y
CONFIG_USB_STORAGE_SDDR09=y
CONFIG_USB_STORAGE_SDDR55=y
CONFIG_USB_STORAGE_JUMPSHOT=y
CONFIG_USB_ACM=m
CONFIG_USB_PRINTER=m
CONFIG_USB_HID=m
CONFIG_USB_HIDINPUT=y
CONFIG_USB_HIDDEV=y
CONFIG_USB_KBD=m
CONFIG_USB_MOUSE=m
CONFIG_USB_AIPTEK=m
CONFIG_USB_WACOM=m
# CONFIG_USB_KBTAB is not set
# CONFIG_USB_POWERMATE is not set
CONFIG_USB_DC2XX=m
CONFIG_USB_MDC800=m
CONFIG_USB_SCANNER=m
CONFIG_USB_MICROTEK=m
CONFIG_USB_HPUSBSCSI=m
CONFIG_USB_IBMCAM=m
# CONFIG_USB_KONICAWC is not set
CONFIG_USB_OV511=m
CONFIG_USB_PWC=m
CONFIG_USB_SE401=m
CONFIG_USB_STV680=m
CONFIG_USB_VICAM=m
CONFIG_USB_DSBR=m
CONFIG_USB_DABUSB=m
CONFIG_USB_PEGASUS=m
CONFIG_USB_RTL8150=m
CONFIG_USB_KAWETH=m
CONFIG_USB_CATC=m
CONFIG_USB_CDCETHER=m
CONFIG_USB_USBNET=m
CONFIG_USB_USS720=m

#
# USB Serial Converter support
#
CONFIG_USB_SERIAL=m
# CONFIG_USB_SERIAL_DEBUG is not set
CONFIG_USB_SERIAL_GENERIC=y
CONFIG_USB_SERIAL_BELKIN=m
CONFIG_USB_SERIAL_WHITEHEAT=m
CONFIG_USB_SERIAL_DIGI_ACCELEPORT=m
CONFIG_USB_SERIAL_EMPEG=m
CONFIG_USB_SERIAL_FTDI_SIO=m
CONFIG_USB_SERIAL_VISOR=m
CONFIG_USB_SERIAL_IPAQ=m
CONFIG_USB_SERIAL_IR=m
CONFIG_USB_SERIAL_EDGEPORT=m
CONFIG_USB_SERIAL_EDGEPORT_TI=m
CONFIG_USB_SERIAL_KEYSPAN_PDA=m
CONFIG_USB_SERIAL_KEYSPAN=m
CONFIG_USB_SERIAL_KEYSPAN_USA28=y
CONFIG_USB_SERIAL_KEYSPAN_USA28X=y
CONFIG_USB_SERIAL_KEYSPAN_USA28XA=y
CONFIG_USB_SERIAL_KEYSPAN_USA28XB=y
CONFIG_USB_SERIAL_KEYSPAN_USA19=y
CONFIG_USB_SERIAL_KEYSPAN_USA18X=y
CONFIG_USB_SERIAL_KEYSPAN_USA19W=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QW=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QI=y
# CONFIG_USB_SERIAL_KEYSPAN_MPR is not set
CONFIG_USB_SERIAL_KEYSPAN_USA49W=y
# CONFIG_USB_SERIAL_KEYSPAN_USA49WLC is not set
CONFIG_USB_SERIAL_MCT_U232=m
CONFIG_USB_SERIAL_KLSI=m
# CONFIG_USB_SERIAL_KOBIL_SCT is not set
CONFIG_USB_SERIAL_PL2303=m
CONFIG_USB_SERIAL_CYBERJACK=m
CONFIG_USB_SERIAL_XIRCOM=m
CONFIG_USB_SERIAL_OMNINET=m
CONFIG_USB_RIO500=m
CONFIG_USB_AUERSWALD=m
CONFIG_USB_TIGL=m
CONFIG_USB_BRLVGER=m
CONFIG_USB_LCD=m

#
# Bluetooth support
#
CONFIG_BLUEZ=m
CONFIG_BLUEZ_L2CAP=m
CONFIG_BLUEZ_SCO=m
# CONFIG_BLUEZ_RFCOMM is not set
CONFIG_BLUEZ_BNEP=m
# CONFIG_BLUEZ_BNEP_MC_FILTER is not set
# CONFIG_BLUEZ_BNEP_PROTO_FILTER is not set

#
# Bluetooth device drivers
#
CONFIG_BLUEZ_HCIUSB=m
# CONFIG_BLUEZ_USB_SCO is not set
CONFIG_BLUEZ_USB_ZERO_PACKET=y
CONFIG_BLUEZ_HCIUART=m
CONFIG_BLUEZ_HCIUART_H4=y
# CONFIG_BLUEZ_HCIUART_BCSP is not set
# CONFIG_BLUEZ_HCIUART_BCSP_TXCRC is not set
CONFIG_BLUEZ_HCIDTL1=m
CONFIG_BLUEZ_HCIBT3C=m
CONFIG_BLUEZ_HCIBLUECARD=m
# CONFIG_BLUEZ_HCIBTUART is not set
CONFIG_BLUEZ_HCIVHCI=m

#
# Kernel hacking
#
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_STACKOVERFLOW=y
# CONFIG_DEBUG_HIGHMEM is not set
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_IOVIRT is not set
CONFIG_MAGIC_SYSRQ=y
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_FRAME_POINTER is not set

#
# Library routines
#
CONFIG_ZLIB_INFLATE=m
CONFIG_ZLIB_DEFLATE=m

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-05-29 19:56     ` Krzysiek Taraszka
@ 2003-05-29 20:18       ` Krzysiek Taraszka
  2003-06-04 18:17         ` Marcelo Tosatti
  0 siblings, 1 reply; 109+ messages in thread
From: Krzysiek Taraszka @ 2003-05-29 20:18 UTC (permalink / raw)
  To: Marcelo Tosatti, Georg Nikodym; +Cc: lkml

Dnia czw 29. maja 2003 21:56, Krzysiek Taraszka napisał:
> Dnia czw 29. maja 2003 21:11, Marcelo Tosatti napisał:
> > On Thu, 29 May 2003, Georg Nikodym wrote:
> > > On Wed, 28 May 2003 21:55:39 -0300 (BRT)
> > >
> > > Marcelo Tosatti <marcelo@conectiva.com.br> wrote:
> > > > Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's
> > > > fix for the IO stalls/deadlocks.
> > >
> > > While others may be dubious about the efficacy of this patch, I've been
> > > running -rc6 on my laptop now since sometime last night and have seen
> > > nothing odd.
> > >
> > > In case anybody cares, I'm using both ide and a ieee1394 (for a large
> > > external drive [which implies scsi]) and I do a _lot_ of big work with
> > > BK so I was seeing the problem within hours previously.
> >
> > Great!
> >
> > -rc7 will have to be released due to some problems :(
>
> hmm, seems to ide modules and others are broken. Im looking for reason why

hmm, for IDE subsystem the ide-proc.o was't made for CONFIG_BLK_DEV_IDE=m ...
anyone goes to fix it ? or shall I prepare and send here my own patch ?

-- 
Krzysiek Taraszka			(dzimi@pld.org.pl)
http://cyborg.kernel.pl/~dzimi/

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-05-29  5:57       ` Marc Wilson
  2003-05-29  7:15         ` Riley Williams
  2003-05-29  8:38         ` Willy Tarreau
@ 2003-06-03 16:02         ` Marcelo Tosatti
  2003-06-03 16:13           ` Marc-Christian Petersen
                             ` (2 more replies)
  2 siblings, 3 replies; 109+ messages in thread
From: Marcelo Tosatti @ 2003-06-03 16:02 UTC (permalink / raw)
  To: Marc Wilson; +Cc: lkml



On Wed, 28 May 2003, Marc Wilson wrote:

> On Thu, May 29, 2003 at 06:34:48AM +0100, Riley Williams wrote:
> > The basic problem there is that any mail client needs to know
> > just how many messages are in a particular folder to handle that
> > folder, and the only way to do this is to count them all. That's
> > what takes the time when one opens a large folder.
>
> No, the basic problem there is that the kernel is deadlocking.  Read the
> VERY long thread for the details.
>
> I think I have enough on the ball to be able to tell the difference between
> mutt opening a folder and counting messages, with a counter and percentage
> indicator advancing, and mutt sitting there deadlocked with the HD activity
> light stuck on and all the rest of X stuck tight.
>
> And it just happened again, so -rc6 is no sure fix.  What did y'all that
> reported the problem had gone away do, patch -rc4 with the akpm patches?
> ^_^

Ok, so you can reproduce the hangs reliably EVEN with -rc6, Marc?

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-06-03 16:02         ` Marcelo Tosatti
@ 2003-06-03 16:13           ` Marc-Christian Petersen
  2003-06-04 21:54             ` Pavel Machek
  2003-06-03 16:30           ` Michael Frank
  2003-06-04  4:04           ` Marc Wilson
  2 siblings, 1 reply; 109+ messages in thread
From: Marc-Christian Petersen @ 2003-06-03 16:13 UTC (permalink / raw)
  To: Marcelo Tosatti, Marc Wilson; +Cc: lkml

On Tuesday 03 June 2003 18:02, Marcelo Tosatti wrote:

Hi Marcelo,

> Ok, so you can reproduce the hangs reliably EVEN with -rc6, Marc?
well, even if you mean Marc Wilson, I also have to say something (as I've 
written in my previous email some days ago)

The pauses/stops are _a lot_ less than w/o the fix but they are _not_ gone. 
Tested with 2.4.21-rc6.

ciao, Marc


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-06-03 16:02         ` Marcelo Tosatti
  2003-06-03 16:13           ` Marc-Christian Petersen
@ 2003-06-03 16:30           ` Michael Frank
  2003-06-03 16:53             ` Matthias Mueller
                               ` (2 more replies)
  2003-06-04  4:04           ` Marc Wilson
  2 siblings, 3 replies; 109+ messages in thread
From: Michael Frank @ 2003-06-03 16:30 UTC (permalink / raw)
  To: Marcelo Tosatti, Marc Wilson; +Cc: lkml

On Wednesday 04 June 2003 00:02, Marcelo Tosatti wrote:
> On Wed, 28 May 2003, Marc Wilson wrote:
> > On Thu, May 29, 2003 at 06:34:48AM +0100, Riley Williams wrote:
> > > The basic problem there is that any mail client needs to know
> > > just how many messages are in a particular folder to handle that
> > > folder, and the only way to do this is to count them all. That's
> > > what takes the time when one opens a large folder.
> >
> > No, the basic problem there is that the kernel is deadlocking.  Read the
> > VERY long thread for the details.
> >
> > I think I have enough on the ball to be able to tell the difference
> > between mutt opening a folder and counting messages, with a counter and
> > percentage indicator advancing, and mutt sitting there deadlocked with
> > the HD activity light stuck on and all the rest of X stuck tight.
> >
> > And it just happened again, so -rc6 is no sure fix.  What did y'all that
> > reported the problem had gone away do, patch -rc4 with the akpm patches?
> > ^_^
>
> Ok, so you can reproduce the hangs reliably EVEN with -rc6, Marc?

-rc6 is better - comparable to 2.4.18 in what I have seen with my script.  

After the long obscure problems since 2.4.19x, -rc6 could use serious 
stress-testing. 

User level testing is not sufficient here - it's just like playing roulette.

By serious stress-testing I mean:

Everone testing comes up with  one dedicated "tough test" 
which _must_ be reproducible (program, script) along his line of 
expertise/application.

Two or more of these independent tests are run in combination.

This method should increase the coverage drastically.

Regards
Michael



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-06-03 16:30           ` Michael Frank
@ 2003-06-03 16:53             ` Matthias Mueller
  2003-06-03 16:59             ` Marc-Christian Petersen
  2003-06-04 14:56             ` Jakob Oestergaard
  2 siblings, 0 replies; 109+ messages in thread
From: Matthias Mueller @ 2003-06-03 16:53 UTC (permalink / raw)
  To: Michael Frank; +Cc: Marcelo Tosatti, Marc Wilson, lkml

On Wed, Jun 04, 2003 at 12:30:27AM +0800, Michael Frank wrote:
> On Wednesday 04 June 2003 00:02, Marcelo Tosatti wrote:
> -rc6 is better - comparable to 2.4.18 in what I have seen with my script.  
> 
> After the long obscure problems since 2.4.19x, -rc6 could use serious 
> stress-testing. 
> 
> User level testing is not sufficient here - it's just like playing roulette.
> 
> By serious stress-testing I mean:
> 
> Everone testing comes up with  one dedicated "tough test" 
> which _must_ be reproducible (program, script) along his line of 
> expertise/application.
> 
> Two or more of these independent tests are run in combination.

Agreed and I'm willing to run test-scripts on my system, that has these
hangs (long ones with 2.4.19-pre1 to 2.4.21-rc5 and only short ones with
2.4.21-rc6). But at the moment I have neither time nor enough knowledge to
write a test to reproduce it.

So if someone comes up with a suitable test skript, I'm happy to try it
and use it on different kernel versions.

Bye,
Matthias
-- 
Matthias.Mueller@rz.uni-karlsruhe.de
Rechenzentrum Universitaet Karlsruhe
Abteilung Netze

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-06-03 16:30           ` Michael Frank
  2003-06-03 16:53             ` Matthias Mueller
@ 2003-06-03 16:59             ` Marc-Christian Petersen
  2003-06-03 17:03               ` Marc-Christian Petersen
  2003-06-03 17:23               ` Michael Frank
  2003-06-04 14:56             ` Jakob Oestergaard
  2 siblings, 2 replies; 109+ messages in thread
From: Marc-Christian Petersen @ 2003-06-03 16:59 UTC (permalink / raw)
  To: Michael Frank, Marcelo Tosatti, Marc Wilson; +Cc: lkml

On Tuesday 03 June 2003 18:30, Michael Frank wrote:

Hi Michael,

> > Ok, so you can reproduce the hangs reliably EVEN with -rc6, Marc?
> -rc6 is better - comparable to 2.4.18 in what I have seen with my script.
> After the long obscure problems since 2.4.19x, -rc6 could use serious
> stress-testing.
> User level testing is not sufficient here - it's just like playing
> roulette.
> By serious stress-testing I mean:
> Everone testing comes up with  one dedicated "tough test"
> which _must_ be reproducible (program, script) along his line of
> expertise/application.

well, very easy one:

dd if=/dev/zero of=/home/largefile bs=16384 count=131072

then use your mouse, your apps, switch between them, use them, _w/o_ pauses, 
delay, stops or kinda that. If _that_ will work flawlessly for everyone, then 
it is fixed, if not, it _needs_ to be fixed.

ciao, Marc


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-06-03 16:59             ` Marc-Christian Petersen
@ 2003-06-03 17:03               ` Marc-Christian Petersen
  2003-06-03 18:02                 ` Anders Karlsson
  2003-06-03 17:23               ` Michael Frank
  1 sibling, 1 reply; 109+ messages in thread
From: Marc-Christian Petersen @ 2003-06-03 17:03 UTC (permalink / raw)
  To: Michael Frank, Marcelo Tosatti, Marc Wilson; +Cc: lkml

On Tuesday 03 June 2003 18:59, Marc-Christian Petersen wrote:

Hi again,

> well, very easy one:
> dd if=/dev/zero of=/home/largefile bs=16384 count=131072
> then use your mouse, your apps, switch between them, use them, _w/o_
> pauses, delay, stops or kinda that. If _that_ will work flawlessly for
> everyone, then it is fixed, if not, it _needs_ to be fixed.
I forgot to mention. If you have more than 2GB free memory (the above one will 
create a 2GB file), the test is useless.

Have less memory free, so the machine will swap, doesn't matter if the same 
disk or another or whatever!

ciao, Marc


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-06-03 16:59             ` Marc-Christian Petersen
  2003-06-03 17:03               ` Marc-Christian Petersen
@ 2003-06-03 17:23               ` Michael Frank
  1 sibling, 0 replies; 109+ messages in thread
From: Michael Frank @ 2003-06-03 17:23 UTC (permalink / raw)
  To: Marc-Christian Petersen, Marcelo Tosatti, Marc Wilson; +Cc: lkml

On Wednesday 04 June 2003 00:59, Marc-Christian Petersen wrote:
> On Tuesday 03 June 2003 18:30, Michael Frank wrote:
> well, very easy one:
>
> dd if=/dev/zero of=/home/largefile bs=16384 count=131072

Got that already - more flexible:

http://www.ussg.iu.edu/hypermail/linux/kernel/0305.3/1291.html

Breaks anything >= 2.4.19 < rc6 in no time.

We need more - any ideas

Reagards
Michael


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-06-03 17:03               ` Marc-Christian Petersen
@ 2003-06-03 18:02                 ` Anders Karlsson
  2003-06-03 21:12                   ` J.A. Magallon
  0 siblings, 1 reply; 109+ messages in thread
From: Anders Karlsson @ 2003-06-03 18:02 UTC (permalink / raw)
  To: Marc-Christian Petersen; +Cc: Michael Frank, Marcelo Tosatti, Marc Wilson, LKML

[-- Attachment #1: Type: text/plain, Size: 1276 bytes --]

Good Evening,

On Tue, 2003-06-03 at 18:03, Marc-Christian Petersen wrote:
> On Tuesday 03 June 2003 18:59, Marc-Christian Petersen wrote:
> 
> Hi again,
> 
> > well, very easy one:
> > dd if=/dev/zero of=/home/largefile bs=16384 count=131072
> > then use your mouse, your apps, switch between them, use them, _w/o_
> > pauses, delay, stops or kinda that. If _that_ will work flawlessly for
> > everyone, then it is fixed, if not, it _needs_ to be fixed.
> I forgot to mention. If you have more than 2GB free memory (the above one will 
> create a 2GB file), the test is useless.
> 
> Have less memory free, so the machine will swap, doesn't matter if the same 
> disk or another or whatever!

Would it count if I said I run 2.4.21-rc6-ac1 and had 768MB RAM, ended
up using about 250MB swap and when I today suspended VMware and closed a
few gnome-terminals, Galeon and Evolution, the mouse cursor would not
move, then jump half way across the screen after a second, then 'stick'
again before doing another jump.

I thought it sounded a little like what you are describing. If more
details are required, let me know and I will try and collect what is
asked for.

Regards,

-- 
Anders Karlsson <anders@trudheim.com>
Trudheim Technology Limited

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Config issue (CONFIG_X86_TSC) Re: Linux 2.4.21-rc6
  2003-05-29  0:55 Linux 2.4.21-rc6 Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2003-05-29 18:00 ` Georg Nikodym
@ 2003-06-03 19:45 ` Paul
  2003-06-03 20:18   ` Jan-Benedict Glaw
  3 siblings, 1 reply; 109+ messages in thread
From: Paul @ 2003-06-03 19:45 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: lkml

Marcelo Tosatti <marcelo@conectiva.com.br>, on Wed May 28, 2003 [09:55:39 PM] said:
> 
> Hi,
> 
> Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's fix
> for the IO stalls/deadlocks.
> 
> Please test it.
> 
	Hi;

	It seems if I run 'make menuconfig', and the only change
I make is to change the processor type from its default to
486, "CONFIG_X86_TSC=y", remains in the .config, which results
in a kernel that wont boot on a 486.
	Running 'make oldconfig' seems to fix it up, though...

Paul
set@pobox.com

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Config issue (CONFIG_X86_TSC) Re: Linux 2.4.21-rc6
  2003-06-03 19:45 ` Config issue (CONFIG_X86_TSC) " Paul
@ 2003-06-03 20:18   ` Jan-Benedict Glaw
  0 siblings, 0 replies; 109+ messages in thread
From: Jan-Benedict Glaw @ 2003-06-03 20:18 UTC (permalink / raw)
  To: lkml; +Cc: Paul, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 1136 bytes --]

On Tue, 2003-06-03 15:45:37 -0400, Paul <set@pobox.com>
wrote in message <20030603194537.GO22874@squish.home.loc>:
> Marcelo Tosatti <marcelo@conectiva.com.br>, on Wed May 28, 2003 [09:55:39 PM] said:
> > Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's fix
> > for the IO stalls/deadlocks.
> > 
> > Please test it.
> > 
> 	Hi;
> 
> 	It seems if I run 'make menuconfig', and the only change
> I make is to change the processor type from its default to
> 486, "CONFIG_X86_TSC=y", remains in the .config, which results
> in a kernel that wont boot on a 486.
> 	Running 'make oldconfig' seems to fix it up, though...

Yeah, that's a but hitting i80386 also, I had sent a patch for that some
time ago to LKML. There's simply some CONFIG_X86_TSC=n missing in the
case of i486 and i486.

MfG, JBG

-- 
   Jan-Benedict Glaw       jbglaw@lug-owl.de    . +49-172-7608481
   "Eine Freie Meinung in  einem Freien Kopf    | Gegen Zensur | Gegen Krieg
    fuer einen Freien Staat voll Freier Bürger" | im Internet! |   im Irak!
      ret = do_actions((curr | FREE_SPEECH) & ~(IRAQ_WAR_2 | DRM | TCPA));

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-06-03 18:02                 ` Anders Karlsson
@ 2003-06-03 21:12                   ` J.A. Magallon
  2003-06-03 21:18                     ` Marc-Christian Petersen
  0 siblings, 1 reply; 109+ messages in thread
From: J.A. Magallon @ 2003-06-03 21:12 UTC (permalink / raw)
  To: Anders Karlsson
  Cc: Marc-Christian Petersen, Michael Frank, Marcelo Tosatti,
	Marc Wilson, LKML


On 06.03, Anders Karlsson wrote:
> Good Evening,
> 
> On Tue, 2003-06-03 at 18:03, Marc-Christian Petersen wrote:
> > On Tuesday 03 June 2003 18:59, Marc-Christian Petersen wrote:
> > 
> > Hi again,
> > 
> > > well, very easy one:
> > > dd if=/dev/zero of=/home/largefile bs=16384 count=131072
> > > then use your mouse, your apps, switch between them, use them, _w/o_
> > > pauses, delay, stops or kinda that. If _that_ will work flawlessly for
> > > everyone, then it is fixed, if not, it _needs_ to be fixed.
> > I forgot to mention. If you have more than 2GB free memory (the above one will 
> > create a 2GB file), the test is useless.
> > 
> > Have less memory free, so the machine will swap, doesn't matter if the same 
> > disk or another or whatever!
> 
> Would it count if I said I run 2.4.21-rc6-ac1 and had 768MB RAM, ended
> up using about 250MB swap and when I today suspended VMware and closed a
> few gnome-terminals, Galeon and Evolution, the mouse cursor would not
> move, then jump half way across the screen after a second, then 'stick'
> again before doing another jump.
> 

One vote in the opposite sense (I know, nobody uses plain rc6 ???)
I am using a -jam kernel (-aa with some additional patches), and I did
not notice anything. Dual PII box with 900 Mb, as buffers were filling
memory, no stalls. Just a very small (less than half a second) jump in the
cursor under gnome when the memory got full, and then smooth again.
I use pointer-focus and was rapidly moving the pointer from window to
window to change focus and response was ok. Launching an aterm was instant.

-- 
J.A. Magallon <jamagallon@able.es>      \                 Software is like sex:
werewolf.able.es                         \           It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.21-rc6-jam1 (gcc 3.2.3 (Mandrake Linux 9.2 3.2.3-1mdk))

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-06-03 21:12                   ` J.A. Magallon
@ 2003-06-03 21:18                     ` Marc-Christian Petersen
  0 siblings, 0 replies; 109+ messages in thread
From: Marc-Christian Petersen @ 2003-06-03 21:18 UTC (permalink / raw)
  To: J.A. Magallon, Anders Karlsson
  Cc: Michael Frank, Marcelo Tosatti, Marc Wilson, LKML

On Tuesday 03 June 2003 23:12, J.A. Magallon wrote:

Hi J.A.,

> One vote in the opposite sense (I know, nobody uses plain rc6 ???)
> I am using a -jam kernel (-aa with some additional patches), and I did
> not notice anything. Dual PII box with 900 Mb, as buffers were filling
> memory, no stalls. Just a very small (less than half a second) jump in the
> cursor under gnome when the memory got full, and then smooth again.
> I use pointer-focus and was rapidly moving the pointer from window to
> window to change focus and response was ok. Launching an aterm was instant.
once again for you ;-)

-aa is using low latency elevator! Pauses/Stops are more less with it.

ciao, Marc


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-06-03 16:02         ` Marcelo Tosatti
  2003-06-03 16:13           ` Marc-Christian Petersen
  2003-06-03 16:30           ` Michael Frank
@ 2003-06-04  4:04           ` Marc Wilson
  2 siblings, 0 replies; 109+ messages in thread
From: Marc Wilson @ 2003-06-04  4:04 UTC (permalink / raw)
  To: lkml; +Cc: Marcelo Tosatti

On Tue, Jun 03, 2003 at 01:02:45PM -0300, Marcelo Tosatti wrote:
> Ok, so you can reproduce the hangs reliably EVEN with -rc6, Marc?

Yes, with -rc6, and this:

rei $ dd if=/dev/zero of=/home/mwilson/largefile bs=16384 count=131072

The mouse starts skipping soon after the box starts swapping.  It
eventually catches up, but then when I start up another application, it
starts again.

I have the test running as I type this e-mail in mutt (with vim as the
editor), and there are noticeable pauses where I'm typing, but there isn't
anything happening on the screen.

It's *much* better than it was with my prior kernel (-rc2), but it's most
definately still there.

Anyone got any other test they want me to make on the box?

-- 
 Marc Wilson |     You're a card which will have to be dealt with.
 msw@cox.net |

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-05-29 19:11   ` -rc7 " Marcelo Tosatti
  2003-05-29 19:56     ` Krzysiek Taraszka
@ 2003-06-04 10:22     ` Andrea Arcangeli
  2003-06-04 10:35       ` Marc-Christian Petersen
  1 sibling, 1 reply; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-04 10:22 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Georg Nikodym, lkml

On Thu, May 29, 2003 at 04:11:12PM -0300, Marcelo Tosatti wrote:
> 
> 
> On Thu, 29 May 2003, Georg Nikodym wrote:
> 
> > On Wed, 28 May 2003 21:55:39 -0300 (BRT)
> > Marcelo Tosatti <marcelo@conectiva.com.br> wrote:
> >
> > > Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's
> > > fix for the IO stalls/deadlocks.
> >
> > While others may be dubious about the efficacy of this patch, I've been
> > running -rc6 on my laptop now since sometime last night and have seen
> > nothing odd.
> >
> > In case anybody cares, I'm using both ide and a ieee1394 (for a large
> > external drive [which implies scsi]) and I do a _lot_ of big work with
> > BK so I was seeing the problem within hours previously.
> 
> Great!

are you really sure that it is the right fix?

I mean, the batching has a basic problem (I was discussing it with Jens
two days ago and he said he's already addressed in 2.5, I wonder if that
could also have an influence on the fact 2.5 is so much better in
fariness)

the issue with batching in 2.4, is that it is blocking at 0 and waking
at batch_requests. But it's not blocking new get_request to eat requests
in the way back from 0 to batch_requests. I mean, there are two
directions, when we move from batch_requests to 0 get_requests should
return requests. in the way back from 0 to batch_requests the
get_request should block (and it doesn't in 2.4, that is the problem)

> 
> -rc7 will have to be released due to some problems :(
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-06-04 10:22     ` Andrea Arcangeli
@ 2003-06-04 10:35       ` Marc-Christian Petersen
  2003-06-04 10:42         ` Jens Axboe
  2003-06-04 10:43         ` -rc7 Re: Linux 2.4.21-rc6 Andrea Arcangeli
  0 siblings, 2 replies; 109+ messages in thread
From: Marc-Christian Petersen @ 2003-06-04 10:35 UTC (permalink / raw)
  To: Andrea Arcangeli, Marcelo Tosatti; +Cc: Georg Nikodym, lkml

On Wednesday 04 June 2003 12:22, Andrea Arcangeli wrote:

Hi Andrea,

> are you really sure that it is the right fix?
> I mean, the batching has a basic problem (I was discussing it with Jens
> two days ago and he said he's already addressed in 2.5, I wonder if that
> could also have an influence on the fact 2.5 is so much better in
> fariness)
> the issue with batching in 2.4, is that it is blocking at 0 and waking
> at batch_requests. But it's not blocking new get_request to eat requests
> in the way back from 0 to batch_requests. I mean, there are two
> directions, when we move from batch_requests to 0 get_requests should
> return requests. in the way back from 0 to batch_requests the
> get_request should block (and it doesn't in 2.4, that is the problem)
do you see a chance to fix this up in 2.4?

ciao, Marc


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-06-04 10:35       ` Marc-Christian Petersen
@ 2003-06-04 10:42         ` Jens Axboe
  2003-06-04 10:46           ` Marc-Christian Petersen
  2003-06-04 10:43         ` -rc7 Re: Linux 2.4.21-rc6 Andrea Arcangeli
  1 sibling, 1 reply; 109+ messages in thread
From: Jens Axboe @ 2003-06-04 10:42 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Andrea Arcangeli, Marcelo Tosatti, Georg Nikodym, lkml

On Wed, Jun 04 2003, Marc-Christian Petersen wrote:
> On Wednesday 04 June 2003 12:22, Andrea Arcangeli wrote:
> 
> Hi Andrea,
> 
> > are you really sure that it is the right fix?
> > I mean, the batching has a basic problem (I was discussing it with Jens
> > two days ago and he said he's already addressed in 2.5, I wonder if that
> > could also have an influence on the fact 2.5 is so much better in
> > fariness)
> > the issue with batching in 2.4, is that it is blocking at 0 and waking
> > at batch_requests. But it's not blocking new get_request to eat requests
> > in the way back from 0 to batch_requests. I mean, there are two
> > directions, when we move from batch_requests to 0 get_requests should
> > return requests. in the way back from 0 to batch_requests the
> > get_request should block (and it doesn't in 2.4, that is the problem)
> do you see a chance to fix this up in 2.4?

Nick posted a patch to do so the other day and asked people to test.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-06-04 10:35       ` Marc-Christian Petersen
  2003-06-04 10:42         ` Jens Axboe
@ 2003-06-04 10:43         ` Andrea Arcangeli
  2003-06-04 11:01           ` Marc-Christian Petersen
  1 sibling, 1 reply; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-04 10:43 UTC (permalink / raw)
  To: Marc-Christian Petersen; +Cc: Marcelo Tosatti, Georg Nikodym, lkml

On Wed, Jun 04, 2003 at 12:35:07PM +0200, Marc-Christian Petersen wrote:
> On Wednesday 04 June 2003 12:22, Andrea Arcangeli wrote:
> 
> Hi Andrea,
> 
> > are you really sure that it is the right fix?
> > I mean, the batching has a basic problem (I was discussing it with Jens
> > two days ago and he said he's already addressed in 2.5, I wonder if that
> > could also have an influence on the fact 2.5 is so much better in
> > fariness)
> > the issue with batching in 2.4, is that it is blocking at 0 and waking
> > at batch_requests. But it's not blocking new get_request to eat requests
> > in the way back from 0 to batch_requests. I mean, there are two
> > directions, when we move from batch_requests to 0 get_requests should
> > return requests. in the way back from 0 to batch_requests the
> > get_request should block (and it doesn't in 2.4, that is the problem)
> do you see a chance to fix this up in 2.4?

sure, it's just a matter of adding a bit to the blkdev structure.
However I'm not 100% sure that it is the real thing that could make the
difference, but overall the exclusive wakeup FIFO in theory should
provide even an higher degree of fariness, so at the very least the
"fix" 2 from Andrew makes very little sense to me, and it seems just an
hack meant to hide a real problem in the algorithm.

I mean, going wakeall (LIFO btw) rather than wake-one FIFO if something
should make things worse unless it is hiding some other issue.

As for 1 and 3 they were just included in my tree for ages.

BTW, Chris recently spotted a nearly impossible to trigger SMP-only race
in the fix pausing patch [great spotting Chris] (to trigger it would
need an intersection of two races at the same time), it'll be fixed in
my next tree, however nobody ever reproduced it and you certainly can
ignore it in practice so it can't explain any issue.

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-06-04 10:42         ` Jens Axboe
@ 2003-06-04 10:46           ` Marc-Christian Petersen
  2003-06-04 10:48             ` Andrea Arcangeli
  0 siblings, 1 reply; 109+ messages in thread
From: Marc-Christian Petersen @ 2003-06-04 10:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrea Arcangeli, Marcelo Tosatti, Georg Nikodym, lkml

On Wednesday 04 June 2003 12:42, Jens Axboe wrote:

Hi Jens,

> > > the issue with batching in 2.4, is that it is blocking at 0 and waking
> > > at batch_requests. But it's not blocking new get_request to eat
> > > requests in the way back from 0 to batch_requests. I mean, there are
> > > two directions, when we move from batch_requests to 0 get_requests
> > > should return requests. in the way back from 0 to batch_requests the
> > > get_request should block (and it doesn't in 2.4, that is the problem)
> > do you see a chance to fix this up in 2.4?
> Nick posted a patch to do so the other day and asked people to test.
Silly mcp. His mail was CC'ed to me :( ... F*ck huge inbox.

ciao, Marc


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-06-04 10:46           ` Marc-Christian Petersen
@ 2003-06-04 10:48             ` Andrea Arcangeli
  2003-06-04 11:57               ` Nick Piggin
  0 siblings, 1 reply; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-04 10:48 UTC (permalink / raw)
  To: Marc-Christian Petersen; +Cc: Jens Axboe, Marcelo Tosatti, Georg Nikodym, lkml

On Wed, Jun 04, 2003 at 12:46:33PM +0200, Marc-Christian Petersen wrote:
> On Wednesday 04 June 2003 12:42, Jens Axboe wrote:
> 
> Hi Jens,
> 
> > > > the issue with batching in 2.4, is that it is blocking at 0 and waking
> > > > at batch_requests. But it's not blocking new get_request to eat
> > > > requests in the way back from 0 to batch_requests. I mean, there are
> > > > two directions, when we move from batch_requests to 0 get_requests
> > > > should return requests. in the way back from 0 to batch_requests the
> > > > get_request should block (and it doesn't in 2.4, that is the problem)
> > > do you see a chance to fix this up in 2.4?
> > Nick posted a patch to do so the other day and asked people to test.
> Silly mcp. His mail was CC'ed to me :( ... F*ck huge inbox.

I was probably not CC'ed, I'll search for the email (and I was
travelling the last few days so I didn't read every single l-k email yet
sorry ;)

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-06-04 10:43         ` -rc7 Re: Linux 2.4.21-rc6 Andrea Arcangeli
@ 2003-06-04 11:01           ` Marc-Christian Petersen
  0 siblings, 0 replies; 109+ messages in thread
From: Marc-Christian Petersen @ 2003-06-04 11:01 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Marcelo Tosatti, Georg Nikodym, lkml

On Wednesday 04 June 2003 12:43, Andrea Arcangeli wrote:

Hi Andrea,

> sure, it's just a matter of adding a bit to the blkdev structure.
> However I'm not 100% sure that it is the real thing that could make the
> difference, but overall the exclusive wakeup FIFO in theory should
> provide even an higher degree of fariness, so at the very least the
> "fix" 2 from Andrew makes very little sense to me, and it seems just an
> hack meant to hide a real problem in the algorithm.
well, at least it reduces pauses/stops ;)

> As for 1 and 3 they were just included in my tree for ages.
err, 1 yes, but I don't see that 3 is in your tree. Well, ok, a bit different. 
But hey, your 1+3 are still having pauses ;)

> BTW, Chris recently spotted a nearly impossible to trigger SMP-only race
> in the fix pausing patch [great spotting Chris] (to trigger it would
Cool Chris!

> need an intersection of two races at the same time), it'll be fixed in
> my next tree, however nobody ever reproduced it and you certainly can
> ignore it in practice so it can't explain any issue.
Good to know. Thanks.

ciao, Marc


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-06-04 10:48             ` Andrea Arcangeli
@ 2003-06-04 11:57               ` Nick Piggin
  2003-06-04 12:00                 ` Jens Axboe
                                   ` (2 more replies)
  0 siblings, 3 replies; 109+ messages in thread
From: Nick Piggin @ 2003-06-04 11:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marc-Christian Petersen, Jens Axboe, Marcelo Tosatti,
	Georg Nikodym, lkml, Matthias Mueller

[-- Attachment #1: Type: text/plain, Size: 2100 bytes --]

Andrea Arcangeli wrote:

>On Wed, Jun 04, 2003 at 12:46:33PM +0200, Marc-Christian Petersen wrote:
>
>>On Wednesday 04 June 2003 12:42, Jens Axboe wrote:
>>
>>Hi Jens,
>>
>>
>>>>>the issue with batching in 2.4, is that it is blocking at 0 and waking
>>>>>at batch_requests. But it's not blocking new get_request to eat
>>>>>requests in the way back from 0 to batch_requests. I mean, there are
>>>>>two directions, when we move from batch_requests to 0 get_requests
>>>>>should return requests. in the way back from 0 to batch_requests the
>>>>>get_request should block (and it doesn't in 2.4, that is the problem)
>>>>>
>>>>do you see a chance to fix this up in 2.4?
>>>>
>>>Nick posted a patch to do so the other day and asked people to test.
>>>
>>Silly mcp. His mail was CC'ed to me :( ... F*ck huge inbox.
>>
>
>I was probably not CC'ed, I'll search for the email (and I was
>travelling the last few days so I didn't read every single l-k email yet
>sorry ;)
>
>
The patch I sent is actually against 2.4.20, contrary to my
babling. Reports I have had say it helps, but maybe not so
much as Andrew'ss fixes. Then Matthias Mueller ported my patch
to 2.4.21-rc6 which include Andrew's fixes.

It seems that they might be fixing two different problems.
It looks promising though.

My patch would not affect read IO throughput for a smallish
number of readers because the queue should never fill up.
 > 1 writer or a lot of readers could see some throughput
drop due to the patch causing the queue to be more FIFO
at high loads.

I have attached the patch again. Its against 2.4.20.

Nick

Matthias Mueller wrote:

>Currently I'm running 2.4.21-rc6 with your patch and the patch from Andrew 
>and it looks very promising.  Both patches seem to address two different 
>problems, combined I can have 2 parallel dds running and play music with 
>xmms and notice no sounddrops (actually i had one, but that was during 
>very high cpu load). Your patch seems to lower IO-throughput, but I 
>haven't tested this, so no real numbers, just my personal feelings and 
>the numbers 'time dd ...' gave me.
>  
>


[-- Attachment #2: blk-fair-batches-24 --]
[-- Type: text/plain, Size: 2612 bytes --]

--- linux-2.4/include/linux/blkdev.h.orig	2003-06-02 21:59:06.000000000 +1000
+++ linux-2.4/include/linux/blkdev.h	2003-06-02 22:39:57.000000000 +1000
@@ -118,13 +118,21 @@ struct request_queue
 	/*
 	 * Boolean that indicates whether this queue is plugged or not.
 	 */
-	char			plugged;
+	int			plugged:1;
 
 	/*
 	 * Boolean that indicates whether current_request is active or
 	 * not.
 	 */
-	char			head_active;
+	int			head_active:1;
+
+	/*
+	 * Booleans that indicate whether the queue's free requests have
+	 * been exhausted and is waiting to drop below the batch_requests
+	 * threshold
+	 */
+	int			read_full:1;
+	int			write_full:1;
 
 	unsigned long		bounce_pfn;
 
@@ -140,6 +148,30 @@ struct request_queue
 	wait_queue_head_t	wait_for_requests[2];
 };
 
+static inline void set_queue_full(request_queue_t *q, int rw)
+{
+	if (rw == READ)
+		q->read_full = 1;
+	else
+		q->write_full = 1;
+}
+
+static inline void clear_queue_full(request_queue_t *q, int rw)
+{
+	if (rw == READ)
+		q->read_full = 0;
+	else
+		q->write_full = 0;
+}
+
+static inline int queue_full(request_queue_t *q, int rw)
+{
+	if (rw == READ)
+		return q->read_full;
+	else
+		return q->write_full;
+}
+
 extern unsigned long blk_max_low_pfn, blk_max_pfn;
 
 #define BLK_BOUNCE_HIGH		(blk_max_low_pfn << PAGE_SHIFT)
--- linux-2.4/drivers/block/ll_rw_blk.c.orig	2003-06-02 21:56:54.000000000 +1000
+++ linux-2.4/drivers/block/ll_rw_blk.c	2003-06-02 22:17:13.000000000 +1000
@@ -513,7 +513,10 @@ static struct request *get_request(reque
 	struct request *rq = NULL;
 	struct request_list *rl = q->rq + rw;
 
-	if (!list_empty(&rl->free)) {
+	if (list_empty(&rl->free))
+		set_queue_full(q, rw);
+	
+	if (!queue_full(q, rw)) {
 		rq = blkdev_free_rq(&rl->free);
 		list_del(&rq->queue);
 		rl->count--;
@@ -594,7 +597,7 @@ static struct request *__get_request_wai
 	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (q->rq[rw].count == 0)
+		if (queue_full(q, rw))
 			schedule();
 		spin_lock_irq(&io_request_lock);
 		rq = get_request(q, rw);
@@ -829,9 +832,14 @@ void blkdev_release_request(struct reque
 	 */
 	if (q) {
 		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= q->batch_requests &&
-				waitqueue_active(&q->wait_for_requests[rw]))
-			wake_up(&q->wait_for_requests[rw]);
+		q->rq[rw].count++;
+		if (q->rq[rw].count >= q->batch_requests) {
+			if (q->rq[rw].count == q->batch_requests) 
+				clear_queue_full(q, rw);
+
+			if (waitqueue_active(&q->wait_for_requests[rw]))
+				wake_up(&q->wait_for_requests[rw]);
+		}
 	}
 }
 

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-06-04 11:57               ` Nick Piggin
@ 2003-06-04 12:00                 ` Jens Axboe
  2003-06-04 12:09                   ` Andrea Arcangeli
  2003-06-04 12:11                   ` Nick Piggin
  2003-06-04 12:35                 ` Miquel van Smoorenburg
  2003-06-09 21:39                 ` [PATCH] io stalls (was: -rc7 Re: Linux 2.4.21-rc6) Chris Mason
  2 siblings, 2 replies; 109+ messages in thread
From: Jens Axboe @ 2003-06-04 12:00 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Marcelo Tosatti,
	Georg Nikodym, lkml, Matthias Mueller

On Wed, Jun 04 2003, Nick Piggin wrote:
> Andrea Arcangeli wrote:
> 
> >On Wed, Jun 04, 2003 at 12:46:33PM +0200, Marc-Christian Petersen wrote:
> >
> >>On Wednesday 04 June 2003 12:42, Jens Axboe wrote:
> >>
> >>Hi Jens,
> >>
> >>
> >>>>>the issue with batching in 2.4, is that it is blocking at 0 and waking
> >>>>>at batch_requests. But it's not blocking new get_request to eat
> >>>>>requests in the way back from 0 to batch_requests. I mean, there are
> >>>>>two directions, when we move from batch_requests to 0 get_requests
> >>>>>should return requests. in the way back from 0 to batch_requests the
> >>>>>get_request should block (and it doesn't in 2.4, that is the problem)
> >>>>>
> >>>>do you see a chance to fix this up in 2.4?
> >>>>
> >>>Nick posted a patch to do so the other day and asked people to test.
> >>>
> >>Silly mcp. His mail was CC'ed to me :( ... F*ck huge inbox.
> >>
> >
> >I was probably not CC'ed, I'll search for the email (and I was
> >travelling the last few days so I didn't read every single l-k email yet
> >sorry ;)
> >
> >
> The patch I sent is actually against 2.4.20, contrary to my
> babling. Reports I have had say it helps, but maybe not so
> much as Andrew'ss fixes. Then Matthias Mueller ported my patch
> to 2.4.21-rc6 which include Andrew's fixes.
> 
> It seems that they might be fixing two different problems.
> It looks promising though.

It is a different problem I think, yours will fix the starvation of
writers (of readers, writers is much much easier to trigger though)
where someone will repeatedly get cheaten by the request allocator.

The other problem is still not clear to anyone. I doubt this patch would
make any difference (apart from a psychological one) in this case,
since you have a single writer and maybe a reader or two. The single
writer cannot starve anyone else.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-06-04 12:00                 ` Jens Axboe
@ 2003-06-04 12:09                   ` Andrea Arcangeli
  2003-06-04 12:20                     ` Jens Axboe
  2003-06-04 12:11                   ` Nick Piggin
  1 sibling, 1 reply; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-04 12:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Nick Piggin, Marc-Christian Petersen, Marcelo Tosatti,
	Georg Nikodym, lkml, Matthias Mueller

On Wed, Jun 04, 2003 at 02:00:53PM +0200, Jens Axboe wrote:
> since you have a single writer and maybe a reader or two. The single
> writer cannot starve anyone else.

unless you're changing an atime and you've to mark_buffer_dirty or
similar (balance_dirty will write stuff the same way from cp and the
reader then).

Maybe we can get some stack trace with kgdb to be sure where the reader
is blocking.

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-06-04 12:00                 ` Jens Axboe
  2003-06-04 12:09                   ` Andrea Arcangeli
@ 2003-06-04 12:11                   ` Nick Piggin
  1 sibling, 0 replies; 109+ messages in thread
From: Nick Piggin @ 2003-06-04 12:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Marcelo Tosatti,
	Georg Nikodym, lkml, Matthias Mueller



Jens Axboe wrote:

>On Wed, Jun 04 2003, Nick Piggin wrote:
>
>>Andrea Arcangeli wrote:
>>
>>
>>>On Wed, Jun 04, 2003 at 12:46:33PM +0200, Marc-Christian Petersen wrote:
>>>
>>>
>>>>On Wednesday 04 June 2003 12:42, Jens Axboe wrote:
>>>>
>>>>Hi Jens,
>>>>
>>>>
>>>>
>>>>>>>the issue with batching in 2.4, is that it is blocking at 0 and waking
>>>>>>>at batch_requests. But it's not blocking new get_request to eat
>>>>>>>requests in the way back from 0 to batch_requests. I mean, there are
>>>>>>>two directions, when we move from batch_requests to 0 get_requests
>>>>>>>should return requests. in the way back from 0 to batch_requests the
>>>>>>>get_request should block (and it doesn't in 2.4, that is the problem)
>>>>>>>
>>>>>>>
>>>>>>do you see a chance to fix this up in 2.4?
>>>>>>
>>>>>>
>>>>>Nick posted a patch to do so the other day and asked people to test.
>>>>>
>>>>>
>>>>Silly mcp. His mail was CC'ed to me :( ... F*ck huge inbox.
>>>>
>>>>
>>>I was probably not CC'ed, I'll search for the email (and I was
>>>travelling the last few days so I didn't read every single l-k email yet
>>>sorry ;)
>>>
>>>
>>>
>>The patch I sent is actually against 2.4.20, contrary to my
>>babling. Reports I have had say it helps, but maybe not so
>>much as Andrew'ss fixes. Then Matthias Mueller ported my patch
>>to 2.4.21-rc6 which include Andrew's fixes.
>>
>>It seems that they might be fixing two different problems.
>>It looks promising though.
>>
>
>It is a different problem I think, yours will fix the starvation of
>writers (of readers, writers is much much easier to trigger though)
>where someone will repeatedly get cheaten by the request allocator.
>
>The other problem is still not clear to anyone. I doubt this patch would
>make any difference (apart from a psychological one) in this case,
>since you have a single writer and maybe a reader or two. The single
>writer cannot starve anyone else.
>

You are right about what the patch does. It wouldn't surprise me
if there are still other problems, but it could be that the reader
has to write some swap or other dirty buffers when trying to get
memory itself.

I have had 3 or so reports all saying similar things, but it could
be psychological I guess.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-06-04 12:09                   ` Andrea Arcangeli
@ 2003-06-04 12:20                     ` Jens Axboe
  2003-06-04 20:50                       ` Rob Landley
  0 siblings, 1 reply; 109+ messages in thread
From: Jens Axboe @ 2003-06-04 12:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Marc-Christian Petersen, Marcelo Tosatti,
	Georg Nikodym, lkml, Matthias Mueller

On Wed, Jun 04 2003, Andrea Arcangeli wrote:
> On Wed, Jun 04, 2003 at 02:00:53PM +0200, Jens Axboe wrote:
> > since you have a single writer and maybe a reader or two. The single
> > writer cannot starve anyone else.
> 
> unless you're changing an atime and you've to mark_buffer_dirty or
> similar (balance_dirty will write stuff the same way from cp and the
> reader then).

Yes you are right, could be.

But the whole thing still smells fishy. Read starvation causing mouse
stalls, hmm.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-06-04 11:57               ` Nick Piggin
  2003-06-04 12:00                 ` Jens Axboe
@ 2003-06-04 12:35                 ` Miquel van Smoorenburg
  2003-06-09 21:39                 ` [PATCH] io stalls (was: -rc7 Re: Linux 2.4.21-rc6) Chris Mason
  2 siblings, 0 replies; 109+ messages in thread
From: Miquel van Smoorenburg @ 2003-06-04 12:35 UTC (permalink / raw)
  To: linux-kernel

In article <3EDDDEBB.4080209@cyberone.com.au>,
Nick Piggin  <piggin@cyberone.com.au> wrote:
>-	char			plugged;
>+	int			plugged:1;

This is dangerous:

struct foo {
        int     bla:1;
};
 
int main()
{
        struct foo      f;
 
        f.bla = 1;
        printf("%d\n", f.bla);
}


$ ./a.out
-1

If you want to put "0" and "1" in a 1-bit field, use "unsigned int bla:1".

Mike.
-- 
.. somehow I have a feeling the hurting hasn't even begun yet
	-- Bill, "The Terrible Thunderlizards"


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-06-03 16:30           ` Michael Frank
  2003-06-03 16:53             ` Matthias Mueller
  2003-06-03 16:59             ` Marc-Christian Petersen
@ 2003-06-04 14:56             ` Jakob Oestergaard
  2 siblings, 0 replies; 109+ messages in thread
From: Jakob Oestergaard @ 2003-06-04 14:56 UTC (permalink / raw)
  To: Michael Frank; +Cc: Marcelo Tosatti, Marc Wilson, lkml

On Wed, Jun 04, 2003 at 12:30:27AM +0800, Michael Frank wrote:
...
> >
> > Ok, so you can reproduce the hangs reliably EVEN with -rc6, Marc?
> 
> -rc6 is better - comparable to 2.4.18 in what I have seen with my script.  

I've run 2.4.20 for a long time, and have been seriously plagued with
the I/O stalls.

On a file server (details below) here I upgraded to 2.4.21-rc6
yesterday.

The I/O stalls have *almost* gone away.

Best of all, we still have our data intact  ;)

Server data:
  ~130 GB data on a ~150 GB ext3fs with >1 million files
  Software RAID-0+1 on four IDE disks
  Two promise controllers 1x20262 1x20269
  1x Intel eepro100, 1x Intel e1000
  dual PIII, half a gig of memory
  NFS server (mainly v3, many different clients)

> 
> After the long obscure problems since 2.4.19x, -rc6 could use serious 
> stress-testing. 

This server rarely has load below 1, but frequently above 15.  It may
run some compilers and linkers locally, but most of the load comes from
NFS serving.

So far it's been running for 28 hours with that kind of load.  Nothing
suspicious in the dmesg yet.

I will of course let you all know if it falls on it's knees.


So far it's all thumbs-up from me!   There may still be an occational
stall here and there, but compared to 2.4.20 this is heaven (it really
was unbelievably annoying having your emacs stall for 10 seconds every
30 seonds when someone was linking on the cluster)  :)

A big *thank*you* to Marcelo for deciding to include a fix for the I/O
stalls!

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-05-29 20:18       ` Krzysiek Taraszka
@ 2003-06-04 18:17         ` Marcelo Tosatti
  2003-06-04 21:41           ` Krzysiek Taraszka
  0 siblings, 1 reply; 109+ messages in thread
From: Marcelo Tosatti @ 2003-06-04 18:17 UTC (permalink / raw)
  To: Krzysiek Taraszka; +Cc: Georg Nikodym, lkml



On Thu, 29 May 2003, Krzysiek Taraszka wrote:

> Dnia czw 29. maja 2003 21:56, Krzysiek Taraszka napisa?:
> > Dnia czw 29. maja 2003 21:11, Marcelo Tosatti napisa?:
> > > On Thu, 29 May 2003, Georg Nikodym wrote:
> > > > On Wed, 28 May 2003 21:55:39 -0300 (BRT)
> > > >
> > > > Marcelo Tosatti <marcelo@conectiva.com.br> wrote:
> > > > > Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's
> > > > > fix for the IO stalls/deadlocks.
> > > >
> > > > While others may be dubious about the efficacy of this patch, I've been
> > > > running -rc6 on my laptop now since sometime last night and have seen
> > > > nothing odd.
> > > >
> > > > In case anybody cares, I'm using both ide and a ieee1394 (for a large
> > > > external drive [which implies scsi]) and I do a _lot_ of big work with
> > > > BK so I was seeing the problem within hours previously.
> > >
> > > Great!
> > >
> > > -rc7 will have to be released due to some problems :(
> >
> > hmm, seems to ide modules and others are broken. Im looking for reason why
>
> hmm, for IDE subsystem the ide-proc.o was't made for CONFIG_BLK_DEV_IDE=m ...
> anyone goes to fix it ? or shall I prepare and send here my own patch ?

Feel free to send your own patch, please :)

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-06-04 12:20                     ` Jens Axboe
@ 2003-06-04 20:50                       ` Rob Landley
  0 siblings, 0 replies; 109+ messages in thread
From: Rob Landley @ 2003-06-04 20:50 UTC (permalink / raw)
  To: Jens Axboe, Andrea Arcangeli; +Cc: lkml

[-- Attachment #1: Type: text/plain, Size: 1497 bytes --]

On Wednesday 04 June 2003 08:20, Jens Axboe wrote:
> On Wed, Jun 04 2003, Andrea Arcangeli wrote:
> > On Wed, Jun 04, 2003 at 02:00:53PM +0200, Jens Axboe wrote:
> > > since you have a single writer and maybe a reader or two. The single
> > > writer cannot starve anyone else.
> >
> > unless you're changing an atime and you've to mark_buffer_dirty or
> > similar (balance_dirty will write stuff the same way from cp and the
> > reader then).
>
> Yes you are right, could be.
>
> But the whole thing still smells fishy. Read starvation causing mouse
> stalls, hmm.

If reads from swap get starved, you can have interactive dropouts in just 
about anything.

My desktop is usually pretty deep into swap.  I upgrade to machines with four 
times as much memory, but that usually means the graphics resolution went up 
and it just lets me keep more windows open in more desktops.  (Currently 
six.)

My record was driving the system so deep into swapping frenzy it was still 
swapping when I came back from lunch.  Really.  This was under 2.4.4, though.  
On RH 9/2.4.20-? my record is a little under five minutes of "frozen 
thrashing on swap" before I got control of the system back.  That's just a 
"go for a soda" break.  And at least the mouse cursor never froze for more 
than a couple seconds at a time during that, even if the desktop was ignoring 
me... :)

Haven't tried 2.5 on anything but servers yet, but it's on my to-do list...

Rob

(I am the VM subsystem's worst nightmare.  Bwahaha.)

[-- Attachment #2: typescript --]
[-- Type: text/plain, Size: 4531 bytes --]

Script started on Wed 04 Jun 2003 04:25:29 PM EDT
^[]0;landley@localhost:~[landley@localhost landley]$ cat /proc/meminfo
        total:    used:    free:  shared: buffers:  cached:
Mem:  261390336 247234560 14155776        0  9351168 80461824
Swap: 542859264 276152320 266706944
MemTotal:       255264 kB
MemFree:         13824 kB
MemShared:           0 kB
Buffers:          9132 kB
Cached:          43372 kB
SwapCached:      35204 kB
Active:         182324 kB
ActiveAnon:     131940 kB
ActiveCache:     50384 kB
Inact_dirty:     19164 kB
Inact_laundry:   14400 kB
Inact_clean:      3512 kB
Inact_target:    43880 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       255264 kB
LowFree:         13824 kB
SwapTotal:      530136 kB
SwapFree:       260456 kB
^[]0;landley@localhost:~[landley@localhost landley]$ cat /proc/slabinfo
slabinfo - version: 1.1
kmem_cache            65     70    108    2    2    1
ip_fib_hash           11    112     32    1    1    1
urb_priv               0      0     64    0    0    1
journal_head          57    770     48    1   10    1
revoke_table           2    250     12    1    1    1
revoke_record          0    112     32    0    1    1
clip_arp_cache         0      0    128    0    0    1
ip_mrt_cache           0      0    128    0    0    1
tcp_tw_bucket          0     90    128    0    3    1
tcp_bind_bucket        4    224     32    1    2    1
tcp_open_request       0     30    128    0    1    1
inet_peer_cache        0     58     64    0    1    1
ip_dst_cache           5     75    256    1    5    1
arp_cache              2     30    128    1    1    1
blkdev_requests      256    270    128    9    9    1
dnotify_cache          0      0     20    0    0    1
file_lock_cache        0     41     92    0    1    1
fasync_cache           2    200     16    1    1    1
uid_cache              2    112     32    1    1    1
skbuff_head_cache    176   2265    256   32  151    1
sock                 589    720   1280  220  240    1
sigqueue               0     29    132    0    1    1
kiobuf                 0      0     64    0    0    1
cdev_cache            26    232     64    2    4    1
bdev_cache             4     58     64    1    1    1
mnt_cache             13     58     64    1    1    1
inode_cache         2395   3647    512  519  521    1
dentry_cache        2477   4050    128  135  135    1
dquot                  0      0    128    0    0    1
filp                2364   2370    128   79   79    1
names_cache            0     14   4096    0   14    1
buffer_head        16649  30360    128  789 1012    1
mm_struct            173    210    256   14   14    1
vm_area_struct      5840   7770    128  238  259    1
fs_cache              78    116     64    2    2    1
files_cache           78    112    512   15   16    1
signal_cache         243    290     64    5    5    1
sighand_cache        235    253   1408   22   23    4
task_struct            0      0   1792    0    0    1
pte_chain           1958   7590    128   83  253    1
size-131072(DMA)       0      0 131072    0    0   32
size-131072            0      0 131072    0    0   32
size-65536(DMA)        0      0  65536    0    0   16
size-65536             0      0  65536    0    0   16
size-32768(DMA)        0      0  32768    0    0    8
size-32768             0      0  32768    0    0    8
size-16384(DMA)        0      0  16384    0    0    4
size-16384             0     16  16384    0   16    4
size-8192(DMA)         0      0   8192    0    0    2
size-8192              4     19   8192    4   19    2
size-4096(DMA)         0      0   4096    0    0    1
size-4096             35     75   4096   35   75    1
size-2048(DMA)         0      0   2048    0    0    1
size-2048              8     86   2048    5   43    1
size-1024(DMA)         0      0   1024    0    0    1
size-1024             59    124   1024   18   31    1
size-512(DMA)          0      0    512    0    0    1
size-512              43    200    512   11   25    1
size-256(DMA)          0      0    256    0    0    1
size-256              43   1200    256    8   80    1
size-128(DMA)          1     30    128    1    1    1
size-128             707   3240    128   33  108    1
size-64(DMA)           0      0    128    0    0    1
size-64              377   1170    128   30   39    1
size-32(DMA)          17     58     64    1    1    1
size-32              397    754     64   10   13    1
^[]0;landley@localhost:~[landley@localhost landley]$ 
Script done on Wed 04 Jun 2003 04:25:42 PM EDT

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-06-04 18:17         ` Marcelo Tosatti
@ 2003-06-04 21:41           ` Krzysiek Taraszka
  2003-06-04 22:37             ` Alan Cox
  0 siblings, 1 reply; 109+ messages in thread
From: Krzysiek Taraszka @ 2003-06-04 21:41 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Georg Nikodym, lkml

Dnia Wednesday 04 of June 2003 20:17, Marcelo Tosatti napisał:
> On Thu, 29 May 2003, Krzysiek Taraszka wrote:
> > Dnia czw 29. maja 2003 21:56, Krzysiek Taraszka napisa?:
> > > Dnia czw 29. maja 2003 21:11, Marcelo Tosatti napisa?:
> > > > On Thu, 29 May 2003, Georg Nikodym wrote:
> > > > > On Wed, 28 May 2003 21:55:39 -0300 (BRT)
> > > > >
> > > > > Marcelo Tosatti <marcelo@conectiva.com.br> wrote:
> > > > > > Here goes -rc6. I've decided to delay 2.4.21 a bit and try
> > > > > > Andrew's fix for the IO stalls/deadlocks.
> > > > >
> > > > > While others may be dubious about the efficacy of this patch, I've
> > > > > been running -rc6 on my laptop now since sometime last night and
> > > > > have seen nothing odd.
> > > > >
> > > > > In case anybody cares, I'm using both ide and a ieee1394 (for a
> > > > > large external drive [which implies scsi]) and I do a _lot_ of big
> > > > > work with BK so I was seeing the problem within hours previously.
> > > >
> > > > Great!
> > > >
> > > > -rc7 will have to be released due to some problems :(
> > >
> > > hmm, seems to ide modules and others are broken. Im looking for reason
> > > why
> >
> > hmm, for IDE subsystem the ide-proc.o was't made for CONFIG_BLK_DEV_IDE=m
> > ... anyone goes to fix it ? or shall I prepare and send here my own patch
> > ?
>
> Feel free to send your own patch, please :)

Hm, I send it few days ago (replay to Andrzej Krzysztofowicz post (sth with 
-rc3 in subject :)) with another fixes but without cmd640 fixes.
Alan made almoust the same changes but him ac tree still have got broken 
cmd640 modular driver (cmd640_vlb still is unresolved).
I tried hack it .. but I droped it ... maybe tomorrow i back to that code ... 
or someone goes to fix it (maybe Alan ?)

-- 
Krzysiek Taraszka			(dzimi@pld.org.pl)
http://cyborg.kernel.pl/~dzimi/


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-06-03 16:13           ` Marc-Christian Petersen
@ 2003-06-04 21:54             ` Pavel Machek
  2003-06-05  2:10               ` Michael Frank
  0 siblings, 1 reply; 109+ messages in thread
From: Pavel Machek @ 2003-06-04 21:54 UTC (permalink / raw)
  To: Marc-Christian Petersen; +Cc: Marcelo Tosatti, Marc Wilson, lkml

Hi!

> > Ok, so you can reproduce the hangs reliably EVEN with -rc6, Marc?
> well, even if you mean Marc Wilson, I also have to say something (as I've 
> written in my previous email some days ago)
> 
> The pauses/stops are _a lot_ less than w/o the fix but they are _not_ gone. 
> Tested with 2.4.21-rc6.

If hangs are not worse than 2.4.20, then I'd go ahead with release....

									Pavel
-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: -rc7   Re: Linux 2.4.21-rc6
  2003-06-04 21:41           ` Krzysiek Taraszka
@ 2003-06-04 22:37             ` Alan Cox
  0 siblings, 0 replies; 109+ messages in thread
From: Alan Cox @ 2003-06-04 22:37 UTC (permalink / raw)
  To: Krzysiek Taraszka; +Cc: Marcelo Tosatti, Georg Nikodym, lkml

On Mer, 2003-06-04 at 22:41, Krzysiek Taraszka wrote: 
> -rc3 in subject :)) with another fixes but without cmd640 fixes.
> Alan made almoust the same changes but him ac tree still have got broken 
> cmd640 modular driver (cmd640_vlb still is unresolved).
> I tried hack it .. but I droped it ... maybe tomorrow i back to that code ... 
> or someone goes to fix it (maybe Alan ?)

cmd640_vlb is gone from the core code in the -ac tree so that suprises
me. Adrian Bunk sent me some more patches to look at. I'm not 100% 
convinced by them but there are a few cases left and some of his stuff
certainly fixes real problems


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Linux 2.4.21-rc6
  2003-06-04 21:54             ` Pavel Machek
@ 2003-06-05  2:10               ` Michael Frank
  0 siblings, 0 replies; 109+ messages in thread
From: Michael Frank @ 2003-06-05  2:10 UTC (permalink / raw)
  To: Pavel Machek, Marc-Christian Petersen; +Cc: Marcelo Tosatti, Marc Wilson, lkml

On Thursday 05 June 2003 05:54, Pavel Machek wrote:
> Hi!
>
> > > Ok, so you can reproduce the hangs reliably EVEN with -rc6, Marc?
> >
> > well, even if you mean Marc Wilson, I also have to say something (as I've
> > written in my previous email some days ago)
> >
> > The pauses/stops are _a lot_ less than w/o the fix but they are _not_
> > gone. Tested with 2.4.21-rc6.
>
> If hangs are not worse than 2.4.20, then I'd go ahead with release....
>
> 									
I have -rc6 running on a P4 for a few days, doing the test script, 
compiles, Opera and found it to be comparable to 2.4.18.

It also does well on slower machines of about 1/4 the the CPU and disk
bandwidth. 

IMHO, interactivity is reasonable (again just IMHO), and others
may disagree.

-- 
Powered by linux-2.5.70-mm3
My current linux related activities in rough order of priority:
- Testing of 2.4/2.5 kernel interactivity
- Testing of Swsusp for 2.4
- Testing of Opera 7.11 emphasizing interactivity
- Research of NFS i/o errors during transfer 2.4>2.5
- Learning 2.5 series kernel debugging with kgdb - it's in the -mm tree
- Studying 2.5 series serial and ide drivers, ACPI, S3
* Input and feedback is always welcome *


^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH] io stalls (was: -rc7   Re: Linux 2.4.21-rc6)
  2003-06-04 11:57               ` Nick Piggin
  2003-06-04 12:00                 ` Jens Axboe
  2003-06-04 12:35                 ` Miquel van Smoorenburg
@ 2003-06-09 21:39                 ` Chris Mason
  2003-06-09 22:19                   ` Andrea Arcangeli
                                     ` (2 more replies)
  2 siblings, 3 replies; 109+ messages in thread
From: Chris Mason @ 2003-06-09 21:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

Ok, there are lots of different problems here, and I've spent a little
while trying to get some numbers with the __get_request_wait stats patch
I posted before.  This is all on ext2, since I wanted to rule out
interactions with the journal flavors.

Basically a dbench 90 run on ext2 rc6 vanilla kernels can generate
latencies of over 2700 jiffies in __get_request_wait, with an average
latency over 250 jiffies.

No, most desktop workloads aren't dbench 90, but between balance_dirty()
and the way we send stuff to disk during memory allocations, just about
any process can get stuck submitting dirty buffers even if you've just
got one process doing a dd if=/dev/zero of=foo.

So, for the moment I'm going to pretend people seeing stalls in X are
stuck in atime updates or memory allocations, or reading proc or some
other silly spot.  

For the SMP corner cases, I've merged Andrea's fix-pausing patch into
rc7, along with an altered form of Nick Piggin's queue_full patch to try
and fix the latency problems.

The major difference from Nick's patch is that once the queue is marked
full, I don't clear the full flag until the wait queue is empty.  This
means new io can't steal available requests until every existing waiter
has been granted a request.

The latency results are better, with average time spent in
__get_request_wait being around 28 jiffies, and a max of 170 jiffies. 
The cost is throughput, further benchmarking needs to be done but, but I
wanted to get this out for review and testing.  It should at least help
us decide if the request allocation code really is causing our problems.

The patch below also includes the __get_request_wait latency stats.  If
people try this and still see stalls, please run elvtune /dev/xxxx and
send along the resulting console output.

I haven't yet compared this to Andrea's elevator latency code, but the
stat patch was originally developed on top of his 2.4.21pre3aa1, where
the average wait was 97 jiffies and the max was 318.

Anyway, less talk, more code.  Treat this with care, it has only been
lightly tested.  Thanks to Andrea and Nick whose patches this is largely
based on:

--- 1.9/drivers/block/blkpg.c	Sat Mar 30 06:58:05 2002
+++ edited/drivers/block/blkpg.c	Mon Jun  9 12:17:24 2003
@@ -261,6 +261,7 @@
 			return blkpg_ioctl(dev, (struct blkpg_ioctl_arg *) arg);
 			
 		case BLKELVGET:
+			blk_print_stats(dev);
 			return blkelvget_ioctl(&blk_get_queue(dev)->elevator,
 					       (blkelv_ioctl_arg_t *) arg);
 		case BLKELVSET:
--- 1.45/drivers/block/ll_rw_blk.c	Wed May 28 03:50:02 2003
+++ edited/drivers/block/ll_rw_blk.c	Mon Jun  9 17:13:16 2003
@@ -429,6 +429,8 @@
 	q->rq[READ].count = 0;
 	q->rq[WRITE].count = 0;
 	q->nr_requests = 0;
+	q->read_full = 0;
+	q->write_full = 0;
 
 	si_meminfo(&si);
 	megs = si.totalram >> (20 - PAGE_SHIFT);
@@ -442,6 +444,56 @@
 	spin_lock_init(&q->queue_lock);
 }
 
+void blk_print_stats(kdev_t dev) 
+{
+	request_queue_t *q;
+	unsigned long avg_wait;
+	unsigned long min_wait;
+	unsigned long high_wait;
+	unsigned long *d;
+
+	q = blk_get_queue(dev);
+	if (!q)
+		return;
+
+	min_wait = q->min_wait;
+	if (min_wait == ~0UL)
+		min_wait = 0;
+	if (q->num_wait) 
+		avg_wait = q->total_wait / q->num_wait;
+	else
+		avg_wait = 0;
+	printk("device %s: num_req %lu, total jiffies waited %lu\n", 
+	       kdevname(dev), q->num_req, q->total_wait);
+	printk("\t%lu forced to wait\n", q->num_wait);
+	printk("\t%lu min wait, %lu max wait\n", min_wait, q->max_wait);
+	printk("\t%lu average wait\n", avg_wait);
+	d = q->deviation;
+	printk("\t%lu < 100, %lu < 200, %lu < 300, %lu < 400, %lu < 500\n",
+               d[0], d[1], d[2], d[3], d[4]);
+	high_wait = d[0] + d[1] + d[2] + d[3] + d[4];
+	high_wait = q->num_wait - high_wait;
+	printk("\t%lu waits longer than 500 jiffies\n", high_wait);
+}
+
+static void reset_stats(request_queue_t *q)
+{
+	q->max_wait		= 0;
+	q->min_wait		= ~0UL;
+	q->total_wait		= 0;
+	q->num_req		= 0;
+	q->num_wait		= 0;
+	memset(q->deviation, 0, sizeof(q->deviation));
+}
+void blk_reset_stats(kdev_t dev) 
+{
+	request_queue_t *q;
+	q = blk_get_queue(dev);
+	if (!q)
+	    return;
+	printk("reset latency stats on device %s\n", kdevname(dev));
+	reset_stats(q);
+}
 static int __make_request(request_queue_t * q, int rw, struct buffer_head * bh);
 
 /**
@@ -491,6 +543,9 @@
 	q->plug_tq.routine	= &generic_unplug_device;
 	q->plug_tq.data		= q;
 	q->plugged        	= 0;
+
+	reset_stats(q);
+
 	/*
 	 * These booleans describe the queue properties.  We set the
 	 * default (and most common) values here.  Other drivers can
@@ -508,7 +563,7 @@
  * Get a free request. io_request_lock must be held and interrupts
  * disabled on the way in.  Returns NULL if there are no free requests.
  */
-static struct request *get_request(request_queue_t *q, int rw)
+static struct request *__get_request(request_queue_t *q, int rw)
 {
 	struct request *rq = NULL;
 	struct request_list *rl = q->rq + rw;
@@ -521,10 +576,17 @@
 		rq->cmd = rw;
 		rq->special = NULL;
 		rq->q = q;
-	}
+	} else
+		set_queue_full(q, rw);
 
 	return rq;
 }
+static struct request *get_request(request_queue_t *q, int rw)
+{
+	if (queue_full(q, rw))
+		return NULL;
+	return __get_request(q, rw);
+}
 
 /*
  * Here's the request allocation design:
@@ -588,23 +650,57 @@
 static struct request *__get_request_wait(request_queue_t *q, int rw)
 {
 	register struct request *rq;
+	int waited = 0;
+	unsigned long wait_start = jiffies;
+	unsigned long time_waited;
 	DECLARE_WAITQUEUE(wait, current);
 
-	add_wait_queue(&q->wait_for_requests[rw], &wait);
+	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
+
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		generic_unplug_device(q);
-		if (q->rq[rw].count == 0)
-			schedule();
 		spin_lock_irq(&io_request_lock);
-		rq = get_request(q, rw);
+		if ((!waited && queue_full(q, rw)) || q->rq[rw].count == 0) {
+			__generic_unplug_device(q);
+			spin_unlock_irq(&io_request_lock);
+			schedule();
+			spin_lock_irq(&io_request_lock);
+			waited = 1;
+		}
+		rq = __get_request(q, rw);
 		spin_unlock_irq(&io_request_lock);
 	} while (rq == NULL);
 	remove_wait_queue(&q->wait_for_requests[rw], &wait);
 	current->state = TASK_RUNNING;
+
+	if (!waitqueue_active(&q->wait_for_requests[rw]))
+		clear_queue_full(q, rw);
+
+	time_waited = jiffies - wait_start;
+	if (time_waited > q->max_wait)
+		q->max_wait = time_waited;
+	if (time_waited && time_waited < q->min_wait)
+		q->min_wait = time_waited;
+	q->total_wait += time_waited;
+	q->num_wait++;
+	if (time_waited < 500) {
+		q->deviation[time_waited/100]++;
+	}
+
 	return rq;
 }
 
+static void get_request_wait_wakeup(request_queue_t *q, int rw)
+{
+	/*
+	 * avoid losing an unplug if a second __get_request_wait did the
+	 * generic_unplug_device while our __get_request_wait was running
+	 * w/o the queue_lock held and w/ our request out of the queue.
+	 */
+	if (waitqueue_active(&q->wait_for_requests[rw]))
+		wake_up(&q->wait_for_requests[rw]);
+}
+
 /* RO fail safe mechanism */
 
 static long ro_bits[MAX_BLKDEV][8];
@@ -829,8 +925,14 @@
 	 */
 	if (q) {
 		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= q->batch_requests)
-			wake_up(&q->wait_for_requests[rw]);
+		q->rq[rw].count++;
+		if (q->rq[rw].count >= q->batch_requests) {
+			smp_mb();
+			if (waitqueue_active(&q->wait_for_requests[rw]))
+				wake_up(&q->wait_for_requests[rw]);
+			else
+				clear_queue_full(q, rw);
+		}
 	}
 }
 
@@ -948,7 +1050,6 @@
 	 */
 	max_sectors = get_max_sectors(bh->b_rdev);
 
-again:
 	req = NULL;
 	head = &q->queue_head;
 	/*
@@ -957,6 +1058,7 @@
 	 */
 	spin_lock_irq(&io_request_lock);
 
+again:
 	insert_here = head->prev;
 	if (list_empty(head)) {
 		q->plug_device_fn(q, bh->b_rdev); /* is atomic */
@@ -1042,6 +1144,9 @@
 			if (req == NULL) {
 				spin_unlock_irq(&io_request_lock);
 				freereq = __get_request_wait(q, rw);
+				head = &q->queue_head;
+				spin_lock_irq(&io_request_lock);
+				get_request_wait_wakeup(q, rw);
 				goto again;
 			}
 		}
@@ -1063,6 +1168,7 @@
 	req->rq_dev = bh->b_rdev;
 	req->start_time = jiffies;
 	req_new_io(req, 0, count);
+	q->num_req++;
 	blk_started_io(count);
 	add_request(q, req, insert_here);
 out:
@@ -1196,8 +1302,15 @@
 	bh->b_rdev = bh->b_dev;
 	bh->b_rsector = bh->b_blocknr * count;
 
+	get_bh(bh);
 	generic_make_request(rw, bh);
 
+	/* fix race condition with wait_on_buffer() */
+	smp_mb(); /* spin_unlock may have inclusive semantics */
+	if (waitqueue_active(&bh->b_wait))
+		wake_up(&bh->b_wait);
+
+	put_bh(bh);
 	switch (rw) {
 		case WRITE:
 			kstat.pgpgout += count;
--- 1.83/fs/buffer.c	Wed May 14 12:51:00 2003
+++ edited/fs/buffer.c	Mon Jun  9 13:55:22 2003
@@ -153,10 +153,23 @@
 	get_bh(bh);
 	add_wait_queue(&bh->b_wait, &wait);
 	do {
-		run_task_queue(&tq_disk);
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 		if (!buffer_locked(bh))
 			break;
+		/*
+		 * We must read tq_disk in TQ_ACTIVE after the
+		 * add_wait_queue effect is visible to other cpus.
+		 * We could unplug some line above it wouldn't matter
+		 * but we can't do that right after add_wait_queue
+		 * without an smp_mb() in between because spin_unlock
+		 * has inclusive semantics.
+		 * Doing it here is the most efficient place so we
+		 * don't do a suprious unplug if we get a racy
+		 * wakeup that make buffer_locked to return 0, and
+		 * doing it here avoids an explicit smp_mb() we
+		 * rely on the implicit one in set_task_state.
+		 */
+		run_task_queue(&tq_disk);
 		schedule();
 	} while (buffer_locked(bh));
 	tsk->state = TASK_RUNNING;
@@ -1507,6 +1520,9 @@
 
 	/* Done - end_buffer_io_async will unlock */
 	SetPageUptodate(page);
+
+	wakeup_page_waiters(page);
+
 	return 0;
 
 out:
@@ -1538,6 +1554,7 @@
 	} while (bh != head);
 	if (need_unlock)
 		UnlockPage(page);
+	wakeup_page_waiters(page);
 	return err;
 }
 
@@ -1765,6 +1782,8 @@
 		else
 			submit_bh(READ, bh);
 	}
+
+	wakeup_page_waiters(page);
 	
 	return 0;
 }
@@ -2378,6 +2397,7 @@
 		submit_bh(rw, bh);
 		bh = next;
 	} while (bh != head);
+	wakeup_page_waiters(page);
 	return 0;
 }
 
--- 1.49/fs/super.c	Wed Dec 18 21:34:24 2002
+++ edited/fs/super.c	Mon Jun  9 12:17:24 2003
@@ -726,6 +726,7 @@
 	if (!fs_type->read_super(s, data, flags & MS_VERBOSE ? 1 : 0))
 		goto Einval;
 	s->s_flags |= MS_ACTIVE;
+	blk_reset_stats(dev);
 	path_release(&nd);
 	return s;
 
--- 1.45/fs/reiserfs/inode.c	Thu May 22 16:35:02 2003
+++ edited/fs/reiserfs/inode.c	Mon Jun  9 12:17:24 2003
@@ -2048,6 +2048,7 @@
     */
     if (nr) {
         submit_bh_for_writepage(arr, nr) ;
+	wakeup_page_waiters(page);
     } else {
         UnlockPage(page) ;
     }
--- 1.23/include/linux/blkdev.h	Fri Nov 29 17:03:01 2002
+++ edited/include/linux/blkdev.h	Mon Jun  9 17:31:18 2003
@@ -126,6 +126,14 @@
 	 */
 	char			head_active;
 
+	/*
+	 * Booleans that indicate whether the queue's free requests have
+	 * been exhausted and is waiting to drop below the batch_requests
+	 * threshold
+	 */
+	char			read_full;
+	char			write_full;
+
 	unsigned long		bounce_pfn;
 
 	/*
@@ -138,8 +146,17 @@
 	 * Tasks wait here for free read and write requests
 	 */
 	wait_queue_head_t	wait_for_requests[2];
+	unsigned long           max_wait;
+	unsigned long           min_wait;
+	unsigned long           total_wait;
+	unsigned long           num_req;
+	unsigned long           num_wait;
+	unsigned long           deviation[5];
 };
 
+void blk_reset_stats(kdev_t dev);
+void blk_print_stats(kdev_t dev);
+
 #define blk_queue_plugged(q)	(q)->plugged
 #define blk_fs_request(rq)	((rq)->cmd == READ || (rq)->cmd == WRITE)
 #define blk_queue_empty(q)	list_empty(&(q)->queue_head)
@@ -156,6 +173,33 @@
 	}
 }
 
+static inline void set_queue_full(request_queue_t *q, int rw)
+{
+	wmb();
+	if (rw == READ)
+		q->read_full = 1;
+	else
+		q->write_full = 1;
+}
+
+static inline void clear_queue_full(request_queue_t *q, int rw)
+{
+	wmb();
+	if (rw == READ)
+		q->read_full = 0;
+	else
+		q->write_full = 0;
+}
+
+static inline int queue_full(request_queue_t *q, int rw)
+{
+	rmb();
+	if (rw == READ)
+		return q->read_full;
+	else
+		return q->write_full;
+}
+
 extern unsigned long blk_max_low_pfn, blk_max_pfn;
 
 #define BLK_BOUNCE_HIGH		(blk_max_low_pfn << PAGE_SHIFT)
@@ -217,6 +261,7 @@
 extern void generic_make_request(int rw, struct buffer_head * bh);
 extern inline request_queue_t *blk_get_queue(kdev_t dev);
 extern void blkdev_release_request(struct request *);
+extern void blk_print_stats(kdev_t dev);
 
 /*
  * Access functions for manipulating queue properties
--- 1.19/include/linux/pagemap.h	Sun Aug 25 15:32:11 2002
+++ edited/include/linux/pagemap.h	Mon Jun  9 14:47:11 2003
@@ -97,6 +97,8 @@
 		___wait_on_page(page);
 }
 
+extern void FASTCALL(wakeup_page_waiters(struct page * page));
+
 /*
  * Returns locked page at given index in given cache, creating it if needed.
  */
--- 1.68/kernel/ksyms.c	Fri May 23 17:40:47 2003
+++ edited/kernel/ksyms.c	Mon Jun  9 12:17:24 2003
@@ -295,6 +295,7 @@
 EXPORT_SYMBOL(filemap_fdatawait);
 EXPORT_SYMBOL(lock_page);
 EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(wakeup_page_waiters);
 
 /* device registration */
 EXPORT_SYMBOL(register_chrdev);
--- 1.77/mm/filemap.c	Thu Apr 24 11:05:10 2003
+++ edited/mm/filemap.c	Mon Jun  9 12:17:24 2003
@@ -812,6 +812,20 @@
 	return &wait[hash];
 }
 
+/*
+ * This must be called after every submit_bh with end_io
+ * callbacks that would result into the blkdev layer waking
+ * up the page after a queue unplug.
+ */
+void wakeup_page_waiters(struct page * page)
+{
+	wait_queue_head_t * head;
+
+	head = page_waitqueue(page);
+	if (waitqueue_active(head))
+		wake_up(head);
+}
+
 /* 
  * Wait for a page to get unlocked.
  *






^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls (was: -rc7   Re: Linux 2.4.21-rc6)
  2003-06-09 21:39                 ` [PATCH] io stalls (was: -rc7 Re: Linux 2.4.21-rc6) Chris Mason
@ 2003-06-09 22:19                   ` Andrea Arcangeli
  2003-06-10  0:27                     ` Chris Mason
  2003-06-10 23:13                     ` Chris Mason
  2003-06-09 23:51                   ` [PATCH] io stalls Nick Piggin
  2003-06-11  0:33                   ` [PATCH] io stalls (was: -rc7 Re: Linux 2.4.21-rc6) Andrea Arcangeli
  2 siblings, 2 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-09 22:19 UTC (permalink / raw)
  To: Chris Mason
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

[-- Attachment #1: Type: text/plain, Size: 3069 bytes --]

Hi,

On Mon, Jun 09, 2003 at 05:39:23PM -0400, Chris Mason wrote:
> Ok, there are lots of different problems here, and I've spent a little
> while trying to get some numbers with the __get_request_wait stats patch
> I posted before.  This is all on ext2, since I wanted to rule out
> interactions with the journal flavors.
> 
> Basically a dbench 90 run on ext2 rc6 vanilla kernels can generate
> latencies of over 2700 jiffies in __get_request_wait, with an average
> latency over 250 jiffies.
> 
> No, most desktop workloads aren't dbench 90, but between balance_dirty()
> and the way we send stuff to disk during memory allocations, just about
> any process can get stuck submitting dirty buffers even if you've just
> got one process doing a dd if=/dev/zero of=foo.
> 
> So, for the moment I'm going to pretend people seeing stalls in X are
> stuck in atime updates or memory allocations, or reading proc or some
> other silly spot.  
> 
> For the SMP corner cases, I've merged Andrea's fix-pausing patch into
> rc7, along with an altered form of Nick Piggin's queue_full patch to try
> and fix the latency problems.
> 
> The major difference from Nick's patch is that once the queue is marked
> full, I don't clear the full flag until the wait queue is empty.  This
> means new io can't steal available requests until every existing waiter
> has been granted a request.
> 
> The latency results are better, with average time spent in
> __get_request_wait being around 28 jiffies, and a max of 170 jiffies. 
> The cost is throughput, further benchmarking needs to be done but, but I
> wanted to get this out for review and testing.  It should at least help
> us decide if the request allocation code really is causing our problems.
> 
> The patch below also includes the __get_request_wait latency stats.  If
> people try this and still see stalls, please run elvtune /dev/xxxx and
> send along the resulting console output.
> 
> I haven't yet compared this to Andrea's elevator latency code, but the
> stat patch was originally developed on top of his 2.4.21pre3aa1, where
> the average wait was 97 jiffies and the max was 318.
> 
> Anyway, less talk, more code.  Treat this with care, it has only been
> lightly tested.  Thanks to Andrea and Nick whose patches this is largely
> based on:

I spent last Saturday working on this too. This is the status of my
current patches, would be interesting to compare them. they're not very
well tested yet though.

They would obsoletes the old fix-pausing and the old elevator-lowlatency
(I was going to release a new tree today, but I delayed it so I fixed
uml today too first [tested with skas and w/o skas]).

those backout the rc7 interactivity changes (the only one that wasn't in
my tree was the add_wait_queue_exclusive, that IMHO would better stay
for scalability reasons).

Of course I would be very interested to know if those two patches (or
Chris's one, you also retained the exclusive wakeup) are still greatly
improved by removing the _exclusive weakups and going wake-all (in
theory they shouldn't).

Andrea

[-- Attachment #2: 9980_fix-pausing-4 --]
[-- Type: text/plain, Size: 8297 bytes --]

diff -urNp --exclude CVS --exclude BitKeeper xx-ref/drivers/block/ll_rw_blk.c xx/drivers/block/ll_rw_blk.c
--- xx-ref/drivers/block/ll_rw_blk.c	2003-06-07 15:22:23.000000000 +0200
+++ xx/drivers/block/ll_rw_blk.c	2003-06-07 15:22:27.000000000 +0200
@@ -596,12 +596,20 @@ static struct request *__get_request_wai
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
 
-	add_wait_queue(&q->wait_for_requests[rw], &wait);
+	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		generic_unplug_device(q);
-		if (q->rq[rw].count == 0)
+		if (q->rq[rw].count == 0) {
+			/*
+			 * All we care about is not to stall if any request
+			 * is been released after we set TASK_UNINTERRUPTIBLE.
+			 * This is the most efficient place to unplug the queue
+			 * in case we hit the race and we can get the request
+			 * without waiting.
+			 */
+			generic_unplug_device(q);
 			schedule();
+		}
 		spin_lock_irq(q->queue_lock);
 		rq = get_request(q, rw);
 		spin_unlock_irq(q->queue_lock);
@@ -611,6 +619,17 @@ static struct request *__get_request_wai
 	return rq;
 }
 
+static void get_request_wait_wakeup(request_queue_t *q, int rw)
+{
+	/*
+	 * avoid losing an unplug if a second __get_request_wait did the
+	 * generic_unplug_device while our __get_request_wait was running
+	 * w/o the queue_lock held and w/ our request out of the queue.
+	 */
+	if (waitqueue_active(&q->wait_for_requests[rw]))
+		wake_up(&q->wait_for_requests[rw]);
+}
+
 /* RO fail safe mechanism */
 
 static long ro_bits[MAX_BLKDEV][8];
@@ -835,8 +854,11 @@ void blkdev_release_request(struct reque
 	 */
 	if (q) {
 		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= q->batch_requests)
-			wake_up(&q->wait_for_requests[rw]);
+		if (++q->rq[rw].count >= q->batch_requests) {
+			smp_mb();
+			if (waitqueue_active(&q->wait_for_requests[rw]))
+				wake_up(&q->wait_for_requests[rw]);
+		}
 	}
 }
 
@@ -954,7 +976,6 @@ static int __make_request(request_queue_
 	 */
 	max_sectors = get_max_sectors(bh->b_rdev);
 
-again:
 	req = NULL;
 	head = &q->queue_head;
 	/*
@@ -963,6 +984,7 @@ again:
 	 */
 	spin_lock_irq(q->queue_lock);
 
+again:
 	insert_here = head->prev;
 	if (list_empty(head)) {
 		q->plug_device_fn(q, bh->b_rdev); /* is atomic */
@@ -1048,6 +1070,9 @@ get_rq:
 			if (req == NULL) {
 				spin_unlock_irq(q->queue_lock);
 				freereq = __get_request_wait(q, rw);
+				head = &q->queue_head;
+				spin_lock_irq(q->queue_lock);
+				get_request_wait_wakeup(q, rw);
 				goto again;
 			}
 		}
@@ -1202,8 +1227,21 @@ void __submit_bh(int rw, struct buffer_h
 	bh->b_rdev = bh->b_dev;
 	bh->b_rsector = blocknr;
 
+	/*
+	 * we play with the bh wait queue below, need to keep a
+	 * reference so the buffer doesn't get freed after the
+	 * end_io handler runs
+	 */
+	get_bh(bh);
+
 	generic_make_request(rw, bh);
 
+	/* fix race condition with wait_on_buffer() */
+	smp_mb(); /* spin_unlock may have inclusive semantics */
+	if (waitqueue_active(&bh->b_wait))
+		wake_up(&bh->b_wait);
+	put_bh(bh);
+
 	switch (rw) {
 		case WRITE:
 			kstat.pgpgout += count;
diff -urNp --exclude CVS --exclude BitKeeper xx-ref/fs/buffer.c xx/fs/buffer.c
--- xx-ref/fs/buffer.c	2003-06-07 15:22:23.000000000 +0200
+++ xx/fs/buffer.c	2003-06-07 15:22:27.000000000 +0200
@@ -158,10 +158,23 @@ void __wait_on_buffer(struct buffer_head
 	get_bh(bh);
 	add_wait_queue(&bh->b_wait, &wait);
 	do {
-		run_task_queue(&tq_disk);
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 		if (!buffer_locked(bh))
 			break;
+		/*
+		 * We must read tq_disk in TQ_ACTIVE after the
+		 * add_wait_queue effect is visible to other cpus.
+		 * We could unplug some line above it wouldn't matter
+		 * but we can't do that right after add_wait_queue
+		 * without an smp_mb() in between because spin_unlock
+		 * has inclusive semantics.
+		 * Doing it here is the most efficient place so we
+		 * don't do a suprious unplug if we get a racy
+		 * wakeup that make buffer_locked to return 0, and
+		 * doing it here avoids an explicit smp_mb() we
+		 * rely on the implicit one in set_task_state.
+		 */
+		run_task_queue(&tq_disk);
 		schedule();
 	} while (buffer_locked(bh));
 	tsk->state = TASK_RUNNING;
@@ -1471,6 +1484,7 @@ static int __block_write_full_page(struc
 
 	if (!page->buffers)
 		create_empty_buffers(page, inode->i_dev, 1 << inode->i_blkbits);
+	BUG_ON(page_count(page) < 3);
 	head = page->buffers;
 
 	block = page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
@@ -1517,6 +1531,9 @@ static int __block_write_full_page(struc
 
 	/* Done - end_buffer_io_async will unlock */
 	SetPageUptodate(page);
+
+	wakeup_page_waiters(page);
+
 	return 0;
 
 out:
@@ -1548,6 +1565,7 @@ out:
 	} while (bh != head);
 	if (need_unlock)
 		UnlockPage(page);
+	wakeup_page_waiters(page);
 	return err;
 }
 
@@ -1721,6 +1739,7 @@ int block_read_full_page(struct page *pa
 	blocksize = 1 << inode->i_blkbits;
 	if (!page->buffers)
 		create_empty_buffers(page, inode->i_dev, blocksize);
+	BUG_ON(page_count(page) < 3);
 	head = page->buffers;
 
 	blocks = PAGE_CACHE_SIZE >> inode->i_blkbits;
@@ -1781,6 +1800,8 @@ int block_read_full_page(struct page *pa
 		else
 			submit_bh(READ, bh);
 	}
+
+	wakeup_page_waiters(page);
 	
 	return 0;
 }
@@ -2400,6 +2421,7 @@ int brw_page(int rw, struct page *page, 
 
 	if (!page->buffers)
 		create_empty_buffers(page, dev, size);
+	BUG_ON(page_count(page) < 3);
 	head = bh = page->buffers;
 
 	/* Stage 1: lock all the buffers */
@@ -2417,6 +2439,7 @@ int brw_page(int rw, struct page *page, 
 		submit_bh(rw, bh);
 		bh = next;
 	} while (bh != head);
+	wakeup_page_waiters(page);
 	return 0;
 }
 
diff -urNp --exclude CVS --exclude BitKeeper xx-ref/fs/reiserfs/inode.c xx/fs/reiserfs/inode.c
--- xx-ref/fs/reiserfs/inode.c	2003-06-07 15:22:11.000000000 +0200
+++ xx/fs/reiserfs/inode.c	2003-06-07 15:22:27.000000000 +0200
@@ -2048,6 +2048,7 @@ static int reiserfs_write_full_page(stru
     */
     if (nr) {
         submit_bh_for_writepage(arr, nr) ;
+	wakeup_page_waiters(page);
     } else {
         UnlockPage(page) ;
     }
diff -urNp --exclude CVS --exclude BitKeeper xx-ref/include/linux/pagemap.h xx/include/linux/pagemap.h
--- xx-ref/include/linux/pagemap.h	2003-06-07 15:22:23.000000000 +0200
+++ xx/include/linux/pagemap.h	2003-06-07 15:22:27.000000000 +0200
@@ -98,6 +98,8 @@ static inline void wait_on_page(struct p
 		___wait_on_page(page);
 }
 
+extern void FASTCALL(wakeup_page_waiters(struct page * page));
+
 /*
  * Returns locked page at given index in given cache, creating it if needed.
  */
diff -urNp --exclude CVS --exclude BitKeeper xx-ref/kernel/ksyms.c xx/kernel/ksyms.c
--- xx-ref/kernel/ksyms.c	2003-06-07 15:22:23.000000000 +0200
+++ xx/kernel/ksyms.c	2003-06-07 15:22:27.000000000 +0200
@@ -319,6 +319,7 @@ EXPORT_SYMBOL(filemap_fdatasync);
 EXPORT_SYMBOL(filemap_fdatawait);
 EXPORT_SYMBOL(lock_page);
 EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(wakeup_page_waiters);
 
 /* device registration */
 EXPORT_SYMBOL(register_chrdev);
diff -urNp --exclude CVS --exclude BitKeeper xx-ref/mm/filemap.c xx/mm/filemap.c
--- xx-ref/mm/filemap.c	2003-06-07 15:22:23.000000000 +0200
+++ xx/mm/filemap.c	2003-06-07 15:22:27.000000000 +0200
@@ -779,6 +779,20 @@ inline wait_queue_head_t * page_waitqueu
 	return wait_table_hashfn(page, &pgdat->wait_table);
 }
 
+/*
+ * This must be called after every submit_bh with end_io
+ * callbacks that would result into the blkdev layer waking
+ * up the page after a queue unplug.
+ */
+void wakeup_page_waiters(struct page * page)
+{
+	wait_queue_head_t * head;
+
+	head = page_waitqueue(page);
+	if (waitqueue_active(head))
+		wake_up(head);
+}
+
 /* 
  * Wait for a page to get unlocked.
  *
diff -urNp --exclude CVS --exclude BitKeeper xx-ref/mm/swapfile.c xx/mm/swapfile.c
--- xx-ref/mm/swapfile.c	2003-06-07 15:22:23.000000000 +0200
+++ xx/mm/swapfile.c	2003-06-07 15:22:44.000000000 +0200
@@ -984,8 +984,10 @@ asmlinkage long sys_swapon(const char * 
 		goto bad_swap;
 	}
 
+	get_page(virt_to_page(swap_header));
 	lock_page(virt_to_page(swap_header));
 	rw_swap_page_nolock(READ, SWP_ENTRY(type,0), (char *) swap_header);
+	put_page(virt_to_page(swap_header));
 
 	if (!memcmp("SWAP-SPACE",swap_header->magic.magic,10))
 		swap_header_version = 1;

[-- Attachment #3: 9981_elevator-lowlatency-5 --]
[-- Type: text/plain, Size: 23407 bytes --]

Binary files x-ref/ID and x/ID differ
diff -urNp --exclude CVS --exclude BitKeeper x-ref/drivers/block/DAC960.c x/drivers/block/DAC960.c
--- x-ref/drivers/block/DAC960.c	2002-11-29 02:22:58.000000000 +0100
+++ x/drivers/block/DAC960.c	2003-06-07 12:37:50.000000000 +0200
@@ -19,8 +19,8 @@
 */
 
 
-#define DAC960_DriverVersion			"2.4.11"
-#define DAC960_DriverDate			"11 October 2001"
+#define DAC960_DriverVersion			"2.4.20aa1"
+#define DAC960_DriverDate			"4 December 2002"
 
 
 #include <linux/version.h>
@@ -2975,8 +2975,9 @@ static boolean DAC960_ProcessRequest(DAC
   Command->SegmentCount = Request->nr_segments;
   Command->BufferHeader = Request->bh;
   Command->RequestBuffer = Request->buffer;
+  Command->Request = Request;
   blkdev_dequeue_request(Request);
-  blkdev_release_request(Request);
+  /* blkdev_release_request(Request); */
   DAC960_QueueReadWriteCommand(Command);
   return true;
 }
@@ -3023,11 +3024,12 @@ static void DAC960_RequestFunction(Reque
   individual Buffer.
 */
 
-static inline void DAC960_ProcessCompletedBuffer(BufferHeader_T *BufferHeader,
+static inline void DAC960_ProcessCompletedBuffer(IO_Request_T *Req, BufferHeader_T *BufferHeader,
 						 boolean SuccessfulIO)
 {
-  blk_finished_io(BufferHeader->b_size >> 9);
+  blk_finished_io(Req, BufferHeader->b_size >> 9);
   BufferHeader->b_end_io(BufferHeader, SuccessfulIO);
+  
 }
 
 
@@ -3116,9 +3118,10 @@ static void DAC960_V1_ProcessCompletedCo
 	    {
 	      BufferHeader_T *NextBufferHeader = BufferHeader->b_reqnext;
 	      BufferHeader->b_reqnext = NULL;
-	      DAC960_ProcessCompletedBuffer(BufferHeader, true);
+	      DAC960_ProcessCompletedBuffer(Command->Request, BufferHeader, true);
 	      BufferHeader = NextBufferHeader;
 	    }
+  	  blkdev_release_request(Command->Request);
 	  if (Command->Completion != NULL)
 	    {
 	      complete(Command->Completion);
@@ -3161,7 +3164,7 @@ static void DAC960_V1_ProcessCompletedCo
 	    {
 	      BufferHeader_T *NextBufferHeader = BufferHeader->b_reqnext;
 	      BufferHeader->b_reqnext = NULL;
-	      DAC960_ProcessCompletedBuffer(BufferHeader, false);
+	      DAC960_ProcessCompletedBuffer(Command->Request, BufferHeader, false);
 	      BufferHeader = NextBufferHeader;
 	    }
 	  if (Command->Completion != NULL)
@@ -3169,6 +3172,7 @@ static void DAC960_V1_ProcessCompletedCo
 	      complete(Command->Completion);
 	      Command->Completion = NULL;
 	    }
+  	  blkdev_release_request(Command->Request);
 	}
     }
   else if (CommandType == DAC960_ReadRetryCommand ||
@@ -3180,12 +3184,12 @@ static void DAC960_V1_ProcessCompletedCo
 	Perform completion processing for this single buffer.
       */
       if (CommandStatus == DAC960_V1_NormalCompletion)
-	DAC960_ProcessCompletedBuffer(BufferHeader, true);
+	DAC960_ProcessCompletedBuffer(Command->Request, BufferHeader, true);
       else
 	{
 	  if (CommandStatus != DAC960_V1_LogicalDriveNonexistentOrOffline)
 	    DAC960_V1_ReadWriteError(Command);
-	  DAC960_ProcessCompletedBuffer(BufferHeader, false);
+	  DAC960_ProcessCompletedBuffer(Command->Request, BufferHeader, false);
 	}
       if (NextBufferHeader != NULL)
 	{
@@ -3203,6 +3207,7 @@ static void DAC960_V1_ProcessCompletedCo
 	  DAC960_QueueCommand(Command);
 	  return;
 	}
+        blkdev_release_request(Command->Request);
     }
   else if (CommandType == DAC960_MonitoringCommand ||
 	   CommandOpcode == DAC960_V1_Enquiry ||
@@ -4222,9 +4227,10 @@ static void DAC960_V2_ProcessCompletedCo
 	    {
 	      BufferHeader_T *NextBufferHeader = BufferHeader->b_reqnext;
 	      BufferHeader->b_reqnext = NULL;
-	      DAC960_ProcessCompletedBuffer(BufferHeader, true);
+	      DAC960_ProcessCompletedBuffer(Command->Request, BufferHeader, true);
 	      BufferHeader = NextBufferHeader;
 	    }
+  	  blkdev_release_request(Command->Request);
 	  if (Command->Completion != NULL)
 	    {
 	      complete(Command->Completion);
@@ -4267,9 +4273,10 @@ static void DAC960_V2_ProcessCompletedCo
 	    {
 	      BufferHeader_T *NextBufferHeader = BufferHeader->b_reqnext;
 	      BufferHeader->b_reqnext = NULL;
-	      DAC960_ProcessCompletedBuffer(BufferHeader, false);
+	      DAC960_ProcessCompletedBuffer(Command->Request, BufferHeader, false);
 	      BufferHeader = NextBufferHeader;
 	    }
+  	  blkdev_release_request(Command->Request);
 	  if (Command->Completion != NULL)
 	    {
 	      complete(Command->Completion);
@@ -4286,12 +4293,12 @@ static void DAC960_V2_ProcessCompletedCo
 	Perform completion processing for this single buffer.
       */
       if (CommandStatus == DAC960_V2_NormalCompletion)
-	DAC960_ProcessCompletedBuffer(BufferHeader, true);
+	DAC960_ProcessCompletedBuffer(Command->Request, BufferHeader, true);
       else
 	{
 	  if (Command->V2.RequestSense.SenseKey != DAC960_SenseKey_NotReady)
 	    DAC960_V2_ReadWriteError(Command);
-	  DAC960_ProcessCompletedBuffer(BufferHeader, false);
+	  DAC960_ProcessCompletedBuffer(Command->Request, BufferHeader, false);
 	}
       if (NextBufferHeader != NULL)
 	{
@@ -4319,6 +4326,7 @@ static void DAC960_V2_ProcessCompletedCo
 	  DAC960_QueueCommand(Command);
 	  return;
 	}
+        blkdev_release_request(Command->Request);
     }
   else if (CommandType == DAC960_MonitoringCommand)
     {
diff -urNp --exclude CVS --exclude BitKeeper x-ref/drivers/block/DAC960.h x/drivers/block/DAC960.h
--- x-ref/drivers/block/DAC960.h	2002-01-22 18:54:52.000000000 +0100
+++ x/drivers/block/DAC960.h	2003-06-07 12:37:50.000000000 +0200
@@ -2282,6 +2282,7 @@ typedef struct DAC960_Command
   unsigned int SegmentCount;
   BufferHeader_T *BufferHeader;
   void *RequestBuffer;
+  IO_Request_T *Request;
   union {
     struct {
       DAC960_V1_CommandMailbox_T CommandMailbox;
@@ -4265,12 +4266,4 @@ static void DAC960_Message(DAC960_Messag
 static void DAC960_CreateProcEntries(void);
 static void DAC960_DestroyProcEntries(void);
 
-
-/*
-  Export the Kernel Mode IOCTL interface.
-*/
-
-EXPORT_SYMBOL(DAC960_KernelIOCTL);
-
-
 #endif /* DAC960_DriverVersion */
diff -urNp --exclude CVS --exclude BitKeeper x-ref/drivers/block/cciss.c x/drivers/block/cciss.c
--- x-ref/drivers/block/cciss.c	2003-06-07 12:37:40.000000000 +0200
+++ x/drivers/block/cciss.c	2003-06-07 12:37:50.000000000 +0200
@@ -1990,14 +1990,14 @@ static void start_io( ctlr_info_t *h)
 	}
 }
 
-static inline void complete_buffers( struct buffer_head *bh, int status)
+static inline void complete_buffers(struct request * req, struct buffer_head *bh, int status)
 {
 	struct buffer_head *xbh;
 	
 	while(bh) {
 		xbh = bh->b_reqnext; 
 		bh->b_reqnext = NULL; 
-		blk_finished_io(bh->b_size >> 9);
+		blk_finished_io(req, bh->b_size >> 9);
 		bh->b_end_io(bh, status);
 		bh = xbh;
 	}
@@ -2140,7 +2140,7 @@ static inline void complete_command( ctl
 		pci_unmap_page(hba[cmd->ctlr]->pdev,
 			temp64.val, cmd->SG[i].Len, ddir);
 	}
-	complete_buffers(cmd->rq->bh, status);
+	complete_buffers(cmd->rq, cmd->rq->bh, status);
 #ifdef CCISS_DEBUG
 	printk("Done with %p\n", cmd->rq);
 #endif /* CCISS_DEBUG */ 
@@ -2224,7 +2224,7 @@ next:
                 printk(KERN_WARNING "doreq cmd for %d, %x at %p\n",
                                 h->ctlr, creq->rq_dev, creq);
                 blkdev_dequeue_request(creq);
-                complete_buffers(creq->bh, 0);
+                complete_buffers(creq, creq->bh, 0);
 		end_that_request_last(creq);
 		goto startio;
         }
diff -urNp --exclude CVS --exclude BitKeeper x-ref/drivers/block/cpqarray.c x/drivers/block/cpqarray.c
--- x-ref/drivers/block/cpqarray.c	2003-06-07 12:37:38.000000000 +0200
+++ x/drivers/block/cpqarray.c	2003-06-07 12:37:50.000000000 +0200
@@ -169,7 +169,7 @@ static void start_io(ctlr_info_t *h);
 
 static inline void addQ(cmdlist_t **Qptr, cmdlist_t *c);
 static inline cmdlist_t *removeQ(cmdlist_t **Qptr, cmdlist_t *c);
-static inline void complete_buffers(struct buffer_head *bh, int ok);
+static inline void complete_buffers(struct request * req, struct buffer_head *bh, int ok);
 static inline void complete_command(cmdlist_t *cmd, int timeout);
 
 static void do_ida_intr(int irq, void *dev_id, struct pt_regs * regs);
@@ -981,7 +981,7 @@ next:
 		printk(KERN_WARNING "doreq cmd for %d, %x at %p\n",
 				h->ctlr, creq->rq_dev, creq);
 		blkdev_dequeue_request(creq);
-		complete_buffers(creq->bh, 0);
+		complete_buffers(creq, creq->bh, 0);
 		end_that_request_last(creq);
 		goto startio;
 	}
@@ -1082,14 +1082,14 @@ static void start_io(ctlr_info_t *h)
 	}
 }
 
-static inline void complete_buffers(struct buffer_head *bh, int ok)
+static inline void complete_buffers(struct request * req, struct buffer_head *bh, int ok)
 {
 	struct buffer_head *xbh;
 	while(bh) {
 		xbh = bh->b_reqnext;
 		bh->b_reqnext = NULL;
 		
-		blk_finished_io(bh->b_size >> 9);
+		blk_finished_io(req, bh->b_size >> 9);
 		bh->b_end_io(bh, ok);
 
 		bh = xbh;
@@ -1131,7 +1131,7 @@ static inline void complete_command(cmdl
                         (cmd->req.hdr.cmd == IDA_READ) ? PCI_DMA_FROMDEVICE : PCI_DMA_TODEVICE);
         }
 
-	complete_buffers(cmd->rq->bh, ok);
+	complete_buffers(cmd->rq, cmd->rq->bh, ok);
 	DBGPX(printk("Done with %p\n", cmd->rq););
 	req_finished_io(cmd->rq);
 	end_that_request_last(cmd->rq);
diff -urNp --exclude CVS --exclude BitKeeper x-ref/drivers/block/ll_rw_blk.c x/drivers/block/ll_rw_blk.c
--- x-ref/drivers/block/ll_rw_blk.c	2003-06-07 12:37:48.000000000 +0200
+++ x/drivers/block/ll_rw_blk.c	2003-06-07 12:53:40.000000000 +0200
@@ -183,11 +183,12 @@ void blk_cleanup_queue(request_queue_t *
 {
 	int count = q->nr_requests;
 
-	count -= __blk_cleanup_queue(&q->rq[READ]);
-	count -= __blk_cleanup_queue(&q->rq[WRITE]);
+	count -= __blk_cleanup_queue(&q->rq);
 
 	if (count)
 		printk("blk_cleanup_queue: leaked requests (%d)\n", count);
+	if (atomic_read(&q->nr_sectors))
+		printk("blk_cleanup_queue: leaked sectors (%d)\n", atomic_read(&q->nr_sectors));
 
 	memset(q, 0, sizeof(*q));
 }
@@ -396,7 +397,7 @@ void generic_unplug_device(void *data)
  *
  * Returns the (new) number of requests which the queue has available.
  */
-int blk_grow_request_list(request_queue_t *q, int nr_requests)
+int blk_grow_request_list(request_queue_t *q, int nr_requests, int max_queue_sectors)
 {
 	unsigned long flags;
 	/* Several broken drivers assume that this function doesn't sleep,
@@ -406,21 +407,31 @@ int blk_grow_request_list(request_queue_
 	spin_lock_irqsave(q->queue_lock, flags);
 	while (q->nr_requests < nr_requests) {
 		struct request *rq;
-		int rw;
 
 		rq = kmem_cache_alloc(request_cachep, SLAB_ATOMIC);
 		if (rq == NULL)
 			break;
 		memset(rq, 0, sizeof(*rq));
 		rq->rq_status = RQ_INACTIVE;
-		rw = q->nr_requests & 1;
-		list_add(&rq->queue, &q->rq[rw].free);
-		q->rq[rw].count++;
+		list_add(&rq->queue, &q->rq.free);
+		q->rq.count++;
 		q->nr_requests++;
 	}
+	
+	/*
+	 * Wakeup waiters after both one quarter of the
+	 * max-in-fligh queue and one quarter of the requests
+	 * are available again.
+	 */
 	q->batch_requests = q->nr_requests / 4;
 	if (q->batch_requests > 32)
 		q->batch_requests = 32;
+	q->batch_sectors = max_queue_sectors / 4;
+
+	q->max_queue_sectors = max_queue_sectors;
+
+	BUG_ON(!q->batch_sectors);
+	atomic_set(&q->nr_sectors, 0);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 	return q->nr_requests;
 }
@@ -429,23 +440,26 @@ static void blk_init_free_list(request_q
 {
 	struct sysinfo si;
 	int megs;		/* Total memory, in megabytes */
-	int nr_requests;
+	int nr_requests, max_queue_sectors = MAX_QUEUE_SECTORS;
 
-	INIT_LIST_HEAD(&q->rq[READ].free);
-	INIT_LIST_HEAD(&q->rq[WRITE].free);
-	q->rq[READ].count = 0;
-	q->rq[WRITE].count = 0;
+	INIT_LIST_HEAD(&q->rq.free);
+	q->rq.count = 0;
 	q->nr_requests = 0;
 
 	si_meminfo(&si);
 	megs = si.totalram >> (20 - PAGE_SHIFT);
-	nr_requests = 128;
-	if (megs < 32)
+	nr_requests = MAX_NR_REQUESTS;
+	if (megs < 30) {
 		nr_requests /= 2;
-	blk_grow_request_list(q, nr_requests);
+		max_queue_sectors /= 2;
+	}
+	/* notice early if anybody screwed the defaults */
+	BUG_ON(!nr_requests);
+	BUG_ON(!max_queue_sectors);
+
+	blk_grow_request_list(q, nr_requests, max_queue_sectors);
 
-	init_waitqueue_head(&q->wait_for_requests[0]);
-	init_waitqueue_head(&q->wait_for_requests[1]);
+	init_waitqueue_head(&q->wait_for_requests);
 }
 
 static int __make_request(request_queue_t * q, int rw, struct buffer_head * bh);
@@ -514,12 +528,19 @@ void blk_init_queue(request_queue_t * q,
  * Get a free request. io_request_lock must be held and interrupts
  * disabled on the way in.  Returns NULL if there are no free requests.
  */
+static struct request * FASTCALL(get_request(request_queue_t *q, int rw));
 static struct request *get_request(request_queue_t *q, int rw)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = q->rq + rw;
+	struct request_list *rl;
 
-	if (!list_empty(&rl->free)) {
+	if (blk_oversized_queue(q))
+		goto out;
+
+	rl = &q->rq;
+	if (list_empty(&rl->free))
+		q->full = 1;
+	if (!q->full) {
 		rq = blkdev_free_rq(&rl->free);
 		list_del(&rq->queue);
 		rl->count--;
@@ -529,6 +550,7 @@ static struct request *get_request(reque
 		rq->q = q;
 	}
 
+ out:
 	return rq;
 }
 
@@ -596,10 +618,25 @@ static struct request *__get_request_wai
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
 
-	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
+	add_wait_queue_exclusive(&q->wait_for_requests, &wait);
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (q->rq[rw].count == 0) {
+
+		/*
+		 * We must read rq.count and blk_oversized_queue()
+		 * and unplug the queue atomically (with the
+		 * spinlock being held for the whole duration of the
+		 * operation). Otherwise we risk to unplug the queue
+		 * before the request is visible in the I/O queue.
+		 *
+		 * On the __make_request side we depend on get_request,
+		 * get_request_wait_wakeup and blk_started_io to run
+		 * under the q->queue_lock and to never release it
+		 * until the request is visible in the I/O queue
+		 * (i.e. after add_request).
+		 */
+		spin_lock_irq(q->queue_lock);
+		if (q->full || blk_oversized_queue(q)) {
 			/*
 			 * All we care about is not to stall if any request
 			 * is been released after we set TASK_UNINTERRUPTIBLE.
@@ -607,14 +644,16 @@ static struct request *__get_request_wai
 			 * in case we hit the race and we can get the request
 			 * without waiting.
 			 */
-			generic_unplug_device(q);
+			__generic_unplug_device(q);
+
+			spin_unlock_irq(q->queue_lock);
 			schedule();
+			spin_lock_irq(q->queue_lock);
 		}
-		spin_lock_irq(q->queue_lock);
 		rq = get_request(q, rw);
 		spin_unlock_irq(q->queue_lock);
 	} while (rq == NULL);
-	remove_wait_queue(&q->wait_for_requests[rw], &wait);
+	remove_wait_queue(&q->wait_for_requests, &wait);
 	current->state = TASK_RUNNING;
 	return rq;
 }
@@ -626,8 +665,8 @@ static void get_request_wait_wakeup(requ
 	 * generic_unplug_device while our __get_request_wait was running
 	 * w/o the queue_lock held and w/ our request out of the queue.
 	 */
-	if (waitqueue_active(&q->wait_for_requests[rw]))
-		wake_up(&q->wait_for_requests[rw]);
+	if (waitqueue_active(&q->wait_for_requests))
+		wake_up(&q->wait_for_requests);
 }
 
 /* RO fail safe mechanism */
@@ -843,7 +882,6 @@ static inline void add_request(request_q
 void blkdev_release_request(struct request *req)
 {
 	request_queue_t *q = req->q;
-	int rw = req->cmd;
 
 	req->rq_status = RQ_INACTIVE;
 	req->q = NULL;
@@ -853,11 +891,13 @@ void blkdev_release_request(struct reque
 	 * assume it has free buffers and check waiters
 	 */
 	if (q) {
-		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= q->batch_requests) {
+		list_add(&req->queue, &q->rq.free);
+		if (++q->rq.count >= q->batch_requests && !blk_oversized_queue_batch(q)) {
+			if (q->full)
+				q->full = 0;
 			smp_mb();
-			if (waitqueue_active(&q->wait_for_requests[rw]))
-				wake_up(&q->wait_for_requests[rw]);
+			if (waitqueue_active(&q->wait_for_requests))
+				wake_up(&q->wait_for_requests);
 		}
 	}
 }
@@ -1003,7 +1043,7 @@ again:
 			req->bhtail->b_reqnext = bh;
 			req->bhtail = bh;
 			req->nr_sectors = req->hard_nr_sectors += count;
-			blk_started_io(count);
+			blk_started_io(req, count);
 			drive_stat_acct(req->rq_dev, req->cmd, count, 0);
 			req_new_io(req, 1, count);
 			attempt_back_merge(q, req, max_sectors, max_segments);
@@ -1025,7 +1065,7 @@ again:
 			req->current_nr_sectors = req->hard_cur_sectors = count;
 			req->sector = req->hard_sector = sector;
 			req->nr_sectors = req->hard_nr_sectors += count;
-			blk_started_io(count);
+			blk_started_io(req, count);
 			drive_stat_acct(req->rq_dev, req->cmd, count, 0);
 			req_new_io(req, 1, count);
 			attempt_front_merge(q, head, req, max_sectors, max_segments);
@@ -1058,7 +1098,7 @@ get_rq:
 		 * See description above __get_request_wait()
 		 */
 		if (rw_ahead) {
-			if (q->rq[rw].count < q->batch_requests) {
+			if (q->rq.count < q->batch_requests || blk_oversized_queue_batch(q)) {
 				spin_unlock_irq(q->queue_lock);
 				goto end_io;
 			}
@@ -1094,7 +1134,7 @@ get_rq:
 	req->rq_dev = bh->b_rdev;
 	req->start_time = jiffies;
 	req_new_io(req, 0, count);
-	blk_started_io(count);
+	blk_started_io(req, count);
 	add_request(q, req, insert_here);
 out:
 	if (freereq)
@@ -1391,7 +1431,7 @@ int end_that_request_first (struct reque
 
 	if ((bh = req->bh) != NULL) {
 		nsect = bh->b_size >> 9;
-		blk_finished_io(nsect);
+		blk_finished_io(req, nsect);
 		req->bh = bh->b_reqnext;
 		bh->b_reqnext = NULL;
 		bh->b_end_io(bh, uptodate);
diff -urNp --exclude CVS --exclude BitKeeper x-ref/drivers/scsi/scsi_lib.c x/drivers/scsi/scsi_lib.c
--- x-ref/drivers/scsi/scsi_lib.c	2003-06-07 12:37:47.000000000 +0200
+++ x/drivers/scsi/scsi_lib.c	2003-06-07 12:37:50.000000000 +0200
@@ -384,7 +384,7 @@ static Scsi_Cmnd *__scsi_end_request(Scs
 	do {
 		if ((bh = req->bh) != NULL) {
 			nsect = bh->b_size >> 9;
-			blk_finished_io(nsect);
+			blk_finished_io(req, nsect);
 			req->bh = bh->b_reqnext;
 			bh->b_reqnext = NULL;
 			sectors -= nsect;
diff -urNp --exclude CVS --exclude BitKeeper x-ref/include/linux/blkdev.h x/include/linux/blkdev.h
--- x-ref/include/linux/blkdev.h	2003-06-07 12:37:47.000000000 +0200
+++ x/include/linux/blkdev.h	2003-06-07 12:49:16.000000000 +0200
@@ -64,12 +64,6 @@ typedef int (make_request_fn) (request_q
 typedef void (plug_device_fn) (request_queue_t *q, kdev_t device);
 typedef void (unplug_device_fn) (void *q);
 
-/*
- * Default nr free requests per queue, ll_rw_blk will scale it down
- * according to available RAM at init time
- */
-#define QUEUE_NR_REQUESTS	8192
-
 struct request_list {
 	unsigned int count;
 	struct list_head free;
@@ -80,7 +74,7 @@ struct request_queue
 	/*
 	 * the queue request freelist, one for reads and one for writes
 	 */
-	struct request_list	rq[2];
+	struct request_list	rq;
 
 	/*
 	 * The total number of requests on each queue
@@ -93,6 +87,21 @@ struct request_queue
 	int batch_requests;
 
 	/*
+	 * The total number of 512byte blocks on each queue
+	 */
+	atomic_t nr_sectors;
+
+	/*
+	 * Batching threshold for sleep/wakeup decisions
+	 */
+	int batch_sectors;
+
+	/*
+	 * The max number of 512byte blocks on each queue
+	 */
+	int max_queue_sectors;
+
+	/*
 	 * Together with queue_head for cacheline sharing
 	 */
 	struct list_head	queue_head;
@@ -118,13 +127,20 @@ struct request_queue
 	/*
 	 * Boolean that indicates whether this queue is plugged or not.
 	 */
-	char			plugged;
+	int			plugged:1;
 
 	/*
 	 * Boolean that indicates whether current_request is active or
 	 * not.
 	 */
-	char			head_active;
+	int			head_active:1;
+
+	/*
+	 * Booleans that indicate whether the queue's free requests have
+	 * been exhausted and is waiting to drop below the batch_requests
+	 * threshold
+	 */
+	int                     full:1;
 
 	unsigned long		bounce_pfn;
 
@@ -137,7 +153,7 @@ struct request_queue
 	/*
 	 * Tasks wait here for free read and write requests
 	 */
-	wait_queue_head_t	wait_for_requests[2];
+	wait_queue_head_t	wait_for_requests;
 };
 
 #define blk_queue_plugged(q)	(q)->plugged
@@ -221,7 +237,7 @@ extern void blkdev_release_request(struc
 /*
  * Access functions for manipulating queue properties
  */
-extern int blk_grow_request_list(request_queue_t *q, int nr_requests);
+extern int blk_grow_request_list(request_queue_t *q, int nr_requests, int max_queue_sectors);
 extern void blk_init_queue(request_queue_t *, request_fn_proc *);
 extern void blk_cleanup_queue(request_queue_t *);
 extern void blk_queue_headactive(request_queue_t *, int);
@@ -245,6 +261,8 @@ extern char * blkdev_varyio[MAX_BLKDEV];
 
 #define MAX_SEGMENTS 128
 #define MAX_SECTORS 255
+#define MAX_QUEUE_SECTORS (4 << (20 - 9)) /* 4 mbytes when full sized */
+#define MAX_NR_REQUESTS 1024 /* 1024k when in 512 units, normally min is 1M in 1k units */
 
 #define PageAlignSize(size) (((size) + PAGE_SIZE -1) & PAGE_MASK)
 
@@ -271,8 +289,40 @@ static inline int get_hardsect_size(kdev
 	return retval;
 }
 
-#define blk_finished_io(nsects)	do { } while (0)
-#define blk_started_io(nsects)	do { } while (0)
+static inline int blk_oversized_queue(request_queue_t * q)
+{
+	return atomic_read(&q->nr_sectors) > q->max_queue_sectors;
+}
+
+static inline int blk_oversized_queue_batch(request_queue_t * q)
+{
+	return atomic_read(&q->nr_sectors) > q->max_queue_sectors - q->batch_sectors;
+}
+
+static inline void blk_started_io(struct request * req, int nsects)
+{
+	request_queue_t * q = req->q;
+
+	if (q)
+		atomic_add(nsects, &q->nr_sectors);
+	BUG_ON(atomic_read(&q->nr_sectors) < 0);
+}
+
+static inline void blk_finished_io(struct request * req, int nsects)
+{
+	request_queue_t * q = req->q;
+
+	/* special requests belongs to a null queue */
+	if (q) {
+		atomic_sub(nsects, &q->nr_sectors);
+		if (q->rq.count >= q->batch_requests && !blk_oversized_queue_batch(q)) {
+			smp_mb();
+			if (waitqueue_active(&q->wait_for_requests))
+				wake_up(&q->wait_for_requests);
+		}
+	}
+	BUG_ON(atomic_read(&q->nr_sectors) < 0);
+}
 
 static inline unsigned int blksize_bits(unsigned int size)
 {
diff -urNp --exclude CVS --exclude BitKeeper x-ref/include/linux/elevator.h x/include/linux/elevator.h
--- x-ref/include/linux/elevator.h	2002-11-29 02:23:18.000000000 +0100
+++ x/include/linux/elevator.h	2003-06-07 12:37:50.000000000 +0200
@@ -80,7 +80,7 @@ static inline int elevator_request_laten
 	return latency;
 }
 
-#define ELV_LINUS_SEEK_COST	16
+#define ELV_LINUS_SEEK_COST	1
 
 #define ELEVATOR_NOOP							\
 ((elevator_t) {								\
@@ -93,8 +93,8 @@ static inline int elevator_request_laten
 
 #define ELEVATOR_LINUS							\
 ((elevator_t) {								\
-	2048,				/* read passovers */		\
-	8192,				/* write passovers */		\
+	128,				/* read passovers */		\
+	512,				/* write passovers */		\
 									\
 	elevator_linus_merge,		/* elevator_merge_fn */		\
 	elevator_linus_merge_req,	/* elevator_merge_req_fn */	\
diff -urNp --exclude CVS --exclude BitKeeper x-ref/include/linux/nbd.h x/include/linux/nbd.h
--- x-ref/include/linux/nbd.h	2003-04-01 12:07:54.000000000 +0200
+++ x/include/linux/nbd.h	2003-06-07 12:37:50.000000000 +0200
@@ -48,7 +48,7 @@ nbd_end_request(struct request *req)
 	spin_lock_irqsave(&io_request_lock, flags);
 	while((bh = req->bh) != NULL) {
 		nsect = bh->b_size >> 9;
-		blk_finished_io(nsect);
+		blk_finished_io(req, nsect);
 		req->bh = bh->b_reqnext;
 		bh->b_reqnext = NULL;
 		bh->b_end_io(bh, uptodate);

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-09 21:39                 ` [PATCH] io stalls (was: -rc7 Re: Linux 2.4.21-rc6) Chris Mason
  2003-06-09 22:19                   ` Andrea Arcangeli
@ 2003-06-09 23:51                   ` Nick Piggin
  2003-06-10  0:32                     ` Chris Mason
  2003-06-10  1:48                     ` Robert White
  2003-06-11  0:33                   ` [PATCH] io stalls (was: -rc7 Re: Linux 2.4.21-rc6) Andrea Arcangeli
  2 siblings, 2 replies; 109+ messages in thread
From: Nick Piggin @ 2003-06-09 23:51 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller



Chris Mason wrote:

>Ok, there are lots of different problems here, and I've spent a little
>while trying to get some numbers with the __get_request_wait stats patch
>I posted before.  This is all on ext2, since I wanted to rule out
>interactions with the journal flavors.
>
>Basically a dbench 90 run on ext2 rc6 vanilla kernels can generate
>latencies of over 2700 jiffies in __get_request_wait, with an average
>latency over 250 jiffies.
>
>No, most desktop workloads aren't dbench 90, but between balance_dirty()
>and the way we send stuff to disk during memory allocations, just about
>any process can get stuck submitting dirty buffers even if you've just
>got one process doing a dd if=/dev/zero of=foo.
>
>So, for the moment I'm going to pretend people seeing stalls in X are
>stuck in atime updates or memory allocations, or reading proc or some
>other silly spot.  
>
>For the SMP corner cases, I've merged Andrea's fix-pausing patch into
>rc7, along with an altered form of Nick Piggin's queue_full patch to try
>and fix the latency problems.
>
>The major difference from Nick's patch is that once the queue is marked
>full, I don't clear the full flag until the wait queue is empty.  This
>means new io can't steal available requests until every existing waiter
>has been granted a request.
>

Yes, this is probably a good idea.

>
>The latency results are better, with average time spent in
>__get_request_wait being around 28 jiffies, and a max of 170 jiffies. 
>The cost is throughput, further benchmarking needs to be done but, but I
>wanted to get this out for review and testing.  It should at least help
>us decide if the request allocation code really is causing our problems.
>

Well the latency numbers are good - is this with dbench 90?

snip

> 
>+static inline void set_queue_full(request_queue_t *q, int rw)
>+{
>+	wmb();
>+	if (rw == READ)
>+		q->read_full = 1;
>+	else
>+		q->write_full = 1;
>+}
>+
>+static inline void clear_queue_full(request_queue_t *q, int rw)
>+{
>+	wmb();
>+	if (rw == READ)
>+		q->read_full = 0;
>+	else
>+		q->write_full = 0;
>+}
>+
>+static inline int queue_full(request_queue_t *q, int rw)
>+{
>+	rmb();
>+	if (rw == READ)
>+		return q->read_full;
>+	else
>+		return q->write_full;
>+}
>+
>

I don't think you need the barriers here, do you?


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls (was: -rc7   Re: Linux 2.4.21-rc6)
  2003-06-09 22:19                   ` Andrea Arcangeli
@ 2003-06-10  0:27                     ` Chris Mason
  2003-06-10 23:13                     ` Chris Mason
  1 sibling, 0 replies; 109+ messages in thread
From: Chris Mason @ 2003-06-10  0:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Mon, 2003-06-09 at 18:19, Andrea Arcangeli wrote:

> > Anyway, less talk, more code.  Treat this with care, it has only been
> > lightly tested.  Thanks to Andrea and Nick whose patches this is largely
> > based on:
> 
> I spent last Saturday working on this too. This is the status of my
> current patches, would be interesting to compare them. they're not very
> well tested yet though.
> 

I'll try to get some numbers in the morning.

> They would obsoletes the old fix-pausing and the old elevator-lowlatency
> (I was going to release a new tree today, but I delayed it so I fixed
> uml today too first [tested with skas and w/o skas]).
> 
> those backout the rc7 interactivity changes (the only one that wasn't in
> my tree was the add_wait_queue_exclusive, that IMHO would better stay
> for scalability reasons).
> 

I didn't test without _exlusive for the final iteration of my patch, but
in all the early ones using _exclusive improve latencies.  I think
people are reporting otherwise because they have hit the sweet spot for
number of procs going after the requests.  With _exclusive they have a
higher chance of getting starved by a new process coming in, without the
_exclusive, each waiter has a fighting chance of getting to the free
request on their own.  Hopefully we can do better with the _exclusive,
it does seem to scale much better.

Aside from the io in flight calculations, the major difference between
our patches is in __get_request_wait.  Once a process waits once, that
call to __get_request_wait ignores q->full in my code.

I found the q->full checks did help, but as you increased the number of
concurrent readers/writers things broke down to the old high latencies. 
By delaying the point where q->full was cleared, I could make the
latency benefit last for a higher number of procs.  Finally I gave up
and left it set until all the waiters were gone, which seems to have the
most consistent results.

The interesting part was it didn't seem to change the hit in
throughput.  The cost was about the same between the original patch and
my final one, but I need to test more.

-chris



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-09 23:51                   ` [PATCH] io stalls Nick Piggin
@ 2003-06-10  0:32                     ` Chris Mason
  2003-06-10  0:47                       ` Nick Piggin
  2003-06-10  1:48                     ` Robert White
  1 sibling, 1 reply; 109+ messages in thread
From: Chris Mason @ 2003-06-10  0:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Mon, 2003-06-09 at 19:51, Nick Piggin wrote:

> >
> >The latency results are better, with average time spent in
> >__get_request_wait being around 28 jiffies, and a max of 170 jiffies. 
> >The cost is throughput, further benchmarking needs to be done but, but I
> >wanted to get this out for review and testing.  It should at least help
> >us decide if the request allocation code really is causing our problems.
> >
> 
> Well the latency numbers are good - is this with dbench 90?
> 

Yes, that number was dbench 90, but dbench 50,90, and 120 gave about the
same stats with the final patch.

> snip

> >+
> >+static inline int queue_full(request_queue_t *q, int rw)
> >+{
> >+	rmb();
> >+	if (rw == READ)
> >+		return q->read_full;
> >+	else
> >+		return q->write_full;
> >+}
> >+
> >
> 
> I don't think you need the barriers here, do you?
> 

I put the barriers in early on when almost all the calls were done
outside spin locks, the current flavor of the patch only does one
clear_queue_full without the io_request_lock held.  It should be enough
to toss a barrier in just that one spot.  But I wanted to leave them in
so I could move things around until the final version (if there ever is
one ;-)

-chris



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-10  0:32                     ` Chris Mason
@ 2003-06-10  0:47                       ` Nick Piggin
  0 siblings, 0 replies; 109+ messages in thread
From: Nick Piggin @ 2003-06-10  0:47 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller



Chris Mason wrote:

>On Mon, 2003-06-09 at 19:51, Nick Piggin wrote:
>
>
>>>The latency results are better, with average time spent in
>>>__get_request_wait being around 28 jiffies, and a max of 170 jiffies. 
>>>The cost is throughput, further benchmarking needs to be done but, but I
>>>wanted to get this out for review and testing.  It should at least help
>>>us decide if the request allocation code really is causing our problems.
>>>
>>>
>>Well the latency numbers are good - is this with dbench 90?
>>
>>
>
>Yes, that number was dbench 90, but dbench 50,90, and 120 gave about the
>same stats with the final patch.
>

Great.

>
>>snip
>>
>
>>>+
>>>+static inline int queue_full(request_queue_t *q, int rw)
>>>+{
>>>+	rmb();
>>>+	if (rw == READ)
>>>+		return q->read_full;
>>>+	else
>>>+		return q->write_full;
>>>+}
>>>+
>>>
>>>
>>I don't think you need the barriers here, do you?
>>
>>
>
>I put the barriers in early on when almost all the calls were done
>outside spin locks, the current flavor of the patch only does one
>clear_queue_full without the io_request_lock held.  It should be enough
>to toss a barrier in just that one spot.  But I wanted to leave them in
>so I could move things around until the final version (if there ever is
>one ;-)
>

Yeah I see.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH] io stalls
  2003-06-09 23:51                   ` [PATCH] io stalls Nick Piggin
  2003-06-10  0:32                     ` Chris Mason
@ 2003-06-10  1:48                     ` Robert White
  2003-06-10  2:13                       ` Chris Mason
  2003-06-10  3:22                       ` Nick Piggin
  1 sibling, 2 replies; 109+ messages in thread
From: Robert White @ 2003-06-10  1:48 UTC (permalink / raw)
  To: Nick Piggin, Chris Mason
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

From: linux-kernel-owner@vger.kernel.org
[mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Nick Piggin

> Chris Mason wrote:

> >The major difference from Nick's patch is that once the queue is marked
> >full, I don't clear the full flag until the wait queue is empty.  This
> >means new io can't steal available requests until every existing waiter
> >has been granted a request.

> Yes, this is probably a good idea.


Err... wouldn't this subvert the spirit, if not the warrant, of real time
scheduling and time-critical applications?

After all we *do* want to all-but-starve lower priority tasks of IO in the
presence of higher priority tasks.  A select few applications absolutely
need to be pampered (think ProTools audio mixing suite on the Mac etc.) and
any solution that doesn't take this into account will have to be re-done by
the people who want to bring these kinds of tasks to Linux.

I am not most familiar with this body of code, but wouldn't the people
trying to do audio sampling and gaming get really frosted if they had to
wait for a list of lower priority IO events to completely drain before they
could get back to work?  It would certainly produce really bad encoding of
live data streams (etc).

>From a purely queue-theory stand point, I'm not even sure why this queue can
become "full".  Shouldn't the bounding case come about primarily by lack of
resources (can't allocate a queue entry or a data block) out where the users
can see and cope with the problem before all the expensive blocking and
waiting.

Still from a pure-theory standpoint, it would be "better" to make the wait
queues priority queues and leave their sizes unbounded.

In practice it is expensive to maintain a fully "proper" priority queue for
a queue of non-trivial size.  Then again, IO isn't cheap over the domain of
time anyway.

The solution proposed, by limiting the queue size sort-of turns the
scheduler's wakeup behavior into that priority queue sorting mechanism.
That in turn would (it seems to me) lead to some degenerate behaviors just
outside the zone of midline stability.  In short several very-high-priority
tasks could completely starve out the system if they can consistently submit
enough request to fill the queue.

[That is: consider a bunch of tasks sleeping in the scheduler because they
are waiting for the queue to empty.  When they are all woken up, they will
actually be scheduled in priority order.  So higher priority tasks get first
crack at the "empty" queue.  If there are "enough" such tasks (which are IO
bound on this device) they will keep getting serviced, and then keep going
back to sleep on the full queue.  (And god help you if they are runaways
8-).  The high priority tasks constantly butt in line (because the scheduler
is now the keeper of the IO queue) and the lower priority tasks could wait
forever.]

{please note; I write some fairly massively-threaded applications, it would
only take one such application running at a high priority to produce "a
substantial number" of high priority processes submitting IO requests, so
the scenario, while not common, is potentially real.}

(so just off the top of my head...)

I would think that the best theoretical solution would be a priority heap.
(ignoring heap storage requirements for a moment) you keep the highest
priority items in the front of the heap and any time a heap reorg passes a
node by you jack that nodes priority by one.  For an extremely busy queue
nothing is starved, but the incline remains high enough to make sure that
the truly desperate priorities (of which there should be few in a real world
system) will "never" wait behind some dd(1) of vanishingly close to no
import.

Clearly doing a full heap with only pointers is ugly almost beyond
comprehension, and doing a heap in an array would tend to be impractical for
a large list under variable conditions.  A red-black tree gets too expensive
if you use them that many times throughout a system.  (and so on)

While several possible sort-of-heapish or sort-of-priority-queueish data str
uctures come to mind, I don't have a replacement concept that I can really
promote just now...

I would say that at a MINIMUM there needs to be some threshold of priority
for requests that get to go on a "full list" no matter what.  There really
"ought to be" a way for requests from higher priority tasks to get  closer
to the front of the list.  There "should be" a priority floor where tasks
with lower priorities get their requests queued up with the current
first-come-first-served mentality (as we don't need to spend a lot of time
thinking about things that have been nice(d) into the noise floor).  And
then there should be a promotion mechanism to prevent complete starvation.

Anything simpler and it is safer from a system stability standpoint to keep
with the current high-latency-on-occasion simple queue solution.

Rob.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH] io stalls
  2003-06-10  1:48                     ` Robert White
@ 2003-06-10  2:13                       ` Chris Mason
  2003-06-10 23:04                         ` Robert White
  2003-06-10  3:22                       ` Nick Piggin
  1 sibling, 1 reply; 109+ messages in thread
From: Chris Mason @ 2003-06-10  2:13 UTC (permalink / raw)
  To: Robert White
  Cc: Nick Piggin, Andrea Arcangeli, Marc-Christian Petersen,
	Jens Axboe, Marcelo Tosatti, Georg Nikodym, lkml,
	Matthias Mueller

On Mon, 2003-06-09 at 21:48, Robert White wrote:
> From: linux-kernel-owner@vger.kernel.org
> [mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Nick Piggin
> 
> > Chris Mason wrote:
> 
> > >The major difference from Nick's patch is that once the queue is marked
> > >full, I don't clear the full flag until the wait queue is empty.  This
> > >means new io can't steal available requests until every existing waiter
> > >has been granted a request.
> 
> > Yes, this is probably a good idea.
> 
> 
> Err... wouldn't this subvert the spirit, if not the warrant, of real time
> scheduling and time-critical applications?
> 

[ lots of interesting points ]

Heh, I didn't really make my goals for the patch clear.  They go:

1) quantify the stalls people are seeing with real numbers so we can
point at a section of code causing bad performance.

2) Provide a somewhat obvious patch that makes the current
__get_request_wait call significantly more fair, in hopes of either
blaming it for the stalls or removing it from the list of candidates

3) fix the stalls

Most of your suggestions are 2.5 discussion material, where real
experimental work is going on.  The 2.4 io request wait queue isn't
working on priorities, the current one tries to be fair to everyone and
provide good throughput to everyone at the same time.  It's failing on
at least one of those, and until we can fix that I don't even want to
think about more complex issues.

Current users of the vanilla 2.4 tree will hopefully benefit from a
lower latency io request wait queue. The next best thing to real time is
a consistently small wait, which is what my patch is trying for.

-chris



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-10  1:48                     ` Robert White
  2003-06-10  2:13                       ` Chris Mason
@ 2003-06-10  3:22                       ` Nick Piggin
  2003-06-10 21:17                         ` Robert White
  1 sibling, 1 reply; 109+ messages in thread
From: Nick Piggin @ 2003-06-10  3:22 UTC (permalink / raw)
  To: Robert White
  Cc: Chris Mason, Andrea Arcangeli, Marc-Christian Petersen,
	Jens Axboe, Marcelo Tosatti, Georg Nikodym, lkml,
	Matthias Mueller



Robert White wrote:

>From: linux-kernel-owner@vger.kernel.org
>[mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Nick Piggin
>
>
>>Chris Mason wrote:
>>
>
>>>The major difference from Nick's patch is that once the queue is marked
>>>full, I don't clear the full flag until the wait queue is empty.  This
>>>means new io can't steal available requests until every existing waiter
>>>has been granted a request.
>>>
>
>>Yes, this is probably a good idea.
>>
>
>
>Err... wouldn't this subvert the spirit, if not the warrant, of real time
>scheduling and time-critical applications?
>

No, my patch (plus Chris' modification) change request allocation
from an overloaded queue from semi random (timing dependant mixture
of LIFO and FIFO), to FIFO.

As Chris has shown, can cause a task to be starved for 2.7s (and
theoretically infinite) when it should be woken in < 200ms under
similar situations with the FIFO scheme.

>
>After all we *do* want to all-but-starve lower priority tasks of IO in the
>presence of higher priority tasks.  A select few applications absolutely
>need to be pampered (think ProTools audio mixing suite on the Mac etc.) and
>any solution that doesn't take this into account will have to be re-done by
>the people who want to bring these kinds of tasks to Linux.
>
>I am not most familiar with this body of code, but wouldn't the people
>trying to do audio sampling and gaming get really frosted if they had to
>wait for a list of lower priority IO events to completely drain before they
>could get back to work?  It would certainly produce really bad encoding of
>live data streams (etc).
>
>

Actually, there is no priority other than time (ie. FIFO), and
seek distance in the IO subsystem. I guess this is why your
arguments fall down ;)

>>From a purely queue-theory stand point, I'm not even sure why this queue can
>become "full".  Shouldn't the bounding case come about primarily by lack of
>resources (can't allocate a queue entry or a data block) out where the users
>can see and cope with the problem before all the expensive blocking and
>waiting.
>

In practice, the problems of having a memory size limited queue
outweigh the benefits.

>
>Still from a pure-theory standpoint, it would be "better" to make the wait
>queues priority queues and leave their sizes unbounded.
>
>In practice it is expensive to maintain a fully "proper" priority queue for
>a queue of non-trivial size.  Then again, IO isn't cheap over the domain of
>time anyway.
>

If IO priorities were implemented, you still have the problem of
starvation. It would be better to simply have a per process limit
on request allocation, and implement the priority scheduling in
the io scheduler.

I think you would find that most processes do just fine with
just a couple of requests each, though.

>
>
>The solution proposed, by limiting the queue size sort-of turns the
>scheduler's wakeup behavior into that priority queue sorting mechanism.
>That in turn would (it seems to me) lead to some degenerate behaviors just
>outside the zone of midline stability.  In short several very-high-priority
>tasks could completely starve out the system if they can consistently submit
>enough request to fill the queue.
>
>[That is: consider a bunch of tasks sleeping in the scheduler because they
>are waiting for the queue to empty.  When they are all woken up, they will
>actually be scheduled in priority order.  So higher priority tasks get first
>crack at the "empty" queue.  If there are "enough" such tasks (which are IO
>bound on this device) they will keep getting serviced, and then keep going
>back to sleep on the full queue.  (And god help you if they are runaways
>8-).  The high priority tasks constantly butt in line (because the scheduler
>is now the keeper of the IO queue) and the lower priority tasks could wait
>forever.]
>

No, they will be woken up one at a time as requests
become freed, and in FIFO order. It might be possible
for a higher (CPU) priority task to be woken up
before the previous has a chance to run, but this
scheme is no worse than before (the solution here is
per process request limits, but this is 2.4).



^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH] io stalls
  2003-06-10  3:22                       ` Nick Piggin
@ 2003-06-10 21:17                         ` Robert White
  2003-06-11  0:40                           ` Nick Piggin
  0 siblings, 1 reply; 109+ messages in thread
From: Robert White @ 2003-06-10 21:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Chris Mason, Andrea Arcangeli, Marc-Christian Petersen,
	Jens Axboe, Marcelo Tosatti, Georg Nikodym, lkml,
	Matthias Mueller

From: Nick Piggin [mailto:piggin@cyberone.com.au]
Sent: Monday, June 09, 2003 8:23 PM
>
> Actually, there is no priority other than time (ie. FIFO), and
> seek distance in the IO subsystem. I guess this is why your
> arguments fall down ;)

I'll buy that for the most part, though one of the differences I read
elsewhere in the thread was the choice between add_wait_queue() and
add_wait_queue_exclusive().  You will, however, note that one of the factors
that is playing in this patch is process priority.

(If I understand correctly) The wait queue in question becomes your FIFOing
agent, it is kind of a pre-queue on the actual IO queue, once you reach a
"full" condition.

In the later case [add_wait_queue_exclusive()] you are strictly FIFO over
the set of processes, where the moment-of-order is determined by insertion
into the wait queue.

In the former case [add_wait_queue()] when the queue is woken up all the
waiters will be marked executable on the scheduler, and the scheduler will
then (at least tend to) sort the submissions into task priority order.  So
the higher priority tasks will get to butt into line.  Worse, the FIFO is
essentially lost to the vagaries of the scheduler so without the _exclusive
you have no FIFO at all.

I think that is the reason that Chris was saying the
add_wait_queue_exclusive() mode "does seem to scale much better."

So your "original new" batching agent is really order-of-arrival that
becomes anti-sorted by process priority.  Which can lead to scheduler
induced starvation (and the observed "improvements" by using the strict FIFO
created by a_w_q_exclusive).  The problem is that you get a little communist
about the FIFO-ness when you use a_w_q_exclusive() and that can *SERIOUSLY*
harm a task that must approach real-time behavior.

One solution would be to stick with the add_wait_queue() process-priority
influenced never-really-FIFO, but every time a process/task wakes up, and it
then doesn't get its request onto the queue, add a small fixed increment to
its priority before going back into the wait.  This gives you both the
process-priority mechanism and a fairness metric.

Something like (in pure pseudo-code since I don't have my references here):

int priority_delta = 0
while (try_enqueing_io_request() == queue_full) {
  if (current->priority < priority_max) {
    current->priority += priority_increment;
    priority_delta += priority_increment;
  }
  wait_on_queue()
}
current->priority -= priority_delta;

(and still, of course, only wake the wait queue when the "full" queue
reaches empty.)

What that gets you is democratic entry into the io request queue when it is
non-full.  It gets you seniority-based (plutocratic?) access to the io queue
as your request "ages" in the full pool.  If the pool gets so large that all
the requests are making their tasks reach priority_max then you "degrade" to
the fairness of the scheduler, which is an arbitrary but workable metric.

You get all that, but you preserve (or invent) a relationship that lets the
task priority automagically factor in "for free" so that relative starvation
(which is a good thing for deliberately asymmetric task priorities, and
matches user expectations) can be achieved without ever having absolute
starvation.

Further if priority_max isn't priority_system_max you get the
real-time-trumps-all behavior that something like a live audio stream
encoder may need (for any priority >= priority_max).

Rob.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH] io stalls
  2003-06-10  2:13                       ` Chris Mason
@ 2003-06-10 23:04                         ` Robert White
  2003-06-11  0:58                           ` Chris Mason
  0 siblings, 1 reply; 109+ messages in thread
From: Robert White @ 2003-06-10 23:04 UTC (permalink / raw)
  To: Chris Mason
  Cc: Nick Piggin, Andrea Arcangeli, Marc-Christian Petersen,
	Jens Axboe, Marcelo Tosatti, Georg Nikodym, lkml,
	Matthias Mueller

From: Chris Mason [mailto:mason@suse.com]
Sent: Monday, June 09, 2003 7:13 PM

> 2) Provide a somewhat obvious patch that makes the current
> __get_request_wait call significantly more fair, in hopes of either
> blaming it for the stalls or removing it from the list of candidates

Without the a_w_q_exclusive() on add_wait_queue the FIFO effect is lost when
all the members of the wait queue compete for their timeslice in the
scheduler.  For all intents and purposes the fairness goes up some (you stop
having the one guy sorted to the un-happy end of the disk) but low priority
tasks will still always end up stalled on the dirty end of the stick.
Basically each new round at the queue-empty moment is a mob rush for the
door.

With the a_w_q_exclusive(), you get past fair and well into anti-optimal.
Your FIFO becomes essentially mandatory with no regard for anything but the
order things hit the wait queue.  (Particularly on an SMP machine, however)
"new requestors" may/will jump to the head of the line because they were
never *in* the wait queue.  So you have only achieved "fairness" with
respect to requests that come in to a io queue that was full-at-the-time of
the initial entry into the driver.  This becomes exactly like the experience
of waiting patiently on line to get off the highway and watching all the
rude people driving by you only to cut over and nose into the queue just at
the exit sign.

So you need the _exclusive if you want any kind of predictable fairness
(without getting into anything obscure) but it is still only "fair" for
those that were unfortunate enough to end up on the wait queue originally.
There is a small window for tasks to butt in freely.

> 3) fix the stalls

Without the _exclusive() you can't have fixed the stalls, you can only have
moved the locus-of-blame to the scheduler which may (or may not) have some
way to compensate and "fake fairness" built in by coincidence.

The thing I suggest in my other email, where you use the non-exclusive
version of the routine but temporarily bump the process priority each time a
request gets foisted off on the wait_queue instead of the IO queue, actually
has semantic fairness built in.  This basically builds a fairness elevator
that functions both over time-in-queue and original process priority (when
built into your basic patch).

It's also quite space/time efficient and fairly clear to reader and
implementer alike.

Rob.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls (was: -rc7   Re: Linux 2.4.21-rc6)
  2003-06-09 22:19                   ` Andrea Arcangeli
  2003-06-10  0:27                     ` Chris Mason
@ 2003-06-10 23:13                     ` Chris Mason
  2003-06-11  0:16                       ` Andrea Arcangeli
  1 sibling, 1 reply; 109+ messages in thread
From: Chris Mason @ 2003-06-10 23:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Mon, 2003-06-09 at 18:19, Andrea Arcangeli wrote:

> I spent last Saturday working on this too. This is the status of my
> current patches, would be interesting to compare them. they're not very
> well tested yet though.
> 
> They would obsoletes the old fix-pausing and the old elevator-lowlatency
> (I was going to release a new tree today, but I delayed it so I fixed
> uml today too first [tested with skas and w/o skas]).
> 
> those backout the rc7 interactivity changes (the only one that wasn't in
> my tree was the add_wait_queue_exclusive, that IMHO would better stay
> for scalability reasons).
> 
> Of course I would be very interested to know if those two patches (or
> Chris's one, you also retained the exclusive wakeup) are still greatly
> improved by removing the _exclusive weakups and going wake-all (in
> theory they shouldn't).

Ok, I merged these into rc7 along with the __get_request_wait stats
patch.  All numbers below were on ext2...I'm calling your patches -aa,
even though it's just a small part of the real -aa ;-) After a dbench 50
run, the -aa __get_request_wait latencies look like this:

device 08:01: num_req 6029, total jiffies waited 213475
        844 forced to wait
        2 min wait, 806 max wait
        252 average wait
        357 < 100, 29 < 200, 110 < 300, 111 < 400, 82 < 500
        155 waits longer than 500 jiffies

I changed my patch to have q->nr_requests at 1024 like yours, and reran
the dbench 50:

device 08:01: num_req 11122, total jiffies waited 121573
        8782 forced to wait
        1 min wait, 237 max wait
        13 average wait
        8654 < 100, 126 < 200, 2 < 300, 0 < 400, 0 < 500
        0 waits longer than 500 jiffies

So, I had 5000 more requests for the same workload, and 8000 of my
requests were forced to wait (compared to 844 of yours).  But the total
number of jiffies spent waiting on my patch was lower, as were the
average and max waits.  Increasing the number of requests with my patch
make the system feel slower, even though the __get_request_wait latency
numbers didn't change.

On this dbench run, you got a throughput of 118mb/s and I got 90mb/s. 
The __get_request_wait latency numbers were reliable across runs, but I
might as well have thrown a dart to pick throughput numbers.  So, next
tests were done with iozone.

On aa after iozone -s 100M -i 0 -t 20 (20 procs each doing streaming
writes to a private 100M file)

device 08:01: num_req 167133, total jiffies waited 872566
        6424 forced to wait
        4 min wait, 507 max wait
        135 average wait
        2619 < 100, 2020 < 200, 1433 < 300, 325 < 400, 26 < 500
        1 waits longer than 500 jiffies

And the iozone throughput numbers looked like so (again -aa patches)

        Children see throughput for 20 initial writers  =   13824.22 KB/sec
        Parent sees throughput for 20 initial writers   =    6811.29 KB/sec
        Min throughput per process                      =     451.99 KB/sec
        Max throughput per process                      =     904.14 KB/sec
        Avg throughput per process                      =     691.21 KB/sec
        Min xfer                                        =   51136.00 KB

The avg throughput per process with vanilla rc7 is 3MB/s, the best I've
been able to do was with nr_requests at higher levels was 1.3MB/s.  With
smaller of iozone threads (10 and lower so far) I can match rc7 speeds,
but not with 20 procs.

Anyway, my latency numbers for iozone -s 100M -i 0 -t 20:

device 08:01: num_req 146049, total jiffies waited 434025
        130670 forced to wait
        1 min wait, 65 max wait
        3 average wait
        130671 < 100, 0 < 200, 0 < 300, 0 < 400, 0 < 500
        0 waits longer than 500 jiffies

And the iozone reported throughput:

        Children see throughput for 20 initial writers  =   19828.92 KB/sec
        Parent sees throughput for 20 initial writers   =    7003.36 KB/sec
        Min throughput per process                      =     526.61 KB/sec
        Max throughput per process                      =    1353.45 KB/sec
        Avg throughput per process                      =     991.45 KB/sec
        Min xfer                                        =   39968.00 KB

The patch I was working on today was almost the same as the one I posted
yesterday, the only difference being the hunk below and changes to
nr_requests (256 balanced nicely on my box, all numbers above were at
1024).

This hunk against my patch yesterday just avoids an unplug in
__get_request_wait if there are still available requests.  A process
might be waiting in __get_request_wait just because the queue was full,
which has little do to with the queue needing an unplug.  He'll get
woken up later by get_request_wait_wakeup if nobody else manages to wake
him (I think).

diff -u edited/drivers/block/ll_rw_blk.c edited/drivers/block/ll_rw_blk.c
--- edited/drivers/block/ll_rw_blk.c	Mon Jun  9 17:13:16 2003
+++ edited/drivers/block/ll_rw_blk.c	Tue Jun 10 16:46:50 2003
@@ -661,7 +661,8 @@
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		spin_lock_irq(&io_request_lock);
 		if ((!waited && queue_full(q, rw)) || q->rq[rw].count == 0) {
-			__generic_unplug_device(q);
+			if (q->rq[rw].count == 0)
+				__generic_unplug_device(q);
 			spin_unlock_irq(&io_request_lock);
 			schedule();
 			spin_lock_irq(&io_request_lock);





^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls (was: -rc7   Re: Linux 2.4.21-rc6)
  2003-06-10 23:13                     ` Chris Mason
@ 2003-06-11  0:16                       ` Andrea Arcangeli
  2003-06-11  0:44                         ` Chris Mason
  0 siblings, 1 reply; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-11  0:16 UTC (permalink / raw)
  To: Chris Mason
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Tue, Jun 10, 2003 at 07:13:45PM -0400, Chris Mason wrote:
> On Mon, 2003-06-09 at 18:19, Andrea Arcangeli wrote:
> The avg throughput per process with vanilla rc7 is 3MB/s, the best I've
> been able to do was with nr_requests at higher levels was 1.3MB/s.  With
> smaller of iozone threads (10 and lower so far) I can match rc7 speeds,
> but not with 20 procs.

at least with my patches, I also made this change:

-#define ELV_LINUS_SEEK_COST    16
+#define ELV_LINUS_SEEK_COST    1

 #define ELEVATOR_NOOP                                                  \
 ((elevator_t) {                                                                \
@@ -93,8 +93,8 @@ static inline int elevator_request_laten

 #define ELEVATOR_LINUS                                                 \
 ((elevator_t) {                                                                \
-       2048,                           /* read passovers */            \
-       8192,                           /* write passovers */           \
+       128,                            /* read passovers */            \
+       512,                            /* write passovers */           \
                                                                        \

you didn't change the I/O scheduler at all compared to mainline, so
there can be quite a lot of difference in the bandwidth average per
process between my patches and mainline and your patches (unless you run
elvtune or unless you backed out the above).

Anyways the 130671 < 100, 0 < 200, 0 < 300, 0 < 400, 0 < 500 from your
patch sounds perfectly fair and that's unrelated to I/O scheduler and
size of runqueue. I believe the most interesting difference is the
blocking of tasks until the waitqueue is empty (i.e. clearing the
waitqueue-full bit only when nobody is waiting). That is the right thing
to do of course, that was a bug in my patch I merged by mistake from
Nick's original patch, and that I'm going to fix immediatly of course.

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls (was: -rc7   Re: Linux 2.4.21-rc6)
  2003-06-09 21:39                 ` [PATCH] io stalls (was: -rc7 Re: Linux 2.4.21-rc6) Chris Mason
  2003-06-09 22:19                   ` Andrea Arcangeli
  2003-06-09 23:51                   ` [PATCH] io stalls Nick Piggin
@ 2003-06-11  0:33                   ` Andrea Arcangeli
  2003-06-11  0:48                     ` [PATCH] io stalls Nick Piggin
  2003-06-11  0:54                     ` [PATCH] io stalls (was: -rc7 Re: Linux 2.4.21-rc6) Chris Mason
  2 siblings, 2 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-11  0:33 UTC (permalink / raw)
  To: Chris Mason
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Mon, Jun 09, 2003 at 05:39:23PM -0400, Chris Mason wrote:
> +	if (!waitqueue_active(&q->wait_for_requests[rw]))
> +		clear_queue_full(q, rw);

you've an smp race above, the smp safe implementation is this:

	if (!waitqueue_active(&q->wait_for_requests[rw])) {
		clear_queue_full(q, rw);
		mb();
		if (unlikely(waitqueue_active(&q->wait_for_requests[rw])))
			wake_up(&q->wait_for_requests[rw]);
	}

I'm also unsure what the "waited" logic does, it doesn't seem necessary.

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-10 21:17                         ` Robert White
@ 2003-06-11  0:40                           ` Nick Piggin
  0 siblings, 0 replies; 109+ messages in thread
From: Nick Piggin @ 2003-06-11  0:40 UTC (permalink / raw)
  To: Robert White
  Cc: Chris Mason, Andrea Arcangeli, Marc-Christian Petersen,
	Jens Axboe, Marcelo Tosatti, Georg Nikodym, lkml,
	Matthias Mueller



Robert White wrote:

>From: Nick Piggin [mailto:piggin@cyberone.com.au]
>Sent: Monday, June 09, 2003 8:23 PM
>
>>Actually, there is no priority other than time (ie. FIFO), and
>>seek distance in the IO subsystem. I guess this is why your
>>arguments fall down ;)
>>
>
>I'll buy that for the most part, though one of the differences I read
>elsewhere in the thread was the choice between add_wait_queue() and
>add_wait_queue_exclusive().  You will, however, note that one of the factors
>that is playing in this patch is process priority.
>
>(If I understand correctly) The wait queue in question becomes your FIFOing
>agent, it is kind of a pre-queue on the actual IO queue, once you reach a
>"full" condition.
>

Right.

>
>In the later case [add_wait_queue_exclusive()] you are strictly FIFO over
>the set of processes, where the moment-of-order is determined by insertion
>into the wait queue.
>
>In the former case [add_wait_queue()] when the queue is woken up all the
>waiters will be marked executable on the scheduler, and the scheduler will
>then (at least tend to) sort the submissions into task priority order.  So
>the higher priority tasks will get to butt into line.  Worse, the FIFO is
>essentially lost to the vagaries of the scheduler so without the _exclusive
>you have no FIFO at all.
>
>I think that is the reason that Chris was saying the
>add_wait_queue_exclusive() mode "does seem to scale much better."
>

Yep

>
>So your "original new" batching agent is really order-of-arrival that
>becomes anti-sorted by process priority.  Which can lead to scheduler
>induced starvation (and the observed "improvements" by using the strict FIFO
>created by a_w_q_exclusive).  The problem is that you get a little communist
>about the FIFO-ness when you use a_w_q_exclusive() and that can *SERIOUSLY*
>harm a task that must approach real-time behavior.
>

I think it had better be FIFO for now. If its not, you're
making the worst case latency worse. It requires a lot of
careful testing to get something like that working right.

You have some good ideas, and quite possibly they would be
worth implementing, but the behaviour of the code is quite
complex, especially when you take into account its affect
on the io scheduler.




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls (was: -rc7   Re: Linux 2.4.21-rc6)
  2003-06-11  0:16                       ` Andrea Arcangeli
@ 2003-06-11  0:44                         ` Chris Mason
  0 siblings, 0 replies; 109+ messages in thread
From: Chris Mason @ 2003-06-11  0:44 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Tue, 2003-06-10 at 20:16, Andrea Arcangeli wrote:
> On Tue, Jun 10, 2003 at 07:13:45PM -0400, Chris Mason wrote:
> > On Mon, 2003-06-09 at 18:19, Andrea Arcangeli wrote:
> > The avg throughput per process with vanilla rc7 is 3MB/s, the best I've
> > been able to do was with nr_requests at higher levels was 1.3MB/s.  With
> > smaller of iozone threads (10 and lower so far) I can match rc7 speeds,
> > but not with 20 procs.
> 
> at least with my patches, I also made this change:
> 
> -#define ELV_LINUS_SEEK_COST    16
> +#define ELV_LINUS_SEEK_COST    1
> 
>  #define ELEVATOR_NOOP                                                  \
>  ((elevator_t) {                                                                \
> @@ -93,8 +93,8 @@ static inline int elevator_request_laten
> 
>  #define ELEVATOR_LINUS                                                 \
>  ((elevator_t) {                                                                \
> -       2048,                           /* read passovers */            \
> -       8192,                           /* write passovers */           \
> +       128,                            /* read passovers */            \
> +       512,                            /* write passovers */           \
>                                                                         \
> 

Right, I had forgotten to elvtune these in before my runs.  It shouldn't
change the __get_request_wait numbers, except for changes in the
percentage of merged requests leading to a different number of requests
overall (which my numbers did show).

> you didn't change the I/O scheduler at all compared to mainline, so
> there can be quite a lot of difference in the bandwidth average per
> process between my patches and mainline and your patches (unless you run
> elvtune or unless you backed out the above).
> 
> Anyways the 130671 < 100, 0 < 200, 0 < 300, 0 < 400, 0 < 500 from your
> patch sounds perfectly fair and that's unrelated to I/O scheduler and
> size of runqueue. I believe the most interesting difference is the
> blocking of tasks until the waitqueue is empty (i.e. clearing the
> waitqueue-full bit only when nobody is waiting). That is the right thing
> to do of course, that was a bug in my patch I merged by mistake from
> Nick's original patch, and that I'm going to fix immediatly of course.

Ok, Increasing q->nr_requests also changes the throughput in high merge
workloads.

Basically if we have 20 procs doing streaming buffered io, the buffers
end up mixed together on the dirty list.  So assuming we hit the hard
dirty limit and all 20 procs are running write_some_buffers() the only
way we'll be able to efficiently merge the end result is if we can get
in 20 * 32 requests before unplugging.

This is because write_some_buffers grabs 32 buffers at a time, and each
caller has to wait fairly in __get_request_wait.  With only 128 requests
in the run queue, the disk is unplugged before any of the 20 procs has
submitted each of their 32 buffers.

It might make sense to change write_some_buffers to work in smaller
units, 32 seems like a lot of times to wait in __get_request_wait just
for an atime update.

-chris



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-11  0:33                   ` [PATCH] io stalls (was: -rc7 Re: Linux 2.4.21-rc6) Andrea Arcangeli
@ 2003-06-11  0:48                     ` Nick Piggin
  2003-06-11  1:07                       ` Andrea Arcangeli
  2003-06-11  0:54                     ` [PATCH] io stalls (was: -rc7 Re: Linux 2.4.21-rc6) Chris Mason
  1 sibling, 1 reply; 109+ messages in thread
From: Nick Piggin @ 2003-06-11  0:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Chris Mason, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller



Andrea Arcangeli wrote:

>On Mon, Jun 09, 2003 at 05:39:23PM -0400, Chris Mason wrote:
>
>>+	if (!waitqueue_active(&q->wait_for_requests[rw]))
>>+		clear_queue_full(q, rw);
>>
>
>you've an smp race above, the smp safe implementation is this:
>
>	if (!waitqueue_active(&q->wait_for_requests[rw])) {
>		clear_queue_full(q, rw);
>		mb();
>		if (unlikely(waitqueue_active(&q->wait_for_requests[rw])))
>			wake_up(&q->wait_for_requests[rw]);
>	}
>
>I'm also unsure what the "waited" logic does, it doesn't seem necessary.
>

When a task is woken up, it is quite likely that the
queue is still marked full.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls (was: -rc7   Re: Linux 2.4.21-rc6)
  2003-06-11  0:33                   ` [PATCH] io stalls (was: -rc7 Re: Linux 2.4.21-rc6) Andrea Arcangeli
  2003-06-11  0:48                     ` [PATCH] io stalls Nick Piggin
@ 2003-06-11  0:54                     ` Chris Mason
  2003-06-11  1:06                       ` Andrea Arcangeli
  1 sibling, 1 reply; 109+ messages in thread
From: Chris Mason @ 2003-06-11  0:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Tue, 2003-06-10 at 20:33, Andrea Arcangeli wrote:
> On Mon, Jun 09, 2003 at 05:39:23PM -0400, Chris Mason wrote:
> > +	if (!waitqueue_active(&q->wait_for_requests[rw]))
> > +		clear_queue_full(q, rw);
> 
> you've an smp race above, the smp safe implementation is this:
> 

clear_queue_full has a wmb() in my patch, and queue_full has a rmb(), I
thought that covered these cases?  I'd rather remove those though, since
the spot you point out is the only place done outside the
io_request_lock.

> 	if (!waitqueue_active(&q->wait_for_requests[rw])) {
> 		clear_queue_full(q, rw);
> 		mb();
> 		if (unlikely(waitqueue_active(&q->wait_for_requests[rw])))
> 			wake_up(&q->wait_for_requests[rw]);
> 	}
> 
I don't think we need the extra wake_up (this is in __get_request_wait,
right?), since it gets done by get_request_wait_wakeup()

> I'm also unsure what the "waited" logic does, it doesn't seem necessary.

Once a process waits once, they are allowed to ignore the q->full flag. 
This way existing waiters can make progress even when q->full is set. 
Without the waited check, q->full will never get cleared because the
last writer wouldn't proceed until the last writer was gone.  I had to
make __get_request for the same reason.

-chris



^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH] io stalls
  2003-06-10 23:04                         ` Robert White
@ 2003-06-11  0:58                           ` Chris Mason
  0 siblings, 0 replies; 109+ messages in thread
From: Chris Mason @ 2003-06-11  0:58 UTC (permalink / raw)
  To: Robert White
  Cc: Nick Piggin, Andrea Arcangeli, Marc-Christian Petersen,
	Jens Axboe, Marcelo Tosatti, Georg Nikodym, lkml,
	Matthias Mueller

On Tue, 2003-06-10 at 19:04, Robert White wrote:
> From: Chris Mason [mailto:mason@suse.com]
> Sent: Monday, June 09, 2003 7:13 PM
> 
> > 2) Provide a somewhat obvious patch that makes the current
> > __get_request_wait call significantly more fair, in hopes of either
> > blaming it for the stalls or removing it from the list of candidates
> 
> Without the a_w_q_exclusive() on add_wait_queue the FIFO effect is lost when
> all the members of the wait queue compete for their timeslice in the
> scheduler.  For all intents and purposes the fairness goes up some (you stop
> having the one guy sorted to the un-happy end of the disk) but low priority
> tasks will still always end up stalled on the dirty end of the stick.
> Basically each new round at the queue-empty moment is a mob rush for the
> door.
> 
> With the a_w_q_exclusive(), you get past fair and well into anti-optimal.
> Your FIFO becomes essentially mandatory with no regard for anything but the
> order things hit the wait queue.  (Particularly on an SMP machine, however)
> "new requestors" may/will jump to the head of the line because they were
> never *in* the wait queue.  

The patches flying around force new io into the wait queue any time
someone else is already waiting, nobody is allowed to jump to the head
of the line.

The rest of your ideas are interesting, we just can't smush them into
2.4.  Please consider doing some experiments on the 2.5 io schedulers
and making suggestions, it's a critical area.

-chris


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls (was: -rc7   Re: Linux 2.4.21-rc6)
  2003-06-11  0:54                     ` [PATCH] io stalls (was: -rc7 Re: Linux 2.4.21-rc6) Chris Mason
@ 2003-06-11  1:06                       ` Andrea Arcangeli
  2003-06-11  1:57                         ` Chris Mason
  0 siblings, 1 reply; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-11  1:06 UTC (permalink / raw)
  To: Chris Mason
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Tue, Jun 10, 2003 at 08:54:00PM -0400, Chris Mason wrote:
> On Tue, 2003-06-10 at 20:33, Andrea Arcangeli wrote:
> > On Mon, Jun 09, 2003 at 05:39:23PM -0400, Chris Mason wrote:
> > > +	if (!waitqueue_active(&q->wait_for_requests[rw]))
> > > +		clear_queue_full(q, rw);
> > 
> > you've an smp race above, the smp safe implementation is this:
> > 
> 
> clear_queue_full has a wmb() in my patch, and queue_full has a rmb(), I
> thought that covered these cases?  I'd rather remove those though, since
> the spot you point out is the only place done outside the
> io_request_lock.
> 
> > 	if (!waitqueue_active(&q->wait_for_requests[rw])) {
> > 		clear_queue_full(q, rw);
> > 		mb();
> > 		if (unlikely(waitqueue_active(&q->wait_for_requests[rw])))
> > 			wake_up(&q->wait_for_requests[rw]);
> > 	}
> > 
> I don't think we need the extra wake_up (this is in __get_request_wait,
> right?), since it gets done by get_request_wait_wakeup()

there's no get_request_wait_wakeup in blkdev_release_request. I put the
construct in both places though (i've the clear_queue_full explicit as
q->full = 0).

And I don't think any of your barriers is needed at all, I mean, we only
need to be careful to clear it right, we don't need to be careful to set
or read it right when it transits from 0 to 1. And the above seems
enough to me to get right the clearing.

> > I'm also unsure what the "waited" logic does, it doesn't seem necessary.
> 
> Once a process waits once, they are allowed to ignore the q->full flag. 
> This way existing waiters can make progress even when q->full is set. 
> Without the waited check, q->full will never get cleared because the
> last writer wouldn't proceed until the last writer was gone.  I had to
> make __get_request for the same reason.

__get_request makes perfect sense of course and it's needed, this is not
the issue, my point about the waited check is that the last writer has
to get the wakeup (and the wakeup has nothing to do with the waited
check since waited == 0), and after the wakeup it will get the request
and it won't re-run the loop, so I don't see why waited is needed.
Furthmore even if for whatever reason it doesn't get the request, it
will re-set full to 1 and it'll be still the first to get the wakeup,
and it will definitely get another wakeup if none request was available.

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-11  0:48                     ` [PATCH] io stalls Nick Piggin
@ 2003-06-11  1:07                       ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-11  1:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Chris Mason, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Wed, Jun 11, 2003 at 10:48:23AM +1000, Nick Piggin wrote:
> 
> 
> Andrea Arcangeli wrote:
> 
> >On Mon, Jun 09, 2003 at 05:39:23PM -0400, Chris Mason wrote:
> >
> >>+	if (!waitqueue_active(&q->wait_for_requests[rw]))
> >>+		clear_queue_full(q, rw);
> >>
> >
> >you've an smp race above, the smp safe implementation is this:
> >
> >	if (!waitqueue_active(&q->wait_for_requests[rw])) {
> >		clear_queue_full(q, rw);
> >		mb();
> >		if (unlikely(waitqueue_active(&q->wait_for_requests[rw])))
> >			wake_up(&q->wait_for_requests[rw]);
> >	}
> >
> >I'm also unsure what the "waited" logic does, it doesn't seem necessary.
> >
> 
> When a task is woken up, it is quite likely that the
> queue is still marked full.

but we don't care if it's marked full, see __get_request. If we cared
about full it would deadlock anyways (no matter the waited logic)

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls (was: -rc7   Re: Linux 2.4.21-rc6)
  2003-06-11  1:06                       ` Andrea Arcangeli
@ 2003-06-11  1:57                         ` Chris Mason
  2003-06-11  2:10                           ` Andrea Arcangeli
  0 siblings, 1 reply; 109+ messages in thread
From: Chris Mason @ 2003-06-11  1:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Tue, 2003-06-10 at 21:06, Andrea Arcangeli wrote:

> And I don't think any of your barriers is needed at all, I mean, we only
> need to be careful to clear it right, we don't need to be careful to set
> or read it right when it transits from 0 to 1. And the above seems
> enough to me to get right the clearing.
> 

The current form of the patch has way too many barriers.  When I first
added them the patch was really different, I left them in because it
seems to be easier to remember to rip them out than add them back ;-)

> > > I'm also unsure what the "waited" logic does, it doesn't seem necessary.
> > 
> > Once a process waits once, they are allowed to ignore the q->full flag. 
> > This way existing waiters can make progress even when q->full is set. 
> > Without the waited check, q->full will never get cleared because the
> > last writer wouldn't proceed until the last writer was gone.  I had to
> > make __get_request for the same reason.
> 
> __get_request makes perfect sense of course and it's needed, this is not
> the issue, my point about the waited check is that the last writer has
> to get the wakeup (and the wakeup has nothing to do with the waited
> check since waited == 0), and after the wakeup it will get the request
> and it won't re-run the loop, so I don't see why waited is needed.
> Furthmore even if for whatever reason it doesn't get the request, it
> will re-set full to 1 and it'll be still the first to get the wakeup,
> and it will definitely get another wakeup if none request was available.

Ok, I see your point, we don't strictly need the waited check.  I had
added it as an optimization at first, so that those who waited once were
not penalized by further queue_full checks. 

-chris



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls (was: -rc7   Re: Linux 2.4.21-rc6)
  2003-06-11  1:57                         ` Chris Mason
@ 2003-06-11  2:10                           ` Andrea Arcangeli
  2003-06-11 12:24                             ` Chris Mason
  2003-06-11 17:42                             ` Chris Mason
  0 siblings, 2 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-11  2:10 UTC (permalink / raw)
  To: Chris Mason
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Tue, Jun 10, 2003 at 09:57:11PM -0400, Chris Mason wrote:
> Ok, I see your point, we don't strictly need the waited check.  I had
> added it as an optimization at first, so that those who waited once were
> not penalized by further queue_full checks. 

I could taste the feeling of not penalizing while reading the code but
that's just a feeling, in reality if they blocked it means they set full
by themself and there was no request so they want to go to sleep no
matter ->full or not ;)

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls (was: -rc7   Re: Linux 2.4.21-rc6)
  2003-06-11  2:10                           ` Andrea Arcangeli
@ 2003-06-11 12:24                             ` Chris Mason
  2003-06-11 17:42                             ` Chris Mason
  1 sibling, 0 replies; 109+ messages in thread
From: Chris Mason @ 2003-06-11 12:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Tue, 2003-06-10 at 22:10, Andrea Arcangeli wrote:
> On Tue, Jun 10, 2003 at 09:57:11PM -0400, Chris Mason wrote:
> > Ok, I see your point, we don't strictly need the waited check.  I had
> > added it as an optimization at first, so that those who waited once were
> > not penalized by further queue_full checks. 
> 
> I could taste the feeling of not penalizing while reading the code but
> that's just a feeling, in reality if they blocked it means they set full
> by themself and there was no request so they want to go to sleep no
> matter ->full or not ;)

You're completely right, as the patch changed I didn't realize waited
wasn't needed anymore ;-)

Are you adding the hunk from yesterday to avoid unplugs when q->rq.count
!= 0?

-chris



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls (was: -rc7   Re: Linux 2.4.21-rc6)
  2003-06-11  2:10                           ` Andrea Arcangeli
  2003-06-11 12:24                             ` Chris Mason
@ 2003-06-11 17:42                             ` Chris Mason
  2003-06-11 18:12                               ` Andrea Arcangeli
  1 sibling, 1 reply; 109+ messages in thread
From: Chris Mason @ 2003-06-11 17:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

[-- Attachment #1: Type: text/plain, Size: 493 bytes --]

Ok here's an updated patch, it changes the barriers around, updates
comments, and gets rid of the waited check in __get_request_wait.  It is
still a combined patch with fix_pausing, queue_full and latency stats,
mostly because I want to make really sure any testers are using all
three.

So, if someone who saw io stalls in 2.4.21-rc could give this a try, I'd
be grateful.  If you still see stalls with this applied, run elvtune
/dev/xxx and send along the resulting console output.

-chris


[-- Attachment #2: io-stalls-5.diff --]
[-- Type: text/plain, Size: 15306 bytes --]

--- 1.9/drivers/block/blkpg.c	Sat Mar 30 06:58:05 2002
+++ edited/drivers/block/blkpg.c	Tue Jun 10 14:49:27 2003
@@ -261,6 +261,7 @@
 			return blkpg_ioctl(dev, (struct blkpg_ioctl_arg *) arg);
 			
 		case BLKELVGET:
+			blk_print_stats(dev);
 			return blkelvget_ioctl(&blk_get_queue(dev)->elevator,
 					       (blkelv_ioctl_arg_t *) arg);
 		case BLKELVSET:
--- 1.45/drivers/block/ll_rw_blk.c	Wed May 28 03:50:02 2003
+++ edited/drivers/block/ll_rw_blk.c	Wed Jun 11 13:36:10 2003
@@ -429,6 +429,8 @@
 	q->rq[READ].count = 0;
 	q->rq[WRITE].count = 0;
 	q->nr_requests = 0;
+	q->read_full = 0;
+	q->write_full = 0;
 
 	si_meminfo(&si);
 	megs = si.totalram >> (20 - PAGE_SHIFT);
@@ -442,6 +444,56 @@
 	spin_lock_init(&q->queue_lock);
 }
 
+void blk_print_stats(kdev_t dev) 
+{
+	request_queue_t *q;
+	unsigned long avg_wait;
+	unsigned long min_wait;
+	unsigned long high_wait;
+	unsigned long *d;
+
+	q = blk_get_queue(dev);
+	if (!q)
+		return;
+
+	min_wait = q->min_wait;
+	if (min_wait == ~0UL)
+		min_wait = 0;
+	if (q->num_wait) 
+		avg_wait = q->total_wait / q->num_wait;
+	else
+		avg_wait = 0;
+	printk("device %s: num_req %lu, total jiffies waited %lu\n", 
+	       kdevname(dev), q->num_req, q->total_wait);
+	printk("\t%lu forced to wait\n", q->num_wait);
+	printk("\t%lu min wait, %lu max wait\n", min_wait, q->max_wait);
+	printk("\t%lu average wait\n", avg_wait);
+	d = q->deviation;
+	printk("\t%lu < 100, %lu < 200, %lu < 300, %lu < 400, %lu < 500\n",
+               d[0], d[1], d[2], d[3], d[4]);
+	high_wait = d[0] + d[1] + d[2] + d[3] + d[4];
+	high_wait = q->num_wait - high_wait;
+	printk("\t%lu waits longer than 500 jiffies\n", high_wait);
+}
+
+static void reset_stats(request_queue_t *q)
+{
+	q->max_wait		= 0;
+	q->min_wait		= ~0UL;
+	q->total_wait		= 0;
+	q->num_req		= 0;
+	q->num_wait		= 0;
+	memset(q->deviation, 0, sizeof(q->deviation));
+}
+void blk_reset_stats(kdev_t dev) 
+{
+	request_queue_t *q;
+	q = blk_get_queue(dev);
+	if (!q)
+	    return;
+	printk("reset latency stats on device %s\n", kdevname(dev));
+	reset_stats(q);
+}
 static int __make_request(request_queue_t * q, int rw, struct buffer_head * bh);
 
 /**
@@ -491,6 +543,9 @@
 	q->plug_tq.routine	= &generic_unplug_device;
 	q->plug_tq.data		= q;
 	q->plugged        	= 0;
+
+	reset_stats(q);
+
 	/*
 	 * These booleans describe the queue properties.  We set the
 	 * default (and most common) values here.  Other drivers can
@@ -508,7 +563,7 @@
  * Get a free request. io_request_lock must be held and interrupts
  * disabled on the way in.  Returns NULL if there are no free requests.
  */
-static struct request *get_request(request_queue_t *q, int rw)
+static struct request *__get_request(request_queue_t *q, int rw)
 {
 	struct request *rq = NULL;
 	struct request_list *rl = q->rq + rw;
@@ -521,35 +576,48 @@
 		rq->cmd = rw;
 		rq->special = NULL;
 		rq->q = q;
-	}
+	} else
+		set_queue_full(q, rw);
 
 	return rq;
 }
 
 /*
- * Here's the request allocation design:
+ * get a free request, honoring the queue_full condition
+ */
+static inline struct request *get_request(request_queue_t *q, int rw)
+{
+	if (queue_full(q, rw))
+		return NULL;
+	return __get_request(q, rw);
+}
+
+/* 
+ * helper func to do memory barriers and wakeups when we finally decide
+ * to clear the queue full condition
+ */
+static inline void clear_full_and_wake(request_queue_t *q, int rw)
+{
+	clear_queue_full(q, rw);
+	mb();
+	if (unlikely(waitqueue_active(&q->wait_for_requests[rw])))
+		wake_up(&q->wait_for_requests[rw]);
+}
+
+/*
+ * Here's the request allocation design, low latency version:
  *
  * 1: Blocking on request exhaustion is a key part of I/O throttling.
  * 
  * 2: We want to be `fair' to all requesters.  We must avoid starvation, and
  *    attempt to ensure that all requesters sleep for a similar duration.  Hence
  *    no stealing requests when there are other processes waiting.
- * 
- * 3: We also wish to support `batching' of requests.  So when a process is
- *    woken, we want to allow it to allocate a decent number of requests
- *    before it blocks again, so they can be nicely merged (this only really
- *    matters if the process happens to be adding requests near the head of
- *    the queue).
- * 
- * 4: We want to avoid scheduling storms.  This isn't really important, because
- *    the system will be I/O bound anyway.  But it's easy.
- * 
- *    There is tension between requirements 2 and 3.  Once a task has woken,
- *    we don't want to allow it to sleep as soon as it takes its second request.
- *    But we don't want currently-running tasks to steal all the requests
- *    from the sleepers.  We handle this with wakeup hysteresis around
- *    0 .. batch_requests and with the assumption that request taking is much,
- *    much faster than request freeing.
+ *
+ * There used to be more here, attempting to allow a process to send in a
+ * number of requests once it has woken up.  But, there's no way to 
+ * tell if a process has just been woken up, or if it is a new process
+ * coming in to steal requests from the waiters.  So, we give up and force
+ * everyone to wait fairly.
  * 
  * So here's what we do:
  * 
@@ -561,50 +629,78 @@
  * 
  *  When a process wants a new request:
  * 
- *    b) If free_requests == 0, the requester sleeps in FIFO manner.
- * 
- *    b) If 0 <  free_requests < batch_requests and there are waiters,
- *       we still take a request non-blockingly.  This provides batching.
- *
- *    c) If free_requests >= batch_requests, the caller is immediately
- *       granted a new request.
+ *    b) If free_requests == 0, the requester sleeps in FIFO manner, and
+ *       the queue full condition is set.  The full condition is not
+ *       cleared until there are no longer any waiters.  Once the full
+ *       condition is set, all new io must wait, hopefully for a very
+ *       short period of time.
  * 
  *  When a request is released:
  * 
- *    d) If free_requests < batch_requests, do nothing.
- * 
- *    f) If free_requests >= batch_requests, wake up a single waiter.
- * 
- *   The net effect is that when a process is woken at the batch_requests level,
- *   it will be able to take approximately (batch_requests) requests before
- *   blocking again (at the tail of the queue).
+ *    c) If free_requests < batch_requests, do nothing.
  * 
- *   This all assumes that the rate of taking requests is much, much higher
- *   than the rate of releasing them.  Which is very true.
+ *    d) If free_requests >= batch_requests, wake up a single waiter.
  *
- * -akpm, Feb 2002.
+ *   As each waiter gets a request, he wakes another waiter.  We do this
+ *   to prevent a race where an unplug might get run before a request makes
+ *   it's way onto the queue.  The result is a cascade of wakeups, so delaying
+ *   the initial wakeup until we've got batch_requests available helps avoid
+ *   wakeups where there aren't any requests available yet.
  */
 
 static struct request *__get_request_wait(request_queue_t *q, int rw)
 {
 	register struct request *rq;
+	unsigned long wait_start = jiffies;
+	unsigned long time_waited;
 	DECLARE_WAITQUEUE(wait, current);
 
-	add_wait_queue(&q->wait_for_requests[rw], &wait);
+	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
+
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		generic_unplug_device(q);
-		if (q->rq[rw].count == 0)
-			schedule();
 		spin_lock_irq(&io_request_lock);
-		rq = get_request(q, rw);
+		if (queue_full(q, rw) || q->rq[rw].count == 0) {
+			if (q->rq[rw].count == 0)
+				__generic_unplug_device(q);
+			spin_unlock_irq(&io_request_lock);
+			schedule();
+			spin_lock_irq(&io_request_lock);
+		}
+		rq = __get_request(q, rw);
 		spin_unlock_irq(&io_request_lock);
 	} while (rq == NULL);
 	remove_wait_queue(&q->wait_for_requests[rw], &wait);
 	current->state = TASK_RUNNING;
+
+	if (!waitqueue_active(&q->wait_for_requests[rw]))
+		clear_full_and_wake(q, rw);
+
+	time_waited = jiffies - wait_start;
+	if (time_waited > q->max_wait)
+		q->max_wait = time_waited;
+	if (time_waited && time_waited < q->min_wait)
+		q->min_wait = time_waited;
+	q->total_wait += time_waited;
+	q->num_wait++;
+	if (time_waited < 500) {
+		q->deviation[time_waited/100]++;
+	}
+
 	return rq;
 }
 
+static void get_request_wait_wakeup(request_queue_t *q, int rw)
+{
+	/*
+	 * avoid losing an unplug if a second __get_request_wait did the
+	 * generic_unplug_device while our __get_request_wait was running
+	 * w/o the queue_lock held and w/ our request out of the queue.
+	 */
+	if (waitqueue_active(&q->wait_for_requests[rw]))
+		wake_up(&q->wait_for_requests[rw]);
+}
+
 /* RO fail safe mechanism */
 
 static long ro_bits[MAX_BLKDEV][8];
@@ -829,8 +925,14 @@
 	 */
 	if (q) {
 		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= q->batch_requests)
-			wake_up(&q->wait_for_requests[rw]);
+		q->rq[rw].count++;
+		if (q->rq[rw].count >= q->batch_requests) {
+			smp_mb();
+			if (waitqueue_active(&q->wait_for_requests[rw]))
+				wake_up(&q->wait_for_requests[rw]);
+			else
+				clear_full_and_wake(q, rw);
+		}
 	}
 }
 
@@ -948,7 +1050,6 @@
 	 */
 	max_sectors = get_max_sectors(bh->b_rdev);
 
-again:
 	req = NULL;
 	head = &q->queue_head;
 	/*
@@ -957,6 +1058,7 @@
 	 */
 	spin_lock_irq(&io_request_lock);
 
+again:
 	insert_here = head->prev;
 	if (list_empty(head)) {
 		q->plug_device_fn(q, bh->b_rdev); /* is atomic */
@@ -1042,6 +1144,9 @@
 			if (req == NULL) {
 				spin_unlock_irq(&io_request_lock);
 				freereq = __get_request_wait(q, rw);
+				head = &q->queue_head;
+				spin_lock_irq(&io_request_lock);
+				get_request_wait_wakeup(q, rw);
 				goto again;
 			}
 		}
@@ -1063,6 +1168,7 @@
 	req->rq_dev = bh->b_rdev;
 	req->start_time = jiffies;
 	req_new_io(req, 0, count);
+	q->num_req++;
 	blk_started_io(count);
 	add_request(q, req, insert_here);
 out:
@@ -1196,8 +1302,15 @@
 	bh->b_rdev = bh->b_dev;
 	bh->b_rsector = bh->b_blocknr * count;
 
+	get_bh(bh);
 	generic_make_request(rw, bh);
 
+	/* fix race condition with wait_on_buffer() */
+	smp_mb(); /* spin_unlock may have inclusive semantics */
+	if (waitqueue_active(&bh->b_wait))
+		wake_up(&bh->b_wait);
+
+	put_bh(bh);
 	switch (rw) {
 		case WRITE:
 			kstat.pgpgout += count;
--- 1.83/fs/buffer.c	Wed May 14 12:51:00 2003
+++ edited/fs/buffer.c	Wed Jun 11 09:56:27 2003
@@ -153,10 +153,23 @@
 	get_bh(bh);
 	add_wait_queue(&bh->b_wait, &wait);
 	do {
-		run_task_queue(&tq_disk);
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 		if (!buffer_locked(bh))
 			break;
+		/*
+		 * We must read tq_disk in TQ_ACTIVE after the
+		 * add_wait_queue effect is visible to other cpus.
+		 * We could unplug some line above it wouldn't matter
+		 * but we can't do that right after add_wait_queue
+		 * without an smp_mb() in between because spin_unlock
+		 * has inclusive semantics.
+		 * Doing it here is the most efficient place so we
+		 * don't do a suprious unplug if we get a racy
+		 * wakeup that make buffer_locked to return 0, and
+		 * doing it here avoids an explicit smp_mb() we
+		 * rely on the implicit one in set_task_state.
+		 */
+		run_task_queue(&tq_disk);
 		schedule();
 	} while (buffer_locked(bh));
 	tsk->state = TASK_RUNNING;
@@ -1507,6 +1520,9 @@
 
 	/* Done - end_buffer_io_async will unlock */
 	SetPageUptodate(page);
+
+	wakeup_page_waiters(page);
+
 	return 0;
 
 out:
@@ -1538,6 +1554,7 @@
 	} while (bh != head);
 	if (need_unlock)
 		UnlockPage(page);
+	wakeup_page_waiters(page);
 	return err;
 }
 
@@ -1765,6 +1782,8 @@
 		else
 			submit_bh(READ, bh);
 	}
+
+	wakeup_page_waiters(page);
 	
 	return 0;
 }
@@ -2378,6 +2397,7 @@
 		submit_bh(rw, bh);
 		bh = next;
 	} while (bh != head);
+	wakeup_page_waiters(page);
 	return 0;
 }
 
--- 1.49/fs/super.c	Wed Dec 18 21:34:24 2002
+++ edited/fs/super.c	Tue Jun 10 14:49:27 2003
@@ -726,6 +726,7 @@
 	if (!fs_type->read_super(s, data, flags & MS_VERBOSE ? 1 : 0))
 		goto Einval;
 	s->s_flags |= MS_ACTIVE;
+	blk_reset_stats(dev);
 	path_release(&nd);
 	return s;
 
--- 1.45/fs/reiserfs/inode.c	Thu May 22 16:35:02 2003
+++ edited/fs/reiserfs/inode.c	Tue Jun 10 14:49:27 2003
@@ -2048,6 +2048,7 @@
     */
     if (nr) {
         submit_bh_for_writepage(arr, nr) ;
+	wakeup_page_waiters(page);
     } else {
         UnlockPage(page) ;
     }
--- 1.23/include/linux/blkdev.h	Fri Nov 29 17:03:01 2002
+++ edited/include/linux/blkdev.h	Wed Jun 11 09:56:55 2003
@@ -126,6 +126,14 @@
 	 */
 	char			head_active;
 
+	/*
+	 * Booleans that indicate whether the queue's free requests have
+	 * been exhausted and is waiting to drop below the batch_requests
+	 * threshold
+	 */
+	char			read_full;
+	char			write_full;
+
 	unsigned long		bounce_pfn;
 
 	/*
@@ -138,8 +146,17 @@
 	 * Tasks wait here for free read and write requests
 	 */
 	wait_queue_head_t	wait_for_requests[2];
+	unsigned long           max_wait;
+	unsigned long           min_wait;
+	unsigned long           total_wait;
+	unsigned long           num_req;
+	unsigned long           num_wait;
+	unsigned long           deviation[5];
 };
 
+void blk_reset_stats(kdev_t dev);
+void blk_print_stats(kdev_t dev);
+
 #define blk_queue_plugged(q)	(q)->plugged
 #define blk_fs_request(rq)	((rq)->cmd == READ || (rq)->cmd == WRITE)
 #define blk_queue_empty(q)	list_empty(&(q)->queue_head)
@@ -156,6 +173,30 @@
 	}
 }
 
+static inline void set_queue_full(request_queue_t *q, int rw)
+{
+	if (rw == READ)
+		q->read_full = 1;
+	else
+		q->write_full = 1;
+}
+
+static inline void clear_queue_full(request_queue_t *q, int rw)
+{
+	if (rw == READ)
+		q->read_full = 0;
+	else
+		q->write_full = 0;
+}
+
+static inline int queue_full(request_queue_t *q, int rw)
+{
+	if (rw == READ)
+		return q->read_full;
+	else
+		return q->write_full;
+}
+
 extern unsigned long blk_max_low_pfn, blk_max_pfn;
 
 #define BLK_BOUNCE_HIGH		(blk_max_low_pfn << PAGE_SHIFT)
@@ -217,6 +258,7 @@
 extern void generic_make_request(int rw, struct buffer_head * bh);
 extern inline request_queue_t *blk_get_queue(kdev_t dev);
 extern void blkdev_release_request(struct request *);
+extern void blk_print_stats(kdev_t dev);
 
 /*
  * Access functions for manipulating queue properties
--- 1.19/include/linux/pagemap.h	Sun Aug 25 15:32:11 2002
+++ edited/include/linux/pagemap.h	Wed Jun 11 08:57:12 2003
@@ -97,6 +97,8 @@
 		___wait_on_page(page);
 }
 
+extern void FASTCALL(wakeup_page_waiters(struct page * page));
+
 /*
  * Returns locked page at given index in given cache, creating it if needed.
  */
--- 1.68/kernel/ksyms.c	Fri May 23 17:40:47 2003
+++ edited/kernel/ksyms.c	Tue Jun 10 14:49:27 2003
@@ -295,6 +295,7 @@
 EXPORT_SYMBOL(filemap_fdatawait);
 EXPORT_SYMBOL(lock_page);
 EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(wakeup_page_waiters);
 
 /* device registration */
 EXPORT_SYMBOL(register_chrdev);
--- 1.77/mm/filemap.c	Thu Apr 24 11:05:10 2003
+++ edited/mm/filemap.c	Tue Jun 10 14:49:28 2003
@@ -812,6 +812,20 @@
 	return &wait[hash];
 }
 
+/*
+ * This must be called after every submit_bh with end_io
+ * callbacks that would result into the blkdev layer waking
+ * up the page after a queue unplug.
+ */
+void wakeup_page_waiters(struct page * page)
+{
+	wait_queue_head_t * head;
+
+	head = page_waitqueue(page);
+	if (waitqueue_active(head))
+		wake_up(head);
+}
+
 /* 
  * Wait for a page to get unlocked.
  *

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls (was: -rc7   Re: Linux 2.4.21-rc6)
  2003-06-11 17:42                             ` Chris Mason
@ 2003-06-11 18:12                               ` Andrea Arcangeli
  2003-06-11 18:27                                 ` Chris Mason
  0 siblings, 1 reply; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-11 18:12 UTC (permalink / raw)
  To: Chris Mason
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Wed, Jun 11, 2003 at 01:42:41PM -0400, Chris Mason wrote:
> +		if (q->rq[rw].count >= q->batch_requests) {
> +			smp_mb();
> +			if (waitqueue_active(&q->wait_for_requests[rw]))
> +				wake_up(&q->wait_for_requests[rw]);

in my tree I also changed this to:

				wake_up_nr(&q->wait_for_requests[rw], q->rq[rw].count);

otherwise only one waiter will eat the requests, while multiple waiters
can eat requests in parallel instead because we freed not just 1 request
but many of them.

I wonder if my above change is really the right way to implement the
removal of the _exclusive line that went in rc6. However with your patch
the wake_up_nr (or ~equivalent removal of _exclusive wakeup of rc6)
should mostly improve cpu parallelism in smp and while waiting for I/O,
the amount of stuff in the I/O queue and the overall fariness shouldn't
change very significantly with this new completely fair FIFO request
allocator.

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls (was: -rc7   Re: Linux 2.4.21-rc6)
  2003-06-11 18:12                               ` Andrea Arcangeli
@ 2003-06-11 18:27                                 ` Chris Mason
  2003-06-11 18:35                                   ` Andrea Arcangeli
  0 siblings, 1 reply; 109+ messages in thread
From: Chris Mason @ 2003-06-11 18:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Wed, 2003-06-11 at 14:12, Andrea Arcangeli wrote:
> On Wed, Jun 11, 2003 at 01:42:41PM -0400, Chris Mason wrote:
> > +		if (q->rq[rw].count >= q->batch_requests) {
> > +			smp_mb();
> > +			if (waitqueue_active(&q->wait_for_requests[rw]))
> > +				wake_up(&q->wait_for_requests[rw]);
> 
> in my tree I also changed this to:
> 
> 				wake_up_nr(&q->wait_for_requests[rw], q->rq[rw].count);
> 
> otherwise only one waiter will eat the requests, while multiple waiters
> can eat requests in parallel instead because we freed not just 1 request
> but many of them.

I tried a few variations of this yesterday and they all led to horrible
latencies, but I couldn't really explain why.  I had a bunch of other
stuff in at the time to try and improve throughput though, so I'll try
it again.

I think part of the problem is the cascading wakeups from
get_request_wait_wakeup().  So if we wakeup 32 procs they in turn wakeup
another 32, etc.

-chris



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls (was: -rc7   Re: Linux 2.4.21-rc6)
  2003-06-11 18:27                                 ` Chris Mason
@ 2003-06-11 18:35                                   ` Andrea Arcangeli
  2003-06-12  1:04                                     ` [PATCH] io stalls Nick Piggin
  0 siblings, 1 reply; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-11 18:35 UTC (permalink / raw)
  To: Chris Mason
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Wed, Jun 11, 2003 at 02:27:13PM -0400, Chris Mason wrote:
> On Wed, 2003-06-11 at 14:12, Andrea Arcangeli wrote:
> > On Wed, Jun 11, 2003 at 01:42:41PM -0400, Chris Mason wrote:
> > > +		if (q->rq[rw].count >= q->batch_requests) {
> > > +			smp_mb();
> > > +			if (waitqueue_active(&q->wait_for_requests[rw]))
> > > +				wake_up(&q->wait_for_requests[rw]);
> > 
> > in my tree I also changed this to:
> > 
> > 				wake_up_nr(&q->wait_for_requests[rw], q->rq[rw].count);
> > 
> > otherwise only one waiter will eat the requests, while multiple waiters
> > can eat requests in parallel instead because we freed not just 1 request
> > but many of them.
> 
> I tried a few variations of this yesterday and they all led to horrible
> latencies, but I couldn't really explain why.  I had a bunch of other

the I/O latency in theory shouldn't change, we're not reordering the
queue at all, they'll go to sleep immediatly again if __get_request
returns null.

> stuff in at the time to try and improve throughput though, so I'll try
> it again.
> 
> I think part of the problem is the cascading wakeups from
> get_request_wait_wakeup().  So if we wakeup 32 procs they in turn wakeup
> another 32, etc.

so maybe it's enough to wakeup count / 2 to account for the double
wakeup? that will avoid some overscheduling indeed.

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-11 18:35                                   ` Andrea Arcangeli
@ 2003-06-12  1:04                                     ` Nick Piggin
  2003-06-12  1:12                                       ` Chris Mason
  2003-06-12  1:29                                       ` Andrea Arcangeli
  0 siblings, 2 replies; 109+ messages in thread
From: Nick Piggin @ 2003-06-12  1:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Chris Mason, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller



Andrea Arcangeli wrote:

>On Wed, Jun 11, 2003 at 02:27:13PM -0400, Chris Mason wrote:
>
>>On Wed, 2003-06-11 at 14:12, Andrea Arcangeli wrote:
>>
>>>On Wed, Jun 11, 2003 at 01:42:41PM -0400, Chris Mason wrote:
>>>
>>>>+		if (q->rq[rw].count >= q->batch_requests) {
>>>>+			smp_mb();
>>>>+			if (waitqueue_active(&q->wait_for_requests[rw]))
>>>>+				wake_up(&q->wait_for_requests[rw]);
>>>>
>>>in my tree I also changed this to:
>>>
>>>				wake_up_nr(&q->wait_for_requests[rw], q->rq[rw].count);
>>>
>>>otherwise only one waiter will eat the requests, while multiple waiters
>>>can eat requests in parallel instead because we freed not just 1 request
>>>but many of them.
>>>
>>I tried a few variations of this yesterday and they all led to horrible
>>latencies, but I couldn't really explain why.  I had a bunch of other
>>
>
>the I/O latency in theory shouldn't change, we're not reordering the
>queue at all, they'll go to sleep immediatly again if __get_request
>returns null.
>

And go to the end of the queue?

>
>>stuff in at the time to try and improve throughput though, so I'll try
>>it again.
>>
>>I think part of the problem is the cascading wakeups from
>>get_request_wait_wakeup().  So if we wakeup 32 procs they in turn wakeup
>>another 32, etc.
>>
>
>so maybe it's enough to wakeup count / 2 to account for the double
>wakeup? that will avoid some overscheduling indeed.
>
>

Andrea, this isn't needed because when the queue falls below
the batch limit, every retiring request will do a wake up and
another request will be put on (as long as the waitqueue is
active).



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  1:04                                     ` [PATCH] io stalls Nick Piggin
@ 2003-06-12  1:12                                       ` Chris Mason
  2003-06-12  1:29                                       ` Andrea Arcangeli
  1 sibling, 0 replies; 109+ messages in thread
From: Chris Mason @ 2003-06-12  1:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Wed, 2003-06-11 at 21:04, Nick Piggin wrote:
> Andrea Arcangeli wrote:
> 
> >On Wed, Jun 11, 2003 at 02:27:13PM -0400, Chris Mason wrote:
> >
> >>On Wed, 2003-06-11 at 14:12, Andrea Arcangeli wrote:
> >>
> >>>On Wed, Jun 11, 2003 at 01:42:41PM -0400, Chris Mason wrote:
> >>>
> >>>>+		if (q->rq[rw].count >= q->batch_requests) {
> >>>>+			smp_mb();
> >>>>+			if (waitqueue_active(&q->wait_for_requests[rw]))
> >>>>+				wake_up(&q->wait_for_requests[rw]);
> >>>>
> >>>in my tree I also changed this to:
> >>>
> >>>				wake_up_nr(&q->wait_for_requests[rw], q->rq[rw].count);
> >>>
> >>>otherwise only one waiter will eat the requests, while multiple waiters
> >>>can eat requests in parallel instead because we freed not just 1 request
> >>>but many of them.
> >>>
> >>I tried a few variations of this yesterday and they all led to horrible
> >>latencies, but I couldn't really explain why.  I had a bunch of other
> >>
> >
> >the I/O latency in theory shouldn't change, we're not reordering the
> >queue at all, they'll go to sleep immediatly again if __get_request
> >returns null.
> >
> 
> And go to the end of the queue?
> 

This got dragged into private mail for a few messages, but we figured
out the problem turns into scheduling fairness with wake_up_nr()

32 procs might get woken, but when the first of those procs gets a
request, he'll wake another, and so on.  But there's no promise that
getting woken fairly means you'll get scheduled fairly, so you might not
get scheduled in for quite a while, perhaps even after new requests have
gone onto the wait queue and gotten woken up.

The real problem is get_request_wait_wakeup, andrea is working on a few
changes to that.  

-chris


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  1:04                                     ` [PATCH] io stalls Nick Piggin
  2003-06-12  1:12                                       ` Chris Mason
@ 2003-06-12  1:29                                       ` Andrea Arcangeli
  2003-06-12  1:37                                         ` Andrea Arcangeli
  2003-06-12  2:22                                         ` Chris Mason
  1 sibling, 2 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-12  1:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Chris Mason, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Thu, Jun 12, 2003 at 11:04:42AM +1000, Nick Piggin wrote:
> 
> 
> Andrea Arcangeli wrote:
> 
> >On Wed, Jun 11, 2003 at 02:27:13PM -0400, Chris Mason wrote:
> >
> >>On Wed, 2003-06-11 at 14:12, Andrea Arcangeli wrote:
> >>
> >>>On Wed, Jun 11, 2003 at 01:42:41PM -0400, Chris Mason wrote:
> >>>
> >>>>+		if (q->rq[rw].count >= q->batch_requests) {
> >>>>+			smp_mb();
> >>>>+			if (waitqueue_active(&q->wait_for_requests[rw]))
> >>>>+				wake_up(&q->wait_for_requests[rw]);
> >>>>
> >>>in my tree I also changed this to:
> >>>
> >>>				wake_up_nr(&q->wait_for_requests[rw], 
> >>>				q->rq[rw].count);
> >>>
> >>>otherwise only one waiter will eat the requests, while multiple waiters
> >>>can eat requests in parallel instead because we freed not just 1 request
> >>>but many of them.
> >>>
> >>I tried a few variations of this yesterday and they all led to horrible
> >>latencies, but I couldn't really explain why.  I had a bunch of other
> >>
> >
> >the I/O latency in theory shouldn't change, we're not reordering the
> >queue at all, they'll go to sleep immediatly again if __get_request
> >returns null.
> >
> 
> And go to the end of the queue?

they stay in queue, so they don't go to the end.

but as Chris found since we've the get_request_wait_wakeup, even waking
free-requests/2 isn't enough since that will generate free-requests *1.5 of
wakeups where the last free-requests/2 (implicitly generated by the
get_request_wait_wakeup) will become runnable and they will race with the
other tasks later waken by another request release.

I sort of fixed that by remebering an old suggestion from Andrew:

static void get_request_wait_wakeup(request_queue_t *q, int rw)
{
	/*
	 * avoid losing an unplug if a second __get_request_wait did the
	 * generic_unplug_device while our __get_request_wait was
	 * running
	 * w/o the queue_lock held and w/ our request out of the queue.
	 */
	if (waitqueue_active(&q->wait_for_requests))
		run_task_queue(&tq_disk);
}


this will avoid get_request_wait_wakeup to mess the wakeup, so we can
wakep_nr(rq.count) safely.

then there's the last issue raised by Chris, that is if we get request
released faster than the tasks can run, still we can generate a not
perfect fairness. My solution to that is to change wake_up to have a
nr_exclusive not obeying to the try_to_wakeup retval. that should
guarantee exact FIFO then, but it's a minor issue because the requests
shouldn't be released systematically in a flood. So I'm leaving it
opened for now, the others already addressed should be the major ones.

> >>stuff in at the time to try and improve throughput though, so I'll try
> >>it again.
> >>
> >>I think part of the problem is the cascading wakeups from
> >>get_request_wait_wakeup().  So if we wakeup 32 procs they in turn wakeup
> >>another 32, etc.
> >>
> >
> >so maybe it's enough to wakeup count / 2 to account for the double
> >wakeup? that will avoid some overscheduling indeed.
> >
> >
> 
> Andrea, this isn't needed because when the queue falls below

actually the problem is that it isn't enough, not that it isn't needed.
I had to stop get_request_wait_wakeup to mess the wakeups, so now I can
return doing /2.

> the batch limit, every retiring request will do a wake up and
> another request will be put on (as long as the waitqueue is
> active).

this was my argument for doing /2, but that will lead to count * 1.5 of
wakeups, where the last count /2 will race with further wakeups screwing
the FIFO ordering. As said that's fixed completely now and the last
issue is the one with the flood of request release that can't keep up
with the tasks becoming runnable but it's hopefully a minor issue (I'm
not going to change how wake_up_nr works right now, maybe later).

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  1:29                                       ` Andrea Arcangeli
@ 2003-06-12  1:37                                         ` Andrea Arcangeli
  2003-06-12  2:22                                         ` Chris Mason
  1 sibling, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-12  1:37 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Chris Mason, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Thu, Jun 12, 2003 at 03:29:51AM +0200, Andrea Arcangeli wrote:
> static void get_request_wait_wakeup(request_queue_t *q, int rw)
> {
> 	/*
> 	 * avoid losing an unplug if a second __get_request_wait did the
> 	 * generic_unplug_device while our __get_request_wait was
> 	 * running
> 	 * w/o the queue_lock held and w/ our request out of the queue.
> 	 */
> 	if (waitqueue_active(&q->wait_for_requests))
> 		run_task_queue(&tq_disk);

btw, that was the old version, Chris did it right
s/run_task_queue(&tq_disk)/__generic_unplug_device(q)/

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  1:29                                       ` Andrea Arcangeli
  2003-06-12  1:37                                         ` Andrea Arcangeli
@ 2003-06-12  2:22                                         ` Chris Mason
  2003-06-12  2:41                                           ` Nick Piggin
  1 sibling, 1 reply; 109+ messages in thread
From: Chris Mason @ 2003-06-12  2:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Wed, 2003-06-11 at 21:29, Andrea Arcangeli wrote:

> this will avoid get_request_wait_wakeup to mess the wakeup, so we can
> wakep_nr(rq.count) safely.
> 
> then there's the last issue raised by Chris, that is if we get request
> released faster than the tasks can run, still we can generate a not
> perfect fairness. My solution to that is to change wake_up to have a
> nr_exclusive not obeying to the try_to_wakeup retval. that should
> guarantee exact FIFO then, but it's a minor issue because the requests
> shouldn't be released systematically in a flood. So I'm leaving it
> opened for now, the others already addressed should be the major ones.

I think the only time we really need to wakeup more than one waiter is
when we hit the q->batch_request mark.  After that, each new request
that is freed can be matched with a single waiter, and we know that any
previously finished requests have probably already been matched to their
own waiter.

-chris




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  2:22                                         ` Chris Mason
@ 2003-06-12  2:41                                           ` Nick Piggin
  2003-06-12  2:46                                             ` Andrea Arcangeli
  2003-06-12 11:57                                             ` Chris Mason
  0 siblings, 2 replies; 109+ messages in thread
From: Nick Piggin @ 2003-06-12  2:41 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller



Chris Mason wrote:

>On Wed, 2003-06-11 at 21:29, Andrea Arcangeli wrote:
>
>
>>this will avoid get_request_wait_wakeup to mess the wakeup, so we can
>>wakep_nr(rq.count) safely.
>>
>>then there's the last issue raised by Chris, that is if we get request
>>released faster than the tasks can run, still we can generate a not
>>perfect fairness. My solution to that is to change wake_up to have a
>>nr_exclusive not obeying to the try_to_wakeup retval. that should
>>guarantee exact FIFO then, but it's a minor issue because the requests
>>shouldn't be released systematically in a flood. So I'm leaving it
>>opened for now, the others already addressed should be the major ones.
>>
>
>I think the only time we really need to wakeup more than one waiter is
>when we hit the q->batch_request mark.  After that, each new request
>that is freed can be matched with a single waiter, and we know that any
>previously finished requests have probably already been matched to their
>own waiter.
>
>
Nope. Not even then. Each retiring request should submit
a wake up, and the process will submit another request.
So the number of requests will be held at the batch_request
mark until no more waiters.

Now that begs the question, why have batch_requests anymore?
It no longer does anything.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  2:41                                           ` Nick Piggin
@ 2003-06-12  2:46                                             ` Andrea Arcangeli
  2003-06-12  2:49                                               ` Nick Piggin
  2003-06-25 19:03                                               ` Chris Mason
  2003-06-12 11:57                                             ` Chris Mason
  1 sibling, 2 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-12  2:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Chris Mason, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Thu, Jun 12, 2003 at 12:41:58PM +1000, Nick Piggin wrote:
> 
> 
> Chris Mason wrote:
> 
> >On Wed, 2003-06-11 at 21:29, Andrea Arcangeli wrote:
> >
> >
> >>this will avoid get_request_wait_wakeup to mess the wakeup, so we can
> >>wakep_nr(rq.count) safely.
> >>
> >>then there's the last issue raised by Chris, that is if we get request
> >>released faster than the tasks can run, still we can generate a not
> >>perfect fairness. My solution to that is to change wake_up to have a
> >>nr_exclusive not obeying to the try_to_wakeup retval. that should
> >>guarantee exact FIFO then, but it's a minor issue because the requests
> >>shouldn't be released systematically in a flood. So I'm leaving it
> >>opened for now, the others already addressed should be the major ones.
> >>
> >
> >I think the only time we really need to wakeup more than one waiter is
> >when we hit the q->batch_request mark.  After that, each new request
> >that is freed can be matched with a single waiter, and we know that any
> >previously finished requests have probably already been matched to their
> >own waiter.
> >
> >
> Nope. Not even then. Each retiring request should submit
> a wake up, and the process will submit another request.
> So the number of requests will be held at the batch_request
> mark until no more waiters.
> 
> Now that begs the question, why have batch_requests anymore?
> It no longer does anything.

it does nothing w/ _exclusive and w/o the wake_up_nr, that's why I added
the wake_up_nr.

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  2:46                                             ` Andrea Arcangeli
@ 2003-06-12  2:49                                               ` Nick Piggin
  2003-06-12  2:51                                                 ` Nick Piggin
  2003-06-12  2:58                                                 ` Andrea Arcangeli
  2003-06-25 19:03                                               ` Chris Mason
  1 sibling, 2 replies; 109+ messages in thread
From: Nick Piggin @ 2003-06-12  2:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Chris Mason, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller



Andrea Arcangeli wrote:

>On Thu, Jun 12, 2003 at 12:41:58PM +1000, Nick Piggin wrote:
>
>>
>>Chris Mason wrote:
>>
>>
>>>On Wed, 2003-06-11 at 21:29, Andrea Arcangeli wrote:
>>>
>>>
>>>
>>>>this will avoid get_request_wait_wakeup to mess the wakeup, so we can
>>>>wakep_nr(rq.count) safely.
>>>>
>>>>then there's the last issue raised by Chris, that is if we get request
>>>>released faster than the tasks can run, still we can generate a not
>>>>perfect fairness. My solution to that is to change wake_up to have a
>>>>nr_exclusive not obeying to the try_to_wakeup retval. that should
>>>>guarantee exact FIFO then, but it's a minor issue because the requests
>>>>shouldn't be released systematically in a flood. So I'm leaving it
>>>>opened for now, the others already addressed should be the major ones.
>>>>
>>>>
>>>I think the only time we really need to wakeup more than one waiter is
>>>when we hit the q->batch_request mark.  After that, each new request
>>>that is freed can be matched with a single waiter, and we know that any
>>>previously finished requests have probably already been matched to their
>>>own waiter.
>>>
>>>
>>>
>>Nope. Not even then. Each retiring request should submit
>>a wake up, and the process will submit another request.
>>So the number of requests will be held at the batch_request
>>mark until no more waiters.
>>
>>Now that begs the question, why have batch_requests anymore?
>>It no longer does anything.
>>
>
>it does nothing w/ _exclusive and w/o the wake_up_nr, that's why I added
>the wake_up_nr.
>
>
That is pretty pointless as well. You might as well just start
waking up at the queue full limit, and wake one at a time.

The purpose for batch_requests was I think for devices with a
very small request size, to reduce context switches.

>Andrea
>
>  
>


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  2:49                                               ` Nick Piggin
@ 2003-06-12  2:51                                                 ` Nick Piggin
  2003-06-12  2:52                                                   ` Nick Piggin
  2003-06-12  3:04                                                   ` Andrea Arcangeli
  2003-06-12  2:58                                                 ` Andrea Arcangeli
  1 sibling, 2 replies; 109+ messages in thread
From: Nick Piggin @ 2003-06-12  2:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Chris Mason, Marc-Christian Petersen,
	Jens Axboe, Marcelo Tosatti, Georg Nikodym, lkml,
	Matthias Mueller



Nick Piggin wrote:

>
>
> Andrea Arcangeli wrote:
>
>> On Thu, Jun 12, 2003 at 12:41:58PM +1000, Nick Piggin wrote:
>>
>>>
>>> Chris Mason wrote:
>>>
>>>
>>>> On Wed, 2003-06-11 at 21:29, Andrea Arcangeli wrote:
>>>>
>>>>
>>>>
>>>>> this will avoid get_request_wait_wakeup to mess the wakeup, so we can
>>>>> wakep_nr(rq.count) safely.
>>>>>
>>>>> then there's the last issue raised by Chris, that is if we get 
>>>>> request
>>>>> released faster than the tasks can run, still we can generate a not
>>>>> perfect fairness. My solution to that is to change wake_up to have a
>>>>> nr_exclusive not obeying to the try_to_wakeup retval. that should
>>>>> guarantee exact FIFO then, but it's a minor issue because the 
>>>>> requests
>>>>> shouldn't be released systematically in a flood. So I'm leaving it
>>>>> opened for now, the others already addressed should be the major 
>>>>> ones.
>>>>>
>>>>>
>>>> I think the only time we really need to wakeup more than one waiter is
>>>> when we hit the q->batch_request mark.  After that, each new request
>>>> that is freed can be matched with a single waiter, and we know that 
>>>> any
>>>> previously finished requests have probably already been matched to 
>>>> their
>>>> own waiter.
>>>>
>>>>
>>>>
>>> Nope. Not even then. Each retiring request should submit
>>> a wake up, and the process will submit another request.
>>> So the number of requests will be held at the batch_request
>>> mark until no more waiters.
>>>
>>> Now that begs the question, why have batch_requests anymore?
>>> It no longer does anything.
>>>
>>
>> it does nothing w/ _exclusive and w/o the wake_up_nr, that's why I added
>> the wake_up_nr.
>>
>>
> That is pretty pointless as well. You might as well just start
> waking up at the queue full limit, and wake one at a time.
>
> The purpose for batch_requests was I think for devices with a
> very small request size, to reduce context switches.


I guess you could fix this by having a "last woken" flag, and
allow that process to allocate requests without blocking from
the batch limit until the queue full limit. That is how
batch_requests is supposed to work.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  2:51                                                 ` Nick Piggin
@ 2003-06-12  2:52                                                   ` Nick Piggin
  2003-06-12  3:04                                                   ` Andrea Arcangeli
  1 sibling, 0 replies; 109+ messages in thread
From: Nick Piggin @ 2003-06-12  2:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Chris Mason, Marc-Christian Petersen,
	Jens Axboe, Marcelo Tosatti, Georg Nikodym, lkml,
	Matthias Mueller



Nick Piggin wrote:

>
> I guess you could fix this by having a "last woken" flag, and
> allow that process to allocate requests without blocking from
> the batch limit until the queue full limit. That is how
> batch_requests is supposed to work.


s/flag/pid maybe?


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  2:49                                               ` Nick Piggin
  2003-06-12  2:51                                                 ` Nick Piggin
@ 2003-06-12  2:58                                                 ` Andrea Arcangeli
  2003-06-12  3:04                                                   ` Nick Piggin
  1 sibling, 1 reply; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-12  2:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Chris Mason, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Thu, Jun 12, 2003 at 12:49:46PM +1000, Nick Piggin wrote:
> 
> 
> Andrea Arcangeli wrote:
> 
> >On Thu, Jun 12, 2003 at 12:41:58PM +1000, Nick Piggin wrote:
> >
> >>
> >>Chris Mason wrote:
> >>
> >>
> >>>On Wed, 2003-06-11 at 21:29, Andrea Arcangeli wrote:
> >>>
> >>>
> >>>
> >>>>this will avoid get_request_wait_wakeup to mess the wakeup, so we can
> >>>>wakep_nr(rq.count) safely.
> >>>>
> >>>>then there's the last issue raised by Chris, that is if we get request
> >>>>released faster than the tasks can run, still we can generate a not
> >>>>perfect fairness. My solution to that is to change wake_up to have a
> >>>>nr_exclusive not obeying to the try_to_wakeup retval. that should
> >>>>guarantee exact FIFO then, but it's a minor issue because the requests
> >>>>shouldn't be released systematically in a flood. So I'm leaving it
> >>>>opened for now, the others already addressed should be the major ones.
> >>>>
> >>>>
> >>>I think the only time we really need to wakeup more than one waiter is
> >>>when we hit the q->batch_request mark.  After that, each new request
> >>>that is freed can be matched with a single waiter, and we know that any
> >>>previously finished requests have probably already been matched to their
> >>>own waiter.
> >>>
> >>>
> >>>
> >>Nope. Not even then. Each retiring request should submit
> >>a wake up, and the process will submit another request.
> >>So the number of requests will be held at the batch_request
> >>mark until no more waiters.
> >>
> >>Now that begs the question, why have batch_requests anymore?
> >>It no longer does anything.
> >>
> >
> >it does nothing w/ _exclusive and w/o the wake_up_nr, that's why I added
> >the wake_up_nr.
> >
> >
> That is pretty pointless as well. You might as well just start
> waking up at the queue full limit, and wake one at a time.
> 
> The purpose for batch_requests was I think for devices with a
> very small request size, to reduce context switches.

batch_requests at least in my tree matters only when each request is
512btyes and you've some thousand of them to compose a 4M queue or so.
To maximize cpu cache usage etc.. I try to wakeup a task every 512bytes
written, but every 32*512bytes written or so. Of course w/o the
wake_up_nr that I added, that wasn't really working w/ the _exlusive
wakeup.

if you check my tree you'll see that for sequential I/O with 512k in
each request (not 512bytes!) batch_requests is already a noop.

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  2:51                                                 ` Nick Piggin
  2003-06-12  2:52                                                   ` Nick Piggin
@ 2003-06-12  3:04                                                   ` Andrea Arcangeli
  1 sibling, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-12  3:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Chris Mason, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Thu, Jun 12, 2003 at 12:51:30PM +1000, Nick Piggin wrote:
> I guess you could fix this by having a "last woken" flag, and
> allow that process to allocate requests without blocking from
> the batch limit until the queue full limit. That is how
> batch_requests is supposed to work.

I see what you mean, I did care about the case of each request belonging
to a different task, but of course this doesn't work if there's just one
task. In such case there will be a single wakeup and one for each
request, so it won't be able to eat all the requests and it'll keep
hanging on the full bitflag. So yes, the ->full bit partly disabled the
batch sectors in presence of only 1 task. With multiple tasks and the
wake_up_nr batch_sectors will still work. However I don't care about
that right now ;), it's a minor issue I guess, single task I/O normally
doesn't seek heavily so more likely it will run into the oversized queue
before being able to take advantage of the batch sectors.

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  2:58                                                 ` Andrea Arcangeli
@ 2003-06-12  3:04                                                   ` Nick Piggin
  2003-06-12  3:12                                                     ` Andrea Arcangeli
  0 siblings, 1 reply; 109+ messages in thread
From: Nick Piggin @ 2003-06-12  3:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Chris Mason, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller



Andrea Arcangeli wrote:

>On Thu, Jun 12, 2003 at 12:49:46PM +1000, Nick Piggin wrote:
>
>>
>>Andrea Arcangeli wrote:
>>
>>>it does nothing w/ _exclusive and w/o the wake_up_nr, that's why I added
>>>the wake_up_nr.
>>>
>>>
>>>
>>That is pretty pointless as well. You might as well just start
>>waking up at the queue full limit, and wake one at a time.
>>
>>The purpose for batch_requests was I think for devices with a
>>very small request size, to reduce context switches.
>>
>
>batch_requests at least in my tree matters only when each request is
>512btyes and you've some thousand of them to compose a 4M queue or so.
>To maximize cpu cache usage etc.. I try to wakeup a task every 512bytes
>written, but every 32*512bytes written or so. Of course w/o the
>wake_up_nr that I added, that wasn't really working w/ the _exlusive
>wakeup.
>
>if you check my tree you'll see that for sequential I/O with 512k in
>each request (not 512bytes!) batch_requests is already a noop.
>


You are waking up multiple tasks which will each submit
1 request. You want to be waking up 1 task which will
submit multiple requests - that is how you will save
context switches, cpu cache, etc, and that task's requests
will have a much better chance of being merged, or at
least serviced as a nice batch than unrelated tasks.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  3:04                                                   ` Nick Piggin
@ 2003-06-12  3:12                                                     ` Andrea Arcangeli
  2003-06-12  3:20                                                       ` Nick Piggin
  0 siblings, 1 reply; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-12  3:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Chris Mason, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Thu, Jun 12, 2003 at 01:04:27PM +1000, Nick Piggin wrote:
> 
> 
> Andrea Arcangeli wrote:
> 
> >On Thu, Jun 12, 2003 at 12:49:46PM +1000, Nick Piggin wrote:
> >
> >>
> >>Andrea Arcangeli wrote:
> >>
> >>>it does nothing w/ _exclusive and w/o the wake_up_nr, that's why I added
> >>>the wake_up_nr.
> >>>
> >>>
> >>>
> >>That is pretty pointless as well. You might as well just start
> >>waking up at the queue full limit, and wake one at a time.
> >>
> >>The purpose for batch_requests was I think for devices with a
> >>very small request size, to reduce context switches.
> >>
> >
> >batch_requests at least in my tree matters only when each request is
> >512btyes and you've some thousand of them to compose a 4M queue or so.
> >To maximize cpu cache usage etc.. I try to wakeup a task every 512bytes
> >written, but every 32*512bytes written or so. Of course w/o the
> >wake_up_nr that I added, that wasn't really working w/ the _exlusive
> >wakeup.
> >
> >if you check my tree you'll see that for sequential I/O with 512k in
> >each request (not 512bytes!) batch_requests is already a noop.
> >
> 
> 
> You are waking up multiple tasks which will each submit
> 1 request. You want to be waking up 1 task which will
> submit multiple requests - that is how you will save
> context switches, cpu cache, etc, and that task's requests
> will have a much better chance of being merged, or at
> least serviced as a nice batch than unrelated tasks.

for fairness reasons if there are multiple tasks, I want to wake them
all and let the others be able to eat requests before the first
allocates all the batch_sectors. So the current code is fine and
batch_sectors still works fine with multiple tasks queued in the
waitqueue, it still makes sense to wake more than one of them at the
same time to improve cpu utilization (regardless they're different
tasks, for istance we take less frequently the waitqueue spinlocks
etc..).

What we disabled is only the batch_sectors in function of the single
task, so if for example there's just 1 single task, we could let it go,
but it's quite a special case, if for example there would be two tasks,
we wouldn't want to let them go ahead (unless we can distributed exactly
count/2 requests to each task w/o reentering into __get_request_wait
that's unlikely). So the current code looks ok to me with the
wake_up_nr to take advantage of the batch_sectors against different
tasks, still w/o penalizing fariness.

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  3:12                                                     ` Andrea Arcangeli
@ 2003-06-12  3:20                                                       ` Nick Piggin
  2003-06-12  3:33                                                         ` Andrea Arcangeli
  2003-06-12 16:06                                                         ` Chris Mason
  0 siblings, 2 replies; 109+ messages in thread
From: Nick Piggin @ 2003-06-12  3:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Chris Mason, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller



Andrea Arcangeli wrote:

>On Thu, Jun 12, 2003 at 01:04:27PM +1000, Nick Piggin wrote:
>
>>
>>Andrea Arcangeli wrote:
>>
>>
>>>On Thu, Jun 12, 2003 at 12:49:46PM +1000, Nick Piggin wrote:
>>>
>>>
>>>>Andrea Arcangeli wrote:
>>>>
>>>>
>>>>>it does nothing w/ _exclusive and w/o the wake_up_nr, that's why I added
>>>>>the wake_up_nr.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>That is pretty pointless as well. You might as well just start
>>>>waking up at the queue full limit, and wake one at a time.
>>>>
>>>>The purpose for batch_requests was I think for devices with a
>>>>very small request size, to reduce context switches.
>>>>
>>>>
>>>batch_requests at least in my tree matters only when each request is
>>>512btyes and you've some thousand of them to compose a 4M queue or so.
>>>To maximize cpu cache usage etc.. I try to wakeup a task every 512bytes
>>>written, but every 32*512bytes written or so. Of course w/o the
>>>wake_up_nr that I added, that wasn't really working w/ the _exlusive
>>>wakeup.
>>>
>>>if you check my tree you'll see that for sequential I/O with 512k in
>>>each request (not 512bytes!) batch_requests is already a noop.
>>>
>>>
>>
>>You are waking up multiple tasks which will each submit
>>1 request. You want to be waking up 1 task which will
>>submit multiple requests - that is how you will save
>>context switches, cpu cache, etc, and that task's requests
>>will have a much better chance of being merged, or at
>>least serviced as a nice batch than unrelated tasks.
>>
>
>for fairness reasons if there are multiple tasks, I want to wake them
>all and let the others be able to eat requests before the first
>allocates all the batch_sectors. So the current code is fine and
>batch_sectors still works fine with multiple tasks queued in the
>waitqueue, it still makes sense to wake more than one of them at the
>same time to improve cpu utilization (regardless they're different
>tasks, for istance we take less frequently the waitqueue spinlocks
>etc..).
>

Its no less fair this way, tasks will still be woken in fifo
order. They will just be given the chance to submit a batch
of requests.

I think the cpu utilization gain of waking a number of tasks
at once would be outweighed by advantage of waking 1 task
and not putting it to sleep again for a number of requests.
You obviously are not claiming concurrency improvements, as
your method would also increase contention on the io lock
(or the queue lock in 2.5).

Then you have the cache gains of running each task for a
longer period of time. You also get possible IO scheduling
improvements.

Consider 8 requests, batch_requests at 4, 10 tasks writing
to different areas of disk.

Your method still only allows each task to have 1 request in
the elevator at once. Mine allows each to have a run of 4
requests in the elevator.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  3:20                                                       ` Nick Piggin
@ 2003-06-12  3:33                                                         ` Andrea Arcangeli
  2003-06-12  3:48                                                           ` Nick Piggin
  2003-06-12 16:06                                                         ` Chris Mason
  1 sibling, 1 reply; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-12  3:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Chris Mason, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Thu, Jun 12, 2003 at 01:20:44PM +1000, Nick Piggin wrote:
> Its no less fair this way, tasks will still be woken in fifo
> order. They will just be given the chance to submit a batch
> of requests.

If you change the behaviour with queued_task_nr > batch_requests it is
less fair period. Whatever else thing I don't care about right now
because it is a minor cpu improvement anyways.

I'm not talking about performance, I'm talking about latency and
fariness only. This is the whole point of the ->full logic.

> I think the cpu utilization gain of waking a number of tasks
> at once would be outweighed by advantage of waking 1 task
> and not putting it to sleep again for a number of requests.
> You obviously are not claiming concurrency improvements, as
> your method would also increase contention on the io lock
> (or the queue lock in 2.5).

I'm claiming that with queued_task_nr > batch_requests the
batch_requests logic still has a chance to save some cpu, this is the
only reason I didn't nuke it completely as you suggested some email ago.

> Then you have the cache gains of running each task for a
> longer period of time. You also get possible IO scheduling
> improvements.
> 
> Consider 8 requests, batch_requests at 4, 10 tasks writing
> to different areas of disk.
> 
> Your method still only allows each task to have 1 request in
> the elevator at once. Mine allows each to have a run of 4
> requests in the elevator.

I definitely want 1 request in the elevator at once or we can as well
drop your ->full and return to be unfair. The whole point of ->full is
to get the total fariness, across the tasks in the queue queue, and for
tasks outside the queue calling get_request too. Since not all tasks
will fit in the I/O queue, providing a very fair FIFO in the
wait_for_request is fundamental to provide any sort of latency
guarantee IMHO (the fact an _exclusive wakeup removal that mixes stuff
and probably has the side effect of being more fair, made that much
difference to mainline users kind of confirms that).

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  3:33                                                         ` Andrea Arcangeli
@ 2003-06-12  3:48                                                           ` Nick Piggin
  2003-06-12  4:17                                                             ` Andrea Arcangeli
  0 siblings, 1 reply; 109+ messages in thread
From: Nick Piggin @ 2003-06-12  3:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Chris Mason, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller



Andrea Arcangeli wrote:

>On Thu, Jun 12, 2003 at 01:20:44PM +1000, Nick Piggin wrote:
>
>>Its no less fair this way, tasks will still be woken in fifo
>>order. They will just be given the chance to submit a batch
>>of requests.
>>
>
>If you change the behaviour with queued_task_nr > batch_requests it is
>less fair period. Whatever else thing I don't care about right now
>because it is a minor cpu improvement anyways.
>
>I'm not talking about performance, I'm talking about latency and
>fariness only. This is the whole point of the ->full logic.
>

I say each task getting 8 requests at a time is as fair
as each getting 1 request at a time. Yes, you may get a
worse latency, but _fairness_ is the same.

>
>>I think the cpu utilization gain of waking a number of tasks
>>at once would be outweighed by advantage of waking 1 task
>>and not putting it to sleep again for a number of requests.
>>You obviously are not claiming concurrency improvements, as
>>your method would also increase contention on the io lock
>>(or the queue lock in 2.5).
>>
>
>I'm claiming that with queued_task_nr > batch_requests the
>batch_requests logic still has a chance to save some cpu, this is the
>only reason I didn't nuke it completely as you suggested some email ago.
>

Well I'm not so sure that your method will do a great deal
of good. On SMP you would increase contention on the spinlock.
IMO it would be better to serialise them on the waitqueue
instead of a spinlock seeing as they are already sleeping.

>
>>Then you have the cache gains of running each task for a
>>longer period of time. You also get possible IO scheduling
>>improvements.
>>
>>Consider 8 requests, batch_requests at 4, 10 tasks writing
>>to different areas of disk.
>>
>>Your method still only allows each task to have 1 request in
>>the elevator at once. Mine allows each to have a run of 4
>>requests in the elevator.
>>
>
>I definitely want 1 request in the elevator at once or we can as well
>drop your ->full and return to be unfair. The whole point of ->full is
>to get the total fariness, across the tasks in the queue queue, and for
>tasks outside the queue calling get_request too. Since not all tasks
>will fit in the I/O queue, providing a very fair FIFO in the
>wait_for_request is fundamental to provide any sort of latency
>guarantee IMHO (the fact an _exclusive wakeup removal that mixes stuff
>and probably has the side effect of being more fair, made that much
>difference to mainline users kind of confirms that).
>
>
I think we'll just have to agree to disagree here. This
sort of thing came up in our CFQ discussion as well. Its
not that I think your way is without merits, but I think
in an overload situtation its a better aim to attempt to
keep throughput up rather than attempt to provide the
lowest possible latency.



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  3:48                                                           ` Nick Piggin
@ 2003-06-12  4:17                                                             ` Andrea Arcangeli
  2003-06-12  4:41                                                               ` Nick Piggin
  0 siblings, 1 reply; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-12  4:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Chris Mason, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Thu, Jun 12, 2003 at 01:48:04PM +1000, Nick Piggin wrote:
> 
> 
> Andrea Arcangeli wrote:
> 
> >On Thu, Jun 12, 2003 at 01:20:44PM +1000, Nick Piggin wrote:
> >
> >>Its no less fair this way, tasks will still be woken in fifo
> >>order. They will just be given the chance to submit a batch
> >>of requests.
> >>
> >
> >If you change the behaviour with queued_task_nr > batch_requests it is
> >less fair period. Whatever else thing I don't care about right now
> >because it is a minor cpu improvement anyways.
> >
> >I'm not talking about performance, I'm talking about latency and
> >fariness only. This is the whole point of the ->full logic.
> >
> 
> I say each task getting 8 requests at a time is as fair
> as each getting 1 request at a time. Yes, you may get a
> worse latency, but _fairness_ is the same.

It is the worse latency that is the problem of course.  Fariness in this
case isn't affected (assuming you would write the batch_sectors fair),
but latency would definitely be affected.

> Well I'm not so sure that your method will do a great deal
> of good. On SMP you would increase contention on the spinlock.
> IMO it would be better to serialise them on the waitqueue
> instead of a spinlock seeing as they are already sleeping.

I think the worse part is the cacheline bouncing, but that might happen
anyways under load. at least certainly it makes sense for UP or if
you're lucky and the tasks run serially (possible if all cpus are
running).

> I think we'll just have to agree to disagree here. This
> sort of thing came up in our CFQ discussion as well. Its
> not that I think your way is without merits, but I think
> in an overload situtation its a better aim to attempt to
> keep throughput up rather than attempt to provide the
> lowest possible latency.

Those changes (like the CFQ I/O scheduler in 2.5) are being developed
mostly due the latency complains we get as feedback on l-k. That's why
I care about latency first here. But we've to care about throughput too
of course. This isn't CFQ, it just tries to provide new requests to
different tasks with the minimal possible latency which in turn also
guarantees fariness. That sounds a good default to me, especially when I
see the removal of the _exclusive wakeup in mainline taken as a major
fariness/latency improvement.

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  4:17                                                             ` Andrea Arcangeli
@ 2003-06-12  4:41                                                               ` Nick Piggin
  0 siblings, 0 replies; 109+ messages in thread
From: Nick Piggin @ 2003-06-12  4:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Chris Mason, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller



Andrea Arcangeli wrote:

>On Thu, Jun 12, 2003 at 01:48:04PM +1000, Nick Piggin wrote:
>
>>
>>Andrea Arcangeli wrote:
>>
>>
>>>On Thu, Jun 12, 2003 at 01:20:44PM +1000, Nick Piggin wrote:
>>>
>>>
>>>>Its no less fair this way, tasks will still be woken in fifo
>>>>order. They will just be given the chance to submit a batch
>>>>of requests.
>>>>
>>>>
>>>If you change the behaviour with queued_task_nr > batch_requests it is
>>>less fair period. Whatever else thing I don't care about right now
>>>because it is a minor cpu improvement anyways.
>>>
>>>I'm not talking about performance, I'm talking about latency and
>>>fariness only. This is the whole point of the ->full logic.
>>>
>>>
>>I say each task getting 8 requests at a time is as fair
>>as each getting 1 request at a time. Yes, you may get a
>>worse latency, but _fairness_ is the same.
>>
>
>It is the worse latency that is the problem of course.  Fariness in this
>case isn't affected (assuming you would write the batch_sectors fair),
>but latency would definitely be affected.
>

Yep.

>
>>Well I'm not so sure that your method will do a great deal
>>of good. On SMP you would increase contention on the spinlock.
>>IMO it would be better to serialise them on the waitqueue
>>instead of a spinlock seeing as they are already sleeping.
>>
>
>I think the worse part is the cacheline bouncing, but that might happen
>anyways under load. at least certainly it makes sense for UP or if
>you're lucky and the tasks run serially (possible if all cpus are
>running).
>
>
>>I think we'll just have to agree to disagree here. This
>>sort of thing came up in our CFQ discussion as well. Its
>>not that I think your way is without merits, but I think
>>in an overload situtation its a better aim to attempt to
>>keep throughput up rather than attempt to provide the
>>lowest possible latency.
>>
>
>Those changes (like the CFQ I/O scheduler in 2.5) are being developed
>mostly due the latency complains we get as feedback on l-k. That's why
>I care about latency first here. But we've to care about throughput too
>of course. This isn't CFQ, it just tries to provide new requests to
>different tasks with the minimal possible latency which in turn also
>guarantees fariness. That sounds a good default to me, especially when I
>see the removal of the _exclusive wakeup in mainline taken as a major
>fariness/latency improvement.
>

Throughput vs latency is always difficult I guess. In this
case, I think when there are few waiters, then latency should
not be much worse. When there are a lot of waiters it is
probably not an interactive load to start with and throughput
is more important.

Anyway, the ideas you are following are interesting and
worthwhile, so we'll each try our own thing :)


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  2:41                                           ` Nick Piggin
  2003-06-12  2:46                                             ` Andrea Arcangeli
@ 2003-06-12 11:57                                             ` Chris Mason
  1 sibling, 0 replies; 109+ messages in thread
From: Chris Mason @ 2003-06-12 11:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Wed, 2003-06-11 at 22:41, Nick Piggin wrote:

> >I think the only time we really need to wakeup more than one waiter is
> >when we hit the q->batch_request mark.  After that, each new request
> >that is freed can be matched with a single waiter, and we know that any
> >previously finished requests have probably already been matched to their
> >own waiter.
> >
> >
> Nope. Not even then. Each retiring request should submit
> a wake up, and the process will submit another request.
> So the number of requests will be held at the batch_request
> mark until no more waiters.
> 
> Now that begs the question, why have batch_requests anymore?
> It no longer does anything.
> 

We've got many flavors of the patch discussed in this thread, so this
needs a little qualification.  When get_request_wait_wakeup wakes one of
the waiters (as in the patch I sent yesterday), you want to make sure
that after you wake the first waiter there is a request available for
the proccess he is going to wake up, and so on for each other waiter.  

I did a quick test of this yesterday, and under the 20 proc iozone test,
turning off batch_requests more than doubled the number of context
switches hit during the run, I'm assuming this was from wakeups that
failed to find requests.

I'm doing a few tests with Andrea's new get_request_wait_wakeup ideas
and wake_up_nr.

-chris



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  3:20                                                       ` Nick Piggin
  2003-06-12  3:33                                                         ` Andrea Arcangeli
@ 2003-06-12 16:06                                                         ` Chris Mason
  2003-06-12 16:16                                                           ` Nick Piggin
  1 sibling, 1 reply; 109+ messages in thread
From: Chris Mason @ 2003-06-12 16:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

[-- Attachment #1: Type: text/plain, Size: 1782 bytes --]

On Wed, 2003-06-11 at 23:20, Nick Piggin wrote:

> 
> I think the cpu utilization gain of waking a number of tasks
> at once would be outweighed by advantage of waking 1 task
> and not putting it to sleep again for a number of requests.
> You obviously are not claiming concurrency improvements, as
> your method would also increase contention on the io lock
> (or the queue lock in 2.5).

I've been trying variations on this for a few days, none have been
thrilling but the end result is better dbench and iozone throughput
overall.  For the 20 writer iozone test, rc7 got an average throughput
of 3MB/s, and yesterdays latency patch got 500k/s or so.  Ouch.

This gets us up to 1.2MB/s.  I'm keeping yesterday's
get_request_wait_wake, which wakes up a waiter instead of unplugging.

The basic idea here is that after a process is woken up and grabs a
request, he becomes the batch owner.  Batch owners get to ignore the
q->full flag for either 1/5 second or 32 requests, whichever comes
first.  The timer part is an attempt at preventing memory pressure
writers (who go 1 req at a time) from holding onto batch ownership for
too long.  Latency stats after dbench 50:

device 08:01: num_req 120077, total jiffies waited 663231
        65538 forced to wait
        1 min wait, 175 max wait
        10 average wait
        65296 < 100, 242 < 200, 0 < 300, 0 < 400, 0 < 500
        0 waits longer than 500 jiffies

Good latency system wide comes from fair waiting, but it also comes from
how fast we can run write_some_buffers(), since that is the unit of
throttling.  Hopefully this patch decreases the time it takes for
write_some_buffers over the past latency patches, or gives someone else
a better idea ;-)

Attached is an incremental over yesterday's io-stalls-5.diff.

-chris


[-- Attachment #2: io-stalls-6-inc.diff --]
[-- Type: text/plain, Size: 3421 bytes --]

diff -u edited/drivers/block/ll_rw_blk.c edited/drivers/block/ll_rw_blk.c
--- edited/drivers/block/ll_rw_blk.c	Wed Jun 11 13:36:10 2003
+++ edited/drivers/block/ll_rw_blk.c	Thu Jun 12 11:53:03 2003
@@ -437,6 +437,12 @@
 	nr_requests = 128;
 	if (megs < 32)
 		nr_requests /= 2;
+	q->batch_owner[0] = NULL;
+	q->batch_owner[1] = NULL;
+	q->batch_remaining[0] = 0;
+	q->batch_remaining[1] = 0;
+	q->batch_jiffies[0] = 0;
+	q->batch_jiffies[1] = 0;
 	blk_grow_request_list(q, nr_requests);
 
 	init_waitqueue_head(&q->wait_for_requests[0]);
@@ -558,6 +564,31 @@
 	blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH);
 }
 
+#define BATCH_JIFFIES (HZ/5)
+static void check_batch_owner(request_queue_t *q, int rw)
+{
+	if (q->batch_owner[rw] != current)
+		return;
+	if (--q->batch_remaining[rw] > 0 && 
+	    jiffies - q->batch_jiffies[rw] < BATCH_JIFFIES) {
+		return;
+	}
+	q->batch_owner[rw] = NULL;
+}
+
+static void set_batch_owner(request_queue_t *q, int rw)
+{
+	struct task_struct *tsk = current;
+	if (q->batch_owner[rw] == tsk)
+		return;
+	if (q->batch_owner[rw] && 
+	    jiffies - q->batch_jiffies[rw] < BATCH_JIFFIES)
+		return;
+	q->batch_jiffies[rw] = jiffies;
+	q->batch_owner[rw] = current;
+	q->batch_remaining[rw] = q->batch_requests;
+}
+
 #define blkdev_free_rq(list) list_entry((list)->next, struct request, queue);
 /*
  * Get a free request. io_request_lock must be held and interrupts
@@ -587,9 +618,13 @@
  */
 static inline struct request *get_request(request_queue_t *q, int rw)
 {
-	if (queue_full(q, rw))
+	struct request *rq;
+	if (queue_full(q, rw) && q->batch_owner[rw] != current)
 		return NULL;
-	return __get_request(q, rw);
+	rq = __get_request(q, rw);
+	if (rq)
+		check_batch_owner(q, rw);
+	return rq;
 }
 
 /* 
@@ -657,9 +692,9 @@
 
 	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
 
+	spin_lock_irq(&io_request_lock);
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		spin_lock_irq(&io_request_lock);
 		if (queue_full(q, rw) || q->rq[rw].count == 0) {
 			if (q->rq[rw].count == 0)
 				__generic_unplug_device(q);
@@ -668,8 +703,9 @@
 			spin_lock_irq(&io_request_lock);
 		}
 		rq = __get_request(q, rw);
-		spin_unlock_irq(&io_request_lock);
 	} while (rq == NULL);
+	set_batch_owner(q, rw);
+	spin_unlock_irq(&io_request_lock);
 	remove_wait_queue(&q->wait_for_requests[rw], &wait);
 	current->state = TASK_RUNNING;
 
@@ -1010,6 +1046,7 @@
 	struct list_head *head, *insert_here;
 	int latency;
 	elevator_t *elevator = &q->elevator;
+	int need_unplug = 0;
 
 	count = bh->b_size >> 9;
 	sector = bh->b_rsector;
@@ -1145,8 +1182,8 @@
 				spin_unlock_irq(&io_request_lock);
 				freereq = __get_request_wait(q, rw);
 				head = &q->queue_head;
+				need_unplug = 1;
 				spin_lock_irq(&io_request_lock);
-				get_request_wait_wakeup(q, rw);
 				goto again;
 			}
 		}
@@ -1174,6 +1211,8 @@
 out:
 	if (freereq)
 		blkdev_release_request(freereq);
+	if (need_unplug)
+		get_request_wait_wakeup(q, rw);
 	spin_unlock_irq(&io_request_lock);
 	return 0;
 end_io:
diff -u edited/include/linux/blkdev.h edited/include/linux/blkdev.h
--- edited/include/linux/blkdev.h	Wed Jun 11 09:56:55 2003
+++ edited/include/linux/blkdev.h	Thu Jun 12 09:44:26 2003
@@ -92,6 +92,10 @@
 	 */
 	int batch_requests;
 
+	struct task_struct *batch_owner[2];
+	int		    batch_remaining[2];
+	unsigned long       batch_jiffies[2];
+
 	/*
 	 * Together with queue_head for cacheline sharing
 	 */

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12 16:06                                                         ` Chris Mason
@ 2003-06-12 16:16                                                           ` Nick Piggin
  0 siblings, 0 replies; 109+ messages in thread
From: Nick Piggin @ 2003-06-12 16:16 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller



Chris Mason wrote:

>On Wed, 2003-06-11 at 23:20, Nick Piggin wrote:
>
>
>>I think the cpu utilization gain of waking a number of tasks
>>at once would be outweighed by advantage of waking 1 task
>>and not putting it to sleep again for a number of requests.
>>You obviously are not claiming concurrency improvements, as
>>your method would also increase contention on the io lock
>>(or the queue lock in 2.5).
>>
>
>I've been trying variations on this for a few days, none have been
>thrilling but the end result is better dbench and iozone throughput
>overall.  For the 20 writer iozone test, rc7 got an average throughput
>of 3MB/s, and yesterdays latency patch got 500k/s or so.  Ouch.
>
>This gets us up to 1.2MB/s.  I'm keeping yesterday's
>get_request_wait_wake, which wakes up a waiter instead of unplugging.
>
>The basic idea here is that after a process is woken up and grabs a
>request, he becomes the batch owner.  Batch owners get to ignore the
>q->full flag for either 1/5 second or 32 requests, whichever comes
>first.  The timer part is an attempt at preventing memory pressure
>writers (who go 1 req at a time) from holding onto batch ownership for
>too long.  Latency stats after dbench 50:
>

Yeah, I get ~50% more throughput and up to 20% better CPU
efficiency on tiobench 256 for sequential and random
writers by doing something similar.



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-12  2:46                                             ` Andrea Arcangeli
  2003-06-12  2:49                                               ` Nick Piggin
@ 2003-06-25 19:03                                               ` Chris Mason
  2003-06-25 19:25                                                 ` Andrea Arcangeli
  2003-06-26  5:48                                                 ` [PATCH] io stalls Nick Piggin
  1 sibling, 2 replies; 109+ messages in thread
From: Chris Mason @ 2003-06-25 19:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

[-- Attachment #1: Type: text/plain, Size: 2624 bytes --]

Hello all,

[ short version, the patch attached should fix io latencies in 2.4.21. 
Please review and/or give it a try ]
 
My last set of patches was directed at reducing the latencies in
__get_request_wait, which really helped reduce stalls when you had lots
of io to one device and balance_dirty() was causing pauses while you
tried to do io to other devices.

But, a streaming write could still starve reads to the same device,
mostly because the read would have to send down any huge merged writes
that were before it in the queue.

Andrea's kernel has a fix for this too, he limits the total number of
sectors that can be in the request queue at any given time.  But, his
patches change blk_finished_io, both in the arguments it takes and the
side effects of calling it.  I don't think we can merge his current form
without breaking external drivers.

So, I added a can_throttle flag to the queue struct, drivers can enable
it if they are going to call the new blk_started_sectors and
blk_finished_sectors funcs any time they call blk_{started,finished}_io,
and these do all the -aa style sector throttling.

There were a few other small changes to Andrea's patch, he wasn't
setting q->full when get_request decided there were too many sectors in
flight.  This resulted in large latencies in __get_request_wait.  He was
also unconditionally clearing q->full in blkdev_release_request, my code
only clears q->full when all the waiters are gone.

I changed generic_unplug_device to zero the elevator_sequence field of
the last request on the queue.  This means there won't be any merges
with requests pending once an unplug is done, and helps limit the number
of sectors that need to be sent down during the run_task_queue(&tq_disk)
in wait_on_buffer.

I lowered the -aa default limit on sectors in flight from 4MB to 2MB. 
We probably want an elvtune for it, large arrays with writeback cache
should be able to tolerate larger values.

There's still a little work left to do, this patch enables sector
throttling for scsi and IDE.  cciss, DAC960 and cpqarray need
modification too (99% done already in -aa).  No sense in doing that
until after the bulk of the patch is reviewed though.

As before, most of the code here is from Andrea and Nick, I've just
wrapped a lot of duct tape around it and done some tweaking.  The
primary pieces are:

fix-pausing (andrea, corner cases where wakeups are missed)
elevator-low-latency (andrea, limit sectors in flight)
queue_full (Nick, fairness in __get_request_wait)

I've removed my latency stats for __get_request_wait in hopes of making
it a better merging candidate.

-chris


[-- Attachment #2: io-stalls-7.diff --]
[-- Type: text/plain, Size: 24826 bytes --]

diff -urN --exclude '*.orig' --exclude '*.rej' parent/drivers/block/ll_rw_blk.c comp/drivers/block/ll_rw_blk.c
--- parent/drivers/block/ll_rw_blk.c	2003-06-25 14:12:09.000000000 -0400
+++ comp/drivers/block/ll_rw_blk.c	2003-06-25 14:11:56.000000000 -0400
@@ -176,11 +176,12 @@
 {
 	int count = q->nr_requests;
 
-	count -= __blk_cleanup_queue(&q->rq[READ]);
-	count -= __blk_cleanup_queue(&q->rq[WRITE]);
+	count -= __blk_cleanup_queue(&q->rq);
 
 	if (count)
 		printk("blk_cleanup_queue: leaked requests (%d)\n", count);
+	if (atomic_read(&q->nr_sectors))
+		printk("blk_cleanup_queue: leaked sectors (%d)\n", atomic_read(&q->nr_sectors));
 
 	memset(q, 0, sizeof(*q));
 }
@@ -215,6 +216,24 @@
 }
 
 /**
+ * blk_queue_throttle_sectors - indicates you will call sector throttling funcs
+ * @q:       The queue which this applies to.
+ * @active:  A flag indication if you want sector throttling on
+ *
+ * Description:
+ * The sector throttling code allows us to put a limit on the number of
+ * sectors pending io to the disk at a given time, sending @active nonzero
+ * indicates you will call blk_started_sectors and blk_finished_sectors in
+ * addition to calling blk_started_io and blk_finished_io in order to
+ * keep track of the number of sectors in flight.
+ **/
+ 
+void blk_queue_throttle_sectors(request_queue_t * q, int active)
+{
+	q->can_throttle = active;
+}
+
+/**
  * blk_queue_make_request - define an alternate make_request function for a device
  * @q:  the request queue for the device to be affected
  * @mfn: the alternate make_request function
@@ -360,8 +379,20 @@
 {
 	if (q->plugged) {
 		q->plugged = 0;
-		if (!list_empty(&q->queue_head))
+		if (!list_empty(&q->queue_head)) {
+			struct request *rq;
+
+			/* we don't want merges later on to come in 
+			 * and significantly increase the amount of
+			 * work during an unplug, it can lead to high
+			 * latencies while some poor waiter tries to
+			 * run an ever increasing chunk of io.
+			 * This does lower throughput some though.
+			 */
+			rq = blkdev_entry_prev_request(&q->queue_head),
+			rq->elevator_sequence = 0;
 			q->request_fn(q);
+		}
 	}
 }
 
@@ -389,7 +420,7 @@
  *
  * Returns the (new) number of requests which the queue has available.
  */
-int blk_grow_request_list(request_queue_t *q, int nr_requests)
+int blk_grow_request_list(request_queue_t *q, int nr_requests, int max_queue_sectors)
 {
 	unsigned long flags;
 	/* Several broken drivers assume that this function doesn't sleep,
@@ -399,21 +430,34 @@
 	spin_lock_irqsave(&io_request_lock, flags);
 	while (q->nr_requests < nr_requests) {
 		struct request *rq;
-		int rw;
 
 		rq = kmem_cache_alloc(request_cachep, SLAB_ATOMIC);
 		if (rq == NULL)
 			break;
 		memset(rq, 0, sizeof(*rq));
 		rq->rq_status = RQ_INACTIVE;
-		rw = q->nr_requests & 1;
-		list_add(&rq->queue, &q->rq[rw].free);
-		q->rq[rw].count++;
+ 		list_add(&rq->queue, &q->rq.free);
+ 		q->rq.count++;
+
 		q->nr_requests++;
 	}
+
+ 	/*
+ 	 * Wakeup waiters after both one quarter of the
+ 	 * max-in-fligh queue and one quarter of the requests
+ 	 * are available again.
+ 	 */
+
 	q->batch_requests = q->nr_requests / 4;
 	if (q->batch_requests > 32)
 		q->batch_requests = 32;
+ 	q->batch_sectors = max_queue_sectors / 4;
+ 
+ 	q->max_queue_sectors = max_queue_sectors;
+ 
+ 	BUG_ON(!q->batch_sectors);
+ 	atomic_set(&q->nr_sectors, 0);
+
 	spin_unlock_irqrestore(&io_request_lock, flags);
 	return q->nr_requests;
 }
@@ -422,23 +466,27 @@
 {
 	struct sysinfo si;
 	int megs;		/* Total memory, in megabytes */
-	int nr_requests;
-
-	INIT_LIST_HEAD(&q->rq[READ].free);
-	INIT_LIST_HEAD(&q->rq[WRITE].free);
-	q->rq[READ].count = 0;
-	q->rq[WRITE].count = 0;
+ 	int nr_requests, max_queue_sectors = MAX_QUEUE_SECTORS;
+  
+ 	INIT_LIST_HEAD(&q->rq.free);
+	q->rq.count = 0;
 	q->nr_requests = 0;
 
 	si_meminfo(&si);
 	megs = si.totalram >> (20 - PAGE_SHIFT);
-	nr_requests = 128;
-	if (megs < 32)
-		nr_requests /= 2;
-	blk_grow_request_list(q, nr_requests);
+ 	nr_requests = MAX_NR_REQUESTS;
+ 	if (megs < 30) {
+  		nr_requests /= 2;
+ 		max_queue_sectors /= 2;
+ 	}
+ 	/* notice early if anybody screwed the defaults */
+ 	BUG_ON(!nr_requests);
+ 	BUG_ON(!max_queue_sectors);
+ 
+ 	blk_grow_request_list(q, nr_requests, max_queue_sectors);
+
+ 	init_waitqueue_head(&q->wait_for_requests);
 
-	init_waitqueue_head(&q->wait_for_requests[0]);
-	init_waitqueue_head(&q->wait_for_requests[1]);
 	spin_lock_init(&q->queue_lock);
 }
 
@@ -491,6 +539,9 @@
 	q->plug_tq.routine	= &generic_unplug_device;
 	q->plug_tq.data		= q;
 	q->plugged        	= 0;
+	q->full			= 0;
+	q->can_throttle		= 0;
+
 	/*
 	 * These booleans describe the queue properties.  We set the
 	 * default (and most common) values here.  Other drivers can
@@ -508,12 +559,13 @@
  * Get a free request. io_request_lock must be held and interrupts
  * disabled on the way in.  Returns NULL if there are no free requests.
  */
-static struct request *get_request(request_queue_t *q, int rw)
+static struct request *__get_request(request_queue_t *q, int rw)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = q->rq + rw;
+	struct request_list *rl;
 
-	if (!list_empty(&rl->free)) {
+	rl = &q->rq;
+	if (!list_empty(&rl->free) && !blk_oversized_queue(q)) {
 		rq = blkdev_free_rq(&rl->free);
 		list_del(&rq->queue);
 		rl->count--;
@@ -521,35 +573,47 @@
 		rq->cmd = rw;
 		rq->special = NULL;
 		rq->q = q;
-	}
-
+	} else
+		q->full = 1;
 	return rq;
 }
 
 /*
- * Here's the request allocation design:
+ * get a free request, honoring the queue_full condition
+ */
+static inline struct request *get_request(request_queue_t *q, int rw)
+{
+	if (q->full)
+		return NULL;
+	return __get_request(q, rw);
+}
+
+/* 
+ * helper func to do memory barriers and wakeups when we finally decide
+ * to clear the queue full condition
+ */
+static inline void clear_full_and_wake(request_queue_t *q)
+{
+	q->full = 0;
+	mb();
+	if (waitqueue_active(&q->wait_for_requests))
+		wake_up(&q->wait_for_requests);
+}
+
+/*
+ * Here's the request allocation design, low latency version:
  *
  * 1: Blocking on request exhaustion is a key part of I/O throttling.
  * 
  * 2: We want to be `fair' to all requesters.  We must avoid starvation, and
  *    attempt to ensure that all requesters sleep for a similar duration.  Hence
  *    no stealing requests when there are other processes waiting.
- * 
- * 3: We also wish to support `batching' of requests.  So when a process is
- *    woken, we want to allow it to allocate a decent number of requests
- *    before it blocks again, so they can be nicely merged (this only really
- *    matters if the process happens to be adding requests near the head of
- *    the queue).
- * 
- * 4: We want to avoid scheduling storms.  This isn't really important, because
- *    the system will be I/O bound anyway.  But it's easy.
- * 
- *    There is tension between requirements 2 and 3.  Once a task has woken,
- *    we don't want to allow it to sleep as soon as it takes its second request.
- *    But we don't want currently-running tasks to steal all the requests
- *    from the sleepers.  We handle this with wakeup hysteresis around
- *    0 .. batch_requests and with the assumption that request taking is much,
- *    much faster than request freeing.
+ *
+ * There used to be more here, attempting to allow a process to send in a
+ * number of requests once it has woken up.  But, there's no way to 
+ * tell if a process has just been woken up, or if it is a new process
+ * coming in to steal requests from the waiters.  So, we give up and force
+ * everyone to wait fairly.
  * 
  * So here's what we do:
  * 
@@ -561,28 +625,23 @@
  * 
  *  When a process wants a new request:
  * 
- *    b) If free_requests == 0, the requester sleeps in FIFO manner.
- * 
- *    b) If 0 <  free_requests < batch_requests and there are waiters,
- *       we still take a request non-blockingly.  This provides batching.
- *
- *    c) If free_requests >= batch_requests, the caller is immediately
- *       granted a new request.
+ *    b) If free_requests == 0, the requester sleeps in FIFO manner, and
+ *       the queue full condition is set.  The full condition is not
+ *       cleared until there are no longer any waiters.  Once the full
+ *       condition is set, all new io must wait, hopefully for a very
+ *       short period of time.
  * 
  *  When a request is released:
  * 
- *    d) If free_requests < batch_requests, do nothing.
- * 
- *    f) If free_requests >= batch_requests, wake up a single waiter.
+ *    c) If free_requests < batch_requests, do nothing.
  * 
- *   The net effect is that when a process is woken at the batch_requests level,
- *   it will be able to take approximately (batch_requests) requests before
- *   blocking again (at the tail of the queue).
- * 
- *   This all assumes that the rate of taking requests is much, much higher
- *   than the rate of releasing them.  Which is very true.
+ *    d) If free_requests >= batch_requests, wake up a single waiter.
  *
- * -akpm, Feb 2002.
+ *   As each waiter gets a request, he wakes another waiter.  We do this
+ *   to prevent a race where an unplug might get run before a request makes
+ *   it's way onto the queue.  The result is a cascade of wakeups, so delaying
+ *   the initial wakeup until we've got batch_requests available helps avoid
+ *   wakeups where there aren't any requests available yet.
  */
 
 static struct request *__get_request_wait(request_queue_t *q, int rw)
@@ -590,21 +649,40 @@
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
 
-	add_wait_queue(&q->wait_for_requests[rw], &wait);
+	add_wait_queue_exclusive(&q->wait_for_requests, &wait);
+
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		generic_unplug_device(q);
-		if (q->rq[rw].count == 0)
-			schedule();
 		spin_lock_irq(&io_request_lock);
-		rq = get_request(q, rw);
+		if (q->full || blk_oversized_queue(q)) {
+			__generic_unplug_device(q);
+			spin_unlock_irq(&io_request_lock);
+			schedule();
+			spin_lock_irq(&io_request_lock);
+		}
+		rq = __get_request(q, rw);
 		spin_unlock_irq(&io_request_lock);
 	} while (rq == NULL);
-	remove_wait_queue(&q->wait_for_requests[rw], &wait);
+	remove_wait_queue(&q->wait_for_requests, &wait);
 	current->state = TASK_RUNNING;
+
+	if (!waitqueue_active(&q->wait_for_requests))
+		clear_full_and_wake(q);
+
 	return rq;
 }
 
+static void get_request_wait_wakeup(request_queue_t *q, int rw)
+{
+	/*
+	 * avoid losing an unplug if a second __get_request_wait did the
+	 * generic_unplug_device while our __get_request_wait was running
+	 * w/o the queue_lock held and w/ our request out of the queue.
+	 */	
+	if (waitqueue_active(&q->wait_for_requests))
+		wake_up(&q->wait_for_requests);
+}
+
 /* RO fail safe mechanism */
 
 static long ro_bits[MAX_BLKDEV][8];
@@ -818,7 +896,6 @@
 void blkdev_release_request(struct request *req)
 {
 	request_queue_t *q = req->q;
-	int rw = req->cmd;
 
 	req->rq_status = RQ_INACTIVE;
 	req->q = NULL;
@@ -828,9 +905,19 @@
 	 * assume it has free buffers and check waiters
 	 */
 	if (q) {
-		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= q->batch_requests)
-			wake_up(&q->wait_for_requests[rw]);
+		int oversized_batch = 0;
+
+		if (q->can_throttle)
+			oversized_batch = blk_oversized_queue_batch(q);
+		q->rq.count++;
+		list_add(&req->queue, &q->rq.free);
+		if (q->rq.count >= q->batch_requests && !oversized_batch) {
+			smp_mb();
+			if (waitqueue_active(&q->wait_for_requests))
+				wake_up(&q->wait_for_requests);
+			else
+				clear_full_and_wake(q);
+		}
 	}
 }
 
@@ -908,6 +995,7 @@
 	struct list_head *head, *insert_here;
 	int latency;
 	elevator_t *elevator = &q->elevator;
+	int should_wake = 0;
 
 	count = bh->b_size >> 9;
 	sector = bh->b_rsector;
@@ -948,7 +1036,6 @@
 	 */
 	max_sectors = get_max_sectors(bh->b_rdev);
 
-again:
 	req = NULL;
 	head = &q->queue_head;
 	/*
@@ -957,7 +1044,9 @@
 	 */
 	spin_lock_irq(&io_request_lock);
 
+again:
 	insert_here = head->prev;
+
 	if (list_empty(head)) {
 		q->plug_device_fn(q, bh->b_rdev); /* is atomic */
 		goto get_rq;
@@ -976,6 +1065,7 @@
 			req->bhtail = bh;
 			req->nr_sectors = req->hard_nr_sectors += count;
 			blk_started_io(count);
+			blk_started_sectors(req, count);
 			drive_stat_acct(req->rq_dev, req->cmd, count, 0);
 			req_new_io(req, 1, count);
 			attempt_back_merge(q, req, max_sectors, max_segments);
@@ -998,6 +1088,7 @@
 			req->sector = req->hard_sector = sector;
 			req->nr_sectors = req->hard_nr_sectors += count;
 			blk_started_io(count);
+			blk_started_sectors(req, count);
 			drive_stat_acct(req->rq_dev, req->cmd, count, 0);
 			req_new_io(req, 1, count);
 			attempt_front_merge(q, head, req, max_sectors, max_segments);
@@ -1030,7 +1121,7 @@
 		 * See description above __get_request_wait()
 		 */
 		if (rw_ahead) {
-			if (q->rq[rw].count < q->batch_requests) {
+			if (q->rq.count < q->batch_requests || blk_oversized_queue_batch(q)) {
 				spin_unlock_irq(&io_request_lock);
 				goto end_io;
 			}
@@ -1042,6 +1133,9 @@
 			if (req == NULL) {
 				spin_unlock_irq(&io_request_lock);
 				freereq = __get_request_wait(q, rw);
+				head = &q->queue_head;
+				spin_lock_irq(&io_request_lock);
+				should_wake = 1;
 				goto again;
 			}
 		}
@@ -1064,10 +1158,13 @@
 	req->start_time = jiffies;
 	req_new_io(req, 0, count);
 	blk_started_io(count);
+	blk_started_sectors(req, count);
 	add_request(q, req, insert_here);
 out:
 	if (freereq)
 		blkdev_release_request(freereq);
+	if (should_wake)
+		get_request_wait_wakeup(q, rw);
 	spin_unlock_irq(&io_request_lock);
 	return 0;
 end_io:
@@ -1196,8 +1293,15 @@
 	bh->b_rdev = bh->b_dev;
 	bh->b_rsector = bh->b_blocknr * count;
 
+	get_bh(bh);
 	generic_make_request(rw, bh);
 
+	/* fix race condition with wait_on_buffer() */
+	smp_mb(); /* spin_unlock may have inclusive semantics */
+	if (waitqueue_active(&bh->b_wait))
+		wake_up(&bh->b_wait);
+
+	put_bh(bh);
 	switch (rw) {
 		case WRITE:
 			kstat.pgpgout += count;
@@ -1350,6 +1454,7 @@
 	if ((bh = req->bh) != NULL) {
 		nsect = bh->b_size >> 9;
 		blk_finished_io(nsect);
+		blk_finished_sectors(req, nsect);
 		req->bh = bh->b_reqnext;
 		bh->b_reqnext = NULL;
 		bh->b_end_io(bh, uptodate);
@@ -1509,6 +1614,7 @@
 EXPORT_SYMBOL(blk_get_queue);
 EXPORT_SYMBOL(blk_cleanup_queue);
 EXPORT_SYMBOL(blk_queue_headactive);
+EXPORT_SYMBOL(blk_queue_throttle_sectors);
 EXPORT_SYMBOL(blk_queue_make_request);
 EXPORT_SYMBOL(generic_make_request);
 EXPORT_SYMBOL(blkdev_release_request);
diff -urN --exclude '*.orig' --exclude '*.rej' parent/drivers/ide/ide-probe.c comp/drivers/ide/ide-probe.c
--- parent/drivers/ide/ide-probe.c	2003-06-25 14:12:09.000000000 -0400
+++ comp/drivers/ide/ide-probe.c	2003-06-25 14:11:55.000000000 -0400
@@ -971,6 +971,7 @@
 
 	q->queuedata = HWGROUP(drive);
 	blk_init_queue(q, do_ide_request);
+	blk_queue_throttle_sectors(q, 1);
 }
 
 #undef __IRQ_HELL_SPIN
diff -urN --exclude '*.orig' --exclude '*.rej' parent/drivers/scsi/scsi.c comp/drivers/scsi/scsi.c
--- parent/drivers/scsi/scsi.c	2003-06-25 14:12:09.000000000 -0400
+++ comp/drivers/scsi/scsi.c	2003-06-25 14:11:55.000000000 -0400
@@ -197,6 +197,7 @@
 
 	blk_init_queue(q, scsi_request_fn);
 	blk_queue_headactive(q, 0);
+	blk_queue_throttle_sectors(q, 1);
 	q->queuedata = (void *) SDpnt;
 }
 
diff -urN --exclude '*.orig' --exclude '*.rej' parent/drivers/scsi/scsi_lib.c comp/drivers/scsi/scsi_lib.c
--- parent/drivers/scsi/scsi_lib.c	2003-06-25 14:12:09.000000000 -0400
+++ comp/drivers/scsi/scsi_lib.c	2003-06-25 14:11:55.000000000 -0400
@@ -378,6 +378,7 @@
 		if ((bh = req->bh) != NULL) {
 			nsect = bh->b_size >> 9;
 			blk_finished_io(nsect);
+			blk_finished_sectors(req, nsect);
 			req->bh = bh->b_reqnext;
 			bh->b_reqnext = NULL;
 			sectors -= nsect;
diff -urN --exclude '*.orig' --exclude '*.rej' parent/fs/buffer.c comp/fs/buffer.c
--- parent/fs/buffer.c	2003-06-25 14:12:09.000000000 -0400
+++ comp/fs/buffer.c	2003-06-25 14:11:53.000000000 -0400
@@ -153,10 +153,23 @@
 	get_bh(bh);
 	add_wait_queue(&bh->b_wait, &wait);
 	do {
-		run_task_queue(&tq_disk);
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 		if (!buffer_locked(bh))
 			break;
+		/*
+		 * We must read tq_disk in TQ_ACTIVE after the
+		 * add_wait_queue effect is visible to other cpus.
+		 * We could unplug some line above it wouldn't matter
+		 * but we can't do that right after add_wait_queue
+		 * without an smp_mb() in between because spin_unlock
+		 * has inclusive semantics.
+		 * Doing it here is the most efficient place so we
+		 * don't do a suprious unplug if we get a racy
+		 * wakeup that make buffer_locked to return 0, and
+		 * doing it here avoids an explicit smp_mb() we
+		 * rely on the implicit one in set_task_state.
+		 */
+		run_task_queue(&tq_disk);
 		schedule();
 	} while (buffer_locked(bh));
 	tsk->state = TASK_RUNNING;
@@ -1523,6 +1536,9 @@
 
 	/* Done - end_buffer_io_async will unlock */
 	SetPageUptodate(page);
+
+	wakeup_page_waiters(page);
+
 	return 0;
 
 out:
@@ -1554,6 +1570,7 @@
 	} while (bh != head);
 	if (need_unlock)
 		UnlockPage(page);
+	wakeup_page_waiters(page);
 	return err;
 }
 
@@ -1781,6 +1798,8 @@
 		else
 			submit_bh(READ, bh);
 	}
+
+	wakeup_page_waiters(page);
 	
 	return 0;
 }
@@ -2394,6 +2413,7 @@
 		submit_bh(rw, bh);
 		bh = next;
 	} while (bh != head);
+	wakeup_page_waiters(page);
 	return 0;
 }
 
diff -urN --exclude '*.orig' --exclude '*.rej' parent/fs/reiserfs/inode.c comp/fs/reiserfs/inode.c
--- parent/fs/reiserfs/inode.c	2003-06-25 14:12:09.000000000 -0400
+++ comp/fs/reiserfs/inode.c	2003-06-25 14:11:53.000000000 -0400
@@ -2209,6 +2209,7 @@
     */
     if (nr) {
         submit_bh_for_writepage(arr, nr) ;
+	wakeup_page_waiters(page);
     } else {
         UnlockPage(page) ;
     }
diff -urN --exclude '*.orig' --exclude '*.rej' parent/include/linux/blkdev.h comp/include/linux/blkdev.h
--- parent/include/linux/blkdev.h	2003-06-25 14:12:09.000000000 -0400
+++ comp/include/linux/blkdev.h	2003-06-25 14:11:56.000000000 -0400
@@ -64,12 +64,6 @@
 typedef void (plug_device_fn) (request_queue_t *q, kdev_t device);
 typedef void (unplug_device_fn) (void *q);
 
-/*
- * Default nr free requests per queue, ll_rw_blk will scale it down
- * according to available RAM at init time
- */
-#define QUEUE_NR_REQUESTS	8192
-
 struct request_list {
 	unsigned int count;
 	struct list_head free;
@@ -80,7 +74,7 @@
 	/*
 	 * the queue request freelist, one for reads and one for writes
 	 */
-	struct request_list	rq[2];
+	struct request_list	rq;
 
 	/*
 	 * The total number of requests on each queue
@@ -93,6 +87,21 @@
 	int batch_requests;
 
 	/*
+	 * The total number of 512byte blocks on each queue
+	 */
+	atomic_t nr_sectors;
+
+	/*
+	 * Batching threshold for sleep/wakeup decisions
+	 */
+	int batch_sectors;
+
+	/*
+	 * The max number of 512byte blocks on each queue
+	 */
+	int max_queue_sectors;
+
+	/*
 	 * Together with queue_head for cacheline sharing
 	 */
 	struct list_head	queue_head;
@@ -118,13 +127,28 @@
 	/*
 	 * Boolean that indicates whether this queue is plugged or not.
 	 */
-	char			plugged;
+	int			plugged:1;
 
 	/*
 	 * Boolean that indicates whether current_request is active or
 	 * not.
 	 */
-	char			head_active;
+	int			head_active:1;
+
+	/*
+	 * Booleans that indicate whether the queue's free requests have
+	 * been exhausted and is waiting to drop below the batch_requests
+	 * threshold
+	 */
+	int			full:1;
+	
+	/*
+	 * Boolean that indicates you will use blk_started_sectors
+	 * and blk_finished_sectors in addition to blk_started_io
+	 * and blk_finished_io.  It enables the throttling code to 
+	 * help keep the size of the in sectors to a reasonable number
+	 */
+	int			can_throttle:1;
 
 	unsigned long		bounce_pfn;
 
@@ -137,7 +161,7 @@
 	/*
 	 * Tasks wait here for free read and write requests
 	 */
-	wait_queue_head_t	wait_for_requests[2];
+	wait_queue_head_t	wait_for_requests;
 };
 
 #define blk_queue_plugged(q)	(q)->plugged
@@ -217,14 +241,16 @@
 extern void generic_make_request(int rw, struct buffer_head * bh);
 extern inline request_queue_t *blk_get_queue(kdev_t dev);
 extern void blkdev_release_request(struct request *);
+extern void blk_print_stats(kdev_t dev);
 
 /*
  * Access functions for manipulating queue properties
  */
-extern int blk_grow_request_list(request_queue_t *q, int nr_requests);
+extern int blk_grow_request_list(request_queue_t *q, int nr_requests, int max_queue_sectors);
 extern void blk_init_queue(request_queue_t *, request_fn_proc *);
 extern void blk_cleanup_queue(request_queue_t *);
 extern void blk_queue_headactive(request_queue_t *, int);
+extern void blk_queue_throttle_sectors(request_queue_t *, int);
 extern void blk_queue_make_request(request_queue_t *, make_request_fn *);
 extern void generic_unplug_device(void *);
 extern inline int blk_seg_merge_ok(struct buffer_head *, struct buffer_head *);
@@ -243,6 +269,8 @@
 
 #define MAX_SEGMENTS 128
 #define MAX_SECTORS 255
+#define MAX_QUEUE_SECTORS (2 << (20 - 9)) /* 2 mbytes when full sized */
+#define MAX_NR_REQUESTS 1024 /* 1024k when in 512 units, normally min is 1M in 1k units */
 
 #define PageAlignSize(size) (((size) + PAGE_SIZE -1) & PAGE_MASK)
 
@@ -268,9 +296,51 @@
 	return retval;
 }
 
+static inline int blk_oversized_queue(request_queue_t * q)
+{
+	if (q->can_throttle)
+		return atomic_read(&q->nr_sectors) > q->max_queue_sectors;
+	return q->rq.count == 0;
+}
+
+static inline int blk_oversized_queue_batch(request_queue_t * q)
+{
+	return atomic_read(&q->nr_sectors) > q->max_queue_sectors - q->batch_sectors;
+}
+
 #define blk_finished_io(nsects)	do { } while (0)
 #define blk_started_io(nsects)	do { } while (0)
 
+static inline void blk_started_sectors(struct request *rq, int count)
+{
+	request_queue_t *q = rq->q;
+	if (q && q->can_throttle) {
+		atomic_add(count, &q->nr_sectors);
+		if (atomic_read(&q->nr_sectors) < 0) {
+			printk("nr_sectors is %d\n", atomic_read(&q->nr_sectors));
+			BUG();
+		}
+	}
+}
+
+static inline void blk_finished_sectors(struct request *rq, int count)
+{
+	request_queue_t *q = rq->q;
+	if (q && q->can_throttle) {
+		atomic_sub(count, &q->nr_sectors);
+		
+		smp_mb();
+		if (q->rq.count >= q->batch_requests && !blk_oversized_queue_batch(q)) {
+			if (waitqueue_active(&q->wait_for_requests))
+				wake_up(&q->wait_for_requests);
+		}
+		if (atomic_read(&q->nr_sectors) < 0) {
+			printk("nr_sectors is %d\n", atomic_read(&q->nr_sectors));
+			BUG();
+		}
+	}
+}
+
 static inline unsigned int blksize_bits(unsigned int size)
 {
 	unsigned int bits = 8;
diff -urN --exclude '*.orig' --exclude '*.rej' parent/include/linux/elevator.h comp/include/linux/elevator.h
--- parent/include/linux/elevator.h	2003-06-25 14:12:09.000000000 -0400
+++ comp/include/linux/elevator.h	2003-06-25 14:11:55.000000000 -0400
@@ -80,7 +80,7 @@
 	return latency;
 }
 
-#define ELV_LINUS_SEEK_COST	16
+#define ELV_LINUS_SEEK_COST	1
 
 #define ELEVATOR_NOOP							\
 ((elevator_t) {								\
@@ -93,8 +93,8 @@
 
 #define ELEVATOR_LINUS							\
 ((elevator_t) {								\
-	2048,				/* read passovers */		\
-	8192,				/* write passovers */		\
+	128,				/* read passovers */		\
+	512,				/* write passovers */		\
 									\
 	elevator_linus_merge,		/* elevator_merge_fn */		\
 	elevator_linus_merge_req,	/* elevator_merge_req_fn */	\
diff -urN --exclude '*.orig' --exclude '*.rej' parent/include/linux/pagemap.h comp/include/linux/pagemap.h
--- parent/include/linux/pagemap.h	2003-06-25 14:12:09.000000000 -0400
+++ comp/include/linux/pagemap.h	2003-06-25 14:11:53.000000000 -0400
@@ -97,6 +97,8 @@
 		___wait_on_page(page);
 }
 
+extern void FASTCALL(wakeup_page_waiters(struct page * page));
+
 /*
  * Returns locked page at given index in given cache, creating it if needed.
  */
diff -urN --exclude '*.orig' --exclude '*.rej' parent/kernel/ksyms.c comp/kernel/ksyms.c
--- parent/kernel/ksyms.c	2003-06-25 14:12:09.000000000 -0400
+++ comp/kernel/ksyms.c	2003-06-25 14:11:53.000000000 -0400
@@ -296,6 +296,7 @@
 EXPORT_SYMBOL(filemap_fdatawait);
 EXPORT_SYMBOL(lock_page);
 EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(wakeup_page_waiters);
 
 /* device registration */
 EXPORT_SYMBOL(register_chrdev);
diff -urN --exclude '*.orig' --exclude '*.rej' parent/mm/filemap.c comp/mm/filemap.c
--- parent/mm/filemap.c	2003-06-25 14:12:09.000000000 -0400
+++ comp/mm/filemap.c	2003-06-25 14:11:53.000000000 -0400
@@ -812,6 +812,20 @@
 	return &wait[hash];
 }
 
+/*
+ * This must be called after every submit_bh with end_io
+ * callbacks that would result into the blkdev layer waking
+ * up the page after a queue unplug.
+ */
+void wakeup_page_waiters(struct page * page)
+{
+	wait_queue_head_t * head;
+
+	head = page_waitqueue(page);
+	if (waitqueue_active(head))
+		wake_up(head);
+}
+
 /* 
  * Wait for a page to get unlocked.
  *

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-25 19:03                                               ` Chris Mason
@ 2003-06-25 19:25                                                 ` Andrea Arcangeli
  2003-06-25 20:18                                                   ` Chris Mason
  2003-06-26  5:48                                                 ` [PATCH] io stalls Nick Piggin
  1 sibling, 1 reply; 109+ messages in thread
From: Andrea Arcangeli @ 2003-06-25 19:25 UTC (permalink / raw)
  To: Chris Mason
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Wed, Jun 25, 2003 at 03:03:43PM -0400, Chris Mason wrote:
> Hello all,
> 
> [ short version, the patch attached should fix io latencies in 2.4.21. 
> Please review and/or give it a try ]
>  
> My last set of patches was directed at reducing the latencies in
> __get_request_wait, which really helped reduce stalls when you had lots
> of io to one device and balance_dirty() was causing pauses while you
> tried to do io to other devices.
> 
> But, a streaming write could still starve reads to the same device,
> mostly because the read would have to send down any huge merged writes
> that were before it in the queue.
> 
> Andrea's kernel has a fix for this too, he limits the total number of
> sectors that can be in the request queue at any given time.  But, his
> patches change blk_finished_io, both in the arguments it takes and the
> side effects of calling it.  I don't think we can merge his current form
> without breaking external drivers.
> 
> So, I added a can_throttle flag to the queue struct, drivers can enable
> it if they are going to call the new blk_started_sectors and
> blk_finished_sectors funcs any time they call blk_{started,finished}_io,
> and these do all the -aa style sector throttling.
> 
> There were a few other small changes to Andrea's patch, he wasn't
> setting q->full when get_request decided there were too many sectors in

wasn't is really in the past, because I'm doing it in 2.4.21rc8aa1 and
in my latest status.

> flight.  This resulted in large latencies in __get_request_wait.  He was
> also unconditionally clearing q->full in blkdev_release_request, my code
> only clears q->full when all the waiters are gone.

my current code including the older 2.4.21rc8aa1 does that too, merged
from your previous patches.

> I changed generic_unplug_device to zero the elevator_sequence field of
> the last request on the queue.  This means there won't be any merges
> with requests pending once an unplug is done, and helps limit the number
> of sectors that need to be sent down during the run_task_queue(&tq_disk)
> in wait_on_buffer.

this sounds like an hack, forbidding merges is pretty bad for
performance in general, of course most of the merging happens in between
the unplugs, but during heavy I/O with frequent unplugs from many
readers this may hurt performance. And as you said this mostly has the
advantage of limiting the size of the queue, like I enforce in my tree
with the elevator-lowlatency. I doubt this is right.

> I lowered the -aa default limit on sectors in flight from 4MB to 2MB. 

I got a few complains for performance slowdown, originally it was 2MB,
so I increased it to 4, from 4M it should be enough for most hardware.

> We probably want an elvtune for it, large arrays with writeback cache
> should be able to tolerate larger values.

Yes, it largely depends on the speed of the device.

> There's still a little work left to do, this patch enables sector
> throttling for scsi and IDE.  cciss, DAC960 and cpqarray need
> modification too (99% done already in -aa).  No sense in doing that
> until after the bulk of the patch is reviewed though.
> 
> As before, most of the code here is from Andrea and Nick, I've just
> wrapped a lot of duct tape around it and done some tweaking.  The
> primary pieces are:
> 
> fix-pausing (andrea, corner cases where wakeups are missed)
> elevator-low-latency (andrea, limit sectors in flight)
> queue_full (Nick, fairness in __get_request_wait)
> 
> I've removed my latency stats for __get_request_wait in hopes of making
> it a better merging candidate.

this is very similar to my status in -aa, exept for the hack that
forbids merging which I think is wrong and the fact you miss the 
wake_up_nr that I added to give a meaning to the batching again and that
you don't avoid the unplugs in get_request_wait_wakeup until the queue
is empty. I mean this:

+static void get_request_wait_wakeup(request_queue_t *q, int rw)
+{
+	/*
+	 * avoid losing an unplug if a second __get_request_wait did the
+	 * generic_unplug_device while our __get_request_wait was
running
+	 * w/o the queue_lock held and w/ our request out of the queue.
+	 */
+	if (q->rq[rw].count == 0 && waitqueue_active(&q->wait_for_requests[rw]))
+		__generic_unplug_device(q);
+}
+

you said last week the above is racy and it even hanged your box, could
you elaborate? The above is in 2.4.21rc8aa1 and it works fine so far
(though especially the race in wait_for_request is never been known to
be reproducible)

thanks,

Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-25 19:25                                                 ` Andrea Arcangeli
@ 2003-06-25 20:18                                                   ` Chris Mason
  2003-06-27  8:41                                                     ` write-caches, I/O stalls: MUST-FIX (was: [PATCH] io stalls) Matthias Andree
  0 siblings, 1 reply; 109+ messages in thread
From: Chris Mason @ 2003-06-25 20:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Wed, 2003-06-25 at 15:25, Andrea Arcangeli wrote:

> > There were a few other small changes to Andrea's patch, he wasn't
> > setting q->full when get_request decided there were too many sectors in
> 
> wasn't is really in the past, because I'm doing it in 2.4.21rc8aa1 and
> in my latest status.
> 

Hmm, I thought I grabbed the patch from rc8aa1, clearly not though,
sorry about that.

> > I changed generic_unplug_device to zero the elevator_sequence field of
> > the last request on the queue.  This means there won't be any merges
> > with requests pending once an unplug is done, and helps limit the number
> > of sectors that need to be sent down during the run_task_queue(&tq_disk)
> > in wait_on_buffer.
> 
> this sounds like an hack, forbidding merges is pretty bad for
> performance in general, of course most of the merging happens in between
> the unplugs, but during heavy I/O with frequent unplugs from many
> readers this may hurt performance. And as you said this mostly has the
> advantage of limiting the size of the queue, like I enforce in my tree
> with the elevator-lowlatency. I doubt this is right.
> 

Well, I would hit sysrq-t when I noticed read stalls, and the reader was
frequently in run_task_queue.  I kept the hunk because it made a
noticeable difference.  I agree there's a throughput tradeoff here, my
goal for the patch was to find the major places I could improve latency
and change them, then go back later and decide if each one was worth it.

Your elevator-lowlatency patch doesn't enforce sector limits for merged
requests, so a merger could constantly come in and steal space in the
sector limit from other waiters.  This lead to high latency in
__get_request_wait.  That hunk for generic_unplug_device solves both of
those problems.
 
> > I lowered the -aa default limit on sectors in flight from 4MB to 2MB. 
> 
> I got a few complains for performance slowdown, originally it was 2MB,
> so I increased it to 4, from 4M it should be enough for most hardware.
> 

I've no preference really.  I didn't notice a throughput difference but
my scsi drives only have 2MB of cache.

> this is very similar to my status in -aa, exept for the hack that
> forbids merging which I think is wrong and the fact you miss the 
> wake_up_nr that I added to give a meaning to the batching again and that
> you don't avoid the unplugs in get_request_wait_wakeup until the queue
> is empty. I mean this:
> 
> +static void get_request_wait_wakeup(request_queue_t *q, int rw)
> +{
> +	/*
> +	 * avoid losing an unplug if a second __get_request_wait did the
> +	 * generic_unplug_device while our __get_request_wait was
> running
> +	 * w/o the queue_lock held and w/ our request out of the queue.
> +	 */
> +	if (q->rq[rw].count == 0 && waitqueue_active(&q->wait_for_requests[rw]))
> +		__generic_unplug_device(q);
> +}
> +
> 
> you said last week the above is racy and it even hanged your box, could
> you elaborate? The above is in 2.4.21rc8aa1 and it works fine so far
> (though especially the race in wait_for_request is never been known to
> be reproducible)

It caused hangs/stalls, but I didn't have the sector throttling code at
the time and it really changes the interaction of things.  I think the
hang went a little like this:

Lets say all the pending io is done, but the wait queue isn't empty yet
because all the waiting tasks haven't yet been scheduled in.  Also, we
have fewer than nr_requests processes waiting to start io, so when they
do all get scheduled in they won't generate an unplug.

q->rq.count = q->nr_requests, q->full = 1

new io comes in, sees q->full = 1, unplugs and sleeps.  No io is done
because the queue is empty.

All the old waiters finally get scheduled in and grab their requests,
but get_request_wait_wakeup doesn't unplug because q->rq.count != 0.

If no additional io comes in, the queue never gets unplugged, and our
waiter never gets woken.

With the sector throttling on, we've got additional wakeups coming from
blk_finished_io (or blk_finished_sectors in my patch).  I kept out the
wakeup_nr idea because I couldn't figure out how to keep
__get_request_wait fair with it in.

-chris



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-25 19:03                                               ` Chris Mason
  2003-06-25 19:25                                                 ` Andrea Arcangeli
@ 2003-06-26  5:48                                                 ` Nick Piggin
  2003-06-26 11:48                                                   ` Chris Mason
  1 sibling, 1 reply; 109+ messages in thread
From: Nick Piggin @ 2003-06-26  5:48 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller



Chris Mason wrote:

>Hello all,
>
>[ short version, the patch attached should fix io latencies in 2.4.21. 
>Please review and/or give it a try ]
> 
>My last set of patches was directed at reducing the latencies in
>__get_request_wait, which really helped reduce stalls when you had lots
>of io to one device and balance_dirty() was causing pauses while you
>tried to do io to other devices.
>
>But, a streaming write could still starve reads to the same device,
>mostly because the read would have to send down any huge merged writes
>that were before it in the queue.
>
>Andrea's kernel has a fix for this too, he limits the total number of
>sectors that can be in the request queue at any given time.  But, his
>patches change blk_finished_io, both in the arguments it takes and the
>side effects of calling it.  I don't think we can merge his current form
>without breaking external drivers.
>
>So, I added a can_throttle flag to the queue struct, drivers can enable
>it if they are going to call the new blk_started_sectors and
>blk_finished_sectors funcs any time they call blk_{started,finished}_io,
>and these do all the -aa style sector throttling.
>
>There were a few other small changes to Andrea's patch, he wasn't
>setting q->full when get_request decided there were too many sectors in
>flight.  This resulted in large latencies in __get_request_wait.  He was
>also unconditionally clearing q->full in blkdev_release_request, my code
>only clears q->full when all the waiters are gone.
>
>I changed generic_unplug_device to zero the elevator_sequence field of
>the last request on the queue.  This means there won't be any merges
>with requests pending once an unplug is done, and helps limit the number
>of sectors that need to be sent down during the run_task_queue(&tq_disk)
>in wait_on_buffer.
>
>I lowered the -aa default limit on sectors in flight from 4MB to 2MB. 
>We probably want an elvtune for it, large arrays with writeback cache
>should be able to tolerate larger values.
>
>There's still a little work left to do, this patch enables sector
>throttling for scsi and IDE.  cciss, DAC960 and cpqarray need
>modification too (99% done already in -aa).  No sense in doing that
>until after the bulk of the patch is reviewed though.
>
>As before, most of the code here is from Andrea and Nick, I've just
>wrapped a lot of duct tape around it and done some tweaking.  The
>primary pieces are:
>
>fix-pausing (andrea, corner cases where wakeups are missed)
>elevator-low-latency (andrea, limit sectors in flight)
>queue_full (Nick, fairness in __get_request_wait)
>

I am hoping to go a slightly different way in 2.5 pending
inclusion of process io contexts. If you had time to look
over my changes there (in current mm tree) it would be
appreciated, but they don't help your problem for 2.4.

I found that my queue full fairness for 2.4 didn't address
the batching issue well. It does, guarantee lowest possible
maximum latency for singular requests, but due to lowered
throughput this can cause worse "high level" latency.

I couldn't find a really good, comprehensive method of
allowing processes to batch without resorting to very
complex wakeup methods unless process io contexts are used.
The other possibility would be to keep a list of "batching"
processes which should achieve the same as io contexts.

An easier approach would be to just allow the last woken
process to submit a batch of requests. This wouldn't have
as good guaranteed fairness, but not to say that it would
have starvation issues. I'll help you implement it if you
are interested.




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-26  5:48                                                 ` [PATCH] io stalls Nick Piggin
@ 2003-06-26 11:48                                                   ` Chris Mason
  2003-06-26 13:04                                                     ` Nick Piggin
  0 siblings, 1 reply; 109+ messages in thread
From: Chris Mason @ 2003-06-26 11:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Thu, 2003-06-26 at 01:48, Nick Piggin wrote:

> I am hoping to go a slightly different way in 2.5 pending
> inclusion of process io contexts. If you had time to look
> over my changes there (in current mm tree) it would be
> appreciated, but they don't help your problem for 2.4.
> 
> I found that my queue full fairness for 2.4 didn't address
> the batching issue well. It does, guarantee lowest possible
> maximum latency for singular requests, but due to lowered
> throughput this can cause worse "high level" latency.
> 
> I couldn't find a really good, comprehensive method of
> allowing processes to batch without resorting to very
> complex wakeup methods unless process io contexts are used.
> The other possibility would be to keep a list of "batching"
> processes which should achieve the same as io contexts.
> 
> An easier approach would be to just allow the last woken
> process to submit a batch of requests. This wouldn't have
> as good guaranteed fairness, but not to say that it would
> have starvation issues. I'll help you implement it if you
> are interested.

One of the things I tried in this area was basically queue ownership. 
When each process woke up, he was given strict ownership of the queue
and could submit up to N number of requests.  One process waited for
ownership in a yield loop for a max limit of a certain number of
jiffies, all the others waited on the request queue.

It generally increased the latency in __get_request wait by a multiple
of N.  I didn't keep it because the current patch is already full of
subtle interactions, I didn't want to make things more confusing than
they already were ;-)

The real problem with this approach is that we're guessing about the
number of requests a given process wants to submit, and we're assuming
those requests are going to be highly mergable.  If the higher levels
pass these hints down to the elevator, we should be able to do a better
job of giving both low latency and high throughput.

Between bios and the pdflush daemons, I think 2.5 is in pretty good
shape to do what we need.  I'm not 100% sure we need batching when the
requests being submitted are not highly mergable, but I haven't put lots
of thought into that part yet.

Anyway for 2.4 I'm not sure there's much more we can do.  I'd like to
add tunables to the current patch, so userland can control the max io in
flight and a simple toggle between throughput mode and latency mode on a
per device basis.  It's not perfect but should tide us over until 2.6.

-chris



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-26 11:48                                                   ` Chris Mason
@ 2003-06-26 13:04                                                     ` Nick Piggin
  2003-06-26 13:18                                                       ` Nick Piggin
  2003-06-26 15:55                                                       ` Chris Mason
  0 siblings, 2 replies; 109+ messages in thread
From: Nick Piggin @ 2003-06-26 13:04 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller



Chris Mason wrote:

>On Thu, 2003-06-26 at 01:48, Nick Piggin wrote:
>
>
>>I am hoping to go a slightly different way in 2.5 pending
>>inclusion of process io contexts. If you had time to look
>>over my changes there (in current mm tree) it would be
>>appreciated, but they don't help your problem for 2.4.
>>
>>I found that my queue full fairness for 2.4 didn't address
>>the batching issue well. It does, guarantee lowest possible
>>maximum latency for singular requests, but due to lowered
>>throughput this can cause worse "high level" latency.
>>
>>I couldn't find a really good, comprehensive method of
>>allowing processes to batch without resorting to very
>>complex wakeup methods unless process io contexts are used.
>>The other possibility would be to keep a list of "batching"
>>processes which should achieve the same as io contexts.
>>
>>An easier approach would be to just allow the last woken
>>process to submit a batch of requests. This wouldn't have
>>as good guaranteed fairness, but not to say that it would
>>have starvation issues. I'll help you implement it if you
>>are interested.
>>
>
>One of the things I tried in this area was basically queue ownership. 
>When each process woke up, he was given strict ownership of the queue
>and could submit up to N number of requests.  One process waited for
>ownership in a yield loop for a max limit of a certain number of
>jiffies, all the others waited on the request queue.
>

Not sure exactly what you mean by one process waiting for ownership
in a yield loop, but why don't you simply allow the queue "owner" to
submit up to a maximum of N requests within a time limit. Once either
limit expires (or, rarely, another might become owner -) the process
would just be put to sleep by the normal queue_full mechanism.

>
>It generally increased the latency in __get_request wait by a multiple
>of N.  I didn't keep it because the current patch is already full of
>subtle interactions, I didn't want to make things more confusing than
>they already were ;-)
>

Yeah, something like that. I think that in a queue full situation,
the processes are wanting to submit more than 1 request though. So
the better thoughput you can achieve by batching translates to
better effective throughput. Read my recent debate with Andrea
about this though - I couldn't convince him!

I have seen much better maximum latencies, 2-3 times the
throughput, and an order of magnitude less context switching on
many threaded tiobench write loads when using batching.

In short, measuring get_request latency won't give you the full
story.

>
>The real problem with this approach is that we're guessing about the
>number of requests a given process wants to submit, and we're assuming
>those requests are going to be highly mergable.  If the higher levels
>pass these hints down to the elevator, we should be able to do a better
>job of giving both low latency and high throughput.
>

No, the numbers (batch # requests, batch time) are not highly scientific.
Simply when a process wakes up, we'll let them submit a small burst of
requests before they go back to sleep. Now in 2.5 (mm) we can cheat and
make this more effective, fair, and without possible missed wakes because
io contexts means that multiple processes can be batching at the same
time, and dynamically allocated requests means it doesn't matter if we
go a bit over the queue limit.

I think a decent solution for 2.4 would be to simply have the one queue
owner, but he allowed the queue to fall below the batch limit, wake
someone else and make them the owner. It can be a bit less fair, and
it doesn't work across queues, but they're less important cases.

>
>Between bios and the pdflush daemons, I think 2.5 is in pretty good
>shape to do what we need.  I'm not 100% sure we need batching when the
>requests being submitted are not highly mergable, but I haven't put lots
>of thought into that part yet.
>

No, there are a couple of problems here.
First, good locality != sequential. I saw tiobench 256 random write
throughput _doubled_ because each process is writing within its own
file.

Second, mergeable doesn't mean anything if your request size only
grows to say 128KB (IDE). I saw tiobench 256 sequential writes on IDE
go from ~ 25% peak throughput to ~70% (4.85->14.11 from 20MB/s disk)

Third, context switch rate. In the latest IBM regression tests,
tiobench 64 on ext2, 8xSMP (so don't look at throughput!), average
cs/s was about 2500 with mainline (FIFO request allocation), and
140 in mm (batching allocation). So nearly 20x better. This might
not be due to batching alone, but I didn't see any other obvious
change in mm.

>
>Anyway for 2.4 I'm not sure there's much more we can do.  I'd like to
>add tunables to the current patch, so userland can control the max io in
>flight and a simple toggle between throughput mode and latency mode on a
>per device basis.  It's not perfect but should tide us over until 2.6.
>
>

The changes do seem to be a critical fix due to the starvation issue,
but I'm worried that they take a big step back in performance under
high disk load. I found my FIFO mechanism to be unacceptably slow for
2.5.




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-26 13:04                                                     ` Nick Piggin
@ 2003-06-26 13:18                                                       ` Nick Piggin
  2003-06-26 15:55                                                       ` Chris Mason
  1 sibling, 0 replies; 109+ messages in thread
From: Nick Piggin @ 2003-06-26 13:18 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Chris Mason, Andrea Arcangeli, Marc-Christian Petersen,
	Jens Axboe, Marcelo Tosatti, Georg Nikodym, lkml,
	Matthias Mueller



Nick Piggin wrote:

snip

>
> Yeah, something like that. I think that in a queue full situation,
> the processes are wanting to submit more than 1 request though. So
> the better thoughput you can achieve by batching translates to
> better effective throughput. Read my recent debate with Andrea 

                   ^^^^^^^^^^
Err, latency

snip

>
> No, the numbers (batch # requests, batch time) are not highly scientific.
> Simply when a process wakes up, we'll let them submit a small burst of
> requests before they go back to sleep.

by this, I mean that its not a big problem that we don't know how many
requests a process wants to submit.

snip

>
> The changes do seem to be a critical fix due to the starvation issue,
> but I'm worried that they take a big step back in performance under
> high disk load. I found my FIFO mechanism to be unacceptably slow for
> 2.5.


BTW. sorry for the lack of better benchmark numbers. I couldn't
find good ones lying around. I found uniprocessor tiobench to
be quite helpful at queue_nr_requests * 0.5, 2 threads to
measure different types of overloadedness.

Also, I didn't see much gain in read performance in my testing -
probably due to AS. I expect 2.4 and 2.5 non AS read performance
to show bigger improvements from batching (ie. regressions).



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-26 13:04                                                     ` Nick Piggin
  2003-06-26 13:18                                                       ` Nick Piggin
@ 2003-06-26 15:55                                                       ` Chris Mason
  2003-06-27  1:21                                                         ` Nick Piggin
  1 sibling, 1 reply; 109+ messages in thread
From: Chris Mason @ 2003-06-26 15:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

[-- Attachment #1: Type: text/plain, Size: 6641 bytes --]

On Thu, 2003-06-26 at 09:04, Nick Piggin wrote:

> >One of the things I tried in this area was basically queue ownership. 
> >When each process woke up, he was given strict ownership of the queue
> >and could submit up to N number of requests.  One process waited for
> >ownership in a yield loop for a max limit of a certain number of
> >jiffies, all the others waited on the request queue.
> >
> 
> Not sure exactly what you mean by one process waiting for ownership
> in a yield loop, but why don't you simply allow the queue "owner" to
> submit up to a maximum of N requests within a time limit. Once either
> limit expires (or, rarely, another might become owner -) the process
> would just be put to sleep by the normal queue_full mechanism.
> 

You need some way to wakeup the queue after that time limit has expired,
in case the owner never submits another request.  This can either be a
timer or a process in a yield loop.  Given that very short expire time I
set (10 jiffies), I went for the yield loop.

> >
> >It generally increased the latency in __get_request wait by a multiple
> >of N.  I didn't keep it because the current patch is already full of
> >subtle interactions, I didn't want to make things more confusing than
> >they already were ;-)
> >
> 
> Yeah, something like that. I think that in a queue full situation,
> the processes are wanting to submit more than 1 request though. So
> the better thoughput you can achieve by batching translates to
> better effective throughput. Read my recent debate with Andrea
> about this though - I couldn't convince him!
> 

Well, it depends ;-) I think we've got 3 basic kinds of procs during a
q->full condition:

1) wants to submit lots of somewhat contiguous io
2) wants to submit a single io
3) wants to submit lots of random io

>From a throughput point of view, we only care about giving batch
ownership to #1.  giving batch ownership to #3 will help reduce context
switches, but if it helps throughput than the io wasn't really random
(you've got a good point about locality below, drive write caches make a
huge difference there).

The problem I see in 2.4 is the elevator can't tell any of these cases
apart, so any attempt at batch ownership is certain to be wrong at least
part of the time.

> I have seen much better maximum latencies, 2-3 times the
> throughput, and an order of magnitude less context switching on
> many threaded tiobench write loads when using batching.
> 
> In short, measuring get_request latency won't give you the full
> story.
> 

Very true.  But get_request latency is the minimum amount of time a
single read is going to wait (in 2.4.x anyway), and that is what we need
to focus on when we're trying to fix interactive performance.

> >
> >The real problem with this approach is that we're guessing about the
> >number of requests a given process wants to submit, and we're assuming
> >those requests are going to be highly mergable.  If the higher levels
> >pass these hints down to the elevator, we should be able to do a better
> >job of giving both low latency and high throughput.
> >
> 
> No, the numbers (batch # requests, batch time) are not highly scientific.
> Simply when a process wakes up, we'll let them submit a small burst of
> requests before they go back to sleep. Now in 2.5 (mm) we can cheat and
> make this more effective, fair, and without possible missed wakes because
> io contexts means that multiple processes can be batching at the same
> time, and dynamically allocated requests means it doesn't matter if we
> go a bit over the queue limit.
> 

I agree 2.5 has a lot more room for the contexts to be effective, and I
think they are a really good idea.

> I think a decent solution for 2.4 would be to simply have the one queue
> owner, but he allowed the queue to fall below the batch limit, wake
> someone else and make them the owner. It can be a bit less fair, and
> it doesn't work across queues, but they're less important cases.
> 
> >
> >Between bios and the pdflush daemons, I think 2.5 is in pretty good
> >shape to do what we need.  I'm not 100% sure we need batching when the
> >requests being submitted are not highly mergable, but I haven't put lots
> >of thought into that part yet.
> >
> 
> No, there are a couple of problems here.
> First, good locality != sequential. I saw tiobench 256 random write
> throughput _doubled_ because each process is writing within its own
> file.
> 
> Second, mergeable doesn't mean anything if your request size only
> grows to say 128KB (IDE). I saw tiobench 256 sequential writes on IDE
> go from ~ 25% peak throughput to ~70% (4.85->14.11 from 20MB/s disk)

Well, play around with raw io, my box writes at roughly disk speed with
128k synchronous requests (contiguous writes).

> Third, context switch rate. In the latest IBM regression tests,
> tiobench 64 on ext2, 8xSMP (so don't look at throughput!), average
> cs/s was about 2500 with mainline (FIFO request allocation), and
> 140 in mm (batching allocation). So nearly 20x better. This might
> not be due to batching alone, but I didn't see any other obvious
> change in mm.
> 

Makes sense.

> >
> >Anyway for 2.4 I'm not sure there's much more we can do.  I'd like to
> >add tunables to the current patch, so userland can control the max io in
> >flight and a simple toggle between throughput mode and latency mode on a
> >per device basis.  It's not perfect but should tide us over until 2.6.
> >
> >
> 
> The changes do seem to be a critical fix due to the starvation issue,
> but I'm worried that they take a big step back in performance under
> high disk load. I found my FIFO mechanism to be unacceptably slow for
> 2.5.

Me too, but I'm not sure how to fix it other than a userspace knob to
turn off the q->full checks for server workloads.  Andrea's
elevator-lowlatency alone has pretty good throughput numbers, since it
still allows request stealing.  Its get_request_wait latency numbers
aren't horrible either, it only suffers in a few corner cases.

But, if someone wants to play with this more, I've attached a quick
remerge of my batch ownership code.  I made a read and write owner, so
that a reader doing a single request doesn't grab ownership and make all
the writes wait.

It does make throughput better overall, and it also makes latencies
worse overall.  We'll probably get similar results just by disabling
q->full in io-stalls-7, but the batch ownership does a better job of
limiting get_request latencies at a fixed (although potentially large)
number.

lat-stat-5.diff goes on top of io-stalls-7.diff from yesterday
batch_owner.diff goes on top of lat-stat-5.diff.

-chris


[-- Attachment #2: batch_owner.diff --]
[-- Type: text/plain, Size: 4506 bytes --]

===== drivers/block/ll_rw_blk.c 1.47 vs edited =====
--- 1.47/drivers/block/ll_rw_blk.c	Thu Jun 26 09:20:08 2003
+++ edited/drivers/block/ll_rw_blk.c	Thu Jun 26 10:52:17 2003
@@ -592,6 +592,8 @@
 	q->full			= 0;
 	q->can_throttle		= 0;
 
+	memset(q->batch, 0, sizeof(struct queue_batch) * 2);
+
 	reset_stats(q);
 
 	/*
@@ -606,6 +608,48 @@
 	blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH);
 }
 
+#define MSEC(x) ((x) * 1000 / HZ)
+#define BATCH_MAX_AGE 100
+int grab_batch_ownership(request_queue_t *q, int rw)
+{
+	struct task_struct *tsk = current;
+	unsigned long age;
+	struct queue_batch *batch = q->batch + rw;
+
+	if (batch->batch_waiter)
+		return 0;
+	if (!batch->batch_owner)
+		goto grab;
+	batch->batch_waiter = tsk;
+	while(1) {
+		age = jiffies - batch->batch_jiffies;
+		if (!batch->batch_owner || MSEC(age) > BATCH_MAX_AGE)
+			break;
+		set_current_state(TASK_RUNNING);
+		spin_unlock_irq(&io_request_lock);
+		schedule();
+		spin_lock_irq(&io_request_lock);
+	}
+	batch->batch_waiter = NULL;
+grab:
+	batch->batch_owner = tsk;
+	batch->batch_jiffies = jiffies;
+	batch->batch_remaining = q->batch_requests;
+	return 1;
+}
+
+void decrement_batch_request(request_queue_t *q, int rw)
+{
+	struct queue_batch *batch = q->batch + rw;
+	if (batch->batch_owner == current) {
+		batch->batch_remaining--;
+		if (!batch->batch_remaining || 
+		    MSEC(jiffies - batch->batch_jiffies) > BATCH_MAX_AGE) {
+			batch->batch_owner = NULL;
+		}
+	}
+}
+
 #define blkdev_free_rq(list) list_entry((list)->next, struct request, queue);
 /*
  * Get a free request. io_request_lock must be held and interrupts
@@ -625,6 +669,7 @@
 		rq->cmd = rw;
 		rq->special = NULL;
 		rq->q = q;
+		decrement_batch_request(q, rw);
 	} else
 		q->full = 1;
 	return rq;
@@ -635,7 +680,7 @@
  */
 static inline struct request *get_request(request_queue_t *q, int rw)
 {
-	if (q->full)
+	if (q->full && q->batch[rw].batch_owner != current)
 		return NULL;
 	return __get_request(q, rw);
 }
@@ -698,25 +743,28 @@
 
 static struct request *__get_request_wait(request_queue_t *q, int rw)
 {
-	register struct request *rq;
+	register struct request *rq = NULL;
 	unsigned long wait_start = jiffies;
 	unsigned long time_waited;
 	DECLARE_WAITQUEUE(wait, current);
 
 	add_wait_queue_exclusive(&q->wait_for_requests, &wait);
 
+	spin_lock_irq(&io_request_lock);
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		spin_lock_irq(&io_request_lock);
 		if (q->full || blk_oversized_queue(q)) {
-			__generic_unplug_device(q);
+			if (blk_oversized_queue(q))
+				__generic_unplug_device(q);
 			spin_unlock_irq(&io_request_lock);
 			schedule();
 			spin_lock_irq(&io_request_lock);
+			if (!grab_batch_ownership(q, rw))
+				continue;
 		}
 		rq = __get_request(q, rw);
-		spin_unlock_irq(&io_request_lock);
 	} while (rq == NULL);
+	spin_unlock_irq(&io_request_lock);
 	remove_wait_queue(&q->wait_for_requests, &wait);
 	current->state = TASK_RUNNING;
 
@@ -978,9 +1026,9 @@
 		list_add(&req->queue, &q->rq.free);
 		if (q->rq.count >= q->batch_requests && !oversized_batch) {
 			smp_mb();
-			if (waitqueue_active(&q->wait_for_requests))
+			if (waitqueue_active(&q->wait_for_requests)) {
 				wake_up(&q->wait_for_requests);
-			else
+			} else
 				clear_full_and_wake(q);
 		}
 	}
===== include/linux/blkdev.h 1.25 vs edited =====
--- 1.25/include/linux/blkdev.h	Thu Jun 26 09:20:08 2003
+++ edited/include/linux/blkdev.h	Thu Jun 26 10:50:17 2003
@@ -69,6 +69,15 @@
 	struct list_head free;
 };
 
+struct queue_batch
+{
+	struct task_struct 	*batch_owner;
+	struct task_struct 	*batch_waiter;
+	unsigned long		batch_jiffies;
+	int			batch_remaining;
+	
+};
+
 struct request_queue
 {
 	/*
@@ -141,7 +150,7 @@
 	 * threshold
 	 */
 	int			full:1;
-	
+
 	/*
 	 * Boolean that indicates you will use blk_started_sectors
 	 * and blk_finished_sectors in addition to blk_started_io
@@ -162,6 +171,9 @@
 	 * Tasks wait here for free read and write requests
 	 */
 	wait_queue_head_t	wait_for_requests;
+
+	struct queue_batch	batch[2];
+
 	unsigned long           max_wait;
 	unsigned long           min_wait;
 	unsigned long           total_wait;
@@ -278,7 +290,7 @@
 
 #define MAX_SEGMENTS 128
 #define MAX_SECTORS 255
-#define MAX_QUEUE_SECTORS (2 << (20 - 9)) /* 4 mbytes when full sized */
+#define MAX_QUEUE_SECTORS (4 << (20 - 9)) /* 4 mbytes when full sized */
 #define MAX_NR_REQUESTS 1024 /* 1024k when in 512 units, normally min is 1M in 1k units */
 
 #define PageAlignSize(size) (((size) + PAGE_SIZE -1) & PAGE_MASK)

[-- Attachment #3: lat-stat-5.diff --]
[-- Type: text/plain, Size: 4473 bytes --]

reverted:
--- b/drivers/block/blkpg.c	Thu Jun 26 09:12:14 2003
+++ a/drivers/block/blkpg.c	Thu Jun 26 09:12:14 2003
@@ -261,6 +261,7 @@
 			return blkpg_ioctl(dev, (struct blkpg_ioctl_arg *) arg);
 			
 		case BLKELVGET:
+			blk_print_stats(dev);
 			return blkelvget_ioctl(&blk_get_queue(dev)->elevator,
 					       (blkelv_ioctl_arg_t *) arg);
 		case BLKELVSET:
reverted:
--- b/drivers/block/ll_rw_blk.c	Thu Jun 26 09:12:14 2003
+++ a/drivers/block/ll_rw_blk.c	Thu Jun 26 09:12:14 2003
@@ -490,6 +490,56 @@
 	spin_lock_init(&q->queue_lock);
 }
 
+void blk_print_stats(kdev_t dev) 
+{
+	request_queue_t *q;
+	unsigned long avg_wait;
+	unsigned long min_wait;
+	unsigned long high_wait;
+	unsigned long *d;
+
+	q = blk_get_queue(dev);
+	if (!q)
+		return;
+
+	min_wait = q->min_wait;
+	if (min_wait == ~0UL)
+		min_wait = 0;
+	if (q->num_wait) 
+		avg_wait = q->total_wait / q->num_wait;
+	else
+		avg_wait = 0;
+	printk("device %s: num_req %lu, total jiffies waited %lu\n", 
+	       kdevname(dev), q->num_req, q->total_wait);
+	printk("\t%lu forced to wait\n", q->num_wait);
+	printk("\t%lu min wait, %lu max wait\n", min_wait, q->max_wait);
+	printk("\t%lu average wait\n", avg_wait);
+	d = q->deviation;
+	printk("\t%lu < 100, %lu < 200, %lu < 300, %lu < 400, %lu < 500\n",
+               d[0], d[1], d[2], d[3], d[4]);
+	high_wait = d[0] + d[1] + d[2] + d[3] + d[4];
+	high_wait = q->num_wait - high_wait;
+	printk("\t%lu waits longer than 500 jiffies\n", high_wait);
+}
+
+static void reset_stats(request_queue_t *q)
+{
+	q->max_wait		= 0;
+	q->min_wait		= ~0UL;
+	q->total_wait		= 0;
+	q->num_req		= 0;
+	q->num_wait		= 0;
+	memset(q->deviation, 0, sizeof(q->deviation));
+}
+void blk_reset_stats(kdev_t dev) 
+{
+	request_queue_t *q;
+	q = blk_get_queue(dev);
+	if (!q)
+	    return;
+	printk("reset latency stats on device %s\n", kdevname(dev));
+	reset_stats(q);
+}
 static int __make_request(request_queue_t * q, int rw, struct buffer_head * bh);
 
 /**
@@ -542,6 +592,8 @@
 	q->full			= 0;
 	q->can_throttle		= 0;
 
+	reset_stats(q);
+
 	/*
 	 * These booleans describe the queue properties.  We set the
 	 * default (and most common) values here.  Other drivers can
@@ -647,6 +699,8 @@
 static struct request *__get_request_wait(request_queue_t *q, int rw)
 {
 	register struct request *rq;
+	unsigned long wait_start = jiffies;
+	unsigned long time_waited;
 	DECLARE_WAITQUEUE(wait, current);
 
 	add_wait_queue_exclusive(&q->wait_for_requests, &wait);
@@ -669,6 +723,17 @@
 	if (!waitqueue_active(&q->wait_for_requests))
 		clear_full_and_wake(q);
 
+	time_waited = jiffies - wait_start;
+	if (time_waited > q->max_wait)
+		q->max_wait = time_waited;
+	if (time_waited && time_waited < q->min_wait)
+		q->min_wait = time_waited;
+	q->total_wait += time_waited;
+	q->num_wait++;
+	if (time_waited < 500) {
+		q->deviation[time_waited/100]++;
+	}
+
 	return rq;
 }
 
@@ -1157,6 +1222,7 @@
 	req->rq_dev = bh->b_rdev;
 	req->start_time = jiffies;
 	req_new_io(req, 0, count);
+	q->num_req++;
 	blk_started_io(count);
 	blk_started_sectors(req, count);
 	add_request(q, req, insert_here);
reverted:
--- b/fs/super.c	Thu Jun 26 09:12:14 2003
+++ a/fs/super.c	Thu Jun 26 09:12:14 2003
@@ -726,6 +726,7 @@
 	if (!fs_type->read_super(s, data, flags & MS_VERBOSE ? 1 : 0))
 		goto Einval;
 	s->s_flags |= MS_ACTIVE;
+	blk_reset_stats(dev);
 	path_release(&nd);
 	return s;
 
reverted:
--- b/include/linux/blkdev.h	Thu Jun 26 09:12:14 2003
+++ a/include/linux/blkdev.h	Thu Jun 26 09:12:14 2003
@@ -162,8 +162,17 @@
 	 * Tasks wait here for free read and write requests
 	 */
 	wait_queue_head_t	wait_for_requests;
+	unsigned long           max_wait;
+	unsigned long           min_wait;
+	unsigned long           total_wait;
+	unsigned long           num_req;
+	unsigned long           num_wait;
+	unsigned long           deviation[5];
 };
 
+void blk_reset_stats(kdev_t dev);
+void blk_print_stats(kdev_t dev);
+
 #define blk_queue_plugged(q)	(q)->plugged
 #define blk_fs_request(rq)	((rq)->cmd == READ || (rq)->cmd == WRITE)
 #define blk_queue_empty(q)	list_empty(&(q)->queue_head)
@@ -269,7 +278,7 @@
 
 #define MAX_SEGMENTS 128
 #define MAX_SECTORS 255
+#define MAX_QUEUE_SECTORS (2 << (20 - 9)) /* 4 mbytes when full sized */
-#define MAX_QUEUE_SECTORS (2 << (20 - 9)) /* 2 mbytes when full sized */
 #define MAX_NR_REQUESTS 1024 /* 1024k when in 512 units, normally min is 1M in 1k units */
 
 #define PageAlignSize(size) (((size) + PAGE_SIZE -1) & PAGE_MASK)

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-26 15:55                                                       ` Chris Mason
@ 2003-06-27  1:21                                                         ` Nick Piggin
  2003-06-27  1:39                                                           ` Chris Mason
  0 siblings, 1 reply; 109+ messages in thread
From: Nick Piggin @ 2003-06-27  1:21 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller



Chris Mason wrote:

>On Thu, 2003-06-26 at 09:04, Nick Piggin wrote:
>
>
>>>One of the things I tried in this area was basically queue ownership. 
>>>When each process woke up, he was given strict ownership of the queue
>>>and could submit up to N number of requests.  One process waited for
>>>ownership in a yield loop for a max limit of a certain number of
>>>jiffies, all the others waited on the request queue.
>>>
>>>
>>Not sure exactly what you mean by one process waiting for ownership
>>in a yield loop, but why don't you simply allow the queue "owner" to
>>submit up to a maximum of N requests within a time limit. Once either
>>limit expires (or, rarely, another might become owner -) the process
>>would just be put to sleep by the normal queue_full mechanism.
>>
>>
>
>You need some way to wakeup the queue after that time limit has expired,
>in case the owner never submits another request.  This can either be a
>timer or a process in a yield loop.  Given that very short expire time I
>set (10 jiffies), I went for the yield loop.
>
>
>>>It generally increased the latency in __get_request wait by a multiple
>>>of N.  I didn't keep it because the current patch is already full of
>>>subtle interactions, I didn't want to make things more confusing than
>>>they already were ;-)
>>>
>>>
>>Yeah, something like that. I think that in a queue full situation,
>>the processes are wanting to submit more than 1 request though. So
>>the better thoughput you can achieve by batching translates to
>>better effective throughput. Read my recent debate with Andrea
>>about this though - I couldn't convince him!
>>
>>
>
>Well, it depends ;-) I think we've got 3 basic kinds of procs during a
>q->full condition:
>
>1) wants to submit lots of somewhat contiguous io
>2) wants to submit a single io
>3) wants to submit lots of random io
>
>>From a throughput point of view, we only care about giving batch
>ownership to #1.  giving batch ownership to #3 will help reduce context
>switches, but if it helps throughput than the io wasn't really random
>(you've got a good point about locality below, drive write caches make a
>huge difference there).
>
>The problem I see in 2.4 is the elevator can't tell any of these cases
>apart, so any attempt at batch ownership is certain to be wrong at least
>part of the time.
>

I think though, for fairness, if we allow one to submit a batch
of requests, we have to give that opportunity to the others.
And yeah, it does reduce context switches, and it does improve
throughput for "random" localised IO.

>
>
>>I have seen much better maximum latencies, 2-3 times the
>>throughput, and an order of magnitude less context switching on
>>many threaded tiobench write loads when using batching.
>>
>>In short, measuring get_request latency won't give you the full
>>story.
>>
>>
>
>Very true.  But get_request latency is the minimum amount of time a
>single read is going to wait (in 2.4.x anyway), and that is what we need
>to focus on when we're trying to fix interactive performance.
>

The read situation is different to write. To fill the read queue,
you need queue_nr_requests / 2-3 (for readahead) reading processes
to fill the queue, more if the reads are random.
If this kernel is being used interactively, its not our fault we
might not give quite as good interactive performance. I'm sure
the fileserver admin would rather take the tripled bandwidth ;)

That said, I think a lot of interactive programs will want to do
more than 1 request at a time anyway.

>
>>>The real problem with this approach is that we're guessing about the
>>>number of requests a given process wants to submit, and we're assuming
>>>those requests are going to be highly mergable.  If the higher levels
>>>pass these hints down to the elevator, we should be able to do a better
>>>job of giving both low latency and high throughput.
>>>
>>>
>>No, the numbers (batch # requests, batch time) are not highly scientific.
>>Simply when a process wakes up, we'll let them submit a small burst of
>>requests before they go back to sleep. Now in 2.5 (mm) we can cheat and
>>make this more effective, fair, and without possible missed wakes because
>>io contexts means that multiple processes can be batching at the same
>>time, and dynamically allocated requests means it doesn't matter if we
>>go a bit over the queue limit.
>>
>>
>
>I agree 2.5 has a lot more room for the contexts to be effective, and I
>think they are a really good idea.
>
>
>>I think a decent solution for 2.4 would be to simply have the one queue
>>owner, but he allowed the queue to fall below the batch limit, wake
>>someone else and make them the owner. It can be a bit less fair, and
>>it doesn't work across queues, but they're less important cases.
>>
>>
>>>Between bios and the pdflush daemons, I think 2.5 is in pretty good
>>>shape to do what we need.  I'm not 100% sure we need batching when the
>>>requests being submitted are not highly mergable, but I haven't put lots
>>>of thought into that part yet.
>>>
>>>
>>No, there are a couple of problems here.
>>First, good locality != sequential. I saw tiobench 256 random write
>>throughput _doubled_ because each process is writing within its own
>>file.
>>
>>Second, mergeable doesn't mean anything if your request size only
>>grows to say 128KB (IDE). I saw tiobench 256 sequential writes on IDE
>>go from ~ 25% peak throughput to ~70% (4.85->14.11 from 20MB/s disk)
>>
>
>Well, play around with raw io, my box writes at roughly disk speed with
>128k synchronous requests (contiguous writes).
>

Yeah, I'm not talking about request overhead - I think a 128K sized
request is just fine. But when there are 256 threads writing, with
FIFO method, 128 threads will each have 1 request in the queue. If
they are sequential writers, each request will probably be 128K.
That isn't enough to get good disk bandwidth. The elevator _has_ to
make a suboptimal decision.

With batching, say 8 processes have 16 sequential requests on the
queue each. The elevator can make good choices.



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-27  1:21                                                         ` Nick Piggin
@ 2003-06-27  1:39                                                           ` Chris Mason
  2003-06-27  9:45                                                             ` Nick Piggin
  0 siblings, 1 reply; 109+ messages in thread
From: Chris Mason @ 2003-06-27  1:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Thu, 2003-06-26 at 21:21, Nick Piggin wrote:

> >Very true.  But get_request latency is the minimum amount of time a
> >single read is going to wait (in 2.4.x anyway), and that is what we need
> >to focus on when we're trying to fix interactive performance.
> >
> 
> The read situation is different to write. To fill the read queue,
> you need queue_nr_requests / 2-3 (for readahead) reading processes
> to fill the queue, more if the reads are random.
> If this kernel is being used interactively, its not our fault we
> might not give quite as good interactive performance. I'm sure
> the fileserver admin would rather take the tripled bandwidth ;)
> 
> That said, I think a lot of interactive programs will want to do
> more than 1 request at a time anyway.
> 

My intuition agrees with yours, but if this is true then andrea's old
elevator-lowlatency patch alone is enough, and we don't need q->full at
all.  Users continued to complain of bad latencies even with his code
applied.

>From a practical point of view his old code is the same as the batch
wakeup code for get_request latencies and provides good throughput. 
There are a few cases where batch wakeup has shorter overall latencies,
but I don't think people were in those heavy workloads while they were
complaining of stalls in -aa.

> >>Second, mergeable doesn't mean anything if your request size only
> >>grows to say 128KB (IDE). I saw tiobench 256 sequential writes on IDE
> >>go from ~ 25% peak throughput to ~70% (4.85->14.11 from 20MB/s disk)
> >>
> >
> >Well, play around with raw io, my box writes at roughly disk speed with
> >128k synchronous requests (contiguous writes).
> >
> 
> Yeah, I'm not talking about request overhead - I think a 128K sized
> request is just fine. But when there are 256 threads writing, with
> FIFO method, 128 threads will each have 1 request in the queue. If
> they are sequential writers, each request will probably be 128K.
> That isn't enough to get good disk bandwidth. The elevator _has_ to
> make a suboptimal decision.
> 
> With batching, say 8 processes have 16 sequential requests on the
> queue each. The elevator can make good choices.

I agree here too, it just doesn't match the user reports we've been
getting in 2.4 ;-)  If 2.5 can dynamically allocate requests now and
then you can get much better results with io contexts/dynamic wakeups,
but I can't see how to make it work in 2.4 without larger backports.

So, the way I see things, we've got a few choices.

1) do nothing.  2.6 isn't that far off.

2) add elevator-lowlatency without q->full.  It solves 90% of the
problem

3) add q->full as well and make it the default.  Great latencies, not so
good throughput.  Add userland tunables so people can switch.

4) back port some larger chunk of 2.5 and find a better overall
solution.

I vote for #3, don't care much if q->full is on or off by default, as
long as we make an easy way for people to set it.

-chris



^ permalink raw reply	[flat|nested] 109+ messages in thread

* write-caches, I/O stalls: MUST-FIX (was: [PATCH] io stalls)
  2003-06-25 20:18                                                   ` Chris Mason
@ 2003-06-27  8:41                                                     ` Matthias Andree
  0 siblings, 0 replies; 109+ messages in thread
From: Matthias Andree @ 2003-06-27  8:41 UTC (permalink / raw)
  To: lkml

On Wed, 25 Jun 2003, Chris Mason wrote:

> I've no preference really.  I didn't notice a throughput difference but
> my scsi drives only have 2MB of cache.

You shouldn't be using the drive's write cache in the first place!

The write cache, regardless of ATA or SCSI, can, as far as I know, not
be used safely with any Linux file systems (and my questions whether 2.6
will finally change that went unanswered so far), because the write
reordering the write cache can do can seriously damage file systems,
whether journalling or not.

Please conduct all your tests with write caches turned off because
that's what matters in REAL systems; in that case, these latencies
become a REAL pain in the back because writing is so much slower because
of all the seeks.

Optimizing for write cached behaviour can happen not a single second
before:

1. the file systems know how to queue "ordered tags" in the right places
   (write barrier to enforce proper ordering for on-disk consistency,
   which I assume will make for a lot of ordered tags for writing to the
   journal itself)

2. the device driver knows how to map "ordered tags" to flush or
   whatever operations for drives that don't do tagged command queueing
   (ATA mostly, or SCSI when TCQ is switched off).

All these "0-bytes in file" problems with XFS, ReiserFS, JFS, ext2 and
ext3 in data=writeback mode happen because the kernel doesn't care about
write ordering, and these broken files are a) occasionally hard to find,
b) another PITA.

I consider proper write ordering and enforcing thereof a MUST-FIX. This
is much more important than getting some extra latencies squished. It
must do the right thing in the first place, and then it can do the right
thing faster.

I am aware that you're not the only person responsible for the state
Linux is in, and I'd like to see the write barriers revived ASAP for at
least ext2/ext3/reiserfs, sym53c8xx, aic7xxx, tmscsim and IDE. I am
sorry not being able to offer any help on that, I'm not acquainted with
the kernel stuff and I can't donate money to anyone for me to do it.

SCNR.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-27  1:39                                                           ` Chris Mason
@ 2003-06-27  9:45                                                             ` Nick Piggin
  2003-06-27 12:41                                                               ` Chris Mason
  0 siblings, 1 reply; 109+ messages in thread
From: Nick Piggin @ 2003-06-27  9:45 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller



Chris Mason wrote:

>On Thu, 2003-06-26 at 21:21, Nick Piggin wrote:
>
>
>>>Very true.  But get_request latency is the minimum amount of time a
>>>single read is going to wait (in 2.4.x anyway), and that is what we need
>>>to focus on when we're trying to fix interactive performance.
>>>
>>>
>>The read situation is different to write. To fill the read queue,
>>you need queue_nr_requests / 2-3 (for readahead) reading processes
>>to fill the queue, more if the reads are random.
>>If this kernel is being used interactively, its not our fault we
>>might not give quite as good interactive performance. I'm sure
>>the fileserver admin would rather take the tripled bandwidth ;)
>>
>>That said, I think a lot of interactive programs will want to do
>>more than 1 request at a time anyway.
>>
>>
>
>My intuition agrees with yours, but if this is true then andrea's old
>elevator-lowlatency patch alone is enough, and we don't need q->full at
>all.  Users continued to complain of bad latencies even with his code
>applied.
>

Didn't that still have the starvation issues in get_request that
my patch addressed though? This batching is needed due to the
strict FIFO behaviour that my "q->full" thing did.

>
>>From a practical point of view his old code is the same as the batch
>wakeup code for get_request latencies and provides good throughput. 
>There are a few cases where batch wakeup has shorter overall latencies,
>but I don't think people were in those heavy workloads while they were
>complaining of stalls in -aa.
>
>
>>>>Second, mergeable doesn't mean anything if your request size only
>>>>grows to say 128KB (IDE). I saw tiobench 256 sequential writes on IDE
>>>>go from ~ 25% peak throughput to ~70% (4.85->14.11 from 20MB/s disk)
>>>>
>>>>
>>>Well, play around with raw io, my box writes at roughly disk speed with
>>>128k synchronous requests (contiguous writes).
>>>
>>>
>>Yeah, I'm not talking about request overhead - I think a 128K sized
>>request is just fine. But when there are 256 threads writing, with
>>FIFO method, 128 threads will each have 1 request in the queue. If
>>they are sequential writers, each request will probably be 128K.
>>That isn't enough to get good disk bandwidth. The elevator _has_ to
>>make a suboptimal decision.
>>
>>With batching, say 8 processes have 16 sequential requests on the
>>queue each. The elevator can make good choices.
>>
>
>I agree here too, it just doesn't match the user reports we've been
>getting in 2.4 ;-)  If 2.5 can dynamically allocate requests now and
>then you can get much better results with io contexts/dynamic wakeups,
>but I can't see how to make it work in 2.4 without larger backports.
>
>So, the way I see things, we've got a few choices.
>
>1) do nothing.  2.6 isn't that far off.
>
>2) add elevator-lowlatency without q->full.  It solves 90% of the
>problem
>
>3) add q->full as well and make it the default.  Great latencies, not so
>good throughput.  Add userland tunables so people can switch.
>
>4) back port some larger chunk of 2.5 and find a better overall
>solution.
>
>I vote for #3, don't care much if q->full is on or off by default, as
>long as we make an easy way for people to set it.
>

5) include the "q->full" starvation fix; add the concept of a
   queue owner, the batching process.

I'm a bit busy at the moment and so I won't test this, unfortunately.
I would prefer that if something like #5 doesn't get in, then nothing
be done for .22 unless its backed up by a few decent benchmarks. But
its not my call anyway.

Cheers,
Nick



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] io stalls
  2003-06-27  9:45                                                             ` Nick Piggin
@ 2003-06-27 12:41                                                               ` Chris Mason
  0 siblings, 0 replies; 109+ messages in thread
From: Chris Mason @ 2003-06-27 12:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Marc-Christian Petersen, Jens Axboe,
	Marcelo Tosatti, Georg Nikodym, lkml, Matthias Mueller

On Fri, 2003-06-27 at 05:45, Nick Piggin wrote:
> Chris Mason wrote:
> >>>
> >>The read situation is different to write. To fill the read queue,
> >>you need queue_nr_requests / 2-3 (for readahead) reading processes
> >>to fill the queue, more if the reads are random.
> >>If this kernel is being used interactively, its not our fault we
> >>might not give quite as good interactive performance. I'm sure
> >>the fileserver admin would rather take the tripled bandwidth ;)
> >>
> >>That said, I think a lot of interactive programs will want to do
> >>more than 1 request at a time anyway.
> >>
> >>
> >
> >My intuition agrees with yours, but if this is true then andrea's old
> >elevator-lowlatency patch alone is enough, and we don't need q->full at
> >all.  Users continued to complain of bad latencies even with his code
> >applied.
> >
> 
> Didn't that still have the starvation issues in get_request that
> my patch addressed though? This batching is needed due to the
> strict FIFO behaviour that my "q->full" thing did.
> 

Sure, but even though the batch wakeup code didn't have starvation
issues, the overall get_request latency was still high.  The end result
was basically the same, without q->full we've got a higher max wait and
a lower average wait.  With batch wakeup we've got a higher average
(300-400 jiffies) and a lower max (800-900 jiffies).

Especially for things like directory listings, where 2.4 generally does
io a few blocks at a time, the get_request latency is a big part of the
latency an interactive user sees.

> >So, the way I see things, we've got a few choices.
> >
> >1) do nothing.  2.6 isn't that far off.
> >
> >2) add elevator-lowlatency without q->full.  It solves 90% of the
> >problem
> >
> >3) add q->full as well and make it the default.  Great latencies, not so
> >good throughput.  Add userland tunables so people can switch.
> >
> >4) back port some larger chunk of 2.5 and find a better overall
> >solution.
> >
> >I vote for #3, don't care much if q->full is on or off by default, as
> >long as we make an easy way for people to set it.
> >
> 
> 5) include the "q->full" starvation fix; add the concept of a
>    queue owner, the batching process.
> 

I've tried two different approaches to #5, the first is a just a
batch_owner where other procs are still allowed to grab requests and the
owner was allowed to ignore q->full.  The end result was low latencies
but not much better throughput.  With a small number of procs, you've
got a good chance bdflush is going to get ownership and the throughput
is pretty good.  With more procs the probability of that goes down and
the throughput benefit goes away.

My second attempt was the batch wakeup patch from yesterday.  Overall I
don't feel the latencies are significantly better with that patch than
with Andrea's elevator-lowlatency and q->full disabled.

> I'm a bit busy at the moment and so I won't test this, unfortunately.
> I would prefer that if something like #5 doesn't get in, then nothing
> be done for .22 unless its backed up by a few decent benchmarks. But
> its not my call anyway.
> 

Andrea's code without q->full is a good starting point regardless.  The
throughput is good and the latencies are better overall.  q->full is
simple enough that making it available via a tunable is pretty easy.  I
really do wish I could make one patch that works well for both, but I've
honestly run out of ideas ;-)

-chris



^ permalink raw reply	[flat|nested] 109+ messages in thread

end of thread, other threads:[~2003-06-27 12:28 UTC | newest]

Thread overview: 109+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-05-29  0:55 Linux 2.4.21-rc6 Marcelo Tosatti
2003-05-29  1:22 ` Con Kolivas
2003-05-29  5:24   ` Marc Wilson
2003-05-29  5:34     ` Riley Williams
2003-05-29  5:57       ` Marc Wilson
2003-05-29  7:15         ` Riley Williams
2003-05-29  8:38         ` Willy Tarreau
2003-05-29  8:40           ` Willy Tarreau
2003-06-03 16:02         ` Marcelo Tosatti
2003-06-03 16:13           ` Marc-Christian Petersen
2003-06-04 21:54             ` Pavel Machek
2003-06-05  2:10               ` Michael Frank
2003-06-03 16:30           ` Michael Frank
2003-06-03 16:53             ` Matthias Mueller
2003-06-03 16:59             ` Marc-Christian Petersen
2003-06-03 17:03               ` Marc-Christian Petersen
2003-06-03 18:02                 ` Anders Karlsson
2003-06-03 21:12                   ` J.A. Magallon
2003-06-03 21:18                     ` Marc-Christian Petersen
2003-06-03 17:23               ` Michael Frank
2003-06-04 14:56             ` Jakob Oestergaard
2003-06-04  4:04           ` Marc Wilson
2003-05-29 10:02 ` Con Kolivas
2003-05-29 18:00 ` Georg Nikodym
2003-05-29 19:11   ` -rc7 " Marcelo Tosatti
2003-05-29 19:56     ` Krzysiek Taraszka
2003-05-29 20:18       ` Krzysiek Taraszka
2003-06-04 18:17         ` Marcelo Tosatti
2003-06-04 21:41           ` Krzysiek Taraszka
2003-06-04 22:37             ` Alan Cox
2003-06-04 10:22     ` Andrea Arcangeli
2003-06-04 10:35       ` Marc-Christian Petersen
2003-06-04 10:42         ` Jens Axboe
2003-06-04 10:46           ` Marc-Christian Petersen
2003-06-04 10:48             ` Andrea Arcangeli
2003-06-04 11:57               ` Nick Piggin
2003-06-04 12:00                 ` Jens Axboe
2003-06-04 12:09                   ` Andrea Arcangeli
2003-06-04 12:20                     ` Jens Axboe
2003-06-04 20:50                       ` Rob Landley
2003-06-04 12:11                   ` Nick Piggin
2003-06-04 12:35                 ` Miquel van Smoorenburg
2003-06-09 21:39                 ` [PATCH] io stalls (was: -rc7 Re: Linux 2.4.21-rc6) Chris Mason
2003-06-09 22:19                   ` Andrea Arcangeli
2003-06-10  0:27                     ` Chris Mason
2003-06-10 23:13                     ` Chris Mason
2003-06-11  0:16                       ` Andrea Arcangeli
2003-06-11  0:44                         ` Chris Mason
2003-06-09 23:51                   ` [PATCH] io stalls Nick Piggin
2003-06-10  0:32                     ` Chris Mason
2003-06-10  0:47                       ` Nick Piggin
2003-06-10  1:48                     ` Robert White
2003-06-10  2:13                       ` Chris Mason
2003-06-10 23:04                         ` Robert White
2003-06-11  0:58                           ` Chris Mason
2003-06-10  3:22                       ` Nick Piggin
2003-06-10 21:17                         ` Robert White
2003-06-11  0:40                           ` Nick Piggin
2003-06-11  0:33                   ` [PATCH] io stalls (was: -rc7 Re: Linux 2.4.21-rc6) Andrea Arcangeli
2003-06-11  0:48                     ` [PATCH] io stalls Nick Piggin
2003-06-11  1:07                       ` Andrea Arcangeli
2003-06-11  0:54                     ` [PATCH] io stalls (was: -rc7 Re: Linux 2.4.21-rc6) Chris Mason
2003-06-11  1:06                       ` Andrea Arcangeli
2003-06-11  1:57                         ` Chris Mason
2003-06-11  2:10                           ` Andrea Arcangeli
2003-06-11 12:24                             ` Chris Mason
2003-06-11 17:42                             ` Chris Mason
2003-06-11 18:12                               ` Andrea Arcangeli
2003-06-11 18:27                                 ` Chris Mason
2003-06-11 18:35                                   ` Andrea Arcangeli
2003-06-12  1:04                                     ` [PATCH] io stalls Nick Piggin
2003-06-12  1:12                                       ` Chris Mason
2003-06-12  1:29                                       ` Andrea Arcangeli
2003-06-12  1:37                                         ` Andrea Arcangeli
2003-06-12  2:22                                         ` Chris Mason
2003-06-12  2:41                                           ` Nick Piggin
2003-06-12  2:46                                             ` Andrea Arcangeli
2003-06-12  2:49                                               ` Nick Piggin
2003-06-12  2:51                                                 ` Nick Piggin
2003-06-12  2:52                                                   ` Nick Piggin
2003-06-12  3:04                                                   ` Andrea Arcangeli
2003-06-12  2:58                                                 ` Andrea Arcangeli
2003-06-12  3:04                                                   ` Nick Piggin
2003-06-12  3:12                                                     ` Andrea Arcangeli
2003-06-12  3:20                                                       ` Nick Piggin
2003-06-12  3:33                                                         ` Andrea Arcangeli
2003-06-12  3:48                                                           ` Nick Piggin
2003-06-12  4:17                                                             ` Andrea Arcangeli
2003-06-12  4:41                                                               ` Nick Piggin
2003-06-12 16:06                                                         ` Chris Mason
2003-06-12 16:16                                                           ` Nick Piggin
2003-06-25 19:03                                               ` Chris Mason
2003-06-25 19:25                                                 ` Andrea Arcangeli
2003-06-25 20:18                                                   ` Chris Mason
2003-06-27  8:41                                                     ` write-caches, I/O stalls: MUST-FIX (was: [PATCH] io stalls) Matthias Andree
2003-06-26  5:48                                                 ` [PATCH] io stalls Nick Piggin
2003-06-26 11:48                                                   ` Chris Mason
2003-06-26 13:04                                                     ` Nick Piggin
2003-06-26 13:18                                                       ` Nick Piggin
2003-06-26 15:55                                                       ` Chris Mason
2003-06-27  1:21                                                         ` Nick Piggin
2003-06-27  1:39                                                           ` Chris Mason
2003-06-27  9:45                                                             ` Nick Piggin
2003-06-27 12:41                                                               ` Chris Mason
2003-06-12 11:57                                             ` Chris Mason
2003-06-04 10:43         ` -rc7 Re: Linux 2.4.21-rc6 Andrea Arcangeli
2003-06-04 11:01           ` Marc-Christian Petersen
2003-06-03 19:45 ` Config issue (CONFIG_X86_TSC) " Paul
2003-06-03 20:18   ` Jan-Benedict Glaw

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).