linux-2.4.10-pre5

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* linux-2.4.10-pre5
@ 2001-09-08  4:18 Linus Torvalds
  2001-09-08  6:32 ` linux-2.4.10-pre5: drivers/net/wan compile fixes Eyal Lebedinsky
                   ` (4 more replies)
  0 siblings, 5 replies; 94+ messages in thread
From: Linus Torvalds @ 2001-09-08  4:18 UTC (permalink / raw)
  To: Kernel Mailing List


Up on the regular places..

		Linus

-----
pre5:
 - Merge with Alan
 - Trond Myklebust: NFS fixes - kmap and root inode special case
 - Al Viro: more superblock cleanups, inode leak in rd.c, minix
   directories in page cache
 - Paul Mackerras: clean up rubbish from sl82c105.c
 - Neil Brown: md/raid cleanups, NFS filehandles
 - Johannes Erdfelt: USB update (usb-2.0 support, visor fix, Clie fix,
   pl2303 driver update)
 - David Miller: sparc and net update
 - Eric Biederman: simplify and correct bootdata allocation - don't
   overwrite ramdisks
 - Tim Waugh: support multiple SuperIO devices, parport doc updates

pre4:
 - Hugh Dickins: swapoff cleanups and speedups
 - Matthew Dharm: USB storage update
 - Keith Owens: Makefile fixes
 - Tom Rini: MPC8xx build fix
 - Nikita Danilov: reiserfs update
 - Jakub Jelinek: ELF loader fix for ET_DYN
 - Andrew Morton: reparent_to_init() for kernel threads
 - Christoph Hellwig: VxFS and SysV updates, vfs_permission fix

pre3:
 - Johannes Erdfelt, Oliver Neukum: USB printer driver race fix
 - John Byrne: fix stupid i386-SMP irq stack layout bug
 - Andreas Bombe, me: yenta IO window fix
 - Neil Brown: raid1 buffer state fix
 - David Miller, Paul Mackerras: fix up sparc and ppc respectively for kmap/kbd_rate
 - Matija Nalis: umsdos fixes, and make it possible to boot up with umsdos
 - Francois Romieu: fix bugs in dscc4 driver
 - Andy Grover: new PCI config space access functions (eventually for ACPI)
 - Albert Cranford: fix incorrect e2fsprog data from ver_linux script
 - Dave Jones: re-sync x86 setup code, fix macsonic kmalloc use
 - Johannes Erdfelt: remove obsolete plusb USB driver
 - Andries Brouwer: fix USB compact flash version info, add blksize ioctls

pre2:
 - Al Viro: block device cleanups
 - Marcelo Tosatti: make bounce buffer allocations more robust (it's ok
   for them to do IO, just not cause recursive bounce IO. So allow them)
 - Anton Altaparmakov: NTFS update (1.1.17)
 - Paul Mackerras: PPC update (big re-org)
 - Petko Manolov: USB pegasus driver fixes
 - David Miller: networking and sparc updates
 - Trond Myklebust: Export atomic_dec_and_lock
 - OGAWA Hirofumi: find and fix umsdos "filldir" users that were broken
   by the 64-bit-cleanups. Fix msdos warnings.
 - Al Viro: superblock handling cleanups and race fixes
 - Johannes Erdfelt++: USB updates

pre1:
 - Jeff Hartmann: DRM AGP/alpha cleanups
 - Ben LaHaise: highmem user pagecopy/clear optimization
 - Vojtech Pavlik: VIA IDE driver update
 - Herbert Xu: make cramfs work with HIGHMEM pages
 - David Fennell: awe32 ram size detection improvement
 - Istvan Varadi: umsdos EMD filename bug fix
 - Keith Owens: make min/max work for pointers too
 - Jan Kara: quota initialization fix
 - Brad Hards: Kaweth USB driver update (enable, and fix endianness)
 - Ralf Baechle: MIPS updates
 - David Gibson: airport driver update
 - Rogier Wolff: firestream ATM driver multi-phy support
 - Daniel Phillips: swap read page referenced set - avoid swap thrashing


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5: drivers/net/wan compile fixes
  2001-09-08  4:18 linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-08  6:32 ` Eyal Lebedinsky
  2001-09-08  6:36 ` 2.4.9-ac10 (not 2.4.10-pre5!) wan fixes Eyal Lebedinsky
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 94+ messages in thread
From: Eyal Lebedinsky @ 2001-09-08  6:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 90 bytes --]

A number of modules fail to include <linux/module.h> but use the new
MODULE_LICENSE macro.

[-- Attachment #2: 2.4.10-pre5-wan.patch --]
[-- Type: text/plain, Size: 3110 bytes --]

diff -u -r linux/drivers/net/wan-orig/sdla_chdlc.c linux/drivers/net/wan/sdla_chdlc.c
--- linux/drivers/net/wan-orig/sdla_chdlc.c	Sat Sep  8 16:22:28 2001
+++ linux/drivers/net/wan/sdla_chdlc.c	Sat Sep  8 16:17:52 2001
@@ -48,6 +48,7 @@
 * Aug 07, 1998	David Fong	Initial version.
 *****************************************************************************/
 
+#include <linux/module.h>
 #include <linux/version.h>
 #include <linux/kernel.h>	/* printk(), and other useful stuff */
 #include <linux/stddef.h>	/* offsetof(), etc. */
diff -u -r linux/drivers/net/wan-orig/sdla_fr.c linux/drivers/net/wan/sdla_fr.c
--- linux/drivers/net/wan-orig/sdla_fr.c	Sat Sep  8 16:23:04 2001
+++ linux/drivers/net/wan/sdla_fr.c	Sat Sep  8 16:23:17 2001
@@ -138,6 +138,7 @@
 * Jan 02, 1997	Gene Kozin	Initial version.
 *****************************************************************************/
 
+#include <linux/module.h>
 #include <linux/version.h>
 #include <linux/kernel.h>	/* printk(), and other useful stuff */
 #include <linux/stddef.h>	/* offsetof(), etc. */
diff -u -r linux/drivers/net/wan-orig/sdla_ft1.c linux/drivers/net/wan/sdla_ft1.c
--- linux/drivers/net/wan-orig/sdla_ft1.c	Sat Sep  8 16:13:18 2001
+++ linux/drivers/net/wan/sdla_ft1.c	Sat Sep  8 16:13:34 2001
@@ -20,6 +20,7 @@
 * Aug 07, 1998	David Fong	Initial version.
 *****************************************************************************/
 
+#include <linux/module.h>
 #include <linux/version.h>
 #include <linux/kernel.h>	/* printk(), and other useful stuff */
 #include <linux/stddef.h>	/* offsetof(), etc. */
diff -u -r linux/drivers/net/wan-orig/sdla_ppp.c linux/drivers/net/wan/sdla_ppp.c
--- linux/drivers/net/wan-orig/sdla_ppp.c	Sat Sep  8 16:22:13 2001
+++ linux/drivers/net/wan/sdla_ppp.c	Sat Sep  8 16:18:50 2001
@@ -90,6 +90,7 @@
 * Jan 06, 1997	Gene Kozin	Initial version.
 *****************************************************************************/
 
+#include <linux/module.h>
 #include <linux/version.h>
 #include <linux/kernel.h>	/* printk(), and other useful stuff */
 #include <linux/stddef.h>	/* offsetof(), etc. */
diff -u -r linux/drivers/net/wan-orig/sdla_x25.c linux/drivers/net/wan/sdla_x25.c
--- linux/drivers/net/wan-orig/sdla_x25.c	Sat Sep  8 16:15:33 2001
+++ linux/drivers/net/wan/sdla_x25.c	Sat Sep  8 16:15:12 2001
@@ -81,6 +81,7 @@
  * 	Includes 
  *=====================================================*/
 
+#include <linux/module.h>
 #include <linux/version.h>
 #include <linux/kernel.h>	/* printk(), and other useful stuff */
 #include <linux/stddef.h>	/* offsetof(), etc. */
diff -u -r linux/drivers/net/wan-orig/wanpipe_multppp.c linux/drivers/net/wan/wanpipe_multppp.c
--- linux/drivers/net/wan-orig/wanpipe_multppp.c	Sat Sep  8 16:24:02 2001
+++ linux/drivers/net/wan/wanpipe_multppp.c	Sat Sep  8 16:19:59 2001
@@ -17,6 +17,7 @@
 *  		module.
 *****************************************************************************/
 
+#include <linux/module.h>
 #include <linux/version.h>
 #include <linux/kernel.h>	/* printk(), and other useful stuff */
 #include <linux/stddef.h>	/* offsetof(), etc. */

^ permalink raw reply	[flat|nested] 94+ messages in thread

* 2.4.9-ac10 (not 2.4.10-pre5!) wan fixes
  2001-09-08  4:18 linux-2.4.10-pre5 Linus Torvalds
  2001-09-08  6:32 ` linux-2.4.10-pre5: drivers/net/wan compile fixes Eyal Lebedinsky
@ 2001-09-08  6:36 ` Eyal Lebedinsky
  2001-09-08  8:32 ` 2.4.10-pre5 compile error George Bonser
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 94+ messages in thread
From: Eyal Lebedinsky @ 2001-09-08  6:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List, Cox, Alan

[-- Attachment #1: Type: text/plain, Size: 90 bytes --]

A number of modules fail to include <linux/module.h> but use the new
MODULE_LICENSE macro.

[-- Attachment #2: 2.4.9-ac10-wan.patch --]
[-- Type: text/plain, Size: 3110 bytes --]

diff -u -r linux/drivers/net/wan-orig/sdla_chdlc.c linux/drivers/net/wan/sdla_chdlc.c
--- linux/drivers/net/wan-orig/sdla_chdlc.c	Sat Sep  8 16:22:28 2001
+++ linux/drivers/net/wan/sdla_chdlc.c	Sat Sep  8 16:17:52 2001
@@ -48,6 +48,7 @@
 * Aug 07, 1998	David Fong	Initial version.
 *****************************************************************************/
 
+#include <linux/module.h>
 #include <linux/version.h>
 #include <linux/kernel.h>	/* printk(), and other useful stuff */
 #include <linux/stddef.h>	/* offsetof(), etc. */
diff -u -r linux/drivers/net/wan-orig/sdla_fr.c linux/drivers/net/wan/sdla_fr.c
--- linux/drivers/net/wan-orig/sdla_fr.c	Sat Sep  8 16:23:04 2001
+++ linux/drivers/net/wan/sdla_fr.c	Sat Sep  8 16:23:17 2001
@@ -138,6 +138,7 @@
 * Jan 02, 1997	Gene Kozin	Initial version.
 *****************************************************************************/
 
+#include <linux/module.h>
 #include <linux/version.h>
 #include <linux/kernel.h>	/* printk(), and other useful stuff */
 #include <linux/stddef.h>	/* offsetof(), etc. */
diff -u -r linux/drivers/net/wan-orig/sdla_ft1.c linux/drivers/net/wan/sdla_ft1.c
--- linux/drivers/net/wan-orig/sdla_ft1.c	Sat Sep  8 16:13:18 2001
+++ linux/drivers/net/wan/sdla_ft1.c	Sat Sep  8 16:13:34 2001
@@ -20,6 +20,7 @@
 * Aug 07, 1998	David Fong	Initial version.
 *****************************************************************************/
 
+#include <linux/module.h>
 #include <linux/version.h>
 #include <linux/kernel.h>	/* printk(), and other useful stuff */
 #include <linux/stddef.h>	/* offsetof(), etc. */
diff -u -r linux/drivers/net/wan-orig/sdla_ppp.c linux/drivers/net/wan/sdla_ppp.c
--- linux/drivers/net/wan-orig/sdla_ppp.c	Sat Sep  8 16:22:13 2001
+++ linux/drivers/net/wan/sdla_ppp.c	Sat Sep  8 16:18:50 2001
@@ -90,6 +90,7 @@
 * Jan 06, 1997	Gene Kozin	Initial version.
 *****************************************************************************/
 
+#include <linux/module.h>
 #include <linux/version.h>
 #include <linux/kernel.h>	/* printk(), and other useful stuff */
 #include <linux/stddef.h>	/* offsetof(), etc. */
diff -u -r linux/drivers/net/wan-orig/sdla_x25.c linux/drivers/net/wan/sdla_x25.c
--- linux/drivers/net/wan-orig/sdla_x25.c	Sat Sep  8 16:15:33 2001
+++ linux/drivers/net/wan/sdla_x25.c	Sat Sep  8 16:15:12 2001
@@ -81,6 +81,7 @@
  * 	Includes 
  *=====================================================*/
 
+#include <linux/module.h>
 #include <linux/version.h>
 #include <linux/kernel.h>	/* printk(), and other useful stuff */
 #include <linux/stddef.h>	/* offsetof(), etc. */
diff -u -r linux/drivers/net/wan-orig/wanpipe_multppp.c linux/drivers/net/wan/wanpipe_multppp.c
--- linux/drivers/net/wan-orig/wanpipe_multppp.c	Sat Sep  8 16:24:02 2001
+++ linux/drivers/net/wan/wanpipe_multppp.c	Sat Sep  8 16:19:59 2001
@@ -17,6 +17,7 @@
 *  		module.
 *****************************************************************************/
 
+#include <linux/module.h>
 #include <linux/version.h>
 #include <linux/kernel.h>	/* printk(), and other useful stuff */
 #include <linux/stddef.h>	/* offsetof(), etc. */

^ permalink raw reply	[flat|nested] 94+ messages in thread

* 2.4.10-pre5 compile error
  2001-09-08  4:18 linux-2.4.10-pre5 Linus Torvalds
  2001-09-08  6:32 ` linux-2.4.10-pre5: drivers/net/wan compile fixes Eyal Lebedinsky
  2001-09-08  6:36 ` 2.4.9-ac10 (not 2.4.10-pre5!) wan fixes Eyal Lebedinsky
@ 2001-09-08  8:32 ` George Bonser
  2001-09-08 17:19 ` linux-2.4.10-pre5 Andrea Arcangeli
  2001-09-08 22:01 ` linux-2.4.10-pre5 Thiago Vinhas de Moraes
  4 siblings, 0 replies; 94+ messages in thread
From: George Bonser @ 2001-09-08  8:32 UTC (permalink / raw)
  To: Linus Torvalds, Kernel Mailing List

gcc -D__KERNEL__ -I/usr/src/linux/include -Wall -Wstrict-prototypes -W
no-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common
 -pipe -mpreferred-stack-boundary=2 -march=i686    -c -o dquot.o
dquot.c
dquot.c: In function `add_dquot_ref':
dquot.c:673: `file' undeclared (first use in this function)
dquot.c:673: (Each undeclared identifier is reported only once
dquot.c:673: for each function it appears in.)
make[3]: *** [dquot.o] Error 1
make[3]: Leaving directory `/usr/src/linux/fs'
make[2]: *** [first_rule] Error 2
make[2]: Leaving directory `/usr/src/linux/fs'
make[1]: *** [_dir_fs] Error 2
make[1]: Leaving directory `/usr/src/linux'
make: *** [stamp-build] Error 2
:/usr/src/linux#

:/usr/src/linux# gcc --version
2.95.4
:/usr/src/linux#


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-08  4:18 linux-2.4.10-pre5 Linus Torvalds
                   ` (2 preceding siblings ...)
  2001-09-08  8:32 ` 2.4.10-pre5 compile error George Bonser
@ 2001-09-08 17:19 ` Andrea Arcangeli
  2001-09-08 17:30   ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-08 22:01 ` linux-2.4.10-pre5 Thiago Vinhas de Moraes
  4 siblings, 1 reply; 94+ messages in thread
From: Andrea Arcangeli @ 2001-09-08 17:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List, Alexander Viro

On Fri, Sep 07, 2001 at 09:18:47PM -0700, Linus Torvalds wrote:
> pre5:
>  - Merge with Alan
>  - Trond Myklebust: NFS fixes - kmap and root inode special case
>  - Al Viro: more superblock cleanups, inode leak in rd.c, minix
>    directories in page cache

"inode leak in rd.c" is correct description, what's wrong is that pre5
is the one introducing the leak, and there was nothing to fix there. The
inode of the initrd must be released with iput because it's a virtual
inode, so if we don't iput it from there we'll lose all references to
it.

There are instead a few initrd bugs that I corrected in the
blkdev-pagecache patch while rewriting the ramdisk/initrd pagecache
backed but I'm too lazy to extract them since they're quite low prio,
I'll just wait for the blkdev in pagecache to be merged instead.

Andrea

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-08 17:19 ` linux-2.4.10-pre5 Andrea Arcangeli
@ 2001-09-08 17:30   ` Linus Torvalds
  2001-09-08 17:57     ` linux-2.4.10-pre5 Andrea Arcangeli
  2001-09-08 21:15     ` linux-2.4.10-pre5 Andreas Dilger
  0 siblings, 2 replies; 94+ messages in thread
From: Linus Torvalds @ 2001-09-08 17:30 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Kernel Mailing List, Alexander Viro

On Sat, 8 Sep 2001, Andrea Arcangeli wrote:
>
> There are instead a few initrd bugs that I corrected in the
> blkdev-pagecache patch while rewriting the ramdisk/initrd pagecache
> backed but I'm too lazy to extract them since they're quite low prio,
> I'll just wait for the blkdev in pagecache to be merged instead.

I'll merge the blkdev in pagecache very early in 2.5.x, but I'm a bit
nervous about merging it in 2.4.x.

That said, if you'll give a description of how you fixed the aliasing
issues etc, maybe I'd be less nervous. Putting it in the page cache is
100% the right thing to do, so in theory I'd really like to merge it
earlier rather than later, but...

		Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-08 17:30   ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-08 17:57     ` Andrea Arcangeli
  2001-09-08 18:01       ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-08 21:15     ` linux-2.4.10-pre5 Andreas Dilger
  1 sibling, 1 reply; 94+ messages in thread
From: Andrea Arcangeli @ 2001-09-08 17:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List, Alexander Viro

On Sat, Sep 08, 2001 at 10:30:30AM -0700, Linus Torvalds wrote:
> That said, if you'll give a description of how you fixed the aliasing
> issues etc, maybe I'd be less nervous. Putting it in the page cache is

First of all I just __block_fsync + truncate_inode_pages(inode->i_mapping, 0) so
all pagecache updates are commited to disk after that, so the latest uptodate
data is on disk and nothing uptodate is in memory. 

If we have the fs mounted under us that seems mounted read only [I don't
even try to synchronize if the fs was mounted rw] I invalidate_device to
possibly shrink its higher level caches as well (this call is non
destructive so it's safe, it's just for sanity), then I lock_super [to
synchronize against the ->remount that could otherwise remount the
device rw under me], I recheck the fs is still read only with the super
lock acquired and if it still is I do the real work (aka
update_buffers).

Let's see the update_buffers in detail:

[..]
#define update_buffers(dev)			\
do {						\
	__invalidate_buffers((dev), 0, 1);	\
	__invalidate_buffers((dev), 0, 2);	\
} while (0)
[..]
   For handling cache coherency with the blkdev pagecache the 'update' case
   is been introduced. It is needed to re-read from disk any pinned
   buffer. NOTE: re-reading from disk is destructive so we can do it only
   when we assume nobody is changing the buffercache under our I/O and when
   we think the disk contains more recent information than the buffercache.
   The update == 1 pass marks the buffers we need to update, the update == 2
   pass does the actual I/O. */
void __invalidate_buffers(kdev_t dev, int destroy_dirty_buffers, int update)
[..]

What update_buffers does is to identify the pinned buffers in the buffercache
and to re-read them from disk under the filesystem (we know the filesystem is
mounted readonly so we can [modulo that the fs gets confused if it sees non
coherent information while we do the update, but with the user app writing to
the buffercache directly things were even worse due the potentially larger
window of time for the race to trigger so it's certainly acceptable as far as
current code is acceptable too]).

another quite brainer part of the patch [non described here] is the cache
pagecache sharing from different inodes:

	mknod hda .. A B ..
	mknod hda.new .. A B ..

then start using hda and hda.new at the same time and have them share the same
pagecache, and that case is been take care of too.

Andrea

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-08 17:57     ` linux-2.4.10-pre5 Andrea Arcangeli
@ 2001-09-08 18:01       ` Linus Torvalds
  2001-09-09  1:09         ` linux-2.4.10-pre5 Andrea Arcangeli
  0 siblings, 1 reply; 94+ messages in thread
From: Linus Torvalds @ 2001-09-08 18:01 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Kernel Mailing List, Alexander Viro


On Sat, 8 Sep 2001, Andrea Arcangeli wrote:
>
> First of all I just __block_fsync + truncate_inode_pages(inode->i_mapping, 0) so
> all pagecache updates are commited to disk after that, so the latest uptodate
> data is on disk and nothing uptodate is in memory.

Hmm. And if two openers have the device open at the same time? One dirties
data after the first one has done __block_fsync? And the truncate will
throw the dirtied page away?

Now, I don't think we need to be _too_ careful about cache coherency: it's
probably ok to do something like

	__block_fsync(dev) - sync _our_ changes
	invalidate_inode_pages(dev) - this will only invalidate unused pages
	invalidate_device(dev)

But I'd just like to feel really comfortable about it. And part of that is
probably to be simpler rather than be clever..

		Linus


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-08 17:30   ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-08 17:57     ` linux-2.4.10-pre5 Andrea Arcangeli
@ 2001-09-08 21:15     ` Andreas Dilger
  2001-09-09  0:59       ` linux-2.4.10-pre5 Andrea Arcangeli
  1 sibling, 1 reply; 94+ messages in thread
From: Andreas Dilger @ 2001-09-08 21:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrea Arcangeli, Kernel Mailing List, Alexander Viro

On Sep 08, 2001  10:30 -0700, Linus Torvalds wrote:
> I'll merge the blkdev in pagecache very early in 2.5.x, but I'm a bit
> nervous about merging it in 2.4.x.
> 
> That said, if you'll give a description of how you fixed the aliasing
> issues etc, maybe I'd be less nervous. Putting it in the page cache is
> 100% the right thing to do, so in theory I'd really like to merge it
> earlier rather than later, but...

I think this may have bad interactions with the filesystems, which still
use buffer cache for metadata.  If the block devices move to page cache,
so should the filesystems.

For example, the "tune2fs" program will modify parts of the superblock
from user space (fields that are read-only from the kernel, e.g. label,
reserved blocks count, etc), because it knows that the data read/written
on /dev/hda1 is coherent with that in the kernel for the filesystem
on /dev/hda1.  The same is true with e2fsck - the metadata should be
kept coherent from user-space and kernel-space or bad things happen.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-08  4:18 linux-2.4.10-pre5 Linus Torvalds
                   ` (3 preceding siblings ...)
  2001-09-08 17:19 ` linux-2.4.10-pre5 Andrea Arcangeli
@ 2001-09-08 22:01 ` Thiago Vinhas de Moraes
  4 siblings, 0 replies; 94+ messages in thread
From: Thiago Vinhas de Moraes @ 2001-09-08 22:01 UTC (permalink / raw)
  To: Linus Torvalds, Kernel Mailing List

Linus,

What's the current state of merges with Alan? I'm sorry for asking this, but 
it's being too dificult to choose what tree to use, and IMHO I don't see any 
reason for not merging all the known stable code from Alan's tree. 
It probably would make the life of too many people easier. ;-)

Best Regards,
Thiago Vinhas

Em Sab, 08 de Set de 2001 01:18, Linus Torvalds escreveu:
> -----
> pre5:
>  - Merge with Alan

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-08 21:15     ` linux-2.4.10-pre5 Andreas Dilger
@ 2001-09-09  0:59       ` Andrea Arcangeli
  0 siblings, 0 replies; 94+ messages in thread
From: Andrea Arcangeli @ 2001-09-09  0:59 UTC (permalink / raw)
  To: Linus Torvalds, Kernel Mailing List, Alexander Viro

On Sat, Sep 08, 2001 at 03:15:39PM -0600, Andreas Dilger wrote:
> On Sep 08, 2001  10:30 -0700, Linus Torvalds wrote:
> > I'll merge the blkdev in pagecache very early in 2.5.x, but I'm a bit
> > nervous about merging it in 2.4.x.
> > 
> > That said, if you'll give a description of how you fixed the aliasing
> > issues etc, maybe I'd be less nervous. Putting it in the page cache is
> > 100% the right thing to do, so in theory I'd really like to merge it
> > earlier rather than later, but...
> 
> I think this may have bad interactions with the filesystems, which still
> use buffer cache for metadata.  If the block devices move to page cache,
> so should the filesystems.
> 
> For example, the "tune2fs" program will modify parts of the superblock
> from user space (fields that are read-only from the kernel, e.g. label,
> reserved blocks count, etc), because it knows that the data read/written
> on /dev/hda1 is coherent with that in the kernel for the filesystem
> on /dev/hda1.  The same is true with e2fsck - the metadata should be
> kept coherent from user-space and kernel-space or bad things happen.

the patch takes care of that transparently of course, if it didn't you
would keep doing long fsck of the root filesystem forever.

Andrea

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-08 18:01       ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-09  1:09         ` Andrea Arcangeli
  2001-09-09  1:20           ` linux-2.4.10-pre5 Linus Torvalds
  0 siblings, 1 reply; 94+ messages in thread
From: Andrea Arcangeli @ 2001-09-09  1:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List, Alexander Viro

On Sat, Sep 08, 2001 at 11:01:59AM -0700, Linus Torvalds wrote:
> 
> On Sat, 8 Sep 2001, Andrea Arcangeli wrote:
> >
> > First of all I just __block_fsync + truncate_inode_pages(inode->i_mapping, 0) so
> > all pagecache updates are commited to disk after that, so the latest uptodate
> > data is on disk and nothing uptodate is in memory.
> 
> Hmm. And if two openers have the device open at the same time? One dirties

of course I described what happens under the bdev semaphore at the very
latest release, so there is no "two" opener case here. If some reference
of the file is still open I don't even attempt to sync anything. (if the
user didn't asked for O_SYNC of course, in such a case the
generic_file_write would take care of it)

> data after the first one has done __block_fsync? And the truncate will
> throw the dirtied page away?

There can't be any truncate because the blkdev isn't a regular file.

> Now, I don't think we need to be _too_ careful about cache coherency: it's
> probably ok to do something like
> 
> 	__block_fsync(dev) - sync _our_ changes
> 	invalidate_inode_pages(dev) - this will only invalidate unused pages
> 	invalidate_device(dev)

that's definitely not enough, see the other issue mentioned by Andreas
in this thread, the reason I wrote the algorithm I explained in the
previous email is as first thing to eventually avoid infinite long fsck
of the root fs.

Andrea

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  1:09         ` linux-2.4.10-pre5 Andrea Arcangeli
@ 2001-09-09  1:20           ` Linus Torvalds
  2001-09-09  1:38             ` linux-2.4.10-pre5 Andrea Arcangeli
  0 siblings, 1 reply; 94+ messages in thread
From: Linus Torvalds @ 2001-09-09  1:20 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Kernel Mailing List, Alexander Viro

On Sun, 9 Sep 2001, Andrea Arcangeli wrote:
> >
> > Hmm. And if two openers have the device open at the same time? One dirties
>
> of course I described what happens under the bdev semaphore at the very
> latest release, so there is no "two" opener case here. If some reference
> of the file is still open I don't even attempt to sync anything. (if the
> user didn't asked for O_SYNC of course, in such a case the
> generic_file_write would take care of it)

That's also a bug.

So imagine that there is another user keeping the bdev active. Implying
that you never even try to sync it at all. That sounds like a bad thing to
do.

> > data after the first one has done __block_fsync? And the truncate will
> > throw the dirtied page away?
>
> There can't be any truncate because the blkdev isn't a regular file.

You said you used truncate_inode_pages(), which _does_ throw the pages
away.  Dirty or not.

What I'm saying is that you at _every_ close (sane semantics for block
devices really do expect the writes to be flushed by close time - how
would they otherwise ever be flushed reliably?) do something like

	fsync(inode);
	invalidate_inode_pages(inode);
	invalidate_device(inode->b_dev);

and be done with it. That syncs the pages that we've dirtied, and it
invalidates all pages that aren't pinned some way. Which is exactly what
you want.

> that's definitely not enough, see the other issue mentioned by Andreas
> in this thread, the reason I wrote the algorithm I explained in the
> previous email is as first thing to eventually avoid infinite long fsck
> of the root fs.

Ehh? Why? The above writes back exactly the same thing that our current
block filesystem writes back. While "invalidate_device()" also throws away
all buffers that aren't pinned.

And the superblock isn't in the buffer cache - it's cached separately, so
invalidate_device() will throw away the buffer associated with it - to be
re-read and re-written by the rw remount.

Will it be different than the current behaviour wrt some other metadata?
Yes. So you could make invalidate_device() stronger, trying to re-read
buffers that aren't dirty. But that doesn't mean that you should act
differently on FS mounted vs not-mounted vs some-other-user.

		Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  1:20           ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-09  1:38             ` Andrea Arcangeli
  2001-09-09  1:53               ` linux-2.4.10-pre5 Linus Torvalds
  0 siblings, 1 reply; 94+ messages in thread
From: Andrea Arcangeli @ 2001-09-09  1:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List, Alexander Viro

On Sat, Sep 08, 2001 at 06:20:51PM -0700, Linus Torvalds wrote:
> 
> 
> On Sun, 9 Sep 2001, Andrea Arcangeli wrote:
> > >
> > > Hmm. And if two openers have the device open at the same time? One dirties
> >
> > of course I described what happens under the bdev semaphore at the very
> > latest release, so there is no "two" opener case here. If some reference
> > of the file is still open I don't even attempt to sync anything. (if the
> > user didn't asked for O_SYNC of course, in such a case the
> > generic_file_write would take care of it)
> 
> That's also a bug.

it's not.

> 
> So imagine that there is another user keeping the bdev active. Implying
> that you never even try to sync it at all. That sounds like a bad thing to
> do.

If you would be right it would also be a bug when two users keep open
the same file at the same time on any filesystem. Tell me where we do
the fsync of the file when the first user closes it.

The whole point is that the two users obviously are sharing the same
cache so there's no issue there.

The only case we need to care specially is the coherency between
filesystems and blkdev users and that's what I do with the algorithm
explained earlier.

> > > data after the first one has done __block_fsync? And the truncate will
> > > throw the dirtied page away?
> >
> > There can't be any truncate because the blkdev isn't a regular file.
> 
> You said you used truncate_inode_pages(), which _does_ throw the pages
> away.  Dirty or not.

I said I do __block_fsync and then truncate_inode_pages. By the time I
drop the pagecache on the blkdev inode I just committed all the changes
to disk, all the uptodate data is on disk then.

No other user can be messing under us because we're serialized by the
bdev semaphore.

> What I'm saying is that you at _every_ close (sane semantics for block
> devices really do expect the writes to be flushed by close time - how
> would they otherwise ever be flushed reliably?) do something like

We don't flush at every close even in current 2.4 buffercache backed.
Even more obviously when there's more than one fd open on the file when
->release isn't even recalled.

> 	fsync(inode);
> 	invalidate_inode_pages(inode);
> 	invalidate_device(inode->b_dev);
> 
> and be done with it. That syncs the pages that we've dirtied, and it
> invalidates all pages that aren't pinned some way. Which is exactly what
> you want.

That's not enough.

> > that's definitely not enough, see the other issue mentioned by Andreas
> > in this thread, the reason I wrote the algorithm I explained in the
> > previous email is as first thing to eventually avoid infinite long fsck
> > of the root fs.
> 
> Ehh? Why? The above writes back exactly the same thing that our current
> block filesystem writes back. While "invalidate_device()" also throws away
> all buffers that aren't pinned.
> 
> And the superblock isn't in the buffer cache - it's cached separately, so

The superblock is cached in pinned buffer cache.

> invalidate_device() will throw away the buffer associated with it - to be
> re-read and re-written by the rw remount.
> 
> Will it be different than the current behaviour wrt some other metadata?
> Yes. So you could make invalidate_device() stronger, trying to re-read
> buffers that aren't dirty. But that doesn't mean that you should act
> differently on FS mounted vs not-mounted vs some-other-user.

Trying to resync a rw mounted fs would corrupt the fs for example, we
definitely want to allow reads on a rw mounted fs, but we don't want to
corrupt the fs once we close, we cannot threat the two cases the same
way.

Also the only case where we need to do the synchronization of the caches
is the blkdev_close, so I don't think it makes sense to change the other
calls that are currently used by code that doesn't need to synchronize.

Andrea

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  1:38             ` linux-2.4.10-pre5 Andrea Arcangeli
@ 2001-09-09  1:53               ` Linus Torvalds
  2001-09-09  2:22                 ` linux-2.4.10-pre5 Andrea Arcangeli
  0 siblings, 1 reply; 94+ messages in thread
From: Linus Torvalds @ 2001-09-09  1:53 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Kernel Mailing List, Alexander Viro

On Sun, 9 Sep 2001, Andrea Arcangeli wrote:
> >
> > So imagine that there is another user keeping the bdev active. Implying
> > that you never even try to sync it at all. That sounds like a bad thing to
> > do.
>
> If you would be right it would also be a bug when two users keep open
> the same file at the same time on any filesystem. Tell me where we do
> the fsync of the file when the first user closes it.

Look again - for regular files we _never_ fsync it. Not on the first
close, not on the last one.

And I'm telling you that device files are different. Things like fdisk and
fsck expect the data to be on the disk after they've closed the device.

They do not expect the data to be on disk only if they were the only
opener, or other special cases.

End of story.

> The whole point is that the two users obviously are sharing the same
> cache so there's no issue there.

There may not be any issue for _them_, but there may easily be an issue
between the user that does the first close, and the next
non-page-cache-user.

This is not about coherency between those two users. That coherency is
something we can take for granted if they both use the page cache, and as
such you shouldn't even bring that issue up - it's not interesting.

What _is_ interesting is the assumption of the disk state of _one_ of the
users, and its expectations about the disk contents etc being up-to-date
after a close(). Up-to-date enough that some other user that does NOT use
the page cache (ie a mount) should see the correct data.

And it sounds very much like your current approach completely misses this.

> The only case we need to care specially is the coherency between
> filesystems and blkdev users and that's what I do with the algorithm
> explained earlier.

Yeah, you explained that you had multiple different approaches depending
on whether it was mounted read-only or not etc etc. Which sounds really
silly to me.

I'm telling you that ONE approach should be the right one. And I'm further
telling you that that one approach is obviously a superset of your "many
different modes of behaviour depending on what is going on".

Which sounds a lot cleaner, and a lot simpler than your special case
heaven.

> We don't flush at every close even in current 2.4 buffercache backed.

Because we don't need to. There are no coherency issues wrt metadata
(which is the only thing that fsck changes).

If fsck changed data contents, we'd better flush dirty buffers, believe
me.

> > 	fsync(inode);
> > 	invalidate_inode_pages(inode);
> > 	invalidate_device(inode->b_dev);
>
> That's not enough.

Tell me why. Yes, you need the extension that I pointed out:

> > Yes. So you could make invalidate_device() stronger, trying to re-read
> > buffers that aren't dirty. But that doesn't mean that you should act
> > differently on FS mounted vs not-mounted vs some-other-user.

		Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  1:53               ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-09  2:22                 ` Andrea Arcangeli
  2001-09-09  2:31                   ` linux-2.4.10-pre5 Linus Torvalds
  0 siblings, 1 reply; 94+ messages in thread
From: Andrea Arcangeli @ 2001-09-09  2:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List, Alexander Viro

On Sat, Sep 08, 2001 at 06:53:39PM -0700, Linus Torvalds wrote:
> 
> On Sun, 9 Sep 2001, Andrea Arcangeli wrote:
> > >
> > > So imagine that there is another user keeping the bdev active. Implying
> > > that you never even try to sync it at all. That sounds like a bad thing to
> > > do.
> >
> > If you would be right it would also be a bug when two users keep open
> > the same file at the same time on any filesystem. Tell me where we do
> > the fsync of the file when the first user closes it.
> 
> Look again - for regular files we _never_ fsync it. Not on the first
> close, not on the last one.

of course.

> 
> And I'm telling you that device files are different. Things like fdisk and
> fsck expect the data to be on the disk after they've closed the device.

Yes, provided they were the only openers. Nitpicking: fsck holds two
references on the device (breakpoint at blkdev_open and you'll see), not
just one like a 'cat', and only the second close(2) will trigger the
fsync.  all right here.

> They do not expect the data to be on disk only if they were the only
							     ^not
> opener, or other special cases.
> 
> End of story.
> 
> > The whole point is that the two users obviously are sharing the same
> > cache so there's no issue there.
> 
> There may not be any issue for _them_, but there may easily be an issue
> between the user that does the first close, and the next
> non-page-cache-user.

The only non page cache user can be the fs.

If an fs is mounted rw while any pagecache user is writing we cannot do
anything about that.

> This is not about coherency between those two users. That coherency is
> something we can take for granted if they both use the page cache, and as
> such you shouldn't even bring that issue up - it's not interesting.

Exactly.

> What _is_ interesting is the assumption of the disk state of _one_ of the
> users, and its expectations about the disk contents etc being up-to-date
> after a close(). Up-to-date enough that some other user that does NOT use
> the page cache (ie a mount) should see the correct data.
> 
> And it sounds very much like your current approach completely misses this.
>
> > The only case we need to care specially is the coherency between
> > filesystems and blkdev users and that's what I do with the algorithm
> > explained earlier.
> 
> Yeah, you explained that you had multiple different approaches depending
> on whether it was mounted read-only or not etc etc. Which sounds really
> silly to me.

It is not silly, if we didn't need to synchronize the caches it would
not be necessary, but we must synchronize the caches, see Andreas's
post. And we can safely synchronize the caches only if we have a ro
mounted fs under us while we hold the super lock.

> I'm telling you that ONE approach should be the right one. And I'm further
> telling you that that one approach is obviously a superset of your "many
> different modes of behaviour depending on what is going on".
> 
> Which sounds a lot cleaner, and a lot simpler than your special case
> heaven.

Simpler and broken, without my heaven it would either corrupt the fs at
the first hdparm or alternatively fsck forever the root at boot.

> 
> > We don't flush at every close even in current 2.4 buffercache backed.
> 
> Because we don't need to. There are no coherency issues wrt metadata
> (which is the only thing that fsck changes).

and so I'm not flushing at every close with the blkdev in pagecache
either. You're the one who said I should flush at every close.

> 
> If fsck changed data contents, we'd better flush dirty buffers, believe
> me.

After an fsck I flush, provided fsck is the only opener of the blkdev.
Fsck will flush and throw away the pagecache, then all the uptodate data
will be on disk. After that _only_ if there's a ro fs under us I
uptodate the pinned metadata in buffer cache to make sure the fs won't
clobber the changes did by fsck.

> 
> > > 	fsync(inode);
> > > 	invalidate_inode_pages(inode);
> > > 	invalidate_device(inode->b_dev);
> >
> > That's not enough.
> 
> Tell me why. Yes, you need the extension that I pointed out:
> 
> > > Yes. So you could make invalidate_device() stronger, trying to re-read
> > > buffers that aren't dirty. But that doesn't mean that you should act
> > > differently on FS mounted vs not-mounted vs some-other-user.


	last blkdev close()
	-	write pagecache to disk
	-	drop pagecache
	-	now all new data is on disk
	-	all the above and the below is done atomically with the
	 	bdev semaphore (no new openers or anything can play
		with the pagecache under us, only thing that could
		play under us on the disk is the fs if mounted rw)

	-	here _if_ we have an fs under us, it still has obsolete
	 	data in pinned buffer cache we need to fix it

	-	so start reading from disk the new data and writing it
		to the pinned buffer cache so the fs won't clobber our
		changes later and will run with the uptodate data, of
		course we can do this _only_ on a ro mounted fs, or we
		could corrupt the fs if the fs were mounted rw, but
		we don't care about the rw mounted fs case, if the
		fs was mounted rw and somebody wrote to the fs via
		blkdev we cannot do anything to save such box.

As far I can tell I'm completly backwards compatible and I'm covering
all the possible cases correctly.

Andrea

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  2:22                 ` linux-2.4.10-pre5 Andrea Arcangeli
@ 2001-09-09  2:31                   ` Linus Torvalds
  2001-09-09  3:30                     ` linux-2.4.10-pre5 Andrea Arcangeli
  0 siblings, 1 reply; 94+ messages in thread
From: Linus Torvalds @ 2001-09-09  2:31 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Kernel Mailing List, Alexander Viro

On Sun, 9 Sep 2001, Andrea Arcangeli wrote:
>
> 	last blkdev close()

I repeat: why last? What if there's a read-only user that has the thing
open for some reason, like gathering statistics every once in a while with
ioctl's or whatever?

That "last" is a bug.

> 	-	write pagecache to disk
> 	-	drop pagecache
> 	-	now all new data is on disk
> 	-	all the above and the below is done atomically with the
> 	 	bdev semaphore (no new openers or anything can play
> 		with the pagecache under us, only thing that could
> 		play under us on the disk is the fs if mounted rw)
>
> 	-	here _if_ we have an fs under us, it still has obsolete
> 	 	data in pinned buffer cache we need to fix it

Agreed. But you should NOT make that a special case.

I'm saying that you can, and should, just _unconditionally_ call

	invalidate_device()

which in turn will walk the buffer cache for that device, and try to throw
it away.

With the simple (again unconditional) addition of: If it cannot throw it
away because it is pinned through bh->b_count, it knows somebody is using
the buffer cache (obviously), so regardless of _what_ kind of user it is,
be it a direct-io one or a magic kernel module or whatever, it does

	lock_buffer(bh);
	if (!buffer_dirty(bh))
		submit_bh(bh, READ);
	else {
		/* we just have to assume that the aliasing is ok */
		unlock_buffer(bh);
	}

UNCONDITIONALLY.

Without caring about things like "is there a filesystem mounted". No
silly rules. The _only_ rule becomes: "try to make the buffer cache as
up-to-date as possible".

See? I'm saying that your approach tries to make decisions that it just
should not make. It shouldn't care or know about things like "a filesystem
has this device mounted".  It should do _one_ thing, and one thing only:
"somebody has written to the device through the page cache, let's try to
invalidate or re-validate as much of the buffer cache as humanly
possible".

Special-case code is bad. Always doing the same thing is good.

		Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  2:31                   ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-09  3:30                     ` Andrea Arcangeli
  2001-09-09  3:58                       ` linux-2.4.10-pre5 Linus Torvalds
  0 siblings, 1 reply; 94+ messages in thread
From: Andrea Arcangeli @ 2001-09-09  3:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List, Alexander Viro

On Sat, Sep 08, 2001 at 07:31:57PM -0700, Linus Torvalds wrote:
> 
> On Sun, 9 Sep 2001, Andrea Arcangeli wrote:
> >
> > 	last blkdev close()
> 
> I repeat: why last? What if there's a read-only user that has the thing
> open for some reason, like gathering statistics every once in a while with
> ioctl's or whatever?
> 
> That "last" is a bug.

The only case where we must synchronize is for tune2fs fsck etc...
there's no ioctl gathering issue running under those. The ioctl
gathering will be started after fsck returned, then it doesn't matter if
the ioctl stays there forever.

Infact it would be particularly bad to synchronize at every close for
the real users of the blkdev, that really want to simply work on a file
as much transparently and fast as possible.

However in theory we could do the fsync + synchronize at every ->release but I
doubt we really need that. At least it wasn't a "bug" but it is a
very intentional behaviour.

> > 	-	write pagecache to disk
> > 	-	drop pagecache
> > 	-	now all new data is on disk
> > 	-	all the above and the below is done atomically with the
> > 	 	bdev semaphore (no new openers or anything can play
> > 		with the pagecache under us, only thing that could
> > 		play under us on the disk is the fs if mounted rw)
> >
> > 	-	here _if_ we have an fs under us, it still has obsolete
> > 	 	data in pinned buffer cache we need to fix it
> 
> Agreed. But you should NOT make that a special case.
> 
> I'm saying that you can, and should, just _unconditionally_ call
> 
> 	invalidate_device()
> 
> which in turn will walk the buffer cache for that device, and try to throw
> it away.
> 
> With the simple (again unconditional) addition of: If it cannot throw it
> away because it is pinned through bh->b_count, it knows somebody is using
> the buffer cache (obviously), so regardless of _what_ kind of user it is,
> be it a direct-io one or a magic kernel module or whatever, it does
> 
> 	lock_buffer(bh);
> 	if (!buffer_dirty(bh))
> 		submit_bh(bh, READ);
> 	else {
> 		/* we just have to assume that the aliasing is ok */
> 		unlock_buffer(bh);
> 	}
> 
> UNCONDITIONALLY.

This will corrupt the fs at the first hdparm -t on a rw mounted disk
_unless_ the fs always lock_buffer(superblock_bh) before modifying any
metadata _and_ marks the buffer dirty before the unlock_super, which is
not the case. We could possibly rely solely on the lock_super around the
update_buffers() but that would be too weak too since the fs is allowed
not to use the lock_super around all its metadata updates, while it is
well defined where we set/reset MSRDONLY, that always happens with the
super lock acquired and so I only rely on the MSRDONLY plus super lock
acquired which is certainly (and I think the only) safe approch
completly backwards compatible.

> Without caring about things like "is there a filesystem mounted". No
> silly rules. The _only_ rule becomes: "try to make the buffer cache as
> up-to-date as possible".
> 
> See? I'm saying that your approach tries to make decisions that it just
> should not make. It shouldn't care or know about things like "a filesystem

I wish I didn't need to make those decisions, infact if you pick the first
patches and you'll see they were not making those decisions and the
early patches were less safe than the latest ones (though the window for
the race was small).

I wish the cache coherency logic would be simpler but just doing
something unconditionally it's going to break things in one way or
another as far I can tell.

Andrea

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  3:30                     ` linux-2.4.10-pre5 Andrea Arcangeli
@ 2001-09-09  3:58                       ` Linus Torvalds
  2001-09-09  4:16                         ` linux-2.4.10-pre5 Andrea Arcangeli
  2001-09-09  4:29                         ` linux-2.4.10-pre5 Andreas Dilger
  0 siblings, 2 replies; 94+ messages in thread
From: Linus Torvalds @ 2001-09-09  3:58 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Kernel Mailing List, Alexander Viro


On Sun, 9 Sep 2001, Andrea Arcangeli wrote:
>
> I wish the cache coherency logic would be simpler but just doing
> something unconditionally it's going to break things in one way or
> another as far I can tell.

I'd rather fix that, then.

Otherwise we'll just end up carrying broken baggage around forever. Which
is not the way to do things.

Anyway, at this point this definitely sounds like a 2.5.x patch. Which I
always pretty much assumed it would be anyway.

		Linus


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  3:58                       ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-09  4:16                         ` Andrea Arcangeli
  2001-09-09  4:28                           ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-09  4:29                         ` linux-2.4.10-pre5 Andreas Dilger
  1 sibling, 1 reply; 94+ messages in thread
From: Andrea Arcangeli @ 2001-09-09  4:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List, Alexander Viro

On Sat, Sep 08, 2001 at 08:58:26PM -0700, Linus Torvalds wrote:
> I'd rather fix that, then.

we'd just need to define a new kind of communication API between a ro
mounted fs and the blkdev layer to avoid the special cases. I
intentionally didn't changed the API and I didn't broken the rules so I
could use the code in 2.4 with all the filesystems transparently (also
the ones not in mainline of course).

> Otherwise we'll just end up carrying broken baggage around forever. Which
> is not the way to do things.

it's not broken, nor even very complex, it may be even cleaner but such
a cleanup that would involve all the filesystems out there wasn't
actually in my high prio list (and certainly not something I would like
to do in 2.4 too).

It's like having numbers to name devices when everybody (apps included)
only knows the names, not the numbers, the API totally sucks obviously,
but that doesn't affect at all the core code that you benchmark etc..,
it doesn't affect when you read or write to the device etc... and this
is why nobody was forced to clean it up yet because the pain was more
than the gain (in such case the pain was even bigger of course, because
the change will be visible to userspace and not only breaking all the
fses).

Andrea

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  4:16                         ` linux-2.4.10-pre5 Andrea Arcangeli
@ 2001-09-09  4:28                           ` Linus Torvalds
  2001-09-09 12:09                             ` linux-2.4.10-pre5 Rik van Riel
                                               ` (2 more replies)
  0 siblings, 3 replies; 94+ messages in thread
From: Linus Torvalds @ 2001-09-09  4:28 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Kernel Mailing List, Alexander Viro

On Sun, 9 Sep 2001, Andrea Arcangeli wrote:

> On Sat, Sep 08, 2001 at 08:58:26PM -0700, Linus Torvalds wrote:
> > I'd rather fix that, then.
>
> we'd just need to define a new kind of communication API between a ro
> mounted fs and the blkdev layer to avoid the special cases.

No, notice how the simple code _should_ work for ro-mounted filesystems
as-is. AND it should work for read-only opens of disks with rw-mounted
filesystems. The only case it doesn't like is the "rw-open of a device
that is rw-mounted".

It's only filesystems that have modified buffers without marking them
dirty (by virtue of having pointers to buffers and delaying the dirtying
until later) that are broken by the "try to make sure all buffers are
up-to-date by reading them in" approach.

And as I don't think the concurrent "rw-open + rw-mount" case makes any
sense, I think the sane solution is simply to disallow it completely.

It shouldn't break any sane uses, and which is really the only way to
guarantee that you cannot have virtual and physical dirty pages/buffers to
the same device at the same time.

So if we simply disallow that case, then we do not have any problem:
either the device was opened read-only (in which case it obviously doesn't
need to flush anything to disk, or invalidate/revalidate existing disk
buffers), or the device was opened with write permissions, in which case
there mustn't be any non-ro mounts to the same device.

Notice? Sane, and doesn't really imply any need to introduce any new
communication between the filesystem and the block layer (only a very
thin channel between "mount" and the block layer). Wouldn't you say?

[ start future rambling ]

Now, I actually believe that we would eventually be better off by having a
real filesystem interface for doing kernel-assisted fsck, backup and
de-fragmentation (all just three different sides of the same thing:
filesystem administration), but that's a separate issue, and right now we
have absolutely no clue about what such an interface would look like. So
that's a long-term thing (other OS's have such interfaces, but they tend
to be specialized for only a subset of possible filesystems. Think
Windows defrag).

That kind of explicit filesystem maintenance interface would make all the
remaining issues go away completely, and would at least in theory allow
on-line fixing of already mounted filesystems. Assuming a good interface
and algorithm for it exists, of course.

[ end future rambling ]

		Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  3:58                       ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-09  4:16                         ` linux-2.4.10-pre5 Andrea Arcangeli
@ 2001-09-09  4:29                         ` Andreas Dilger
  2001-09-09  4:54                           ` linux-2.4.10-pre5 Linus Torvalds
                                             ` (2 more replies)
  1 sibling, 3 replies; 94+ messages in thread
From: Andreas Dilger @ 2001-09-09  4:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrea Arcangeli, Kernel Mailing List

On Sep 08, 2001  20:58 -0700, Linus Torvalds wrote:
> On Sun, 9 Sep 2001, Andrea Arcangeli wrote:
> > I wish the cache coherency logic would be simpler but just doing
> > something unconditionally it's going to break things in one way or
> > another as far I can tell.
> 
> I'd rather fix that, then.
> 
> Otherwise we'll just end up carrying broken baggage around forever. Which
> is not the way to do things.
> 
> Anyway, at this point this definitely sounds like a 2.5.x patch. Which I
> always pretty much assumed it would be anyway.

So basically - when we move block devices to the page cache, get rid of
buffer cache usage in the filesystems as well?  Ext2 is nearly there at
least.  One alternative is as Daniel Phillips did in the indexed-ext2-
directory patch, where he kept the "bread" interface, but backed it
with the page cache, so it required relatively little change to the
filesystem.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  4:29                         ` linux-2.4.10-pre5 Andreas Dilger
@ 2001-09-09  4:54                           ` Linus Torvalds
  2001-09-09  6:17                             ` linux-2.4.10-pre5 Andreas Dilger
  2001-09-09  9:05                           ` linux-2.4.10-pre5 Christoph Hellwig
  2001-09-09 13:14                           ` linux-2.4.10-pre5 Anton Altaparmakov
  2 siblings, 1 reply; 94+ messages in thread
From: Linus Torvalds @ 2001-09-09  4:54 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Andrea Arcangeli, Kernel Mailing List

On Sat, 8 Sep 2001, Andreas Dilger wrote:
>
> So basically - when we move block devices to the page cache, get rid of
> buffer cache usage in the filesystems as well?  Ext2 is nearly there at
> least.  One alternative is as Daniel Phillips did in the indexed-ext2-
> directory patch, where he kept the "bread" interface, but backed it
> with the page cache, so it required relatively little change to the
> filesystem.

This might be a really easy solution. We might just make sure that the
buffer manipulation interfaces we export to filesystems (and there aren't
actually all that many of them - it's mainly bread and getblk) always end
up using the page cache, and just return the buffer head that is embedded
inside the page cache.

That way we don't have any new aliasing issues _at_all_. The user-mode
accesses to the block devices would always end up using the same buffers
that the low-level filesystem does.

Hmm.. That actually would have another major advantage too: the whole
notion of a "anonymous buffer page" would just disappear. Because there
would be no interfaces to even create them - buffer pages would always be
associated with a mapping.

Andrea(s) - interested in pursuing this particular approach? In fact,
since "bread()" uses "getblk()", it is almost sufficient to just make
getblk()  use the page cache, and the rest will follow... You can even get
rid of the buffer hash etc, and make the buffer head noticeably smaller.

[ Yeah, I'm being a bit optimistic - you also end up having to re-write
  "get_hash_table()" to use a page cache lookup etc. So it's definitely
  some major surgery in fs/buffer.c, but "major" might actually be just a
  couple of hundred lines ]

The good news here is that once it works (and you've destroyed your
filesystem about fifty times debugging it :), it's pretty much guaranteed
not to introduce any new and "interesting" interactions between
filesystems and user-level programs accessing the device.

And no filesystem should ever notice. They can still access the buffer
head as if it was just a buffer head, and wouldn't care about the fact
that it happens to be part of a mapping.

Any pitfalls?

[ I can see at least one already: __invalidate_buffers() and
  set_blocksize() would both have to be re-done, probably along the lines
  of "invalidate_inode_pages()" and "fsync+truncate_inode_pages()"
  respectively. ]

Comments?

		Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  4:54                           ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-09  6:17                             ` Andreas Dilger
  2001-09-09 17:31                               ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-09 19:51                               ` linux-2.4.10-pre5 Daniel Phillips
  0 siblings, 2 replies; 94+ messages in thread
From: Andreas Dilger @ 2001-09-09  6:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrea Arcangeli, Kernel Mailing List, Daniel Phillips

On Sep 08, 2001  21:54 -0700, Linus Torvalds wrote:
> On Sat, 8 Sep 2001, Andreas Dilger wrote:
> > So basically - when we move block devices to the page cache, get rid of
> > buffer cache usage in the filesystems as well?  Ext2 is nearly there at
> > least.  One alternative is as Daniel Phillips did in the indexed-ext2-
> > directory patch, where he kept the "bread" interface, but backed it
> > with the page cache, so it required relatively little change to the
> > filesystem.
> 
> This might be a really easy solution. We might just make sure that the
> buffer manipulation interfaces we export to filesystems (and there aren't
> actually all that many of them - it's mainly bread and getblk) always end
> up using the page cache, and just return the buffer head that is embedded
> inside the page cache.
> 
> That way we don't have any new aliasing issues _at_all_. The user-mode
> accesses to the block devices would always end up using the same buffers
> that the low-level filesystem does.

> Andrea(s) - interested in pursuing this particular approach? In fact,
> since "bread()" uses "getblk()", it is almost sufficient to just make
> getblk()  use the page cache, and the rest will follow... You can even get
> rid of the buffer hash etc, and make the buffer head noticeably smaller.
> 
> [ Yeah, I'm being a bit optimistic - you also end up having to re-write
>   "get_hash_table()" to use a page cache lookup etc. So it's definitely
>   some major surgery in fs/buffer.c, but "major" might actually be just a
>   couple of hundred lines ]

Well, Daniel probably has the best handle on the state of this code (it
may be that he has already done 90% of the work).  I've CC'd him on this
to get him in the loop.

> And no filesystem should ever notice. They can still access the buffer
> head as if it was just a buffer head, and wouldn't care about the fact
> that it happens to be part of a mapping.
> 
> Any pitfalls?
> 
> [ I can see at least one already: __invalidate_buffers() and
>   set_blocksize() would both have to be re-done, probably along the lines
>   of "invalidate_inode_pages()" and "fsync+truncate_inode_pages()"
>   respectively. ]
> 
> Comments?

I think this fits in with your overall strategy as well - remove the buffer
as a "cache" object, and only use it as an I/O object, right?  With this
change, all of the cache functionality is in the page cache, and the buffers
are only used as handles for I/O.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  4:29                         ` linux-2.4.10-pre5 Andreas Dilger
  2001-09-09  4:54                           ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-09  9:05                           ` Christoph Hellwig
  2001-09-09 13:14                           ` linux-2.4.10-pre5 Anton Altaparmakov
  2 siblings, 0 replies; 94+ messages in thread
From: Christoph Hellwig @ 2001-09-09  9:05 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: torvalds, andrea, linux-kernel

In article <20010908222923.H32553@turbolinux.com> you wrote:
> On Sep 08, 2001  20:58 -0700, Linus Torvalds wrote:
>> On Sun, 9 Sep 2001, Andrea Arcangeli wrote:
>> > I wish the cache coherency logic would be simpler but just doing
>> > something unconditionally it's going to break things in one way or
>> > another as far I can tell.
>> 
>> I'd rather fix that, then.
>> 
>> Otherwise we'll just end up carrying broken baggage around forever. Which
>> is not the way to do things.
>> 
>> Anyway, at this point this definitely sounds like a 2.5.x patch. Which I
>> always pretty much assumed it would be anyway.

> So basically - when we move block devices to the page cache, get rid of
> buffer cache usage in the filesystems as well?  Ext2 is nearly there at
> least.

IBM's Linux port of JFS2 does already not use the buffercache at all.
It has an special struct address_space (and inode due to the braindamaged
assumption that ->host must be an inode, introduced in 2.4.0-test) that
covers the whole filesystems.

> One alternative is as Daniel Phillips did in the indexed-ext2-
> directory patch, where he kept the "bread" interface, but backed it
> with the page cache, so it required relatively little change to the
> filesystem.

I'd rather prefer to use a different structure for this kind of accesses,
so that we can get rid of struct buffer_head altogether (especially
with Jens' bio rewrite that nukes it out of the lowlevel drivers.)

An example for such an interface is the fbuf use for directorioes in
SVR4/SVR5.  Header file that should explain it attached.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.

/*
 * Copyright (c) 2001 Caldera International, Inc.. All Rights Reserved.  
 *                                                                       
 *        THIS IS UNPUBLISHED PROPRIETARY SOURCE CODE OF                 
 *                  CALDERA INTERNATIONAL, INC.                          
 *                                                                       
 *   The copyright notice above does not evidence any actual or intended 
 *   publication of such source code.                                    
 */

#ifndef _FS_FBUF_H	/* wrapper symbol for kernel use */
#define _FS_FBUF_H	/* subject to change without notice */

#ident	"@(#)unixsrc:usr/src/common/uts/fs/fbuf.h /main/uw7_nj/1"
#ident	"$Header: $"

#if defined(__cplusplus)
extern "C" {
#endif

#ifdef _KERNEL_HEADERS

#include <mem/seg.h> /* REQUIRED */
#include <util/types.h>	/* REQUIRED */

#elif defined(_KERNEL) || defined(_KMEMUSER)

#include <vm/seg.h> /* REQUIRED */
#include <sys/types.h>	/* REQUIRED */

#endif /* _KERNEL_HEADERS */

#if defined(_KERNEL) || defined(_KMEMUSER)

/*
 * A struct fbuf is used to get a mapping to part of a file using the
 * segkmap facilities.  After you get a mapping, you can fbrelse() it
 * (giving a seg code to pass back to segmap_release), you can fbwrite()
 * it (causes a synchronous write back using the file mapping information),
 * or you can fbiwrite it (causing indirect synchronous write back to
 * the block number given without using the file mapping information).
 */

typedef struct fbuf {
	char	*fb_addr;
	size_t	 fb_count;
} fbuf_t;

#endif /* _KERNEL || _KMEMUSER */

#ifdef _KERNEL

struct vnode;

#ifdef _FSKI
extern int fbread(struct vnode *vp, off_t off, size_t len, enum seg_rw rw,
		  fbuf_t **fbpp);
#else
extern int fbread(struct vnode *vp, off64_t off, size_t len, enum seg_rw rw,
		  fbuf_t **fbpp);
#endif

extern int fbrelse(fbuf_t *fbp, uint_t sm_flags);

#endif /* _KERNEL */

#if defined(__cplusplus)
	}
#endif

#endif /* _FS_FBUF_H */

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  4:28                           ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-09 12:09                             ` Rik van Riel
  2001-09-09 14:53                               ` linux-2.4.10-pre5 Andrea Arcangeli
  2001-09-09 18:17                               ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-09 14:47                             ` linux-2.4.10-pre5 Andrea Arcangeli
  2001-09-09 23:56                             ` linux-2.4.10-pre5 Daniel Phillips
  2 siblings, 2 replies; 94+ messages in thread
From: Rik van Riel @ 2001-09-09 12:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrea Arcangeli, Kernel Mailing List, Alexander Viro

On Sat, 8 Sep 2001, Linus Torvalds wrote:

> It's only filesystems that have modified buffers without marking them
> dirty (by virtue of having pointers to buffers and delaying the dirtying
> until later) that are broken by the "try to make sure all buffers are
> up-to-date by reading them in" approach.

Think of the inode and dentry caches.  I guess we need
some way to invalidate those.

Rik
--
IA64: a worthy successor to the i860.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  4:29                         ` linux-2.4.10-pre5 Andreas Dilger
  2001-09-09  4:54                           ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-09  9:05                           ` linux-2.4.10-pre5 Christoph Hellwig
@ 2001-09-09 13:14                           ` Anton Altaparmakov
  2001-09-09 14:31                             ` linux-2.4.10-pre5 Andrea Arcangeli
  2 siblings, 1 reply; 94+ messages in thread
From: Anton Altaparmakov @ 2001-09-09 13:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

At 05:54 09/09/2001, Linus Torvalds wrote:
>On Sat, 8 Sep 2001, Andreas Dilger wrote:
> > So basically - when we move block devices to the page cache, get rid of
> > buffer cache usage in the filesystems as well?  Ext2 is nearly there at
> > least.  One alternative is as Daniel Phillips did in the indexed-ext2-
> > directory patch, where he kept the "bread" interface, but backed it
> > with the page cache, so it required relatively little change to the
> > filesystem.
>
>This might be a really easy solution. We might just make sure that the
>buffer manipulation interfaces we export to filesystems (and there aren't
>actually all that many of them - it's mainly bread and getblk) always end
>up using the page cache, and just return the buffer head that is embedded
>inside the page cache.
>
>That way we don't have any new aliasing issues _at_all_. The user-mode
>accesses to the block devices would always end up using the same buffers
>that the low-level filesystem does.
>
>Hmm.. That actually would have another major advantage too: the whole
>notion of a "anonymous buffer page" would just disappear. Because there
>would be no interfaces to even create them - buffer pages would always be
>associated with a mapping.
>
>Andrea(s) - interested in pursuing this particular approach? In fact,
>since "bread()" uses "getblk()", it is almost sufficient to just make
>getblk()  use the page cache, and the rest will follow... You can even get
>rid of the buffer hash etc, and make the buffer head noticeably smaller.
>
>[ Yeah, I'm being a bit optimistic - you also end up having to re-write
>   "get_hash_table()" to use a page cache lookup etc. So it's definitely
>   some major surgery in fs/buffer.c, but "major" might actually be just a
>   couple of hundred lines ]
>
>The good news here is that once it works (and you've destroyed your
>filesystem about fifty times debugging it :), it's pretty much guaranteed
>not to introduce any new and "interesting" interactions between
>filesystems and user-level programs accessing the device.
>
>And no filesystem should ever notice. They can still access the buffer
>head as if it was just a buffer head, and wouldn't care about the fact
>that it happens to be part of a mapping.
>
>Any pitfalls?

Not a pitfall, but a question: what happens with the get_block() callback 
passed to block_read_full_page() by the readpage() address space method of 
individual filesystems with respect to reading sparse files?

The problem I see is as follows: reading a sparse region from a sparse 
file, get_block() callback will return with just setting bh->b_blocknr = 
-1UL; but _not_ setting BH_Mapped on the buffer. (That is what NTFS TNG 
does anyway, probably not ideal, but I wasn't sure what to do in this case 
as bh->b_blocknr = 0UL is quite valid and returns the first data block of 
the $Boot system file...)

Will this continue to work as expected? Or is it fundamentally broken and I 
shouldn't be using block_read_full_page() for NTFS at all and having my own 
replacement? - I only use block_read_full_page() for non-resident files 
already as resident files are not stored in disk blocks so the concept of a 
get_block() or using buffer heads for final i/o is not applicable to them 
at all... - But I still need to have buffer heads without a disk mapping 
for sparse files. At least I would like to allocate the actual on disk 
storage only when someone actually writes to a hole, but not reads from it.

Best regards,

Anton


-- 
   "Nothing succeeds like success." - Alexandre Dumas
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09 13:14                           ` linux-2.4.10-pre5 Anton Altaparmakov
@ 2001-09-09 14:31                             ` Andrea Arcangeli
  0 siblings, 0 replies; 94+ messages in thread
From: Andrea Arcangeli @ 2001-09-09 14:31 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Linus Torvalds, Andreas Dilger, Kernel Mailing List

On Sun, Sep 09, 2001 at 02:14:54PM +0100, Anton Altaparmakov wrote:
> Not a pitfall, but a question: what happens with the get_block() callback 
> passed to block_read_full_page() by the readpage() address space method of 

anything related to file data uses the "real" pagecache that have
nothing to do with this getblk pagecache backed logic, this is only a
trick to share the ram memory for the physically indexed disk cache and
not to run into the alising issues. I cannot see anything obviously
wrong with this logic so I believe it worth a try to implement but I
will do that after I finished other things. Also since Daniel may have
just finished implementing that he may just send me the buffer.c diff
that I can check and integrate into the blkdev-pagecache work in the
future.

Andrea

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  4:28                           ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-09 12:09                             ` linux-2.4.10-pre5 Rik van Riel
@ 2001-09-09 14:47                             ` Andrea Arcangeli
  2001-09-09 16:24                               ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-09 23:56                             ` linux-2.4.10-pre5 Daniel Phillips
  2 siblings, 1 reply; 94+ messages in thread
From: Andrea Arcangeli @ 2001-09-09 14:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List, Alexander Viro

On Sat, Sep 08, 2001 at 09:28:43PM -0700, Linus Torvalds wrote:
> 
> On Sun, 9 Sep 2001, Andrea Arcangeli wrote:
> 
> > On Sat, Sep 08, 2001 at 08:58:26PM -0700, Linus Torvalds wrote:
> > > I'd rather fix that, then.
> >
> > we'd just need to define a new kind of communication API between a ro
> > mounted fs and the blkdev layer to avoid the special cases.
> 
> No, notice how the simple code _should_ work for ro-mounted filesystems
> as-is. AND it should work for read-only opens of disks with rw-mounted

correct.

> filesystems. The only case it doesn't like is the "rw-open of a device
> that is rw-mounted".

it also doesn't work for ro-open of a device that is rw-mounted, hdparm
-t as said a million of times now.

> It's only filesystems that have modified buffers without marking them
> dirty (by virtue of having pointers to buffers and delaying the dirtying
> until later) that are broken by the "try to make sure all buffers are
> up-to-date by reading them in" approach.

All filesystems marks the buffers dirty after modifying them. The
problem is that they don't always lock_buffer(); modify ; mark_dirty();
unlock_buffer(), so we cannot safely do the update-I/O from disk to
buffer cache on the pinned buffers or we risk to screwup the
filesystem while it's in the "modify; mark_dirty()" part.

> And as I don't think the concurrent "rw-open + rw-mount" case makes any
> sense, I think the sane solution is simply to disallow it completely.

I don't disallow it, I just don't guarantee coherency after that, the fs
could be screwedup if you rw-open + rw-mount and that's ok.

the case where it is not ok is hdparm: ro-open + rw-mounted.

> So if we simply disallow that case, then we do not have any problem:
> either the device was opened read-only (in which case it obviously doesn't
> need to flush anything to disk, or invalidate/revalidate existing disk

Then you are making a special case "the device was opened read-only in
which case it obviously doesn't need to flush anything to disk" means
you are not running the update_buffers() if the open was O_RDONLY (if
you make this special case you would be safe on that side).  But also
the user may do O_RDWR open and not modify the device, and still we must
not screwup a rw-mountd fs under it.

as far I can tell my synchronization code is the simpler possible and
it's completly safe, without changing the synchronization API with the
pinned buffers, or without making getblk to be backed on the blkdev
pagecache enterely that has larger impact on the kernel.

Making getblk to be blkdev pagecache backed seems the cleaner way to get
rid of those aliasing issues to be allowed to just limit the very latest
close to a __block_fsync + truncate_inode_pages() [which is a 5 lines
cleanup compared to my current state of the patch], with it not even the
last blkdev release will run the block_fsync since the fs still holds a
reference on the same side, the fs will be just another blkdev user like
if it came from userspace. However this logic is potentially more
invasive than the blkdev-pagecache code that matters and it doesn't
provide a single advantage to the user, so this isn't in my top list at
the moment, sounds 2.5 cleanup material.

Andrea

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09 12:09                             ` linux-2.4.10-pre5 Rik van Riel
@ 2001-09-09 14:53                               ` Andrea Arcangeli
  2001-09-09 18:17                               ` linux-2.4.10-pre5 Linus Torvalds
  1 sibling, 0 replies; 94+ messages in thread
From: Andrea Arcangeli @ 2001-09-09 14:53 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linus Torvalds, Kernel Mailing List, Alexander Viro

On Sun, Sep 09, 2001 at 09:09:53AM -0300, Rik van Riel wrote:
> On Sat, 8 Sep 2001, Linus Torvalds wrote:
> 
> > It's only filesystems that have modified buffers without marking them
> > dirty (by virtue of having pointers to buffers and delaying the dirtying
> > until later) that are broken by the "try to make sure all buffers are
> > up-to-date by reading them in" approach.
> 
> Think of the inode and dentry caches.  I guess we need
> some way to invalidate those.

I recall invalidate_device for that reason before starting the update
(this is indipendent from the blkdev-pagecache patch though, the problem
with the higher level caches applies to mainline as well at the last
blkdev close).

That's meant to give a better chance to the ro-mounted fs to notice the
modifications done by userspace. Probably invalidate_device should also
recall shrink_dcache_sb before running invalidate_inodes though...

Andrea

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09 14:47                             ` linux-2.4.10-pre5 Andrea Arcangeli
@ 2001-09-09 16:24                               ` Linus Torvalds
  2001-09-09 17:29                                 ` linux-2.4.10-pre5 Andrea Arcangeli
  0 siblings, 1 reply; 94+ messages in thread
From: Linus Torvalds @ 2001-09-09 16:24 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Kernel Mailing List, Alexander Viro

On Sun, 9 Sep 2001, Andrea Arcangeli wrote:
>
> > filesystems. The only case it doesn't like is the "rw-open of a device
> > that is rw-mounted".
>
> it also doesn't work for ro-open of a device that is rw-mounted, hdparm
> -t as said a million of times now.

It _does_ work for that case, and you just aren't reading my emails.

You only need to invalidate the device if the open was a read-write open.

It would be _stupid_ to force a writeback and device invalidate for
read-only opens, now wouldn't it?

The fact that you cannot know the difference between a read-only and a
read-write open is _entirely_ due to the fact that you leave the flush
until the last close. If you do it at every close (like I have said for
the last twohundred mails or so), you can trivially see if the open was a
read-only or not.

		Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09 16:24                               ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-09 17:29                                 ` Andrea Arcangeli
  0 siblings, 0 replies; 94+ messages in thread
From: Andrea Arcangeli @ 2001-09-09 17:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List, Alexander Viro

On Sun, Sep 09, 2001 at 09:24:51AM -0700, Linus Torvalds wrote:
> 
> On Sun, 9 Sep 2001, Andrea Arcangeli wrote:
> >
> > > filesystems. The only case it doesn't like is the "rw-open of a device
> > > that is rw-mounted".
> >
> > it also doesn't work for ro-open of a device that is rw-mounted, hdparm
> > -t as said a million of times now.
> 
> It _does_ work for that case, and you just aren't reading my emails.
> 
> You only need to invalidate the device if the open was a read-write open.
> 
> It would be _stupid_ to force a writeback and device invalidate for
> read-only opens, now wouldn't it?

1) you add a special case just for that, and you argued for ages that you
   don't want special cases in the blkdev close path and I agree on that
   (infact I don't have special cases except where strictly needed to be
   safe)

> The fact that you cannot know the difference between a read-only and a
> read-write open is _entirely_ due to the fact that you leave the flush
> until the last close. If you do it at every close (like I have said for
> the last twohundred mails or so), you can trivially see if the open was a
> read-only or not.

If you read my previous email completly:

	you make this special case you would be safe on that side).  But also
								     ^^^^^^^^
	the user may do O_RDWR open and not modify the device, and still we must
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	not screwup a rw-mountd fs under it.
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

you would now know why even adding the O_RDONLY special case isn't
enough. I can rephrase it: the user is allowed to open(O_RDWR), to only
read(2) and not to write(2) and still expect not to corrupt the
underlying fs. We cannot check all the tools out there to verify they
are opeining all the device with O_RDONLY when they're not going to
write to it. So if some tool is opeining with O_RDWR and never using
write(2) we shouldn't corrupt the underlying rw-mounted fs in my
vocabulary. I considered all this details while writing the current
code.

Andrea

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  6:17                             ` linux-2.4.10-pre5 Andreas Dilger
@ 2001-09-09 17:31                               ` Linus Torvalds
  2001-09-09 19:19                                 ` linux-2.4.10-pre5 Daniel Phillips
  2001-09-09 19:51                               ` linux-2.4.10-pre5 Daniel Phillips
  1 sibling, 1 reply; 94+ messages in thread
From: Linus Torvalds @ 2001-09-09 17:31 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Andrea Arcangeli, Kernel Mailing List, Daniel Phillips

On Sun, 9 Sep 2001, Andreas Dilger wrote:
>
> I think this fits in with your overall strategy as well - remove the buffer
> as a "cache" object, and only use it as an I/O object, right?  With this
> change, all of the cache functionality is in the page cache, and the buffers
> are only used as handles for I/O.

Absolutely. It would be wonderful to get rid of the buffer hashes, and
just replace them with walking the page hash instead (plus walking the
per-page bh list if we have non-page-sized buffers at non-zero offset). It
would also clearly make the page cache be the allocation unit and VM
entity.

The interesting thing is that once you remove the buffer hash entries,
there's not a lot inside "struct buffer_head" that isn't required for IO
anyway. Maybe we could do without bh->b_count, but it is at least
currently required for backwards compatibility with all the code that
thinks buffer heads are autonomous entities. But I actually suspect it
makes a lot of sense even for a stand-alone IO entity (I'm a firm believer
in reference counting as a way to avoid memory management trouble).

The LRU list and page list is needed for VM stuff, and could be cleanly
separated out (nothing to do with actual IO). Same goes for b_inode and
b_inode_buffers.

And b_data could be removed, as the information is implied in b_page and
the position there-in, but at the same time it's probably useful enough
for low-level IO to leave.

So I'd like to see this kind of cleanup, especially as it would apparently
both clean up a higher-level memory management issue _and_ make it much
easier to make the transition to a page-cache for the user accesses (which
is just _required_ for mmap and friends to work on physical devices).

Dan, how much of this do you have?

		Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09 12:09                             ` linux-2.4.10-pre5 Rik van Riel
  2001-09-09 14:53                               ` linux-2.4.10-pre5 Andrea Arcangeli
@ 2001-09-09 18:17                               ` Linus Torvalds
  2001-09-09 20:18                                 ` linux-2.4.10-pre5 H. Peter Anvin
  1 sibling, 1 reply; 94+ messages in thread
From: Linus Torvalds @ 2001-09-09 18:17 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, Kernel Mailing List, Alexander Viro

On Sun, 9 Sep 2001, Rik van Riel wrote:
> On Sat, 8 Sep 2001, Linus Torvalds wrote:
>
> > It's only filesystems that have modified buffers without marking them
> > dirty (by virtue of having pointers to buffers and delaying the dirtying
> > until later) that are broken by the "try to make sure all buffers are
> > up-to-date by reading them in" approach.
>
> Think of the inode and dentry caches.  I guess we need
> some way to invalidate those.

Note that we've never done it before. The inode and dentry caches have
never been coherent with the buffer cache, and we have in fact
historically not even tried to shrink them (to try to minimize the impact
of the non-coherency).

The inode data in memory doesn't even show _up_ in the buffer cache, for
example. Much less dentries, which are so virtualized inside the kernel
that they have absolutely no information on whether they exist on a disk
at all, much less any way to map them to a disk location.

I agree that coherency wrt fsck is something that theoretically would be a
GoodThing(tm). And this is, in fact, why I believe that filesystem
management _must_ have a good interface to the low-level filesystem.
Because you cannot do it any other way.

This is not a fsck-only issue. I am a total non-believer in the "dump"
program, for example. I always found it to be a totally ridiculous and
idiotic way to make backups. It would be much better to have good
(filesystem-independent) interfaces to do what "dump" wants to do (ie have
ways of explicitly bypassing the accessed bits and get the full inode
information etc).

Nobody does a "read()" on directories any more to parse the directory
structure. Similarly, nobody should have done a "dump" on the raw device
any more for the last 20 years or so. But some backwards places still do.

Backup of a live filesystem should be much easier than fsck of a live
filesystem, but I believe that there are actually lots of common issues
there, and such an interface should eventually be designed with both in
mind. Both want to get raw inode lists, for exaple. Both potentially want
to be able to read (and specify) block positions on the disk. Etc etc.

And notice now de-fragmentation falls neatly into this hole too: I bet
that if you had good management interfaces for fsck and backup, you'd
pretty much automatically be able to do defrag through the same
interfaces.

			Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09 17:31                               ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-09 19:19                                 ` Daniel Phillips
  2001-09-09 23:24                                   ` linux-2.4.10-pre5 Linus Torvalds
  0 siblings, 1 reply; 94+ messages in thread
From: Daniel Phillips @ 2001-09-09 19:19 UTC (permalink / raw)
  To: Linus Torvalds, Andreas Dilger; +Cc: Andrea Arcangeli, Kernel Mailing List

On September 9, 2001 07:31 pm, Linus Torvalds wrote:
> On Sun, 9 Sep 2001, Andreas Dilger wrote:
> >
> > I think this fits in with your overall strategy as well - remove the buffer
> > as a "cache" object, and only use it as an I/O object, right?  With this
> > change, all of the cache functionality is in the page cache, and the buffers
> > are only used as handles for I/O.
> 
> Absolutely. It would be wonderful to get rid of the buffer hashes, and
> just replace them with walking the page hash instead (plus walking the
> per-page bh list if we have non-page-sized buffers at non-zero offset). It
> would also clearly make the page cache be the allocation unit and VM
> entity.
> 
> The interesting thing is that once you remove the buffer hash entries,
> there's not a lot inside "struct buffer_head" that isn't required for IO
> anyway.

Let me take inventory here, grouping the fields:

Needed for basic IO functionality:
        atomic_t b_count;               /* users using this block */
        unsigned long b_state;          /* buffer state bitmap (see above) */
        unsigned long b_blocknr;        /* block number */
        unsigned long b_flushtime;      /* Time when (dirty) buffer should be written */
        struct page *b_page;            /* the page this bh is mapped to */
        struct buffer_head *b_reqnext;  /* request queue */
        wait_queue_head_t b_wait;

Can get from mapping:
        unsigned short b_size;          /* block size */
        kdev_t b_dev;                   /* device (B_FREE = free) */
        void (*b_end_io)(struct buffer_head *bh, int uptodate); /* I/O completion */
        struct inode * b_inode;

Not needed with variable page size:
        unsigned short b_size;          /* block size */
        struct buffer_head *b_this_page;/* circular list of buffers in one page */
        char * b_data;                  /* pointer to data block */
        struct list_head b_inode_buffers;   /* doubly linked list of inode dirty buffers */

Could possibly get rid of (with a page cache mapping):
        struct buffer_head *b_next;     /* Hash queue list */
        struct buffer_head **b_pprev;   /* doubly linked list of hash-queue */

Used by raid, loop and highmem, could move to request struct:
        void *b_private;                /* reserved for b_end_io */

Should die:
        kdev_t b_rdev;                  /* Real device */
        unsigned long b_rsector;        /* Real buffer location on disk */
        struct buffer_head *b_next_free;/* lru/free list linkage */
        struct buffer_head *b_prev_free;/* doubly linked list of buffers */

(Note b_size appears twice in the list above.)  So it's about evenly split
between fields we needed even if the buffer just becomes an IO tag, and
fields that could be gotten rid of.  The b_wait field could go since we're
really waiting on an IO request, which has its own wait field.

> Maybe we could do without bh->b_count, but it is at least
> currently required for backwards compatibility with all the code that
> thinks buffer heads are autonomous entities. But I actually suspect it
> makes a lot of sense even for a stand-alone IO entity (I'm a firm believer
> in reference counting as a way to avoid memory management trouble).

Maybe.  We might be able to tell from the state flags and the page use count
that the buffer head is really freeable.

> The LRU list and page list is needed for VM stuff, and could be cleanly
> separated out (nothing to do with actual IO). Same goes for b_inode and
> b_inode_buffers.

We can easily get rid of b_inode, since it's in the page->mapping.

> And b_data could be removed, as the information is implied in b_page and
> the position there-in, but at the same time it's probably useful enough
> for low-level IO to leave.
>
> So I'd like to see this kind of cleanup, especially as it would apparently
> both clean up a higher-level memory management issue _and_ make it much
> easier to make the transition to a page-cache for the user accesses (which
> is just _required_ for mmap and friends to work on physical devices).
> 
> Dan, how much of this do you have?

Working code?  Just the page cache version of ext2_getblk in the directory
indexing patch.  This seems to have worked out fairly well.  Though Al
finds it distasteful on philosophical grounds it does seem to be a pragmatic
way to cut through the current complexity, and combines well with Al's
straight-up page cache code without apparent breakage.

I have put considerable thought into how to move all the rest of the
remaining Ext2 buffer code into page cache, but this is still at the design
stage.  Most of it is easy: group descriptors, block/inode bitmaps,
superblocks.  The hard part is ext2_get_block and specifically the indirect
blocks, if we want "page cache style" usage and not just transplanted
buffer-style coding.  I've started on this but don't have working code yet.

One observation: the buffer hash link is currently unused for page cache
buffers.  We could possibly use that for reverse mapping from logical inode
blocks to physical device blocks, to combat aliasing.  A spin-off benefit
is, the same mechanism could be used to implement a physical readahead
cache which can do things that logical readahead can't.  For example, it
could do readahead through a group of small files without knowing anything
about the metadata, which we might not have read yet.

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  6:17                             ` linux-2.4.10-pre5 Andreas Dilger
  2001-09-09 17:31                               ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-09 19:51                               ` Daniel Phillips
  1 sibling, 0 replies; 94+ messages in thread
From: Daniel Phillips @ 2001-09-09 19:51 UTC (permalink / raw)
  To: Andreas Dilger, Linus Torvalds; +Cc: Andrea Arcangeli, Kernel Mailing List

On September 9, 2001 08:17 am, Andreas Dilger wrote:
> On Sep 08, 2001  21:54 -0700, Linus Torvalds wrote:
> > On Sat, 8 Sep 2001, Andreas Dilger wrote:
> > > So basically - when we move block devices to the page cache, get rid of
> > > buffer cache usage in the filesystems as well?  Ext2 is nearly there at
> > > least.  One alternative is as Daniel Phillips did in the indexed-ext2-
> > > directory patch, where he kept the "bread" interface, but backed it
> > > with the page cache, so it required relatively little change to the
> > > filesystem.
> > 
> > This might be a really easy solution. We might just make sure that the
> > buffer manipulation interfaces we export to filesystems (and there aren't
> > actually all that many of them - it's mainly bread and getblk) always end
> > up using the page cache, and just return the buffer head that is embedded
> > inside the page cache.
> > 
> > That way we don't have any new aliasing issues _at_all_. The user-mode
> > accesses to the block devices would always end up using the same buffers
> > that the low-level filesystem does.
> 
> > Andrea(s) - interested in pursuing this particular approach? In fact,
> > since "bread()" uses "getblk()", it is almost sufficient to just make
> > getblk()  use the page cache, and the rest will follow... You can even get
> > rid of the buffer hash etc, and make the buffer head noticeably smaller.
> > 
> > [ Yeah, I'm being a bit optimistic - you also end up having to re-write
> >   "get_hash_table()" to use a page cache lookup etc. So it's definitely
> >   some major surgery in fs/buffer.c, but "major" might actually be just a
> >   couple of hundred lines ]
> 
> Well, Daniel probably has the best handle on the state of this code (it
> may be that he has already done 90% of the work).  I've CC'd him on this
> to get him in the loop.

Hi, Andreas.  Yes, I've had some success with that approach and it's nice to
see this particular wheel being reinvented.  The original suggestion came
from SCT by the way.

> > And no filesystem should ever notice. They can still access the buffer
> > head as if it was just a buffer head, and wouldn't care about the fact
> > that it happens to be part of a mapping.
> > 
> > Any pitfalls?
> > 
> > [ I can see at least one already: __invalidate_buffers() and
> >   set_blocksize() would both have to be re-done, probably along the lines
> >   of "invalidate_inode_pages()" and "fsync+truncate_inode_pages()"
> >   respectively. ]

Set_blocksize looks like a malformed idea anyway.  I guess its purpose in life
is to get around the fact that we don't know the filesystem blocksize until
after reading the superblock.  So why doesn't the filesystem just fill in the
field in struct super_block and let the vfs worry about it?

If we are going to integrate buffers with the page cache then we have
buffer->page->mapping->inode->super->blocksize and we usually have most of
that chain dereferenced in all the interesting places.

> > Comments?
> 
> I think this fits in with your overall strategy as well - remove the buffer
> as a "cache" object, and only use it as an I/O object, right?  With this
> change, all of the cache functionality is in the page cache, and the buffers
> are only used as handles for I/O.

It's a step in that direction.  Because of the page vs block size mismatch I
find myself still using the buffer_head as a data object, typically because
of locking granularity problems.  If we are really headed towards variable
page size then this problem goes away, and the struct page can be the data
handle in every case.  This is looking more likely all the time, which is one
reason I took the side trip in mm hacking - to understand that code well
enough to know if there are any real stoppers.  At this point I don't think
there are any.  Looking through what little literature is available on this,
fragmentation is the only real problem mentioned, and we're already busily
chewing away at that one.  Rik's rmaps would help a lot there.

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09 18:17                               ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-09 20:18                                 ` H. Peter Anvin
  2001-09-10  0:39                                   ` linux-2.4.10-pre5 Simon Kirby
  2001-09-10 21:22                                   ` linux-2.4.10-pre5 Stephen C. Tweedie
  0 siblings, 2 replies; 94+ messages in thread
From: H. Peter Anvin @ 2001-09-09 20:18 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <Pine.LNX.4.33.0109091105380.14479-100000@penguin.transmeta.com>
By author:    Linus Torvalds <torvalds@transmeta.com>
In newsgroup: linux.dev.kernel
> 
> I agree that coherency wrt fsck is something that theoretically would be a
> GoodThing(tm). And this is, in fact, why I believe that filesystem
> management _must_ have a good interface to the low-level filesystem.
> Because you cannot do it any other way.
> 
> This is not a fsck-only issue. I am a total non-believer in the "dump"
> program, for example. I always found it to be a totally ridiculous and
> idiotic way to make backups. It would be much better to have good
> (filesystem-independent) interfaces to do what "dump" wants to do (ie have
> ways of explicitly bypassing the accessed bits and get the full inode
> information etc).
> 
> Nobody does a "read()" on directories any more to parse the directory
> structure. Similarly, nobody should have done a "dump" on the raw device
> any more for the last 20 years or so. But some backwards places still do.
> 

The main reason people seems to still justify use dump/restore is --
believe it or not -- the inability to set atime.  One would think this
would be a trivial extension to the VFS, even if protected by a
capability (CAP_BACKUP?).

The ideal way to run backups I have found is on filesystems which
support atomic snapshots -- that way, your backup set becomes not only
safe (since it goes through the kernel etc. etc.) but totally
coherent, since it is guaranteed to be unchanging.  This is a major
win for filesystems which can do atomic snapshots, and I'd highly
encourage filesystem developers to consider this feature.

	-hpa



-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt	<amsp@zytor.com>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09 19:19                                 ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-09 23:24                                   ` Linus Torvalds
  2001-09-09 23:54                                     ` linux-2.4.10-pre5 Alan Cox
                                                       ` (2 more replies)
  0 siblings, 3 replies; 94+ messages in thread
From: Linus Torvalds @ 2001-09-09 23:24 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On Sun, 9 Sep 2001, Daniel Phillips wrote:
>
> Needed for basic IO functionality:
>         atomic_t b_count;               /* users using this block */
>         unsigned long b_state;          /* buffer state bitmap (see above) */
>         unsigned long b_blocknr;        /* block number */
>         unsigned long b_flushtime;      /* Time when (dirty) buffer should be written */
>         struct page *b_page;            /* the page this bh is mapped to */
>         struct buffer_head *b_reqnext;  /* request queue */
>         wait_queue_head_t b_wait;
>
> Can get from mapping:
>         unsigned short b_size;          /* block size */

No.

Don't fall into the trap of thinking that blocksizes are constant for a
mapping.

Yes, they are _right_now_, if only because of limitations of the old
buffer cache. But it's very very wrong to think you can get the buffer
size from the mapping.

Besides - we want to be able to do IO _without_ any mappings at all. The
"iobuf" thing should be completely mapping-independent, in case somebody
wants to just do raw IO.

Which means that these two:

>         kdev_t b_dev;                   /* device (B_FREE = free) */
>         void (*b_end_io)(struct buffer_head *bh, int uptodate); /* I/O completion */

also have to be in the iobuf anyway, quite regardless of any other issues.

> Could possibly get rid of (with a page cache mapping):
>         struct buffer_head *b_next;     /* Hash queue list */
>         struct buffer_head **b_pprev;   /* doubly linked list of hash-queue */

Absolutely.

> Used by raid, loop and highmem, could move to request struct:
>         void *b_private;                /* reserved for b_end_io */

No. We only have one request struct, and we can have hundreds of buffers
associated with it. This is very much a iobuf thing.

> Should die:
>         kdev_t b_rdev;                  /* Real device */
>         unsigned long b_rsector;        /* Real buffer location on disk */

No - these are what all the indirect IO code uses.

>         struct buffer_head *b_next_free;/* lru/free list linkage */
>         struct buffer_head *b_prev_free;/* doubly linked list of buffers */

Right now we keep the dirty list in these. We should drop it eventually,
and just every time we mark a buffer dirty we also mark the page dirty,
and then we use the _real_ dirty lists (which are nicely per-inode etc).

> > Maybe we could do without bh->b_count, but it is at least
> > currently required for backwards compatibility with all the code that
> > thinks buffer heads are autonomous entities. But I actually suspect it
> > makes a lot of sense even for a stand-alone IO entity (I'm a firm believer
> > in reference counting as a way to avoid memory management trouble).
>
> Maybe.  We might be able to tell from the state flags and the page use count
> that the buffer head is really freeable.

I doubt it. Almost all data structures that are reachable many ways need
to have a _count_.

People who think that locking is a way of "pinning" data structures are
invariably wrong, stupid, and need to get spanked. Reference counts rule.

> > Dan, how much of this do you have?
>
> Working code?  Just the page cache version of ext2_getblk in the directory
> indexing patch.

Which is probably fine - we could replace the existing getblk with it, and
do only minor fixups. Maybe.

> One observation: the buffer hash link is currently unused for page cache
> buffers.  We could possibly use that for reverse mapping from logical inode
> blocks to physical device blocks, to combat aliasing.

That still assumes that we have to look them up, which I'd rather avoid.
Also, I'd rather get rid of the buffer hash link completely, instead of
making it more confusing..

>						  A spin-off benefit
> is, the same mechanism could be used to implement a physical readahead
> cache which can do things that logical readahead can't.

Considering that 99.9% of all disks do this on a lower hardware layer
anyway, I very much doubt it has any good properties to make software more
complex by having that kind of readahead in sw.

		Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09 23:24                                   ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-09 23:54                                     ` Alan Cox
  2001-09-10  0:04                                       ` linux-2.4.10-pre5 Daniel Phillips
  2001-09-10  0:23                                       ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-10  0:23                                     ` linux-2.4.10-pre5 Daniel Phillips
  2001-09-10 21:18                                     ` linux-2.4.10-pre5 Daniel Phillips
  2 siblings, 2 replies; 94+ messages in thread
From: Alan Cox @ 2001-09-09 23:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Daniel Phillips, Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

How do you plan to handle the situation where we have multiple instances
of the same 4K disk block each of which contains 1K of data in the
start of the page copy and 3K of zeroes.

This isnt idle speculation - thing about BSD UFS fragments.

> anyway, I very much doubt it has any good properties to make software more
> complex by having that kind of readahead in sw.

Even the complex stuff like the i2o raid controllers seems to benefit
primarily from file level not physical readahead, that gives it enough to 
do intelligent scheduling and to keep the drive firmware busy making good
decisions (hopefully)

Even with software raid the physical readahead decisions belong in the raid
code as they are tied to strip alignment and sizes not high level policy

Alan

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09  4:28                           ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-09 12:09                             ` linux-2.4.10-pre5 Rik van Riel
  2001-09-09 14:47                             ` linux-2.4.10-pre5 Andrea Arcangeli
@ 2001-09-09 23:56                             ` Daniel Phillips
  2 siblings, 0 replies; 94+ messages in thread
From: Daniel Phillips @ 2001-09-09 23:56 UTC (permalink / raw)
  To: Linus Torvalds, Andrea Arcangeli; +Cc: Kernel Mailing List, Alexander Viro

On September 9, 2001 06:28 am, Linus Torvalds wrote:
> [ start future rambling ]
> 
> Now, I actually believe that we would eventually be better off by having a
> real filesystem interface for doing kernel-assisted fsck, backup and
> de-fragmentation (all just three different sides of the same thing:
> filesystem administration), but that's a separate issue, and right now we
> have absolutely no clue about what such an interface would look like. So
> that's a long-term thing (other OS's have such interfaces, but they tend
> to be specialized for only a subset of possible filesystems. Think
> Windows defrag).
> 
> That kind of explicit filesystem maintenance interface would make all the
> remaining issues go away completely, and would at least in theory allow
> on-line fixing of already mounted filesystems. Assuming a good interface
> and algorithm for it exists, of course.

A more modest goal, which I'm simply restating from earlier in the thread, 
would be to achieve a coherent view of a volume at the block level.  This is 
just a cache coherency problem.  The trouble is, every solution that's been 
looked at so far has a non-zero cost associated with it for the normal case 
of no sharing.

*However*.  If we could find a method of implementing the cache coherency so 
it actually accelerates the common case then it would become an attractive 
thing to do.

Physically-based readahead might be such a magic bullet.  The structural cost 
of the cache coherency would be a reverse-lookup hash chain on each page 
cache page, which we already almost have with the unused buffer hash links.  
The operational cost would be an extra hash probe for each block read, write, 
create, or delete.  Against this, physical readahead can sometimes do a 
better job than logical readahead because it can optimize across files and 
metadata structures, something that is difficult to do at the logical level.  
Spending a few extra CPU cycles to buy improved filesystem latency in the 
common case would be a good trade.

While I'm in here between the [future] brackets, I'll mention a few points
in favor:

  - Some of the reverse probes can be optimized away, for example, when we
    find a page in the pagecache we can skip the reverse probe because we
    know it's not in the physical (aka buffer) cache.  This should be the
    common case.

  - Scsi disks, and now some IDE disks, already do physical readahead, but 
    they typically have far less cache memory than the processor does.  If
    physical readahead is good, then more must be better ;-)  In any
    event, moving the physical readahead onto the processor side of the
    bus reduces latency.

  - Physical readahead and logical readahead could conceivably work
    together, for example, by using an interface where the vfs provides
    hints to a physical readahead process, rather than taking on the job
    of intiating IO itself.

  - There are opportunities to do zero-copy moves from the physical
    cache to the page cache, by relinking the underlying page instead
    of copying.

  - This isn't a huge change to the way we already do things.  The
    interface to the physical cache is simply the existing buffer
    API.  We repurpose the currently unused hash chain for page cache 
    buffers, and that's transparent.  The vm scanner has to know a
    few more rules about freeing buffers.  The actual cache
    consistency is implemented transparently by the block IO library.

  - If we ever do get to the point of having variable sized page
    objects the implementation gets simpler.   Not that it's
    particularly complex as it is.

> [ end future rambling ]

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09 23:54                                     ` linux-2.4.10-pre5 Alan Cox
@ 2001-09-10  0:04                                       ` Daniel Phillips
  2001-09-10  0:23                                       ` linux-2.4.10-pre5 Linus Torvalds
  1 sibling, 0 replies; 94+ messages in thread
From: Daniel Phillips @ 2001-09-10  0:04 UTC (permalink / raw)
  To: Alan Cox, Linus Torvalds
  Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On September 10, 2001 01:54 am, Alan Cox wrote:
> How do you plan to handle the situation where we have multiple instances
> of the same 4K disk block each of which contains 1K of data in the
> start of the page copy and 3K of zeroes.
> 
> This isnt idle speculation - thing about BSD UFS fragments.

This is handled the same way ReiserFS handles tail merging now - by 
instantiating multiple pages in the respective inode mappings with copies
of the respective parts of the shared block, which stays in the physical
block cashe (aka buffer cache).

The dirty work is done by xxx_get_block.

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09 23:24                                   ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-09 23:54                                     ` linux-2.4.10-pre5 Alan Cox
@ 2001-09-10  0:23                                     ` Daniel Phillips
  2001-09-10  0:38                                       ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-10 21:18                                     ` linux-2.4.10-pre5 Daniel Phillips
  2 siblings, 1 reply; 94+ messages in thread
From: Daniel Phillips @ 2001-09-10  0:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On September 10, 2001 01:24 am, Linus Torvalds wrote:
> On Sun, 9 Sep 2001, Daniel Phillips wrote:
> >
> > Needed for basic IO functionality:
> >         atomic_t b_count;               /* users using this block */
> >         unsigned long b_state;          /* buffer state bitmap (see above) */
> >         unsigned long b_blocknr;        /* block number */
> >         unsigned long b_flushtime;      /* Time when (dirty) buffer should be written */
> >         struct page *b_page;            /* the page this bh is mapped to */
> >         struct buffer_head *b_reqnext;  /* request queue */
> >         wait_queue_head_t b_wait;
> >
> > Can get from mapping:
> >         unsigned short b_size;          /* block size */
> 
> No.
> 
> Don't fall into the trap of thinking that blocksizes are constant for a
> mapping.
> 
> Yes, they are _right_now_, if only because of limitations of the old
> buffer cache. But it's very very wrong to think you can get the buffer
> size from the mapping.

Well, I'm happy enough to get this far down the list before getting the first
Bzzt.  But why do you think we need different blocks sizes in the same 
mapping?  Maybe NTFS is bizarre enough to need that, but then why not just 
have special mappings for the "misc" sizes?

> Besides - we want to be able to do IO _without_ any mappings at all. The
> "iobuf" thing should be completely mapping-independent, in case somebody
> wants to just do raw IO.
> 
> Which means that these two:
> 
> >         kdev_t b_dev;                   /* device (B_FREE = free) */
> >         void (*b_end_io)(struct buffer_head *bh, int uptodate); /* I/O completion */
> 
> also have to be in the iobuf anyway, quite regardless of any other issues.
> 
> > Could possibly get rid of (with a page cache mapping):
> >         struct buffer_head *b_next;     /* Hash queue list */
> >         struct buffer_head **b_pprev;   /* doubly linked list of hash-queue */
> 
> Absolutely.
> 
> > Used by raid, loop and highmem, could move to request struct:
> >         void *b_private;                /* reserved for b_end_io */
> 
> No. We only have one request struct, and we can have hundreds of buffers
> associated with it. This is very much a iobuf thing.

Yes, duh.

> > Should die:
> >         kdev_t b_rdev;                  /* Real device */
> >         unsigned long b_rsector;        /* Real buffer location on disk */
> 
> No - these are what all the indirect IO code uses.

They're still warts.  It would be nice if the various block remappers could
provide their own pool of remapping structs.  Note that bounce buffers are
never in the page cache, so there are a 

> 
> >         struct buffer_head *b_next_free;/* lru/free list linkage */
> >         struct buffer_head *b_prev_free;/* doubly linked list of buffers */
> 
> Right now we keep the dirty list in these. We should drop it eventually,
> and just every time we mark a buffer dirty we also mark the page dirty,
> and then we use the _real_ dirty lists (which are nicely per-inode etc).
> 
> > > Maybe we could do without bh->b_count, but it is at least
> > > currently required for backwards compatibility with all the code that
> > > thinks buffer heads are autonomous entities. But I actually suspect it
> > > makes a lot of sense even for a stand-alone IO entity (I'm a firm believer
> > > in reference counting as a way to avoid memory management trouble).
> >
> > Maybe.  We might be able to tell from the state flags and the page use count
> > that the buffer head is really freeable.
> 
> I doubt it. Almost all data structures that are reachable many ways need
> to have a _count_.
> 
> People who think that locking is a way of "pinning" data structures are
> invariably wrong, stupid, and need to get spanked. Reference counts rule.

I didn't want to get rid of it, it's in line with how all the other objects
are handled.  It's robust.

> > > Dan, how much of this do you have?
> >
> > Working code?  Just the page cache version of ext2_getblk in the directory
> > indexing patch.
> 
> Which is probably fine - we could replace the existing getblk with it, and
> do only minor fixups. Maybe.

Yep, maybe.  Christoph was looking at trying this idea with vxfs, I should
ping him on that.

> > One observation: the buffer hash link is currently unused for page cache
> > buffers.  We could possibly use that for reverse mapping from logical inode
> > blocks to physical device blocks, to combat aliasing.
> 
> That still assumes that we have to look them up, which I'd rather avoid.
> Also, I'd rather get rid of the buffer hash link completely, instead of
> making it more confusing..

Just to start you thinking...
 
> >						  A spin-off benefit
> > is, the same mechanism could be used to implement a physical readahead
> > cache which can do things that logical readahead can't.
> 
> Considering that 99.9% of all disks do this on a lower hardware layer
> anyway, I very much doubt it has any good properties to make software more
> complex by having that kind of readahead in sw.

Well, I already wrote a longer screed on the subject for your Sunday night
reading pleasure ;-)

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09 23:54                                     ` linux-2.4.10-pre5 Alan Cox
  2001-09-10  0:04                                       ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-10  0:23                                       ` Linus Torvalds
  1 sibling, 0 replies; 94+ messages in thread
From: Linus Torvalds @ 2001-09-10  0:23 UTC (permalink / raw)
  To: Alan Cox
  Cc: Daniel Phillips, Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On Mon, 10 Sep 2001, Alan Cox wrote:
>
> How do you plan to handle the situation where we have multiple instances
> of the same 4K disk block each of which contains 1K of data in the
> start of the page copy and 3K of zeroes.

Note that the file data is indexed by a completely different index, and
has been sicne the page cache was introduced. I am _not_ suggesting that
we get rid of _that_ alias. We've always had data aliasing with the page
cache, and it doesn't matter.

Think of it as an address space issue - where each address space has its
own indexing. One of the advantages of the page cache is the fact that it
has the notion of completely independent address spaces, after all.

So we have one index which is the "physical index", which is the one that
the getblk/bread interfaces use, and which is also the one that the
raw-device-in-pagecache code uses. There are no aliasing issues (as long
as we make sure that we have a "meta-inode" to create a single address
space for this index, and do not try to use different inodes for different
representations of the same block device major/minor number)

The other index is the one we already have, namely the file-virtual index.

And switching the first index over from the purely physically indexed
buffer cache to a new page-cache address space doesn't impact this
already-existing index _at_all_.

So in the physical index, you see one 1 4kB block.

In the virtual index, you have 4 4kB blocks, where the start contents just
happen to be gotten from different parts of the physical 4kB block..

And notice how this is NOT something new at all - this is _exactly_ what
we do now, except out physical index is currently not "wrapped" in any
page cache address space.

> > anyway, I very much doubt it has any good properties to make software more
> > complex by having that kind of readahead in sw.
>
> Even the complex stuff like the i2o raid controllers seems to benefit
> primarily from file level not physical readahead, that gives it enough to
> do intelligent scheduling and to keep the drive firmware busy making good
> decisions (hopefully)

Right. The advantages of read-ahead are two-fold:
 - give the disk (and low-level IO layer) more information about future
   patterns (so that they can generate better schedules for fetching the
   data)
 - overlap IO (especially seek and rotational delays) and computation.

HOWEVER, neither of those advantages actually exist for physical
read-ahead, simply because current disks will always end up doing
buffering anyway, which means that limited physical read-ahead is pretty
much guaranteed by the disk - and doing it in software is only going to
slow things down by generating more traffic over the control/data
channels.

Sure, you could do _extensive_ physical read-ahead (ie more than what
disks tend to do on their own), but that implies a fair number of sectors
(modern disks tend to do something akin to "track buffering" in the 64+kB
range, so you'd have to do noticeably more than that in software to have
any chance of making a difference), and then you'd probably often end up
reading too much and loading the line with data you may not actually need.

		Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10  0:23                                     ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-10  0:38                                       ` Linus Torvalds
  2001-09-10  1:04                                         ` linux-2.4.10-pre5 Andrea Arcangeli
  2001-09-10  1:45                                         ` linux-2.4.10-pre5 Chris Mason
  0 siblings, 2 replies; 94+ messages in thread
From: Linus Torvalds @ 2001-09-10  0:38 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On Mon, 10 Sep 2001, Daniel Phillips wrote:
>
> Well, I'm happy enough to get this far down the list before getting the first
> Bzzt.  But why do you think we need different blocks sizes in the same
> mapping?  Maybe NTFS is bizarre enough to need that, but then why not just
> have special mappings for the "misc" sizes?

Oh, any extent-based filesystem is going to be _much_ happier just trying
to create the buffer head patterns the way the extents say, rather than
always have to break it down to the smallest common denominator.

The simplest case of this is a tail, where there is nothign wrong with
having a 5kB file, where the first page is a single 4kB buffer, and the
second page is one mapped 1kB buffer and three unmapped buffers. Which
means that you end up reading the file in just two buffers. And they are
of different sizes.

And that's the _simple_ case.

There are other cases - there are filesystems with unaligned buffers,
which can mean that the pattern for a "struct page" actually ends up being

	|      page 1          |           page 2      |

	1kB |    2kB     | 1kB  1kB |    2kb    |  1kB

So each page actually ends up being _three_ buffers, of different sizes.

Sure, you _could_ force them all to be the smallest common size, but that
only means that the elevator (and the controller) have more work to do.
Why would you do that?

Finally, a true extent-based filesystem might mean that the first 4 pages
were all allocated a single 16kB physical block, and we obviously want to
do those pages as full pages. The next 10kB of the file might have been
allocated scattered 1kB blocks - maybe the disk is really fragmented, and
we just had to do it that way. Should the fact that we have some 1kB
extents mean that we do _all_ IO in 1kB chunks? I don't think so.

> > > Should die:
> > >         kdev_t b_rdev;                  /* Real device */
> > >         unsigned long b_rsector;        /* Real buffer location on disk */
> >
> > No - these are what all the indirect IO code uses.
>
> They're still warts.  It would be nice if the various block remappers could
> provide their own pool of remapping structs.

Fair enough - although at least one block remapper we want to see is the
"partition remapper", which right now is a special case at a the device
driver level, but if you think about it it's really nothing more than a
really simple form of remapping.

And a lot of people _would_ prefer to think about it that way - it would
make the partitioning something that the device driver wouldn't see or
know about. Which is a good thing. But it would also mean that just about
all IO gets _some_ remapping.

And requiring us to have another pool of buffers for every single bh in
flight would be painful, I think you'll agree. So while I agree with you
in theory, I also think that the current setup actually has advantages.

> > Which is probably fine - we could replace the existing getblk with it, and
> > do only minor fixups. Maybe.
>
> Yep, maybe.  Christoph was looking at trying this idea with vxfs, I should
> ping him on that.

I'd love to hear if somebody can do something like this. It still sounds
very much like a 2.5.x thing (and hopefully I and Alan can merge enough
that it's going to be RSN), but if done right it should be trivial to
back-port to 2.4.x after testing.

It would definitely make all the issues with Andrea's pagecache code just
go away completely.

		Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09 20:18                                 ` linux-2.4.10-pre5 H. Peter Anvin
@ 2001-09-10  0:39                                   ` Simon Kirby
  2001-09-10  8:30                                     ` linux-2.4.10-pre5 Kai Henningsen
  2001-09-10 21:22                                   ` linux-2.4.10-pre5 Stephen C. Tweedie
  1 sibling, 1 reply; 94+ messages in thread
From: Simon Kirby @ 2001-09-10  0:39 UTC (permalink / raw)
  To: linux-kernel

On Sun, Sep 09, 2001 at 01:18:57PM -0700, H. Peter Anvin wrote:

> The main reason people seems to still justify use dump/restore is --
> believe it or not -- the inability to set atime.  One would think this
> would be a trivial extension to the VFS, even if protected by a
> capability (CAP_BACKUP?).

What do people actually use atime for, anyway?  I've always
noatime/nodiratime'd most servers I've set up because it saves so much
disk I/O, and I have yet to see anything really use it.  I can see that
in some cases it would be useful to turn it _on_ (perhaps for debugging /
removal of unused files, etc.), but it seems silly that the default case
is a situation which on the surface seems dumb (opening a file for read
causes a disk write).

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[       sim@stormix.com       ][       sim@netnation.com        ]
[ Opinions expressed are not necessarily those of my employers. ]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10  0:38                                       ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-10  1:04                                         ` Andrea Arcangeli
  2001-09-10  1:45                                         ` linux-2.4.10-pre5 Chris Mason
  1 sibling, 0 replies; 94+ messages in thread
From: Andrea Arcangeli @ 2001-09-10  1:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, Andreas Dilger, Kernel Mailing List

On Sun, Sep 09, 2001 at 05:38:14PM -0700, Linus Torvalds wrote:
> It would definitely make all the issues with Andrea's pagecache code just
> go away completely.

I also recommend to write it on top of the blkdev in pagecache patch
since there I just implemented the "physical address space" abstraction,
I had to write it to make the mknod hda and mknod hda.new to share the
same cache transparently.

Andrea

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10  0:38                                       ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-10  1:04                                         ` linux-2.4.10-pre5 Andrea Arcangeli
@ 2001-09-10  1:45                                         ` Chris Mason
  2001-09-10  1:55                                           ` linux-2.4.10-pre5 Andrea Arcangeli
                                                             ` (2 more replies)
  1 sibling, 3 replies; 94+ messages in thread
From: Chris Mason @ 2001-09-10  1:45 UTC (permalink / raw)
  To: Andrea Arcangeli, Linus Torvalds
  Cc: Daniel Phillips, Andreas Dilger, Kernel Mailing List

On Monday, September 10, 2001 03:04:05 AM +0200 Andrea Arcangeli
<andrea@suse.de> wrote:

> On Sun, Sep 09, 2001 at 05:38:14PM -0700, Linus Torvalds wrote:
>> It would definitely make all the issues with Andrea's pagecache code just
>> go away completely.
> 
> I also recommend to write it on top of the blkdev in pagecache patch
> since there I just implemented the "physical address space" abstraction,
> I had to write it to make the mknod hda and mknod hda.new to share the
> same cache transparently.
> 

Hi guys,

I some code on top of the writepage for all io patch (2.4.2 timeframe),
that implemented getblk_mapping, get_hash_table_mapping and bread_mapping,
which gave the same features as the original but took an address space as
one of the args.  

The idea is more or less what has been discussed, but it did assume one
blocksize per mapping.  Of course, set_blocksize and invalidate_buffers
were on the todo list ;-)  The only other gotcha was calling
filemap_fdatasync and truncate_inode_pages in put_super, to make sure
things got flushed right.

Anyway, the whole thing can be cut down to a smallish patch, either alone
or on top of andrea's stuff.  Daniel, if you want to work together on it,
I'm game.

-chris

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10  1:45                                         ` linux-2.4.10-pre5 Chris Mason
@ 2001-09-10  1:55                                           ` Andrea Arcangeli
  2001-09-10  2:15                                             ` linux-2.4.10-pre5 Daniel Phillips
  2001-09-10  2:02                                           ` linux-2.4.10-pre5 Chris Mason
  2001-09-10  2:03                                           ` linux-2.4.10-pre5 Linus Torvalds
  2 siblings, 1 reply; 94+ messages in thread
From: Andrea Arcangeli @ 2001-09-10  1:55 UTC (permalink / raw)
  To: Chris Mason
  Cc: Linus Torvalds, Daniel Phillips, Andreas Dilger, Kernel Mailing List

On Sun, Sep 09, 2001 at 09:45:01PM -0400, Chris Mason wrote:
> 
> 
> On Monday, September 10, 2001 03:04:05 AM +0200 Andrea Arcangeli
> <andrea@suse.de> wrote:
> 
> > On Sun, Sep 09, 2001 at 05:38:14PM -0700, Linus Torvalds wrote:
> >> It would definitely make all the issues with Andrea's pagecache code just
> >> go away completely.
> > 
> > I also recommend to write it on top of the blkdev in pagecache patch
> > since there I just implemented the "physical address space" abstraction,
> > I had to write it to make the mknod hda and mknod hda.new to share the
> > same cache transparently.
> > 
> 
> Hi guys,
> 
> I some code on top of the writepage for all io patch (2.4.2 timeframe),
> that implemented getblk_mapping, get_hash_table_mapping and bread_mapping,
> which gave the same features as the original but took an address space as
> one of the args.  
> 
> The idea is more or less what has been discussed, but it did assume one
> blocksize per mapping.  Of course, set_blocksize and invalidate_buffers

hmm, this sounds a bit odd, there must be only one "physical address
space", the whole point of the change is to have only one and to kill
all the aliasing issues this way. In turn also set_blocksize will be
nearly a noop and it won't matter anymore the granularity of the I/O,
there won't be alias allowed anymore because the backing store of the
cache (the "physical address space") will be shared by all the users
regardless of the blocksize.

getblk should unconditionally alloc a new bh entity and only care to map
it to the right cache backing store with a pagecache hash lookup.

Andrea

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10  1:45                                         ` linux-2.4.10-pre5 Chris Mason
  2001-09-10  1:55                                           ` linux-2.4.10-pre5 Andrea Arcangeli
@ 2001-09-10  2:02                                           ` Chris Mason
  2001-09-10  2:06                                             ` linux-2.4.10-pre5 Andrea Arcangeli
  2001-09-10  2:03                                           ` linux-2.4.10-pre5 Linus Torvalds
  2 siblings, 1 reply; 94+ messages in thread
From: Chris Mason @ 2001-09-10  2:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Daniel Phillips, Andreas Dilger, Kernel Mailing List



On Monday, September 10, 2001 03:55:19 AM +0200 Andrea Arcangeli
<andrea@suse.de> wrote:

>> The idea is more or less what has been discussed, but it did assume one
>> blocksize per mapping.  Of course, set_blocksize and invalidate_buffers
> 
> hmm, this sounds a bit odd, there must be only one "physical address
> space", the whole point of the change is to have only one and to kill
> all the aliasing issues this way. 

Sorry, I wasn't clear...when I wrote the first patch, your stuff didn't
exist ;-)  The original idea was to generalize it out so any mapping could
be used, yours will do nicely.

> In turn also set_blocksize will be
> nearly a noop and it won't matter anymore the granularity of the I/O,
> there won't be alias allowed anymore because the backing store of the
> cache (the "physical address space") will be shared by all the users
> regardless of the blocksize.
> 
> getblk should unconditionally alloc a new bh entity and only care to map
> it to the right cache backing store with a pagecache hash lookup.

Nods.

-chris


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10  1:45                                         ` linux-2.4.10-pre5 Chris Mason
  2001-09-10  1:55                                           ` linux-2.4.10-pre5 Andrea Arcangeli
  2001-09-10  2:02                                           ` linux-2.4.10-pre5 Chris Mason
@ 2001-09-10  2:03                                           ` Linus Torvalds
  2001-09-10  2:41                                             ` linux-2.4.10-pre5 Chris Mason
  2 siblings, 1 reply; 94+ messages in thread
From: Linus Torvalds @ 2001-09-10  2:03 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andrea Arcangeli, Daniel Phillips, Andreas Dilger, Kernel Mailing List


On Sun, 9 Sep 2001, Chris Mason wrote:
>
> On Monday, September 10, 2001 03:04:05 AM +0200 Andrea Arcangeli
> <andrea@suse.de> wrote:
> >
> > I also recommend to write it on top of the blkdev in pagecache patch
> > since there I just implemented the "physical address space" abstraction,
> > I had to write it to make the mknod hda and mknod hda.new to share the
> > same cache transparently.

Agreed - they do share a _lto_ of the issues.

> I some code on top of the writepage for all io patch (2.4.2 timeframe),
> that implemented getblk_mapping, get_hash_table_mapping and bread_mapping,
> which gave the same features as the original but took an address space as
> one of the args.
>
> Anyway, the whole thing can be cut down to a smallish patch, either alone
> or on top of andrea's stuff.  Daniel, if you want to work together on it,
> I'm game.

If you'd be willing to integrate your patches on top of andreas (and
massage them to be the _only_ getblk interface) and look what the
resulting thing looks like, I bet that there will be a lot of people
interested in working on any final remaining issues.

The devil is always in the details, but at the same time it ends up being
rather important to get some first "mostly-working" code to base further
refinements on, unless we want to just talk the problem to death ;)

		Linus


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10  2:02                                           ` linux-2.4.10-pre5 Chris Mason
@ 2001-09-10  2:06                                             ` Andrea Arcangeli
  0 siblings, 0 replies; 94+ messages in thread
From: Andrea Arcangeli @ 2001-09-10  2:06 UTC (permalink / raw)
  To: Chris Mason
  Cc: Linus Torvalds, Daniel Phillips, Andreas Dilger, Kernel Mailing List

On Sun, Sep 09, 2001 at 10:02:52PM -0400, Chris Mason wrote:
> Sorry, I wasn't clear...when I wrote the first patch, your stuff didn't
> exist ;-)  The original idea was to generalize it out so any mapping could
> be used, yours will do nicely.

ok ;)

Andrea

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10  1:55                                           ` linux-2.4.10-pre5 Andrea Arcangeli
@ 2001-09-10  2:15                                             ` Daniel Phillips
  2001-09-10  2:20                                               ` linux-2.4.10-pre5 Chris Mason
  2001-09-10  2:22                                               ` linux-2.4.10-pre5 Daniel Phillips
  0 siblings, 2 replies; 94+ messages in thread
From: Daniel Phillips @ 2001-09-10  2:15 UTC (permalink / raw)
  To: Andrea Arcangeli, Chris Mason
  Cc: Linus Torvalds, Andreas Dilger, Kernel Mailing List

On September 10, 2001 03:55 am, Andrea Arcangeli wrote:
> getblk should unconditionally alloc a new bh entity and only care to map
> it to the right cache backing store with a pagecache hash lookup.

To feel anything like the original the new getblk has to be idempotent: 
subsequent calls return the same block.

Another issue: we need to be able to find the appropriate mapping for the 
getblk.  That means either looking it up in yet another [major][minor] vector 
or changing the type of the argument to getblk, which is not as painful as it 
sounds because it isn't used in very many places any more.

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10  2:15                                             ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-10  2:20                                               ` Chris Mason
  2001-09-10  2:40                                                 ` linux-2.4.10-pre5 Daniel Phillips
  2001-09-10  3:02                                                 ` linux-2.4.10-pre5 Chris Mason
  2001-09-10  2:22                                               ` linux-2.4.10-pre5 Daniel Phillips
  1 sibling, 2 replies; 94+ messages in thread
From: Chris Mason @ 2001-09-10  2:20 UTC (permalink / raw)
  To: Daniel Phillips, Andrea Arcangeli
  Cc: Linus Torvalds, Andreas Dilger, Kernel Mailing List



On Monday, September 10, 2001 04:22:25 AM +0200 Daniel Phillips
<phillips@bonn-fries.net> wrote:

> On September 10, 2001 04:15 am, Daniel Phillips wrote:
>> On September 10, 2001 03:55 am, Andrea Arcangeli wrote:
>> > getblk should unconditionally alloc a new bh entity and only care to
>> > map it to the right cache backing store with a pagecache hash lookup.
>> 
>> To feel anything like the original the new getblk has to be idempotent: 
>> subsequent calls return the same block.
> 
> Err, buffer_head

How about subsequent calls for the same offset with the same blocksize need
to return the same buffer head?

-chris



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10  2:15                                             ` linux-2.4.10-pre5 Daniel Phillips
  2001-09-10  2:20                                               ` linux-2.4.10-pre5 Chris Mason
@ 2001-09-10  2:22                                               ` Daniel Phillips
  1 sibling, 0 replies; 94+ messages in thread
From: Daniel Phillips @ 2001-09-10  2:22 UTC (permalink / raw)
  To: Andrea Arcangeli, Chris Mason
  Cc: Linus Torvalds, Andreas Dilger, Kernel Mailing List

On September 10, 2001 04:15 am, Daniel Phillips wrote:
> On September 10, 2001 03:55 am, Andrea Arcangeli wrote:
> > getblk should unconditionally alloc a new bh entity and only care to map
> > it to the right cache backing store with a pagecache hash lookup.
> 
> To feel anything like the original the new getblk has to be idempotent: 
> subsequent calls return the same block.

Err, buffer_head

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10  2:20                                               ` linux-2.4.10-pre5 Chris Mason
@ 2001-09-10  2:40                                                 ` Daniel Phillips
  2001-09-10  3:02                                                 ` linux-2.4.10-pre5 Chris Mason
  1 sibling, 0 replies; 94+ messages in thread
From: Daniel Phillips @ 2001-09-10  2:40 UTC (permalink / raw)
  To: Chris Mason, Andrea Arcangeli
  Cc: Linus Torvalds, Andreas Dilger, Kernel Mailing List

On September 10, 2001 04:20 am, Chris Mason wrote:
> On Monday, September 10, 2001 04:22:25 AM +0200 Daniel Phillips
> <phillips@bonn-fries.net> wrote:
> 
> > On September 10, 2001 04:15 am, Daniel Phillips wrote:
> >> On September 10, 2001 03:55 am, Andrea Arcangeli wrote:
> >> > getblk should unconditionally alloc a new bh entity and only care to
> >> > map it to the right cache backing store with a pagecache hash lookup.
> >> 
> >> To feel anything like the original the new getblk has to be idempotent: 
> >> subsequent calls return the same block.
> > 
> > Err, buffer_head
> 
> How about subsequent calls for the same offset with the same blocksize need
> to return the same buffer head?

Are we picking nits?  Better add "the same dev" and "until the buffer head is 
freed" ;-)

<Attempting to add some content> I've always felt that passing the blocksize 
and hashing on it is just a little bizarre.  This means for example that you 
can have two 2K buffers overlaid on a 4K buffer, and they will be entirely 
non-coherent. If we substitute "mapping" where "dev" goes we wouldn't need 
the size any more.  We could still take it and just BUG if it doesn't match 
the mapping.

Bearing in mind Linus's comments about the liklihood that we might see mixed 
block sizes in a single mapping something in the future.

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10  2:03                                           ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-10  2:41                                             ` Chris Mason
  0 siblings, 0 replies; 94+ messages in thread
From: Chris Mason @ 2001-09-10  2:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Daniel Phillips, Andreas Dilger, Kernel Mailing List



On Sunday, September 09, 2001 07:03:46 PM -0700 Linus Torvalds
<torvalds@transmeta.com> wrote:

>> Anyway, the whole thing can be cut down to a smallish patch, either alone
>> or on top of andrea's stuff.  Daniel, if you want to work together on it,
>> I'm game.
> 
> If you'd be willing to integrate your patches on top of andreas (and
> massage them to be the _only_ getblk interface) and look what the
> resulting thing looks like, I bet that there will be a lot of people
> interested in working on any final remaining issues.

Ok, the getblk interace wasn't really tied to the rest of the patch, so it
should be easy to cut out.  I'll give it a go a little later in the week,
I'm still catching up from sick days last week ;-)

-chris


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10  2:20                                               ` linux-2.4.10-pre5 Chris Mason
  2001-09-10  2:40                                                 ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-10  3:02                                                 ` Chris Mason
  2001-09-10  3:36                                                   ` linux-2.4.10-pre5 Daniel Phillips
  2001-09-10 19:06                                                   ` linux-2.4.10-pre5 Chris Mason
  1 sibling, 2 replies; 94+ messages in thread
From: Chris Mason @ 2001-09-10  3:02 UTC (permalink / raw)
  To: Daniel Phillips, Andrea Arcangeli
  Cc: Linus Torvalds, Andreas Dilger, Kernel Mailing List



On Monday, September 10, 2001 04:40:21 AM +0200 Daniel Phillips
<phillips@bonn-fries.net> wrote:

>> How about subsequent calls for the same offset with the same blocksize
>> need to return the same buffer head?
> 
> Are we picking nits?  Better add "the same dev" and "until the buffer
> head is  freed" ;-)

;-)  Really, wasn't trying for that.  If we just say later calls for the
same offset, we get in trouble later on if we also want variable, very
large blocksizes.  If we relax the rules to allow multiple buffer heads for
the same physical spot on disk, things get easier, and the FS is
responsible for not doing something stupid with it.  

The data is still consistent either way, there are just multiple io handles.

-chris


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10  3:02                                                 ` linux-2.4.10-pre5 Chris Mason
@ 2001-09-10  3:36                                                   ` Daniel Phillips
  2001-09-10 19:06                                                   ` linux-2.4.10-pre5 Chris Mason
  1 sibling, 0 replies; 94+ messages in thread
From: Daniel Phillips @ 2001-09-10  3:36 UTC (permalink / raw)
  To: Chris Mason, Andrea Arcangeli
  Cc: Linus Torvalds, Andreas Dilger, Kernel Mailing List

On September 10, 2001 05:02 am, Chris Mason wrote:
> On Monday, September 10, 2001 04:40:21 AM +0200 Daniel Phillips
> <phillips@bonn-fries.net> wrote:
> 
> >> How about subsequent calls for the same offset with the same blocksize
> >> need to return the same buffer head?
> > 
> > Are we picking nits?  Better add "the same dev" and "until the buffer
> > head is  freed" ;-)
> 
> ;-)  Really, wasn't trying for that.  If we just say later calls for the
> same offset, we get in trouble later on if we also want variable, very
> large blocksizes.

Well, I really wonder if buffers are the right transport medium for variable, 
large blocks, aka extents.  Personally, I think buffers will have disappeared 
or mutated unrecognizably by the time we get around to adding extents to ext2 
or its descendents.  Note that XFS already implements extents on Linux, by 
mapping them onto the pagecache I believe.

> If we relax the rules to allow multiple buffer heads for
> the same physical spot on disk, things get easier, and the FS is
> responsible for not doing something stupid with it.  
> 
> The data is still consistent either way, there are just multiple io handles.

Were you thinking of one mapping for all buffers on a given partition?  If 
so, how did you plan to handle different buffer sizes?  Were you planning to 
keep the existing buffer hash chain or use the page cache hash chain, as I 
did for ext2_getblk?

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10  0:39                                   ` linux-2.4.10-pre5 Simon Kirby
@ 2001-09-10  8:30                                     ` Kai Henningsen
  2001-09-11  5:29                                       ` linux-2.4.10-pre5 Peter Samuelson
  0 siblings, 1 reply; 94+ messages in thread
From: Kai Henningsen @ 2001-09-10  8:30 UTC (permalink / raw)
  To: linux-kernel

sim@netnation.com (Simon Kirby)  wrote on 09.09.01 in <20010909173947.A20202@netnation.com>:

> What do people actually use atime for, anyway?  I've always
> noatime/nodiratime'd most servers I've set up because it saves so much
> disk I/O, and I have yet to see anything really use it.  I can see that
> in some cases it would be useful to turn it _on_ (perhaps for debugging /
> removal of unused files, etc.), but it seems silly that the default case
> is a situation which on the surface seems dumb (opening a file for read
> causes a disk write).

I see two possible atime uses:

1. Cleaning up /tmp (mtime is *not* a good indicator that a file is no  
longer used)
2. Swapping out files to slower storage

Essentially, both use the "do we still need this thing" aspect.

Of course, for this to be useful, we really need programs to be able to  
say "ignore my use of this file". tar --atime-preserve, for example  
(which, incidentally, notes a technical problem with doing this). I'd also  
add stuff like files or md5sum or similar diagnostic tools to the "might  
want to not affect atime" list, and possibly also stuff like inn ("atime  
on /var/spool/news is just silly").

HOWEVER, I'd think what would really be nice for this would be an  
O_NOATIME flag (which does enforce read-only operation, of course), and  
not fixing the atime back afterwards (inherently racy, but even more so by  
design of utimes and ctime).

That flag would also be fairly easy to detect with autoconf or similar  
functionality.

MfG Kai

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10  3:02                                                 ` linux-2.4.10-pre5 Chris Mason
  2001-09-10  3:36                                                   ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-10 19:06                                                   ` Chris Mason
  1 sibling, 0 replies; 94+ messages in thread
From: Chris Mason @ 2001-09-10 19:06 UTC (permalink / raw)
  To: Daniel Phillips, Andrea Arcangeli
  Cc: Linus Torvalds, Andreas Dilger, Kernel Mailing List

On Monday, September 10, 2001 05:36:07 AM +0200 Daniel Phillips
<phillips@bonn-fries.net> wrote:

> Well, I really wonder if buffers are the right transport medium for
> variable,  large blocks, aka extents.  Personally, I think buffers will
> have disappeared  or mutated unrecognizably by the time we get around to
> adding extents to ext2  or its descendents.  Note that XFS already
> implements extents on Linux, by  mapping them onto the pagecache I
> believe.

Yes, JFS as well, but this is somewhat unrelated to using buffers as io
handles.

> 
>> If we relax the rules to allow multiple buffer heads for
>> the same physical spot on disk, things get easier, and the FS is
>> responsible for not doing something stupid with it.  
>> 
>> The data is still consistent either way, there are just multiple io
>> handles.
> 
> Were you thinking of one mapping for all buffers on a given partition?

one mapping for all buffers accessed through getblk, get_hash_table, bread,
and the block device.

> If  so, how did you plan to handle different buffer sizes?  Were you
> planning to  keep the existing buffer hash chain or use the page cache
> hash chain, as I  did for ext2_getblk?

It would be through page->buffers.  Of course, this part is entirely
uncoded, but the nice thing about an address space on top of the physical
device is that bh->b_blocknr and bh->b_size directly compute to the offset
in the device.  This means the order of the buffers on page->buffers does
not have to be related to the location of the data they point to.

So, along the lines of what Andrea suggested, buffer heads can be allocated
as needed, regardless of the blocksizes already in use in the page.  The
current blkdev page cache code would need to be updated to use this idea,
but it should work.

-chris

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09 23:24                                   ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-09 23:54                                     ` linux-2.4.10-pre5 Alan Cox
  2001-09-10  0:23                                     ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-10 21:18                                     ` Daniel Phillips
  2001-09-10 21:23                                       ` linux-2.4.10-pre5 Alex Bligh - linux-kernel
  2001-09-10 22:15                                       ` linux-2.4.10-pre5 Linus Torvalds
  2 siblings, 2 replies; 94+ messages in thread
From: Daniel Phillips @ 2001-09-10 21:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On September 10, 2001 01:24 am, Linus Torvalds wrote:
> On Sun, 9 Sep 2001, Daniel Phillips wrote:
> >						  A spin-off benefit
> > is, the same mechanism could be used to implement a physical readahead
> > cache which can do things that logical readahead can't.
> 
> Considering that 99.9% of all disks do this on a lower hardware layer
> anyway, I very much doubt it has any good properties to make software more
> complex by having that kind of readahead in sw.

Here's some anectdotal evidence to the contrary.

This machine requires about 1.5 seconds to diff two kernel trees if both 
trees are in cache.  If neither tree is in cache it takes 90 seconds.  It's a 
total of about 300M of source - reading that into memory should take about 10 
seconds at 30M/sec, taking one pass across the disk and assuming no extensive 
fragmentation.

We lost 78.5 seconds somewhere.  From the sound of the disk drives, I'd say 
we lost it to seeking, which physical readahead with a large cache would be 
able to largely eliminate in this case.

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-09 20:18                                 ` linux-2.4.10-pre5 H. Peter Anvin
  2001-09-10  0:39                                   ` linux-2.4.10-pre5 Simon Kirby
@ 2001-09-10 21:22                                   ` Stephen C. Tweedie
  1 sibling, 0 replies; 94+ messages in thread
From: Stephen C. Tweedie @ 2001-09-10 21:22 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel, Stephen Tweedie

Hi,

On Sun, Sep 09, 2001 at 01:18:57PM -0700, H. Peter Anvin wrote:

> The ideal way to run backups I have found is on filesystems which
> support atomic snapshots -- that way, your backup set becomes not only
> safe (since it goes through the kernel etc. etc.) but totally
> coherent, since it is guaranteed to be unchanging.  This is a major
> win for filesystems which can do atomic snapshots, and I'd highly
> encourage filesystem developers to consider this feature.

It's already done.  LVM's snapshot facility has the ability to ask the
filesystem to quiesce itself into a consistent state before the
snapshot is taken, and both Reiserfs and ext3 support that function.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10 21:18                                     ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-10 21:23                                       ` Alex Bligh - linux-kernel
  2001-09-10 21:54                                         ` linux-2.4.10-pre5 Daniel Phillips
  2001-09-10 22:15                                       ` linux-2.4.10-pre5 Linus Torvalds
  1 sibling, 1 reply; 94+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-09-10 21:23 UTC (permalink / raw)
  To: Daniel Phillips, Linus Torvalds
  Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List,
	Alex Bligh - linux-kernel

--On Monday, 10 September, 2001 11:18 PM +0200 Daniel Phillips 
<phillips@bonn-fries.net> wrote:

> We lost 78.5 seconds somewhere.  From the sound of the disk drives, I'd
> say  we lost it to seeking, which physical readahead with a large cache
> would be  able to largely eliminate in this case.

So you only get a large % of your 78.5 seconds back
from pure physical readahead if the files and metadata
are (a) ordered in monitonically increasing disk
locations w.r.t. access order, and (b) physically close (no time
wasted reading in irrelevant data), or you apply some
form of clairvoyance patch :-) An alternative benchmark
would be do dd the /entire/ disk into RAM, then
run your diff on that, and I bet you get the
opposite result.

More serious point: if we retain readahead at a
logical level, you get it for non-physical
files too (e.g. NFS) - I presume this is
the intention. If so, what advantage does
additional physical readahead give you,
given physical ordering is surely no better
(and probably worse than) logical ordering.

--
Alex Bligh

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10 21:23                                       ` linux-2.4.10-pre5 Alex Bligh - linux-kernel
@ 2001-09-10 21:54                                         ` Daniel Phillips
  2001-09-10 22:39                                           ` linux-2.4.10-pre5 Alex Bligh - linux-kernel
  0 siblings, 1 reply; 94+ messages in thread
From: Daniel Phillips @ 2001-09-10 21:54 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel, Linus Torvalds
  Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List,
	Alex Bligh - linux-kernel

On September 10, 2001 11:23 pm, Alex Bligh - linux-kernel wrote:
> --On Monday, 10 September, 2001 11:18 PM +0200 Daniel Phillips 
> <phillips@bonn-fries.net> wrote:
> 
> > We lost 78.5 seconds somewhere.  From the sound of the disk drives, I'd
> > say  we lost it to seeking, which physical readahead with a large cache
> > would be  able to largely eliminate in this case.
> 
> So you only get a large % of your 78.5 seconds back
> from pure physical readahead if the files and metadata
> are (a) ordered in monitonically increasing disk
> locations w.r.t. access order, and

Well, no, that's one of the benefits of physical readahead: it is far less 
sensitive to disk layout than file-base readahead.

> (b) physically close (no time wasted reading in irrelevant data),

You mean, well clustered, which it is.  These trees were created by an untar 
onto an otherwise relatively young filetree.  I could repeat the experiment 
on a partition with nothing but this source on it.  I would expect similar 
results.

> or you apply some
> form of clairvoyance patch :-) An alternative benchmark
> would be do dd the /entire/ disk into RAM, then
> run your diff on that, and I bet you get the
> opposite result.

Doesn't this just support what I'm saying?  Anyway, the dd trick won't work 
because the page cache has no way of using the contents of the buffer cache.  
It has to re-read it even though it's already in memory - this is ugly, don't 
you agree?  When we make dd use the page cache as with Andrea's patch it 
still won't work because the page cache will still ignore blocks that already 
exist in the buffer cache mapping.

Because we're still using the buffer cache for metadata there will be some 
improvement with the dd trick, but not enormous.  I can't take the time to 
test it right now, but I will.

If we implement the physical block reverse mapping as I described in an 
earlier post then the dd technique will work fine.

> More serious point: if we retain readahead at a
> logical level, you get it for non-physical
> files too (e.g. NFS) - I presume this is
> the intention. If so, what advantage does
> additional physical readahead give you,

Well, you can see from my demonstration that logical readahead sucks 
badly for this common case.  I did mention that the two can coexist, didn't 
I?  In fact they already do on any system with a caching disk controller.  
Just not enough to handle two kernel trees.

> given physical ordering is surely no better
> (and probably worse than) logical ordering.

I'd have a good think about that 'surely'.

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10 21:18                                     ` linux-2.4.10-pre5 Daniel Phillips
  2001-09-10 21:23                                       ` linux-2.4.10-pre5 Alex Bligh - linux-kernel
@ 2001-09-10 22:15                                       ` Linus Torvalds
  2001-09-10 22:26                                         ` linux-2.4.10-pre5 Linus Torvalds
                                                           ` (2 more replies)
  1 sibling, 3 replies; 94+ messages in thread
From: Linus Torvalds @ 2001-09-10 22:15 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On Mon, 10 Sep 2001, Daniel Phillips wrote:
>
> Here's some anectdotal evidence to the contrary.
>
> This machine requires about 1.5 seconds to diff two kernel trees if both
> trees are in cache.  If neither tree is in cache it takes 90 seconds.  It's a
> total of about 300M of source - reading that into memory should take about 10
> seconds at 30M/sec, taking one pass across the disk and assuming no extensive
> fragmentation.
>
> We lost 78.5 seconds somewhere.  From the sound of the disk drives, I'd say
> we lost it to seeking, which physical readahead with a large cache would be
> able to largely eliminate in this case.

Yes, we could have a huge physical read-ahead, and hope that the logical
layout is such that consecutive files in the directory are also
consecutive on disk (which is quite often true).

And yes, doing a cold "diff" is about the worst case - we can't take
advantage of logical read-ahead within the files themselves (they tend to
be too small for read-ahead to matter on that level), and the IO is
bouncing back and forth between two different trees - and thus most likely
two very different places on the disk.

And even when the drive does physical read-ahead, a drive IO window of
64kB-256kB (and let's assume about 50% of that is actually _ahead_ of the
read) is not going to avoid the constant back-and-forth seeking when the
combined size of the two kernel trees is in the 50MB region.

[ There are also drives that just aren't very good at handling their
  internal caches. You'll see drives that have a 2MB on-board buffer, but
  the way the buffer is managed it might be used in fixed chunks. Some of
  the really worst ones only have a single buffer - so seeking back and
  forth just trashes the drive buffer completely. That's rather unusual,
  though. ]

However, physical read-ahead really isn't the answer here. I bet you could
cut your time down with it, agreed. But you'd hurt a lot of other loads,
and it really depends on nice layout on disk. Plus you wouldn't even know
where to put the data you read-ahead: you only have the physical address,
not the logical address, and the page-cache is purely logically indexed..

The answer to this kind of thing is to try to make the "diff" itself be
nicer on the cache. I bet you could speed up diff a lot by having it read
in multiple files in one go if you really wanted to. It probably isn't
worth most peoples time.

(Ugly secret: because I tend to have tons of memory, I sometimes do

	find tree1 tree2 -type f | xargs cat > /dev/null

just after I have rebooted, just so that I always have my kernel trees in
the cache - after that they tend to stay there.. Having a gig of ram
makes you do stupid things).

			Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10 22:15                                       ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-10 22:26                                         ` Linus Torvalds
  2001-09-10 22:39                                         ` linux-2.4.10-pre5 Rik van Riel
  2001-09-10 23:20                                         ` linux-2.4.10-pre5 Daniel Phillips
  2 siblings, 0 replies; 94+ messages in thread
From: Linus Torvalds @ 2001-09-10 22:26 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

Btw, with IDE disks, you might try to enable the drive lookahead by hand -
whether it is enabled by default or not may depend on the drive. Try if
"hdparm -A1 /dev/hdX" makes a difference for you.

(It probably doesn't, simply because it's probably already enabled, and
even if it isn't, the lookahead buffer isn't likely to be big enough to
help that much with the fundamental seeking problem in doing a recursive
diff).

		Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10 22:15                                       ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-10 22:26                                         ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-10 22:39                                         ` Rik van Riel
  2001-09-10 23:14                                           ` linux-2.4.10-pre5 Daniel Phillips
  2001-09-11 10:02                                           ` linux-2.4.10-pre5 Daniel Phillips
  2001-09-10 23:20                                         ` linux-2.4.10-pre5 Daniel Phillips
  2 siblings, 2 replies; 94+ messages in thread
From: Rik van Riel @ 2001-09-10 22:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Daniel Phillips, Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On Mon, 10 Sep 2001, Linus Torvalds wrote:

> (Ugly secret: because I tend to have tons of memory, I sometimes do
>
> 	find tree1 tree2 -type f | xargs cat > /dev/null

This suggests we may want to do agressive readahead on the
inode blocks.

They are small enough to - mostly - cache and should reduce
the amount of disk seeks quite a bit. In an 8 MB block group
with one 128 byte inode every 8 kB, we have a total of 128 kB
of inodes...

regards,

Rik
--
IA64: a worthy successor to the i860.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10 21:54                                         ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-10 22:39                                           ` Alex Bligh - linux-kernel
  2001-09-10 23:13                                             ` linux-2.4.10-pre5 Daniel Phillips
  0 siblings, 1 reply; 94+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-09-10 22:39 UTC (permalink / raw)
  To: Daniel Phillips, Alex Bligh - linux-kernel, Linus Torvalds
  Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List,
	Alex Bligh - linux-kernel

Daniel,

(apologies for reorder)

> Doesn't this just support what I'm saying?  Anyway, the dd trick won't
> work  because the page cache has no way of using the contents of the
> buffer cache.

You have nicely illustrated an assumption of mine which was:

>   It has to re-read it even though it's already in memory -
> this is ugly, don't  you agree?  When we make dd use the page cache as
> with Andrea's patch it  still won't work because the page cache will
> still ignore blocks that already  exist in the buffer cache mapping.

Doing logical as opposed to physical readahead is at least theoretically
orthogonal to whether or not caching 'product' is available per
reader/file or on a system wide basis.

You could look at this problem the other way around, and firstly
say "what we have read, we shall cache, and we shall not restrict
this to the reader who caused it to be read; and what we guess
is going to be read by file we shall read in preparation [1]". And
/then/ say, once you've elevator sorted the lot, that if we
have spare I/O bandwidth etc. etc., that it might make sense
to extend each portion of the elevator vector, especially if
it coalesces entries together. (But no point caching stuff twice.)

[1]=because I am assuming that this has wider applicability (NFS etc.)
and better results (/alone/) than physical readahead, for most
applications, not least as you can apply more intelligence at
that level. And I am assuming any extra benefits of physical readahead
(i.e. beyond those that would be given by an intelligent form
of logical readahead whose results were shared) are done by the disk.

> If we implement the physical block reverse mapping as I described in an
> earlier post then the dd technique will work fine.

I think that is nearly isomorphic to what I'm saying, but I'm not
talking about improving just the benchmark.

Example: writing read-ahead caching on a different OS (admittedly
a while ago) we discovered that physical readahead was useful
(surprise surprise) but ended up eating memory. [This was an
environment where the readahead cache (and I/O activity) could
live on a different processor(s) and disk I/O b/w didn't reduce
other I/O b/w. In some senses this is a little bit similar to
SMP. We had some memory to give away, but limited]. We discovered,
somewhat to our surprise, that if you trapped directory enumeration,
and read (in the background) the first few blocks of every file therein,
you got vast performance increases over straight readahead. This
turned out to be because this was exactly what many applications
did. And we also saw the gain over non-physical filesystems. Now
I have no idea whether this is the case under Linux too (almost
a decade later), but all I'm saying is by the time you've got down
to the block number layer, (a) you've thrown away important
information, and (b) the device manufacturer has already thought
of that, and gone done it in hardware. There's no way, however,
they could have done the directory readahead in h/w.

>> More serious point: if we retain readahead at a
>> logical level, you get it for non-physical
>> files too (e.g. NFS) - I presume this is
>> the intention. If so, what advantage does
>> additional physical readahead give you,
>
> Well, you can see from my demonstration that logical readahead sucks
> badly for this common case.

No, I see you correctly pointing out that /current/
logical readahead can suck as the page cache cannot
use the contents of the buffer cache.

> I did mention that the two can coexist,
> didn't  I?  In fact they already do on any system with a caching disk
> controller.   Just not enough to handle two kernel trees.

All I'm saying is we already have 2 layers of cache: disk
controller, and logical readahead (though it seems the
latter needs fixing). If you add a third between them
(in kernel physical caching) better it has some intelligence
(i.e. information), or ability (your suggestion: proximity
of well clustered files) sufficient to justify its presence.

>> given physical ordering is surely no better
>> (and probably worse than) logical ordering.
>
> I'd have a good think about that 'surely'.

By 'better' I meant 'gives no more information to a
caching algorithm, given we have it at the physical
layer (disk) anyway'.

What your benchmark seems to argue is that the disk's
physical readahead cache needs to be supplemented. I don't
necessarily disagree. But what readahead, and what cache?

Extraneous nits:

>> (a) ordered in monitonically increasing disk
>> locations w.r.t. access order, and
>
> Well, no, that's one of the benefits of physical readahead: it is far
> less  sensitive to disk layout than file-base readahead.

1. Hangon, if it's like that on disk, the disk  will have
   (physical) read ahead into its onboard cache, but...

2. Assume no (or ineffective) disk cache. You miss the point. If the
   data is stored on disk in a different physical order from that in
   which it is accessed, not only will you seek, but you will now
   read unnecessary data. As we are picking the kernel as our example
   du jour, consider untar linux kernel to disk, reboot machine,
   compile. Now are all those .c files, and (worse) .h files read
   in the same order they were written to disk? Sure, you may accidentally
   end up reading ahead data that you need later, but 150Mb is a pretty
   large data set to iterate over; how much of the job does the on-disk
   cache do for you? And where is the rest best done? We have finite
   amounts of memory, and I/O b/w, and using them maximally intelligently
   is a good thing. Your argument before was that I/O b/w was cheap,
   and memory was cheap. So should we make a RAM image of the entire
   disk? No. Selecting the bits to transfer with maximum intelligence
   is the key.

>> (b) physically close (no time wasted reading in irrelevant data),
>
> You mean, well clustered, which it is.

Because (as you say below) you untarred it clean. It
needs to be both clustered, and accessed in the same
order, more than once, was my point. And I wonder
how often this occurs (yes, tar then diff is a
valid example).

--
Alex Bligh

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10 22:39                                           ` linux-2.4.10-pre5 Alex Bligh - linux-kernel
@ 2001-09-10 23:13                                             ` Daniel Phillips
  2001-09-10 23:25                                               ` linux-2.4.10-pre5 Alex Bligh - linux-kernel
  0 siblings, 1 reply; 94+ messages in thread
From: Daniel Phillips @ 2001-09-10 23:13 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel, Linus Torvalds
  Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List,
	Alex Bligh - linux-kernel

On September 11, 2001 12:39 am, Alex Bligh - linux-kernel wrote:
> You could look at this problem the other way around, and firstly
> say "what we have read, we shall cache, and we shall not restrict
> this to the reader who caused it to be read; and what we guess
> is going to be read by file we shall read in preparation [1]". And
> /then/ say, once you've elevator sorted the lot, that if we
> have spare I/O bandwidth etc. etc., that it might make sense
> to extend each portion of the elevator vector, especially if
> it coalesces entries together. (But no point caching stuff twice.)

Yes, that's bascially what I had in mind.  The extension of the elevator 
vector is the physical readahead.  Except I had in mind to turn it around, 
and have the logical readahead not do the readahead itself, but just guide 
the operation of the physical readahead.

> [1]=because I am assuming that this has wider applicability (NFS etc.)
> and better results (/alone/) than physical readahead, for most
> applications, not least as you can apply more intelligence at
> that level. And I am assuming any extra benefits of physical readahead
> (i.e. beyond those that would be given by an intelligent form
> of logical readahead whose results were shared) are done by the disk.
> 
> > If we implement the physical block reverse mapping as I described in an
> > earlier post then the dd technique will work fine.
> 
> I think that is nearly isomorphic to what I'm saying, but I'm not
> talking about improving just the benchmark.

We're on the same wavelength now.

> Example: writing read-ahead caching on a different OS (admittedly
> a while ago) we discovered that physical readahead was useful
> (surprise surprise) but ended up eating memory. [This was an
> environment where the readahead cache (and I/O activity) could
> live on a different processor(s) and disk I/O b/w didn't reduce
> other I/O b/w. In some senses this is a little bit similar to
> SMP. We had some memory to give away, but limited]. We discovered,
> somewhat to our surprise, that if you trapped directory enumeration,
> and read (in the background) the first few blocks of every file therein,
> you got vast performance increases over straight readahead. This
> turned out to be because this was exactly what many applications
> did. And we also saw the gain over non-physical filesystems. Now
> I have no idea whether this is the case under Linux too (almost
> a decade later), but all I'm saying is by the time you've got down
> to the block number layer, (a) you've thrown away important
> information,

Don't throw away that information then, use it as hints to guide the physical 
readahead.

> and (b) the device manufacturer has already thought
> of that, and gone done it in hardware. There's no way, however,
> they could have done the directory readahead in h/w.
> 
> >> More serious point: if we retain readahead at a
> >> logical level, you get it for non-physical
> >> files too (e.g. NFS) - I presume this is
> >> the intention. If so, what advantage does
> >> additional physical readahead give you,
> >
> > Well, you can see from my demonstration that logical readahead sucks
> > badly for this common case.
> 
> No, I see you correctly pointing out that /current/
> logical readahead can suck as the page cache cannot
> use the contents of the buffer cache.

OK, now to shorten this up, if you've reached the conclusion that the page 
cache needs to be able to take advantage of blocks already read into the 
buffer cache then we're done.  That was my point all along.

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10 22:39                                         ` linux-2.4.10-pre5 Rik van Riel
@ 2001-09-10 23:14                                           ` Daniel Phillips
  2001-09-10 23:16                                             ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-11 10:02                                           ` linux-2.4.10-pre5 Daniel Phillips
  1 sibling, 1 reply; 94+ messages in thread
From: Daniel Phillips @ 2001-09-10 23:14 UTC (permalink / raw)
  To: Rik van Riel, Linus Torvalds
  Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On September 11, 2001 12:39 am, Rik van Riel wrote:
> On Mon, 10 Sep 2001, Linus Torvalds wrote:
> 
> > (Ugly secret: because I tend to have tons of memory, I sometimes do
> >
> > 	find tree1 tree2 -type f | xargs cat > /dev/null
> 
> This suggests we may want to do agressive readahead on the
> inode blocks.

While that is most probably true, that wasn't his point.  He preloaded the 
page cache.

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10 23:14                                           ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-10 23:16                                             ` Linus Torvalds
  2001-09-11  0:53                                               ` linux-2.4.10-pre5 Rik van Riel
  0 siblings, 1 reply; 94+ messages in thread
From: Linus Torvalds @ 2001-09-10 23:16 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rik van Riel, Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On Tue, 11 Sep 2001, Daniel Phillips wrote:

> On September 11, 2001 12:39 am, Rik van Riel wrote:
> > On Mon, 10 Sep 2001, Linus Torvalds wrote:
> >
> > > (Ugly secret: because I tend to have tons of memory, I sometimes do
> > >
> > > 	find tree1 tree2 -type f | xargs cat > /dev/null
> >
> > This suggests we may want to do agressive readahead on the
> > inode blocks.
>
> While that is most probably true, that wasn't his point.  He preloaded the
> page cache.

Well, yes, but more importantly, I pre-loaded my page cache _with_io_that_
_is_more_likely_to_be_consecutive_.

This means that the preload + subsequent diff is already likely to be
faster than doing just one "diff" would have been.

So it' snot just about preloading. It's also about knowing about access
patterns beforehand - something that the kernel really cannot do.

Pre-loading your cache always depends on some limited portion of
prescience. If you preload too aggressively compared to your knowledge of
future usage patterns, you're _always_ going to lose. I think that the
kernel doing physical read-ahead is "too aggressive", while doing it
inside "diff" (that _knows_ the future patterns) is not.

And doing it by hand depends on the user knowing what he will do in the
future. In some cases that's fairly easy for the user to guess ;)

		Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10 22:15                                       ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-10 22:26                                         ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-10 22:39                                         ` linux-2.4.10-pre5 Rik van Riel
@ 2001-09-10 23:20                                         ` Daniel Phillips
  2001-09-11  0:20                                           ` linux-2.4.10-pre5 Linus Torvalds
  2 siblings, 1 reply; 94+ messages in thread
From: Daniel Phillips @ 2001-09-10 23:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On September 11, 2001 12:15 am, Linus Torvalds wrote:
> However, physical read-ahead really isn't the answer here. I bet you could
> cut your time down with it, agreed. But you'd hurt a lot of other loads,
> and it really depends on nice layout on disk. Plus you wouldn't even know
> where to put the data you read-ahead: you only have the physical address,
> not the logical address, and the page-cache is purely logically indexed..

You leave it in the buffer cache and the page cache checks for it there 
before deciding it has to hit the disk.

For coherency, getblk has to check for the block in the page cache.  We can 
this implement using the buffer hash chain links, currently unused for page 
cache buffers.

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10 23:13                                             ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-10 23:25                                               ` Alex Bligh - linux-kernel
  0 siblings, 0 replies; 94+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-09-10 23:25 UTC (permalink / raw)
  To: Daniel Phillips, Alex Bligh - linux-kernel, Linus Torvalds
  Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List,
	Alex Bligh - linux-kernel

--On Tuesday, 11 September, 2001 1:13 AM +0200 Daniel Phillips 
<phillips@bonn-fries.net> wrote:

> OK, now to shorten this up, if you've reached the conclusion that the
> page  cache needs to be able to take advantage of blocks already read
> into the  buffer cache then we're done.  That was my point all along.

ACK. May be I am arguing all pages read should live in page/unified cache,
i.e. current buffer cache should at most contain references to data already
read and influence expiry of those pages (as opposed to hold private copies
thereof) - see previous readahead debate ref best/worst pages to expire -
BUT STILL thus drive what is cached at the layers below, (i.e.
drive the file based readahead). Not saying there is necessarilly no place
for physical readahead even with disks being able to do it, just difficult
to measure whilst all logical readahead pages put in buffer cache cannot be
taken advantage of by page cache.

It may well be that I am lagging well behind on the thought process.
Need lkml readahead obviously :-)

--
Alex Bligh

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10 23:20                                         ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-11  0:20                                           ` Linus Torvalds
  2001-09-11  1:16                                             ` linux-2.4.10-pre5 Daniel Phillips
  0 siblings, 1 reply; 94+ messages in thread
From: Linus Torvalds @ 2001-09-11  0:20 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List


On Tue, 11 Sep 2001, Daniel Phillips wrote:
> On September 11, 2001 12:15 am, Linus Torvalds wrote:
> > However, physical read-ahead really isn't the answer here. I bet you could
> > cut your time down with it, agreed. But you'd hurt a lot of other loads,
> > and it really depends on nice layout on disk. Plus you wouldn't even know
> > where to put the data you read-ahead: you only have the physical address,
> > not the logical address, and the page-cache is purely logically indexed..
>
> You leave it in the buffer cache and the page cache checks for it there
> before deciding it has to hit the disk.

Umm..

Ehh.. So now you have two cases:
 - you hit in the cache, in which case you've done an extra allocation,
   and will have to do an extra memcpy.
 - you don't hit in the cache, in which case you did IO that was
   completely useless and just made the system slower.

Considering that the extra allocation and memcpy is likely to seriously
degrade performance on any high-end hardware if it happens any noticeable
percentage of the time, I don't see how your suggest can _ever_ be a win.
The only time you avoid the memcpy is when you wasted the IO completely.

So please explain to me how you think this is all a good idea? Or explain
why you think the memcpy is not going to be noticeable in disk throughput
for normal loads?

		Linus


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10 23:16                                             ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-11  0:53                                               ` Rik van Riel
  2001-09-11  6:39                                                 ` linux-2.4.10-pre5 Hua Zhong
  2001-09-11 15:12                                                 ` linux-2.4.10-pre5 Linus Torvalds
  0 siblings, 2 replies; 94+ messages in thread
From: Rik van Riel @ 2001-09-11  0:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Daniel Phillips, Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On Mon, 10 Sep 2001, Linus Torvalds wrote:
> > On September 11, 2001 12:39 am, Rik van Riel wrote:

> > > This suggests we may want to do agressive readahead on the
> > > inode blocks.

> So it' snot just about preloading. It's also about knowing about access
> patterns beforehand - something that the kernel really cannot do.
>
> Pre-loading your cache always depends on some limited portion of
> prescience.

OTOH, agressively pre-loading metadata should be ok in a lot
of cases, because metadata is very small, but wastes about
half of our disk time because of the seeks ...

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-11  0:20                                           ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-11  1:16                                             ` Daniel Phillips
  2001-09-11  2:27                                               ` linux-2.4.10-pre5 Linus Torvalds
  0 siblings, 1 reply; 94+ messages in thread
From: Daniel Phillips @ 2001-09-11  1:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On September 11, 2001 02:20 am, Linus Torvalds wrote:
> On Tue, 11 Sep 2001, Daniel Phillips wrote:
> > On September 11, 2001 12:15 am, Linus Torvalds wrote:
> > > However, physical read-ahead really isn't the answer here. I bet you could
> > > cut your time down with it, agreed. But you'd hurt a lot of other loads,
> > > and it really depends on nice layout on disk. Plus you wouldn't even know
> > > where to put the data you read-ahead: you only have the physical address,
> > > not the logical address, and the page-cache is purely logically indexed..
> >
> > You leave it in the buffer cache and the page cache checks for it there
> > before deciding it has to hit the disk.
> 
> Umm..
> 
> Ehh.. So now you have two cases:
>  - you hit in the cache, in which case you've done an extra allocation,
>    and will have to do an extra memcpy.
>  - you don't hit in the cache, in which case you did IO that was
>    completely useless and just made the system slower.

If the read comes from block_read then the data goes into the page cache.  If
it comes from getblk (because of physical readahead or "dd xxx >null") the 
data goes into the buffer cache and may later be moved to the page cache, 
once we know what the logical mapping is.

> Considering that the extra allocation and memcpy is likely to seriously
> degrade performance on any high-end hardware if it happens any noticeable
> percentage of the time, I don't see how your suggest can _ever_ be a win.
> The only time you avoid the memcpy is when you wasted the IO completely.

We don't have any extra allocation in the cases we're already handling now, 
it works exactly the same.  The only overhead is an extra hash probe on cache 
miss.

> So please explain to me how you think this is all a good idea? Or explain
> why you think the memcpy is not going to be noticeable in disk throughput
> for normal loads?

When we're forced to do a memcpy it's for a case where we're saving a read or 
a seek.  Even then, the memcpy can be optimized away in the common case that 
the blocksize matches page_size.  Sometimes, even when the blocksize doesn't 
match this optimization would be possible.  But the memcpy optimization isn't 
important, the goal is to save reads and seeks by combining reads and reading 
blocks in physical order as opposed to file order.

An observation: logical readahead can *never* read a block before it knows 
what the physical mapping is, whereas physical readahead can.

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-11  1:16                                             ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-11  2:27                                               ` Linus Torvalds
  2001-09-11  7:45                                                 ` linux-2.4.10-pre5 Helge Hafting
  2001-09-11 10:27                                                 ` linux-2.4.10-pre5 Daniel Phillips
  0 siblings, 2 replies; 94+ messages in thread
From: Linus Torvalds @ 2001-09-11  2:27 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On Tue, 11 Sep 2001, Daniel Phillips wrote:
> >
> > Ehh.. So now you have two cases:
> >  - you hit in the cache, in which case you've done an extra allocation,
> >    and will have to do an extra memcpy.
> >  - you don't hit in the cache, in which case you did IO that was
> >    completely useless and just made the system slower.
>
> If the read comes from block_read then the data goes into the page cache.  If
> it comes from getblk (because of physical readahead or "dd xxx >null") the
> data goes into the buffer cache and may later be moved to the page cache,
> once we know what the logical mapping is.

Note that practically all accesses will be through the logical mapping,
and the virtual index.

While the physical mapping and the physical index is the only one you can
do physical read-ahead into.

And the two DO NOT OVERLAP. They never will. You can't just swizzle
pointers around - you will have to memcpy.

In short, you'll en dup doing a memcpy() pretty much every single time you
hit.

> > Considering that the extra allocation and memcpy is likely to seriously
> > degrade performance on any high-end hardware if it happens any noticeable
> > percentage of the time, I don't see how your suggest can _ever_ be a win.
> > The only time you avoid the memcpy is when you wasted the IO completely.
>
> We don't have any extra allocation in the cases we're already handling now,
> it works exactly the same.  The only overhead is an extra hash probe on cache
> miss.

No it doesn't. We're already handing _nothing_: the only thing we're
handling right now is:

 - swap cache read-ahead (which is _not_ a physical read-ahead, but a
   logical one, and as such is not at all the case you're handling)

 - we're invalidating buffer heads that we find to be aliasing the virtual
   page we create, which is _also_ not at all the same thing, it's simply
   an issue of making sure that we haven't had the (unlikely) case of a
   meta-data block being free'd, but not yet written back.

So in the first case we don't have an aliasing issue at all (it's really a
virtual index, although in the case of a swap partition the virtual
address ends up being a 1:1 mapping of the physical address), and in the
second case we do not intend to use the physical mapping we find, we just
intend to get rid of it.

> > So please explain to me how you think this is all a good idea? Or explain
> > why you think the memcpy is not going to be noticeable in disk throughput
> > for normal loads?
>
> When we're forced to do a memcpy it's for a case where we're saving a read or
> a seek.

No.

The above is assuming that the disk doesn't already have the data in its
buffers. In which case the only thing we're doing is making the IO command
and the DMA that filled the page happen earlier.

Which can be good for latency, but considering that the read-ahead is at
least as likely to be _bad_ for latency, I don't believe in that argument
very much. Especially not when you've dirtied the cache and introduced an
extra memcop.

>	  Even then, the memcpy can be optimized away in the common case that
> the blocksize matches page_size.

Well, you actually have to be very very careful: if you do that there is
just a _ton_ of races there (you'd better be _really_ sure that nobody
else has already found either of the pages, the "move page" operation is
not exactly completely painless).

>				  Sometimes, even when the blocksize doesn't
> match this optimization would be possible.  But the memcpy optimization isn't
> important, the goal is to save reads and seeks by combining reads and reading
> blocks in physical order as opposed to file order.

Try it. I'll be convinced by numbers, and I'll bet you won't have the
numbers for the common cases to prove yourself right.

You're making the (in my opinion untenable) argument that the logical
read-ahead is flawed enough of the time that you win by doing an
opportunistic physical read-ahead, even when that implies ore memory
pressure both from an allocation standpoint (if you fail 10% of the time,
you'll have 10% more page cache pages you need to get rid of gracefully,
that never had _any_ useful information in them) and from a memory bus
standpoint.

> An observation: logical readahead can *never* read a block before it knows
> what the physical mapping is, whereas physical readahead can.

Sure. But the meta-data is usually on the order of 1% or less of the data,
which means that you tend to need to read a meta-data block only 1% of the
time you need to read a real data block.

Which makes _that_ optimization not all that useful.

So I'm claiming that in order to get any useful performance improvments,
your physical read-ahead has to be _clearly_ better than the logical one.
I doubt it is.

Basically, the things where physical read-ahead might win:
 - it can be done without metadata (< 1%)
 - it can be done "between files" (very uncommon, especially as you have
   to assume that different files are physically close)

Can you see any others? I doubt you can get physical read-ahead that gets
more than a few percentage points better hits than the logical one. AND I
further claim that you'll get a _lot_ more misses on physical read-ahead,
which means that your physical read-ahead window should probably be larger
than our current logical read-ahead.

I have seen no arguments from you that might imply anything else, really.

Which in turn also implies that the _overhead_ of physical read-ahead is
larger. And that is without even the issue of memcpy and/or switching
pages around, _and_ completely ignoring all the complexity-issues.

But hey, at the end of the day, numbers rule.

		Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10  8:30                                     ` linux-2.4.10-pre5 Kai Henningsen
@ 2001-09-11  5:29                                       ` Peter Samuelson
  2001-09-11 11:29                                         ` linux-2.4.10-pre5 Kai Henningsen
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Samuelson @ 2001-09-11  5:29 UTC (permalink / raw)
  To: kaih, linux-kernel

> I see two possible atime uses:
>
> 1. Cleaning up /tmp (mtime is *not* a good indicator that a file is no
> longer used)
> 2. Swapping out files to slower storage
>
> Essentially, both use the "do we still need this thing" aspect.

The Debian 'popularity-contest' package has an interesting use for
atime.  It runs weekly (if you install it, of course -- nobody *has* to
do this!) and determines when each of your installed packages was last
referenced -- then mails this anonymously to a drop box somewhere.  The
Higher Purpose here is to determine what the "most useful" Debian
packages are.  These go on the first volume of the Debian CD set, to
make a one-volume Debian CD as useful as possible.

The Windows 2000 install subsystem seems to collect similar data.
Heaven only knows whether *that* data ever makes it back to its
owners....

Peter

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-11  0:53                                               ` linux-2.4.10-pre5 Rik van Riel
@ 2001-09-11  6:39                                                 ` Hua Zhong
  2001-09-11 15:12                                                 ` linux-2.4.10-pre5 Linus Torvalds
  1 sibling, 0 replies; 94+ messages in thread
From: Hua Zhong @ 2001-09-11  6:39 UTC (permalink / raw)
  To: Rik van Riel, Linus Torvalds
  Cc: Daniel Phillips, Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

It's not about the absolute size of meta data..it's about having limited
cache size.  Whatever you read ahead, as long as it doesn't do good it
wastes space and pushes (more) useful data out of the cache, and thus what
you gain in the readahead will be outweighted by further re-reading.

readahead is good, but doing it too aggressively is not.

----- Original Message -----
From: "Rik van Riel" <riel@conectiva.com.br>
To: "Linus Torvalds" <torvalds@transmeta.com>
Cc: "Daniel Phillips" <phillips@bonn-fries.net>; "Andreas Dilger"
<adilger@turbolabs.com>; "Andrea Arcangeli" <andrea@suse.de>; "Kernel
Mailing List" <linux-kernel@vger.kernel.org>
Sent: Monday, September 10, 2001 5:53 PM
Subject: Re: linux-2.4.10-pre5


> On Mon, 10 Sep 2001, Linus Torvalds wrote:
> > > On September 11, 2001 12:39 am, Rik van Riel wrote:
>
> > > > This suggests we may want to do agressive readahead on the
> > > > inode blocks.
>
> > So it' snot just about preloading. It's also about knowing about access
> > patterns beforehand - something that the kernel really cannot do.
> >
> > Pre-loading your cache always depends on some limited portion of
> > prescience.
>
> OTOH, agressively pre-loading metadata should be ok in a lot
> of cases, because metadata is very small, but wastes about
> half of our disk time because of the seeks ...
>
> regards,
>
> Rik
> --
> IA64: a worthy successor to i860.
>
> http://www.surriel.com/ http://distro.conectiva.com/
>
> Send all your spam to aardvark@nl.linux.org (spam digging piggy)
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-11  2:27                                               ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-11  7:45                                                 ` Helge Hafting
  2001-09-11 10:27                                                 ` linux-2.4.10-pre5 Daniel Phillips
  1 sibling, 0 replies; 94+ messages in thread
From: Helge Hafting @ 2001-09-11  7:45 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

Linus Torvalds wrote:

> > An observation: logical readahead can *never* read a block before it knows
> > what the physical mapping is, whereas physical readahead can.
> 
> Sure. But the meta-data is usually on the order of 1% or less of the data,
> which means that you tend to need to read a meta-data block only 1% of the
> time you need to read a real data block.

Seems to me a readahead without metadata don't buy very much.  Sure,
you get the file page early without looking up metadata on disk.  But
oops - it cannot be used yet as we don't yet know the fact that it _is_
part of the file!  When the process gets to ask for that part of the
file
we still have to wait for metadata.

Physical readahead may or may not help - but I cannot see that this
particular aspect helps anything.

Helge Hafting

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-10 22:39                                         ` linux-2.4.10-pre5 Rik van Riel
  2001-09-10 23:14                                           ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-11 10:02                                           ` Daniel Phillips
  2001-09-11 20:07                                             ` linux-2.4.10-pre5 Rik van Riel
  1 sibling, 1 reply; 94+ messages in thread
From: Daniel Phillips @ 2001-09-11 10:02 UTC (permalink / raw)
  To: Rik van Riel, Linus Torvalds
  Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On September 11, 2001 12:39 am, Rik van Riel wrote:
> On Mon, 10 Sep 2001, Linus Torvalds wrote:
> 
> > (Ugly secret: because I tend to have tons of memory, I sometimes do
> >
> > 	find tree1 tree2 -type f | xargs cat > /dev/null
> 
> This suggests we may want to do agressive readahead on the
> inode blocks.
> 
> They are small enough to - mostly - cache and should reduce
> the amount of disk seeks quite a bit. In an 8 MB block group
> with one 128 byte inode every 8 kB, we have a total of 128 kB
> of inodes...

I tested this idea by first doing a ls -R on the tree, then Linus's find 
command:

    time ls -R linux >/dev/null
    time find linux -type f | xargs cat > /dev/null

the plan being that the ls command would read all the inode blocks and hardly 
any of the files would be big enough to have an index block, so we would 
effectively have all metadata in cache.

According to your theory the total time for the two commands should be less 
than the second command alone.  But it wasn't, the two commands together took 
almost exactly the same time as the second command by itself.

There goes that theory.

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-11  2:27                                               ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-11  7:45                                                 ` linux-2.4.10-pre5 Helge Hafting
@ 2001-09-11 10:27                                                 ` Daniel Phillips
  2001-09-11 15:39                                                   ` linux-2.4.10-pre5 Linus Torvalds
  1 sibling, 1 reply; 94+ messages in thread
From: Daniel Phillips @ 2001-09-11 10:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

[long, interesting only to hardcore readahead freaks]

On September 11, 2001 04:27 am, Linus Torvalds wrote:
> On Tue, 11 Sep 2001, Daniel Phillips wrote:
> > >
> > > Ehh.. So now you have two cases:
> > >  - you hit in the cache, in which case you've done an extra allocation,
> > >    and will have to do an extra memcpy.
> > >  - you don't hit in the cache, in which case you did IO that was
> > >    completely useless and just made the system slower.
> >
> > If the read comes from block_read then the data goes into the page cache.  If
> > it comes from getblk (because of physical readahead or "dd xxx >null") the
> > data goes into the buffer cache and may later be moved to the page cache,
> > once we know what the logical mapping is.
> 
> Note that practically all accesses will be through the logical mapping,
> and the virtual index.
> 
> While the physical mapping and the physical index is the only one you can
> do physical read-ahead into.
> 
> And the two DO NOT OVERLAP. They never will. You can't just swizzle
> pointers around - you will have to memcpy.
> 
> In short, you'll end up doing a memcpy() pretty much every single time you
> hit.

And *only* when we hit.  Even if we don't optimize away the memcpy, it's a
win, so long as we get enough hits to make up for the cost of any wasted
readaheads.  Any time physical readahead correctly hits a block of metadata
then chances are good we've eliminated a synchronous stall.

[...]
>  - we're invalidating buffer heads that we find to be aliasing the virtual
>    page we create, which is _also_ not at all the same thing, it's simply
>    an issue of making sure that we haven't had the (unlikely) case of a
>    meta-data block being free'd, but not yet written back.

Aha.  So we are going to do the buffer cache hash probe anyway, the thing I 
call the the reverse lookup.  Now I'm just suggesting we drop the other shoe 
and chain all the page cache blocks to the buffer hash.  The only extra cost 
will be the buffer hash insert and delete, and in return we get complete 
coherency.

> > > So please explain to me how you think this is all a good idea? Or explain
> > > why you think the memcpy is not going to be noticeable in disk throughput
> > > for normal loads?
> >
> > When we're forced to do a memcpy it's for a case where we're saving a read or
> > a seek.
> 
> No.
> 
> The above is assuming that the disk doesn't already have the data in its
> buffers. In which case the only thing we're doing is making the IO command
> and the DMA that filled the page happen earlier.
> 
> Which can be good for latency, but considering that the read-ahead is at
> least as likely to be _bad_ for latency, I don't believe in that argument
> very much. Especially not when you've dirtied the cache and introduced an
> extra memcop.

Wait, does dma dirty the cache?  I'd hope not.

> >	  Even then, the memcpy can be optimized away in the common case that
> > the blocksize matches page_size.
> 
> Well, you actually have to be very very careful: if you do that there is
> just a _ton_ of races there (you'd better be _really_ sure that nobody
> else has already found either of the pages, the "move page" operation is
> not exactly completely painless).

Is doesn't look that bad.  The buffer hash link doesn't change so we don't
need the hash_table_lock.  We basically do an add_to_page_cache less the
lru_cache_add and flags intialization.

> [...]
> You're making the (in my opinion untenable) argument that the logical
> read-ahead is flawed enough of the time that you win by doing an
> opportunistic physical read-ahead, even when that implies ore memory
> pressure both from an allocation standpoint (if you fail 10% of the time,
> you'll have 10% more page cache pages you need to get rid of gracefully,
> that never had _any_ useful information in them) and from a memory bus
> standpoint.

Yes, that's the main danger indeed.  So we'd better be sensitive to other 
disk traffic and not hog it for readahead.  The same with cache memory.  When
we do need to recover cache, the readahead blocks better be easy to recover.
It helps that they're clean.  To recover them efficiently they can live on
their own lru list, which can be readily canibalized.

> > An observation: logical readahead can *never* read a block before it knows
> > what the physical mapping is, whereas physical readahead can.
> 
> Sure. But the meta-data is usually on the order of 1% or less of the data,
> which means that you tend to need to read a meta-data block only 1% of the
> time you need to read a real data block.
> 
> Which makes _that_ optimization not all that useful.

Oh no, the metadata blocks have a far greater impact than that: they are
serializers in the sense that you have to read the metadata before reading
the data blocks.  This causes the head has to return to locations it's
already visited, because it didn't know that the blocks at that location
would be needed soon.  I'm tentatively adopting this as the explanation for
the 5X gap between current read performance and what we know to be possible.

> So I'm claiming that in order to get any useful performance improvments,
> your physical read-ahead has to be _clearly_ better than the logical one.
> I doubt it is.

Even marginally better will satisfy me, because what I'm really after is the
coherency across buffer and page cache.  Also, I'm not presenting it as an
either/or.  I see physical readahead as complementary to logical readahead.

> Basically, the things where physical read-ahead might win:
>  - it can be done without metadata (< 1%)

See above.  It's not the actual size of the metadata that matters, it's
the way it serializes access to the data blocks.

>  - it can be done "between files" (very uncommon, especially as you have
>    to assume that different files are physically close)

Entire files can lie "between files".  That is, you have files A B C D in 
order on the disk and the directory happens to order them A D C B.  Relying
only on logical readahead there is a good chance of incurring an extra seek.

Disk fragmentation makes the out-of-order problem a lot more common.

> Can you see any others?

Yes:

  - Lots of ide drives don't have any cache at all.  It's not in question
    that *some* physical readahead is good, right?  At least we should be
    able to buffer a couple of tracks even for those dumb disks.  This is
    particularly helpful in the case where we're reading lots of little
    files.  With logical readahead, even if the inode happens to be in
    memory by virtue of being on the same block (physical readahead for
    free) we have to set up a new set of reads for each file, plenty of
    time for the file data to pass under the read head.  The result is
    one complete revolution per file, 6 ms.  Whereas with physical
    readahead we could pick those files up at a rate of 1 ms each.

  - The disk firmware ought to be smart enough to evict already-read
    blocks from its cache early, in which case our readahead
    effectively frees up space in its cache.

And not connected with readhead per se, but with the associated cache
coherency:

  - It lets us use dd for preload, which takes maximimum advantage of
    the user's prescience.  It's also a natural interface for programs
    that know more about expected use patterns than the kernel.  For
    example: a GUI, we could certainly use some improvement there.  Or
    a jvm, or even a browser (hello mozilla).

We have the case of the diffs to show us that, with a little prescience,
we can do a *lot* better with physical readahead.  Now what remains is to 
approximate that prescience with useful heuristics.

The point about dd doesn't rely on readahead at all, it just requires that
the page cache snoop and copy from the buffer cache.  I'm quite prepared to 
implement that, just to see how it works.  Then I can imagine a trivial 
algorithm that traverses a directory tree bmapping all the files, sorts that 
then dd's all the heavily clustered parts into the cache in one pass.  That 
should run quite a bit faster than your find command.

Something else we get for free is the ability to getblk a block that's
already in page cache, with full cache coherency.  You mentioned this could
be useful for on-the-fly fsck (it requires cooperation between the fs and
the fsck).  It could also be useful for distributed filesystems.

As far as robustness goes, we get rid of the aliasing problem and gain the
ability to detect cross-linked data blocks.

So we should see some benefit even before attempting to improve readahead.

> I doubt you can get physical read-ahead that gets
> more than a few percentage points better hits than the logical one. AND I
> further claim that you'll get a _lot_ more misses on physical read-ahead,
> which means that your physical read-ahead window should probably be larger
> than our current logical read-ahead.

Yes, that's true.  More misses balanced against fewer seeks and disk 
rotations, and improved latency.

> I have seen no arguments from you that might imply anything else, really.
> 
> Which in turn also implies that the _overhead_ of physical read-ahead is
> larger. And that is without even the issue of memcpy and/or switching
> pages around, _and_ completely ignoring all the complexity-issues.
> 
> But hey, at the end of the day, numbers rule.

Yes, for sure.  As far as I'm concerned, the diff test and your find example 
prove that there is gold to be mined here.

>From my point of view, the advantages of increased robustness and error 
detection by themselves are worth the price of admission alone.  So if I
only managed to get say, 2% improvement it would be a solid victory.  It
just has to cover the cost of the extra hash probes.

We already know we can do much better than that, a factor of 5-6 for the
diff under good conditions.  Surely over time we'll learn how to come
closer to that with a predictive as opposed to prescient algorithm.

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-11  5:29                                       ` linux-2.4.10-pre5 Peter Samuelson
@ 2001-09-11 11:29                                         ` Kai Henningsen
  0 siblings, 0 replies; 94+ messages in thread
From: Kai Henningsen @ 2001-09-11 11:29 UTC (permalink / raw)
  To: linux-kernel

peter@cadcamlab.org (Peter Samuelson)  wrote on 11.09.01 in <20010911002956.D582@cadcamlab.org>:

> > I see two possible atime uses:
> >
> > 1. Cleaning up /tmp (mtime is *not* a good indicator that a file is no
> > longer used)
> > 2. Swapping out files to slower storage
> >
> > Essentially, both use the "do we still need this thing" aspect.
>
> The Debian 'popularity-contest' package has an interesting use for
> atime.

Oh yes, forgot about that one. It's still the same idea, though.

>  These go on the first volume of the Debian CD set, to
> make a one-volume Debian CD as useful as possible.

And the others are pushed to the second CD ... "swapping out files to  
slower storage".


MfG Kai

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-11  0:53                                               ` linux-2.4.10-pre5 Rik van Riel
  2001-09-11  6:39                                                 ` linux-2.4.10-pre5 Hua Zhong
@ 2001-09-11 15:12                                                 ` Linus Torvalds
  2001-09-11 15:44                                                   ` linux-2.4.10-pre5 Daniel Phillips
  1 sibling, 1 reply; 94+ messages in thread
From: Linus Torvalds @ 2001-09-11 15:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Daniel Phillips, Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On Mon, 10 Sep 2001, Rik van Riel wrote:
> >
> > Pre-loading your cache always depends on some limited portion of
> > prescience.
>
> OTOH, agressively pre-loading metadata should be ok in a lot
> of cases, because metadata is very small, but wastes about
> half of our disk time because of the seeks ...

I actually agree to some degree here. The main reason I agree is that
meta-data often is (a) known to be physically contiguous (so pre-fecthing
is easy and cheap on most hardwate) and (b) meta-data _is_ small, so you
don't have to prefetch very much (so pre-fetching is not all that
expensive).

Trying to read two or more pages of inode data whenever we fetch an inode
might not be a bad idea, for example. Either we fetch a _lot_ of inodes
(in which case the prefetching is very likely to get a hit anyway), or we
don't (in which case the prefetching is unlikely to hurt all that much
either). You don't easily get into a situation where you prefetch a lot
without gaining anything.

We might do other kinds of opportunistic pre-fetching, like have "readdir"
start prefetching for the inode data it finds. That miht be a win for many
loads (and it might be a loss too - there _are_ loads that really only
care about the filename, although I suspect they are fairly rare).

		Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-11 10:27                                                 ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-11 15:39                                                   ` Linus Torvalds
  2001-09-11 16:52                                                     ` linux-2.4.10-pre5 Daniel Phillips
  0 siblings, 1 reply; 94+ messages in thread
From: Linus Torvalds @ 2001-09-11 15:39 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On Tue, 11 Sep 2001, Daniel Phillips wrote:
> >
> > In short, you'll end up doing a memcpy() pretty much every single time you
> > hit.
>
> And *only* when we hit.  Even if we don't optimize away the memcpy, it's a
> win, so long as we get enough hits to make up for the cost of any wasted
> readaheads.  Any time physical readahead correctly hits a block of metadata
> then chances are good we've eliminated a synchronous stall.

Ehh..

Your argument seems to be "it won't cost us anything if we don't hit".

But hey, if we don't hit, then it _will_ cost us - that implies that the
read-ahead was _completely_ wasted. That's the worst case, not the "good"
case. You've just wasted CPU cycles in setting up the bogus IO, and disk
and bus cycles in executing it.

> >  - we're invalidating buffer heads that we find to be aliasing the virtual
> >    page we create, which is _also_ not at all the same thing, it's simply
> >    an issue of making sure that we haven't had the (unlikely) case of a
> >    meta-data block being free'd, but not yet written back.
>
> Aha.  So we are going to do the buffer cache hash probe anyway, the thing I
> call the the reverse lookup.  Now I'm just suggesting we drop the other shoe
> and chain all the page cache blocks to the buffer hash.  The only extra cost
> will be the buffer hash insert and delete, and in return we get complete
> coherency.

"complete"? No. The coherency is very much one-way: we'd better not have
anything that actually dirties the buffer cache, because that dirtying
will _not_ be seen by a virtual cache.

Note that this is not anything new, though.

> > The above is assuming that the disk doesn't already have the data in its
> > buffers. In which case the only thing we're doing is making the IO command
> > and the DMA that filled the page happen earlier.
> >
> > Which can be good for latency, but considering that the read-ahead is at
> > least as likely to be _bad_ for latency, I don't believe in that argument
> > very much. Especially not when you've dirtied the cache and introduced an
> > extra memcop.
>
> Wait, does dma dirty the cache?  I'd hope not.

It can do so (many architectures like the ARM actually do DMA in
software), but no, I wasn't talking about the DMA, but about the memcpy.

> > Well, you actually have to be very very careful: if you do that there is
> > just a _ton_ of races there (you'd better be _really_ sure that nobody
> > else has already found either of the pages, the "move page" operation is
> > not exactly completely painless).
>
> Is doesn't look that bad.  The buffer hash link doesn't change so we don't
> need the hash_table_lock.  We basically do an add_to_page_cache less the
> lru_cache_add and flags intialization.

Wrong.

You need the hash_table_lock for _another_ reason: you need to make really
sure that nobody is traversing the hash table right then - because you can
NOT afford to have somebody else find one of the pages or buffers while
you're doing the operation (you have to test for all the counts being
zero, and you have to do that test when you can guarantee that nobody else
suddenly finds it).

> > > An observation: logical readahead can *never* read a block before it knows
> > > what the physical mapping is, whereas physical readahead can.
> >
> > Sure. But the meta-data is usually on the order of 1% or less of the data,
> > which means that you tend to need to read a meta-data block only 1% of the
> > time you need to read a real data block.
> >
> > Which makes _that_ optimization not all that useful.
>
> Oh no, the metadata blocks have a far greater impact than that: they are
> serializers in the sense that you have to read the metadata before reading
> the data blocks.

Ehh.. You _have_ to read them anyway before you can use your data.

Your argument is fundamentally flawed: remember that you cannot actually
_use_ the data you read ahead before you actually have the linear mapping.
And you have to get the meta-data information before you can _create_ the
linear mapping.

Doing physical read-ahead does not mean you can avoid reading meta-data,
and if you claim that as a "win", you're just lying to yourself.

Doing physical read-ahead only means that you can read-ahead _before_
reading meta-data: it does not remove the need for meta-data, it only
potentially removes a ordering dependency.

But it potentially removes that ordering dependency through speculation
(only if the speculation ends up being correct, of course), and
speculation has a cost. This is nothing new - people have been doing
things like this at other levels for a long time (doing things like data
speculation inside CPU's, for example).

Basically, you do not decrease IO - you only try to potentially make it
more parallel. Which can be a win.

But it can be a big loss too. Speculation always ends up depending ont he
fact that you have more "throughput" than you actually take advantage of.
By implication, you can also clearly see that speculation is bound to be
provably _bad_ in any case where you can already saturate your resources.

> > So I'm claiming that in order to get any useful performance improvments,
> > your physical read-ahead has to be _clearly_ better than the logical one.
> > I doubt it is.
>
> Even marginally better will satisfy me, because what I'm really after is the
> coherency across buffer and page cache.  Also, I'm not presenting it as an
> either/or.  I see physical readahead as complementary to logical readahead.

Marginally better is not good enough if other loads are marginally worse.

And I will bet that you _will_ see marginally worse numbers. Which is why
I think you want "clearly better" just to offset the marginally worse
numbers.

> > Basically, the things where physical read-ahead might win:
> >  - it can be done without metadata (< 1%)
>
> See above.  It's not the actual size of the metadata that matters, it's
> the way it serializes access to the data blocks.

See above. It always will. There is NO WAY you can actually return the
file data to the user before having read the metadata. QED. You ONLY
remove ordering constraints, nothing more.

> > Can you see any others?
>
> Yes:
>
>   - Lots of ide drives don't have any cache at all.  It's not in question
>     that *some* physical readahead is good, right?

Every single IDE drive I know of has cache. Some of them only has a single
buffer, but quite frankly, I've not seen a new IDE drive in the last few
years with less than 1MB of cache. And that's not a "single 1MB buffer".

And this trend hasn't been going backwards. Disks have _always_ been
getting more intelligent, rather than less.

In short, you're betting against history here. And against technology
getting better. It's not a bet that I would ever do..

>   - The disk firmware ought to be smart enough to evict already-read
>     blocks from its cache early, in which case our readahead
>     effectively frees up space in its cache.

With 2MB of disk cache (which is what all the IDE disks _I_ have access to
have), the disk will have more of a read-ahead buffer than you'd
reasonably do in software. How much did you imagine you'd read ahead
physically? Megabytes? I don't think so..

		Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-11 15:12                                                 ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-11 15:44                                                   ` Daniel Phillips
  2001-09-11 15:48                                                     ` linux-2.4.10-pre5 Linus Torvalds
  0 siblings, 1 reply; 94+ messages in thread
From: Daniel Phillips @ 2001-09-11 15:44 UTC (permalink / raw)
  To: Linus Torvalds, Rik van Riel
  Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On September 11, 2001 05:12 pm, Linus Torvalds wrote:
> On Mon, 10 Sep 2001, Rik van Riel wrote:
> > >
> > > Pre-loading your cache always depends on some limited portion of
> > > prescience.
> >
> > OTOH, agressively pre-loading metadata should be ok in a lot
> > of cases, because metadata is very small, but wastes about
> > half of our disk time because of the seeks ...
> 
> I actually agree to some degree here. The main reason I agree is that
> meta-data often is (a) known to be physically contiguous (so pre-fecthing
> is easy and cheap on most hardwate) and (b) meta-data _is_ small, so you
> don't have to prefetch very much (so pre-fetching is not all that
> expensive).
> 
> Trying to read two or more pages of inode data whenever we fetch an inode
> might not be a bad idea, for example. Either we fetch a _lot_ of inodes
> (in which case the prefetching is very likely to get a hit anyway), or we
> don't (in which case the prefetching is unlikely to hurt all that much
> either). You don't easily get into a situation where you prefetch a lot
> without gaining anything.
> 
> We might do other kinds of opportunistic pre-fetching, like have "readdir"
> start prefetching for the inode data it finds. That miht be a win for many
> loads (and it might be a loss too - there _are_ loads that really only
> care about the filename, although I suspect they are fairly rare).

But see my post in this thread where I created a simple test to show that, 
even when we pre-read *all* the inodes in a directory, there is no great 
performance win.

Thinking about it some more, I can see that some optimization is possible for 
the 20% or so of the work that was devoted to reading in the inodes, but such 
optimization has the nature of physical readahead in the inode table.  Which 
would apply equally well to the file data itself.

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-11 15:44                                                   ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-11 15:48                                                     ` Linus Torvalds
  2001-09-11 16:05                                                       ` linux-2.4.10-pre5 Alex Bligh - linux-kernel
                                                                         ` (2 more replies)
  0 siblings, 3 replies; 94+ messages in thread
From: Linus Torvalds @ 2001-09-11 15:48 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rik van Riel, Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On Tue, 11 Sep 2001, Daniel Phillips wrote:
>
> But see my post in this thread where I created a simple test to show that,
> even when we pre-read *all* the inodes in a directory, there is no great
> performance win.

Note that I suspect that because the inode tree _is_ fairly dense, you
don't actually need to do much read-ahead in most cases. Simply because
you automatically do read-ahead _always_: when somebody reads a 128-byte
inode, you (whether you like it or not) always "read-ahead" the 31 inodes
around it on a 4kB filesystem.

So we _already_ do read-ahead by a "factor of 31". Whether we can improve
that or not by increasing it to 63 inodes, who knows?

I actually think that the "start read-ahead for inode blocks when you do
readdir" might be a bigger win, because that would be a _new_ kind of
read-ahead that we haven't done before, and might improve performance for
things like "ls -l" in the cold-cache situation..

(Although again, because the inode is relatively small to the IO cache
size, it's probably fairly _hard_ to get a fully cold-cache inode case. So
I'm not sure even that kind of read-ahead would actually make any
difference at all).

		Linus

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-11 15:48                                                     ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-11 16:05                                                       ` Alex Bligh - linux-kernel
  2001-09-11 16:07                                                       ` linux-2.4.10-pre5 Daniel Phillips
  2001-09-11 17:17                                                       ` linux-2.4.10-pre5 Daniel Phillips
  2 siblings, 0 replies; 94+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-09-11 16:05 UTC (permalink / raw)
  To: Linus Torvalds, Daniel Phillips
  Cc: Rik van Riel, Andreas Dilger, Andrea Arcangeli,
	Kernel Mailing List, Alex Bligh - linux-kernel

--On Tuesday, September 11, 2001 8:48 AM -0700 Linus Torvalds 
<torvalds@transmeta.com> wrote:

> I actually think that the "start read-ahead for inode blocks when you do
> readdir" might be a bigger win, because that would be a _new_ kind of
> read-ahead that we haven't done before, and might improve performance for
> things like "ls -l" in the cold-cache situation..

This was exactly what I did in the example I gave Daniel, except
we went slightly further (ONLY if I/O was quiescent) and effectively
went through the readdir, if we found a directory, preloaded (effectively)
the blocks we'd need for a readdir() on that, and if we found a file,
preloaded the first few blocks (this will be harder if you've no
obvious logical map); before actually reading any of the stuff,
we'd get a list of the blocks to read, and sort it as per physical
layout. We also logically read ahead files as soon as they were
opened (we generally had the first few locks from the above
oepration). For directories with short files in, and/or recursive tree
operations, it was a big win. [It might be less of a win under Linux
as the peculiar environment this lived on had resource-fork-esque
stuff represented by directory like things.]

--
Alex Bligh

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-11 15:48                                                     ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-11 16:05                                                       ` linux-2.4.10-pre5 Alex Bligh - linux-kernel
@ 2001-09-11 16:07                                                       ` Daniel Phillips
  2001-09-11 16:07                                                         ` linux-2.4.10-pre5 Alex Bligh - linux-kernel
  2001-09-11 17:17                                                       ` linux-2.4.10-pre5 Daniel Phillips
  2 siblings, 1 reply; 94+ messages in thread
From: Daniel Phillips @ 2001-09-11 16:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On September 11, 2001 05:48 pm, Linus Torvalds wrote:
> On Tue, 11 Sep 2001, Daniel Phillips wrote:
> >
> > But see my post in this thread where I created a simple test to show that,
> > even when we pre-read *all* the inodes in a directory, there is no great
> > performance win.
> 
> Note that I suspect that because the inode tree _is_ fairly dense, you
> don't actually need to do much read-ahead in most cases. Simply because
> you automatically do read-ahead _always_: when somebody reads a 128-byte
> inode, you (whether you like it or not) always "read-ahead" the 31 inodes
> around it on a 4kB filesystem.
> 
> So we _already_ do read-ahead by a "factor of 31". Whether we can improve
> that or not by increasing it to 63 inodes, who knows?
> 
> I actually think that the "start read-ahead for inode blocks when you do
> readdir" might be a bigger win, because that would be a _new_ kind of
> read-ahead that we haven't done before, and might improve performance for
> things like "ls -l" in the cold-cache situation..

But wait, if our theories are correct then the disk is doing physical 
readahead anyway, and its a nice scsi disk with lots of cache, so *why does 
it take so long to read the metadata*?  It's about 11,000 files, that's 
1.3Meg of inodes and just a couple hundred K of directories.

There is clearly something nonoptimal about the hardware readahead and/or 
caching.

> (Although again, because the inode is relatively small to the IO cache
> size, it's probably fairly _hard_ to get a fully cold-cache inode case. So
> I'm not sure even that kind of read-ahead would actually make any
> difference at all).

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-11 16:07                                                       ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-11 16:07                                                         ` Alex Bligh - linux-kernel
  2001-09-11 16:13                                                           ` linux-2.4.10-pre5 Martin Dalecki
  0 siblings, 1 reply; 94+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-09-11 16:07 UTC (permalink / raw)
  To: Daniel Phillips, Linus Torvalds
  Cc: Rik van Riel, Andreas Dilger, Andrea Arcangeli,
	Kernel Mailing List, Alex Bligh - linux-kernel



--On Tuesday, September 11, 2001 6:07 PM +0200 Daniel Phillips 
<phillips@bonn-fries.net> wrote:

> There is clearly something nonoptimal about the hardware readahead and/or
> caching.

Partly I guess that the disk h/w cache is the wrong side of a potential
IO bottleneck - i.e. you can read faster from main memory than from
any disk cache.

--
Alex Bligh

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-11 16:07                                                         ` linux-2.4.10-pre5 Alex Bligh - linux-kernel
@ 2001-09-11 16:13                                                           ` Martin Dalecki
  0 siblings, 0 replies; 94+ messages in thread
From: Martin Dalecki @ 2001-09-11 16:13 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel
  Cc: Daniel Phillips, Linus Torvalds, Rik van Riel, Andreas Dilger,
	Andrea Arcangeli, Kernel Mailing List

Alex Bligh - linux-kernel wrote:
> 
> --On Tuesday, September 11, 2001 6:07 PM +0200 Daniel Phillips
> <phillips@bonn-fries.net> wrote:
> 
> > There is clearly something nonoptimal about the hardware readahead and/or
> > caching.
> 
> Partly I guess that the disk h/w cache is the wrong side of a potential
> IO bottleneck - i.e. you can read faster from main memory than from
> any disk cache.

But during the read from the disk you can do different things in
paralell.
Think DMA! And then please remember my posts from a year ago:

We have already 3 different read-ahead layers in Linux!

Read it twice THRE one on top of the other:

1. Physical.

2. Inside the driver.

2a. block device layer.

3. File system layer.

And in addition to this there is:

2a. or maybe even 2b if you take MD and RAID into account as well.

And plase remember that each of them can fight against the other becouse
they all presume different ordering mechanisms.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-11 15:39                                                   ` linux-2.4.10-pre5 Linus Torvalds
@ 2001-09-11 16:52                                                     ` Daniel Phillips
  0 siblings, 0 replies; 94+ messages in thread
From: Daniel Phillips @ 2001-09-11 16:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On September 11, 2001 05:39 pm, Linus Torvalds wrote:
> On Tue, 11 Sep 2001, Daniel Phillips wrote:
> > > In short, you'll end up doing a memcpy() pretty much every single time you
> > > hit.
> >
> > And *only* when we hit.  Even if we don't optimize away the memcpy, it's a
> > win, so long as we get enough hits to make up for the cost of any wasted
> > readaheads.  Any time physical readahead correctly hits a block of metadata
> > then chances are good we've eliminated a synchronous stall.
> 
> Ehh..
> 
> Your argument seems to be "it won't cost us anything if we don't hit".
> 
> But hey, if we don't hit, then it _will_ cost us - that implies that the
> read-ahead was _completely_ wasted. That's the worst case, not the "good"
> case. You've just wasted CPU cycles in setting up the bogus IO, and disk
> and bus cycles in executing it.

My fault for being imprecise.  By "hit" I meant "hit readahead data in the 
buffer cache".  Hits on data already in the page cache will work exactly as 
they do now, and page cache data that is originally accessed through the page 
cache (i.e., whenever we know its mapping) will not take a side trip through 
the buffer cache.

> > Aha.  So we are going to do the buffer cache hash probe anyway, the thing I
> > call the the reverse lookup.  Now I'm just suggesting we drop the other shoe
> > and chain all the page cache blocks to the buffer hash.  The only extra cost
> > will be the buffer hash insert and delete, and in return we get complete
> > coherency.
> 
> "complete"? No. The coherency is very much one-way: we'd better not have
> anything that actually dirties the buffer cache, because that dirtying
> will _not_ be seen by a virtual cache.

Then I haven't communicated what I had in mind to do.  Dirtying data through 
the buffer cache would be ok.  There are two cases: 1) the block is in the 
page cache, but we get to it through the buffer hash chain instead of the 
page hash chain.  Fine, we can dirty it.  2) the block is not in the page 
cache.  We can still dirty it, because if the page cache tries to add that 
block it will first go look for it in the "buffer cache" (which is really 
just a way of saying "not in the page cache") and move it into the page
cache.

> [...]
> Your argument is fundamentally flawed: remember that you cannot actually
> _use_ the data you read ahead before you actually have the linear mapping.
> And you have to get the meta-data information before you can _create_ the
> linear mapping.

You are seeing an inconsistency I'm not seeing.

> Doing physical read-ahead does not mean you can avoid reading meta-data,
> and if you claim that as a "win", you're just lying to yourself.
> 
> Doing physical read-ahead only means that you can read-ahead _before_
> reading meta-data: it does not remove the need for meta-data, it only
> potentially removes a ordering dependency.

Right.

> But it potentially removes that ordering dependency through speculation
> (only if the speculation ends up being correct, of course), and
> speculation has a cost. This is nothing new - people have been doing
> things like this at other levels for a long time (doing things like data
> speculation inside CPU's, for example).
> 
> Basically, you do not decrease IO - you only try to potentially make it
> more parallel. Which can be a win.
> 
> But it can be a big loss too. Speculation always ends up depending ont he
> fact that you have more "throughput" than you actually take advantage of.
> By implication, you can also clearly see that speculation is bound to be
> provably _bad_ in any case where you can already saturate your resources.

Absolutely, so such a facility must shut itself down when there are other 
demands for the bandwidth.

> [...]
> Marginally better is not good enough if other loads are marginally worse.

Of course I did not mean that.  I meant marginally better on average.

> And I will bet that you _will_ see marginally worse numbers. Which is why
> I think you want "clearly better" just to offset the marginally worse
> numbers.

OK, I'll buy that.

> > > Basically, the things where physical read-ahead might win:
> > >  - it can be done without metadata (< 1%)
> >
> > See above.  It's not the actual size of the metadata that matters, it's
> > the way it serializes access to the data blocks.
> 
> See above. It always will. There is NO WAY you can actually return the
> file data to the user before having read the metadata. QED.

Fine, but you may already have read the *data* by the time you read the 
metadata, and so don't have to wait for it, it's already in cache.  That's
where the win is.

> You ONLY remove ordering constraints, nothing more.

Agreed.  But that's a big deal.

> [...]
> With 2MB of disk cache (which is what all the IDE disks _I_ have access to
> have), the disk will have more of a read-ahead buffer than you'd
> reasonably do in software. How much did you imagine you'd read ahead
> physically? Megabytes? I don't think so..

Yes, I was thinking megabytes in order to get anything close to the 5X 
improvement that appears to be possible.  But first I was planning to start 
more modestly, just pick up a few preemptive data reads here and there that 
the logical readahead mechanism would miss, in addition to servicing the 
logical readahead just as we do now.

To do the more ambitious multi-megabyte readahead, I had in mind something
that would combine information from some highlevel sources - directory 
operations, generic_read, etc - with information picked up at the
physical level, massaging it all into a kind of display from which we'd
be able to determine easily that we have a trend going on like "reading
the whole damm directory".  Once we know that's what we're doing it's not 
hard to see what to do about it.

Just to do a reality check on that: if somebody reads the first 1/10 of
the files in a directory tree in getdents order, it's a pretty safe bet
they're going to read the other 9/10ths, right?

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-11 15:48                                                     ` linux-2.4.10-pre5 Linus Torvalds
  2001-09-11 16:05                                                       ` linux-2.4.10-pre5 Alex Bligh - linux-kernel
  2001-09-11 16:07                                                       ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-11 17:17                                                       ` Daniel Phillips
  2 siblings, 0 replies; 94+ messages in thread
From: Daniel Phillips @ 2001-09-11 17:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On September 11, 2001 05:48 pm, Linus Torvalds wrote:
> I actually think that the "start read-ahead for inode blocks when you do
> readdir" might be a bigger win, because that would be a _new_ kind of
> read-ahead that we haven't done before, and might improve performance for
> things like "ls -l" in the cold-cache situation..
> 
> (Although again, because the inode is relatively small to the IO cache
> size, it's probably fairly _hard_ to get a fully cold-cache inode case. So
> I'm not sure even that kind of read-ahead would actually make any
> difference at all).

Having it all in cache makes a huge difference.  For:

  ls -R linux >/dev/null

it's about 5 seconds cold, about .1 second the second time.  A factor of 50!

--
Daniel

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: linux-2.4.10-pre5
  2001-09-11 10:02                                           ` linux-2.4.10-pre5 Daniel Phillips
@ 2001-09-11 20:07                                             ` Rik van Riel
  0 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2001-09-11 20:07 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Linus Torvalds, Andreas Dilger, Andrea Arcangeli, Kernel Mailing List

On Tue, 11 Sep 2001, Daniel Phillips wrote:

> I tested this idea by first doing a ls -R on the tree, then
> Linus's find command:
>
>     time ls -R linux >/dev/null
>     time find linux -type f | xargs cat > /dev/null

> According to your theory the total time for the two commands
> should be less than the second command alone.  But it wasn't,
> the two commands together took almost exactly the same time as
> the second command by itself.

Well DUH, your first find isn't doing any readahead on the
inodes either.

> There goes that theory.

You might want to test it first.

cheers,

Rik
--
IA64: a worthy successor to the i860.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 94+ messages in thread

end of thread, other threads:[~2001-09-11 20:07 UTC | newest]

Thread overview: 94+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-09-08  4:18 linux-2.4.10-pre5 Linus Torvalds
2001-09-08  6:32 ` linux-2.4.10-pre5: drivers/net/wan compile fixes Eyal Lebedinsky
2001-09-08  6:36 ` 2.4.9-ac10 (not 2.4.10-pre5!) wan fixes Eyal Lebedinsky
2001-09-08  8:32 ` 2.4.10-pre5 compile error George Bonser
2001-09-08 17:19 ` linux-2.4.10-pre5 Andrea Arcangeli
2001-09-08 17:30   ` linux-2.4.10-pre5 Linus Torvalds
2001-09-08 17:57     ` linux-2.4.10-pre5 Andrea Arcangeli
2001-09-08 18:01       ` linux-2.4.10-pre5 Linus Torvalds
2001-09-09  1:09         ` linux-2.4.10-pre5 Andrea Arcangeli
2001-09-09  1:20           ` linux-2.4.10-pre5 Linus Torvalds
2001-09-09  1:38             ` linux-2.4.10-pre5 Andrea Arcangeli
2001-09-09  1:53               ` linux-2.4.10-pre5 Linus Torvalds
2001-09-09  2:22                 ` linux-2.4.10-pre5 Andrea Arcangeli
2001-09-09  2:31                   ` linux-2.4.10-pre5 Linus Torvalds
2001-09-09  3:30                     ` linux-2.4.10-pre5 Andrea Arcangeli
2001-09-09  3:58                       ` linux-2.4.10-pre5 Linus Torvalds
2001-09-09  4:16                         ` linux-2.4.10-pre5 Andrea Arcangeli
2001-09-09  4:28                           ` linux-2.4.10-pre5 Linus Torvalds
2001-09-09 12:09                             ` linux-2.4.10-pre5 Rik van Riel
2001-09-09 14:53                               ` linux-2.4.10-pre5 Andrea Arcangeli
2001-09-09 18:17                               ` linux-2.4.10-pre5 Linus Torvalds
2001-09-09 20:18                                 ` linux-2.4.10-pre5 H. Peter Anvin
2001-09-10  0:39                                   ` linux-2.4.10-pre5 Simon Kirby
2001-09-10  8:30                                     ` linux-2.4.10-pre5 Kai Henningsen
2001-09-11  5:29                                       ` linux-2.4.10-pre5 Peter Samuelson
2001-09-11 11:29                                         ` linux-2.4.10-pre5 Kai Henningsen
2001-09-10 21:22                                   ` linux-2.4.10-pre5 Stephen C. Tweedie
2001-09-09 14:47                             ` linux-2.4.10-pre5 Andrea Arcangeli
2001-09-09 16:24                               ` linux-2.4.10-pre5 Linus Torvalds
2001-09-09 17:29                                 ` linux-2.4.10-pre5 Andrea Arcangeli
2001-09-09 23:56                             ` linux-2.4.10-pre5 Daniel Phillips
2001-09-09  4:29                         ` linux-2.4.10-pre5 Andreas Dilger
2001-09-09  4:54                           ` linux-2.4.10-pre5 Linus Torvalds
2001-09-09  6:17                             ` linux-2.4.10-pre5 Andreas Dilger
2001-09-09 17:31                               ` linux-2.4.10-pre5 Linus Torvalds
2001-09-09 19:19                                 ` linux-2.4.10-pre5 Daniel Phillips
2001-09-09 23:24                                   ` linux-2.4.10-pre5 Linus Torvalds
2001-09-09 23:54                                     ` linux-2.4.10-pre5 Alan Cox
2001-09-10  0:04                                       ` linux-2.4.10-pre5 Daniel Phillips
2001-09-10  0:23                                       ` linux-2.4.10-pre5 Linus Torvalds
2001-09-10  0:23                                     ` linux-2.4.10-pre5 Daniel Phillips
2001-09-10  0:38                                       ` linux-2.4.10-pre5 Linus Torvalds
2001-09-10  1:04                                         ` linux-2.4.10-pre5 Andrea Arcangeli
2001-09-10  1:45                                         ` linux-2.4.10-pre5 Chris Mason
2001-09-10  1:55                                           ` linux-2.4.10-pre5 Andrea Arcangeli
2001-09-10  2:15                                             ` linux-2.4.10-pre5 Daniel Phillips
2001-09-10  2:20                                               ` linux-2.4.10-pre5 Chris Mason
2001-09-10  2:40                                                 ` linux-2.4.10-pre5 Daniel Phillips
2001-09-10  3:02                                                 ` linux-2.4.10-pre5 Chris Mason
2001-09-10  3:36                                                   ` linux-2.4.10-pre5 Daniel Phillips
2001-09-10 19:06                                                   ` linux-2.4.10-pre5 Chris Mason
2001-09-10  2:22                                               ` linux-2.4.10-pre5 Daniel Phillips
2001-09-10  2:02                                           ` linux-2.4.10-pre5 Chris Mason
2001-09-10  2:06                                             ` linux-2.4.10-pre5 Andrea Arcangeli
2001-09-10  2:03                                           ` linux-2.4.10-pre5 Linus Torvalds
2001-09-10  2:41                                             ` linux-2.4.10-pre5 Chris Mason
2001-09-10 21:18                                     ` linux-2.4.10-pre5 Daniel Phillips
2001-09-10 21:23                                       ` linux-2.4.10-pre5 Alex Bligh - linux-kernel
2001-09-10 21:54                                         ` linux-2.4.10-pre5 Daniel Phillips
2001-09-10 22:39                                           ` linux-2.4.10-pre5 Alex Bligh - linux-kernel
2001-09-10 23:13                                             ` linux-2.4.10-pre5 Daniel Phillips
2001-09-10 23:25                                               ` linux-2.4.10-pre5 Alex Bligh - linux-kernel
2001-09-10 22:15                                       ` linux-2.4.10-pre5 Linus Torvalds
2001-09-10 22:26                                         ` linux-2.4.10-pre5 Linus Torvalds
2001-09-10 22:39                                         ` linux-2.4.10-pre5 Rik van Riel
2001-09-10 23:14                                           ` linux-2.4.10-pre5 Daniel Phillips
2001-09-10 23:16                                             ` linux-2.4.10-pre5 Linus Torvalds
2001-09-11  0:53                                               ` linux-2.4.10-pre5 Rik van Riel
2001-09-11  6:39                                                 ` linux-2.4.10-pre5 Hua Zhong
2001-09-11 15:12                                                 ` linux-2.4.10-pre5 Linus Torvalds
2001-09-11 15:44                                                   ` linux-2.4.10-pre5 Daniel Phillips
2001-09-11 15:48                                                     ` linux-2.4.10-pre5 Linus Torvalds
2001-09-11 16:05                                                       ` linux-2.4.10-pre5 Alex Bligh - linux-kernel
2001-09-11 16:07                                                       ` linux-2.4.10-pre5 Daniel Phillips
2001-09-11 16:07                                                         ` linux-2.4.10-pre5 Alex Bligh - linux-kernel
2001-09-11 16:13                                                           ` linux-2.4.10-pre5 Martin Dalecki
2001-09-11 17:17                                                       ` linux-2.4.10-pre5 Daniel Phillips
2001-09-11 10:02                                           ` linux-2.4.10-pre5 Daniel Phillips
2001-09-11 20:07                                             ` linux-2.4.10-pre5 Rik van Riel
2001-09-10 23:20                                         ` linux-2.4.10-pre5 Daniel Phillips
2001-09-11  0:20                                           ` linux-2.4.10-pre5 Linus Torvalds
2001-09-11  1:16                                             ` linux-2.4.10-pre5 Daniel Phillips
2001-09-11  2:27                                               ` linux-2.4.10-pre5 Linus Torvalds
2001-09-11  7:45                                                 ` linux-2.4.10-pre5 Helge Hafting
2001-09-11 10:27                                                 ` linux-2.4.10-pre5 Daniel Phillips
2001-09-11 15:39                                                   ` linux-2.4.10-pre5 Linus Torvalds
2001-09-11 16:52                                                     ` linux-2.4.10-pre5 Daniel Phillips
2001-09-09 19:51                               ` linux-2.4.10-pre5 Daniel Phillips
2001-09-09  9:05                           ` linux-2.4.10-pre5 Christoph Hellwig
2001-09-09 13:14                           ` linux-2.4.10-pre5 Anton Altaparmakov
2001-09-09 14:31                             ` linux-2.4.10-pre5 Andrea Arcangeli
2001-09-08 21:15     ` linux-2.4.10-pre5 Andreas Dilger
2001-09-09  0:59       ` linux-2.4.10-pre5 Andrea Arcangeli
2001-09-08 22:01 ` linux-2.4.10-pre5 Thiago Vinhas de Moraes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).